Reducing Latency between landscape server and registered device

I’ve DM’d you my email address.

I’m not sure how to specify this for a brand store as the docs aren’t clear.

… but the device is already able to get the serial assertion. Does snap known serial print anything? Does it have the same brand-id as “$OUR_BRANDSTORE_ID”?

In the meantime, I’m looking at what has changed since 244.

@st3v3nmw I sent you the logs over email.

I definitely have a known serial already:

charlee@b40e85f8-1173-11ea-87ce-24418c038a3b:~$ snap known serial
type: serial
authority-id: $BRAND_ID
revision: 23
brand-id: $BRAND_ID
model: $MODEL
serial: b40e85f8-1173-11ea-87ce-24418c038a3b
...

I checked again now and broker is still in some sort of registration loop.

Fix is in the pipeline…

Will let you know when it’s approved, built and on the Edge channel

Hi @Charlee, the fix should now be available in the edge and beta channels.

Your other issue of generating clones is pending but not quite ready yet.

If you could give the latest builds a go and let us know if that resolves your issue that would be great.

Thanks

PS: Ah might have spoken too soon…

@mikecw do you still want me to test this?

FYI we don’t use landscape-client.config, we do everything through defaults in our gadget snap, but possible the issue is the same in that case?

Hi @charlee, sorry I know I am being a bit all over the place at the moment. Busy week.

Right - I wouldn’t test that one and hold off for a few days if I where you. We now have PRs that should resolve all your issues. They just need approving and building - then your clones and devices not coming up should all be good. I’ll let you know as soon as those have gone through the system.

Sorry it took so long.

Michael

1 Like

Dear Michael,

No worries, and thanks a lot for keeping me updated.

1 Like

Hi Charlee, can you give the latest/edge build a test - that should resolve all the issues with your devices not coming up.

As for your issue with cloned devices, we think we have a fix, but it is not super clean at the moment so it will be a little longer I am afraid.

If this edge build resolves your issues, we can get that oen to stable so you can have all your annotations stuff sooner rather than later.

Thanks

Dear Mike,

Didn’t have much time to test this today, but our very basic test that the image comes online on Landscape works! Tested with revision 314.

I did not test changing the exchange intervals or annotations yet, but coming online is a very good start :smiley:

ALso thanks a lot for releasing this before being ready with the cloned stuff.

I think we will be able to manage the cloning for now, the benefits of the other changes are too big! Also, since I know it only happens with Landscape update it is a little more predictable when it will happen.

Fix for cloning is included in the current edge and beta channels.

This will only work going forward, reverting to builds previous to these will still potentially cause clones. The documentation will be updated sorted.

1 Like

Perfect. We are currently focussing on some other stuff but next step is to update and test the landscape-client in our images and test them in the devices closeby.

Probably doing that in the next couple of weeks, so you might hear from us!

I will trigger our basic integration test with the latest version to confirm with you the basics work.

1 Like

I’m testing this again now, and although the snap now goes online consistently, stuff still seems fairly slow.

In our integration tests, we wait until Landscape reports a “last_ping_time” for the new device, after which we consider it as “online”.

With a landscape instance with these settings:

...
ping_interval = 30
urgent_exchange_interval = 30
exchange_interval = 30

It takes 5 minutes from the device going online on the internet to Landscape marking the first “last_ping_time” to not-None.

When looking at the api response I also see two other values with timestamps: last_exchange_time and effective, are any of these values better to use than last_ping_time to establish whether the device is ready to be used?

After this we will add a test to execute a script and measure the time to completion.

@mikecw What would be the parameters you would suggest to get the nicest user experience without overloading stuff? For example, I now set all the exchanges to 30, but noticed we are sending up monitoring with that frequency as well. That might not be necessary. We just want devices to have a “most accurate online status” and “scripts to be executed as fast as possible”

And what is the best way to determine a device is online from the get_computers API call?

I set up a test and waiting for the first last_ping_time variable to be populated after a “new” device first boots takes 5 minutes. If we take last_exchange_time it takes 2.26 minutes.

If I then execute a script, it does get delivered quite quickly: 27 seconds Then if I wait for the status to go to succeeded it was pretty quick again: 37 seconds

So good results. However, I have to say that once in a while scripts don’t seem to get delivered at all, or it takes up to 30 minutes. Is there any edge cases we can hit here? What is the expected behavior if a script gets triggered and a device is offline at the time for example?

Hi Charlee,

For your first question, I don’t really have a good answer. We have not really played with reducing these parameters too much in the past, so I am afraid your experience is probably as good as/better than ours.

In terms of most up-to-date online status, I don’t think it is deterministic. You can have pings that don’t then initiate exchanges, or exchanges that occur without a ping. The engineer in me says take both and get the minimum.

The scripts not getting delivered for 30 minutes is a bizarre scenario that I would like to dig into further. With your ping settings, the server should flag there is a script working fairly quickly and then instantly schedule an exchange. If you have any logs from one of those instances, we’d be happy to look into it to see if you are somehow managing a perfect storm of other activities that somehow delay the script.