Reducing Latency between landscape server and registered device

st3v3nmw · October 1, 2024, 9:57am

I’ve DM’d you my email address.

I’m not sure how to specify this for a brand store as the docs aren’t clear.

… but the device is already able to get the serial assertion. Does snap known serial print anything? Does it have the same brand-id as “$OUR_BRANDSTORE_ID”?

In the meantime, I’m looking at what has changed since 244.

Charlee · October 1, 2024, 11:06am

@st3v3nmw I sent you the logs over email.

I definitely have a known serial already:

charlee@b40e85f8-1173-11ea-87ce-24418c038a3b:~$ snap known serial
type: serial
authority-id: $BRAND_ID
revision: 23
brand-id: $BRAND_ID
model: $MODEL
serial: b40e85f8-1173-11ea-87ce-24418c038a3b
...

I checked again now and broker is still in some sort of registration loop.

mikecw · October 3, 2024, 3:04pm

Fix is in the pipeline…

github.com/canonical/landscape-client

fix: registration with landscape-client.config on the snap

canonical:main ← st3v3nmw:fix-registration-on-snap-3

opened 03:00PM - 03 Oct 24 UTC

st3v3nmw

+3 -1

After the [landscape-config refactor](https://github.com/canonical/landscape-cli…ent/pull/262/files), registering on new revisions of the snap fails since the `config` app cannot read files in `/sys/bus/xen/devices` & `/sys/class/dmi/id`. To reproduce: - `snap install landscape-client --edge` - `sudo landscape-client.config ...` fails with the following error: ``` Traceback (most recent call last): File "/snap/landscape-client/294/usr/bin/landscape-config", line 16, in <module> main(sys.argv[1:]) File "/snap/landscape-client/294/usr/lib/python3.10/site-packages/landscape/client/configuration.py", line 914, in main exit_code = attempt_registration(identity, config) File "/snap/landscape-client/294/usr/lib/python3.10/site-packages/landscape/client/configuration.py", line 739, in attempt_registration client_info = ClientRegistrationInfo.from_identity(identity) File "/snap/landscape-client/294/usr/lib/python3.10/site-packages/landscape/client/registration.py", line 61, in from_identity vm_info=get_vm_info(), File "/snap/landscape-client/294/usr/lib/python3.10/site-packages/landscape/lib/vm_info.py", line 23, in get_vm_info if _is_vm_xen(root_path): File "/snap/landscape-client/294/usr/lib/python3.10/site-packages/landscape/lib/vm_info.py", line 56, in _is_vm_xen return os.path.isdir(sys_xen_path) and os.listdir(sys_xen_path) PermissionError: [Errno 13] Permission denied: '/sys/bus/xen/devices' ``` --- Fix: Added the `hardware-observe` interface to the `landscape-client.config` app which gives it access to the necessary folders. The [`hardware-observe` ](https://snapcraft.io/docs/hardware-observe-interface) interface will grant read access to these paths: > - For tools like lspci -A linux-sysfs to get information on files in /sys: > /sys/{block,bus,class,devices,firmware}/{,**} > - For container information: > /run/systemd/container The `landscape-client` app is already connected to `hardware-observe` so no more auto-connection requests are required.

Will let you know when it’s approved, built and on the Edge channel

mikecw · October 4, 2024, 7:47am

Hi @Charlee, the fix should now be available in the edge and beta channels.

Your other issue of generating clones is pending but not quite ready yet.

If you could give the latest builds a go and let us know if that resolves your issue that would be great.

Thanks

PS: Ah might have spoken too soon…

Charlee · October 7, 2024, 2:08pm

@mikecw do you still want me to test this?

FYI we don’t use landscape-client.config, we do everything through defaults in our gadget snap, but possible the issue is the same in that case?

mikecw · October 8, 2024, 9:30am

Hi @charlee, sorry I know I am being a bit all over the place at the moment. Busy week.

Right - I wouldn’t test that one and hold off for a few days if I where you. We now have PRs that should resolve all your issues. They just need approving and building - then your clones and devices not coming up should all be good. I’ll let you know as soon as those have gone through the system.

Sorry it took so long.

Michael

Charlee · October 8, 2024, 9:33am

Dear Michael,

No worries, and thanks a lot for keeping me updated.

mikecw · October 11, 2024, 7:25am

Hi Charlee, can you give the latest/edge build a test - that should resolve all the issues with your devices not coming up.

As for your issue with cloned devices, we think we have a fix, but it is not super clean at the moment so it will be a little longer I am afraid.

If this edge build resolves your issues, we can get that oen to stable so you can have all your annotations stuff sooner rather than later.

Thanks

Charlee · October 11, 2024, 3:36pm

Dear Mike,

Didn’t have much time to test this today, but our very basic test that the image comes online on Landscape works! Tested with revision 314.

I did not test changing the exchange intervals or annotations yet, but coming online is a very good start

Charlee · October 11, 2024, 3:37pm

ALso thanks a lot for releasing this before being ready with the cloned stuff.

I think we will be able to manage the cloning for now, the benefits of the other changes are too big! Also, since I know it only happens with Landscape update it is a little more predictable when it will happen.

mikecw · October 18, 2024, 9:48am

Fix for cloning is included in the current edge and beta channels.

This will only work going forward, reverting to builds previous to these will still potentially cause clones. The documentation will be updated sorted.

Charlee · October 18, 2024, 9:56am

Perfect. We are currently focussing on some other stuff but next step is to update and test the landscape-client in our images and test them in the devices closeby.

Probably doing that in the next couple of weeks, so you might hear from us!

I will trigger our basic integration test with the latest version to confirm with you the basics work.

Charlee · October 22, 2024, 3:14pm

I’m testing this again now, and although the snap now goes online consistently, stuff still seems fairly slow.

In our integration tests, we wait until Landscape reports a “last_ping_time” for the new device, after which we consider it as “online”.

With a landscape instance with these settings:

...
ping_interval = 30
urgent_exchange_interval = 30
exchange_interval = 30

It takes 5 minutes from the device going online on the internet to Landscape marking the first “last_ping_time” to not-None.

When looking at the api response I also see two other values with timestamps: last_exchange_time and effective, are any of these values better to use than last_ping_time to establish whether the device is ready to be used?

After this we will add a test to execute a script and measure the time to completion.

Charlee · October 23, 2024, 8:28pm

@mikecw What would be the parameters you would suggest to get the nicest user experience without overloading stuff? For example, I now set all the exchanges to 30, but noticed we are sending up monitoring with that frequency as well. That might not be necessary. We just want devices to have a “most accurate online status” and “scripts to be executed as fast as possible”

And what is the best way to determine a device is online from the get_computers API call?

I set up a test and waiting for the first last_ping_time variable to be populated after a “new” device first boots takes 5 minutes. If we take last_exchange_time it takes 2.26 minutes.

If I then execute a script, it does get delivered quite quickly: 27 seconds Then if I wait for the status to go to succeeded it was pretty quick again: 37 seconds

So good results. However, I have to say that once in a while scripts don’t seem to get delivered at all, or it takes up to 30 minutes. Is there any edge cases we can hit here? What is the expected behavior if a script gets triggered and a device is offline at the time for example?

mikecw · October 25, 2024, 5:52am

Hi Charlee,

For your first question, I don’t really have a good answer. We have not really played with reducing these parameters too much in the past, so I am afraid your experience is probably as good as/better than ours.

In terms of most up-to-date online status, I don’t think it is deterministic. You can have pings that don’t then initiate exchanges, or exchanges that occur without a ping. The engineer in me says take both and get the minimum.

The scripts not getting delivered for 30 minutes is a bizarre scenario that I would like to dig into further. With your ping settings, the server should flag there is a script working fairly quickly and then instantly schedule an exchange. If you have any logs from one of those instances, we’d be happy to look into it to see if you are somehow managing a perfect storm of other activities that somehow delay the script.

Charlee · October 29, 2024, 10:26am

Hey Mike,

Can confirm that also the annotations plug is working nicely

I saw you guys made an epoch, does that mean that we cannot downgrade from this revision?

Any idea when the snap will go to stable? Do you need to do extra validations? In other words, do you think this is good enough for production?

mikecw · October 29, 2024, 11:09am

Hi Charlee,

That’s good news! As for downgrading you can if you need but there shouldn’t be a need to.

As for the snap, it has been tested and all known issues are resolved and working and tested for production so I think we can go stable with it (famous last words). I’ll do that for you now.

Charlee · October 30, 2024, 8:35am

It happened again that the script took forever to reach the device. I think there might be some edge condition that if the device is exchanging some initial data the script doesn’t arrive.

Picture to show the script triggered:

If I look in the logs I see that at 8:07 UTC the device is exchanging data about what snaps are installed etc:

2024-10-30T08:07:37Z landscape-client.landscape-client[2339]: 2024-10-30 08:07:37,471 DEBUG    [MainThread] Finished firing pre-exchange.
2024-10-30T08:07:37Z landscape-client.landscape-client[2339]: 2024-10-30 08:07:37,491 INFO     [MainThread] Starting urgent message exchange with https://landscape.canonical.com/message-system.
2024-10-30T08:07:37Z landscape-client.landscape-client[2339]: 2024-10-30 08:07:37,539 DEBUG    [PoolThread-twisted.internet.reactor-0] Sending payload:
2024-10-30T08:07:37Z landscape-client.landscape-client[2339]: {'accepted-types': b'3k\xa7\xde\xc7QS-J\xf7\xe6\xfd\x10a\xe1 ',
2024-10-30T08:07:37Z landscape-client.landscape-client[2339]:  'client-api': b'3.8',
2024-10-30T08:07:37Z landscape-client.landscape-client[2339]:  'messages': [{'api': b'3.3',
...
2024-10-30T08:07:37Z landscape-client.landscape-client[2339]:                'snaps': {'installed': [{'config': '{"code": '
2024-10-30T08:07:37Z landscape-client.landscape-client[2339]:                                                   '"b40e85f8-1173-11ea-87ce-24418c038a3b"}',
2024-10-30T08:07:37Z landscape-client.landscape-client[2339]:    
...2024-10-30T08:07:37Z landscape-client.landscape-client[2339]:  'sequence': 1,
2024-10-30T08:07:37Z landscape-client.landscape-client[2339]:  'server-api': b'3.3',
2024-10-30T08:07:37Z landscape-client.landscape-client[2339]:  'total-messages': 16}
2024-10-30T08:22:50Z landscape-client.landscape-client[2339]: 2024-10-30 08:07:40,183 INFO     [PoolThread-twisted.internet.reactor-0] Sent 65521 bytes and received 232 bytes in 2.68s.
2024-10-30T08:22:50Z landscape-client.landscape-client[2339]: 2024-10-30 08:07:40,184 DEBUG    [PoolThread-twisted.internet.reactor-0] Received payload:

Now I look closer to these logs it also seems like maybe it just takes super long to send the first payload; on that last logline it says in journalctl that at 08:22 65521 bytes were sent. However, in the internal logs it seems like that log was triggered at 08:07 already??

Is it possible that the Landscape server is struggeling with some load? I have noticed on the UI and API we are regularly receiving server errors.

Also, now my script finally arrived:

And it seems to have some curl errors. Ofcourse internet problems could be on our side, but I am running this in a virtual QEMU instance for which I never really experienced network issues, and the network for the rest of my laptop works fine.

(this is still revision 244 btw, not sure if that matters)

mikecw · October 30, 2024, 11:54am

Ok, so I have just had a chat with our support people about this and it sounds like an issue they are currently investigating.

It seems that “sometimes” when sending a script, it gets to the device and executes, then it sends its status response to the server. The server is not acknowledging that return message. The client keeps trying on the normal exchange interval (default: 15 mins) to get that message to the server.

Just remind me - was this to our Landscape SaaS instance or were you self-hosting? The issue appears to be an issue on our SaaS - which is under active investigation. Client version shouldn’t matter - it does seem to be server side.

Charlee · October 30, 2024, 12:11pm

Yes, SaaS.

In general, we see that SaaS behavior is incredibly undeterministic in terms of timings and it feels like load is not handled properly.