I’m not sure how to specify this for a brand store as the docs aren’t clear.
… but the device is already able to get the serial assertion. Does snap known serial print anything? Does it have the same brand-id as “$OUR_BRANDSTORE_ID”?
In the meantime, I’m looking at what has changed since 244.
Hi @charlee, sorry I know I am being a bit all over the place at the moment. Busy week.
Right - I wouldn’t test that one and hold off for a few days if I where you. We now have PRs that should resolve all your issues. They just need approving and building - then your clones and devices not coming up should all be good. I’ll let you know as soon as those have gone through the system.
ALso thanks a lot for releasing this before being ready with the cloned stuff.
I think we will be able to manage the cloning for now, the benefits of the other changes are too big!
Also, since I know it only happens with Landscape update it is a little more predictable when it will happen.
Fix for cloning is included in the current edge and beta channels.
This will only work going forward, reverting to builds previous to these will still potentially cause clones. The documentation will be updated sorted.
Perfect. We are currently focussing on some other stuff but next step is to update and test the landscape-client in our images and test them in the devices closeby.
Probably doing that in the next couple of weeks, so you might hear from us!
I will trigger our basic integration test with the latest version to confirm with you the basics work.
It takes 5 minutes from the device going online on the internet to Landscape marking the first “last_ping_time” to not-None.
When looking at the api response I also see two other values with timestamps: last_exchange_time and effective, are any of these values better to use than last_ping_time to establish whether the device is ready to be used?
After this we will add a test to execute a script and measure the time to completion.
@mikecw What would be the parameters you would suggest to get the nicest user experience without overloading stuff? For example, I now set all the exchanges to 30, but noticed we are sending up monitoring with that frequency as well. That might not be necessary. We just want devices to have a “most accurate online status” and “scripts to be executed as fast as possible”
And what is the best way to determine a device is online from the get_computers API call?
I set up a test and waiting for the first last_ping_time variable to be populated after a “new” device first boots takes 5 minutes. If we take last_exchange_time it takes 2.26 minutes.
If I then execute a script, it does get delivered quite quickly: 27 seconds
Then if I wait for the status to go to succeeded it was pretty quick again: 37 seconds
So good results. However, I have to say that once in a while scripts don’t seem to get delivered at all, or it takes up to 30 minutes. Is there any edge cases we can hit here? What is the expected behavior if a script gets triggered and a device is offline at the time for example?
For your first question, I don’t really have a good answer. We have not really played with reducing these parameters too much in the past, so I am afraid your experience is probably as good as/better than ours.
In terms of most up-to-date online status, I don’t think it is deterministic. You can have pings that don’t then initiate exchanges, or exchanges that occur without a ping. The engineer in me says take both and get the minimum.
The scripts not getting delivered for 30 minutes is a bizarre scenario that I would like to dig into further. With your ping settings, the server should flag there is a script working fairly quickly and then instantly schedule an exchange. If you have any logs from one of those instances, we’d be happy to look into it to see if you are somehow managing a perfect storm of other activities that somehow delay the script.
That’s good news! As for downgrading you can if you need but there shouldn’t be a need to.
As for the snap, it has been tested and all known issues are resolved and working and tested for production so I think we can go stable with it (famous last words). I’ll do that for you now.
It happened again that the script took forever to reach the device. I think there might be some edge condition that if the device is exchanging some initial data the script doesn’t arrive.
Now I look closer to these logs it also seems like maybe it just takes super long to send the first payload; on that last logline it says in journalctl that at 08:22 65521 bytes were sent. However, in the internal logs it seems like that log was triggered at 08:07 already??
Is it possible that the Landscape server is struggeling with some load? I have noticed on the UI and API we are regularly receiving server errors.
And it seems to have some curl errors.
Ofcourse internet problems could be on our side, but I am running this in a virtual QEMU instance for which I never really experienced network issues, and the network for the rest of my laptop works fine.
(this is still revision 244 btw, not sure if that matters)
Ok, so I have just had a chat with our support people about this and it sounds like an issue they are currently investigating.
It seems that “sometimes” when sending a script, it gets to the device and executes, then it sends its status response to the server. The server is not acknowledging that return message. The client keeps trying on the normal exchange interval (default: 15 mins) to get that message to the server.
Just remind me - was this to our Landscape SaaS instance or were you self-hosting? The issue appears to be an issue on our SaaS - which is under active investigation. Client version shouldn’t matter - it does seem to be server side.