Grub.cfg corruption on pc-amd64 gadget

I recently had a system get a corrupted grub.cfg file in the efi partition of the system. The system would not boot past the grub console after failing to load the OS. While outputting the grub.cfg the file stopped halfway with this error: error: invalid cluster 0 I know it could have possibly been something in our application. But I was wondering if there were any ideas as to what could have caused corruption in the EFI partition as our system and kernel logs don’t appear to turn up anything that looks like it would have modified this partition.

It seems like the system-boot partition is corrupted. You can do a dd dump of it and then run fsck.vat on the resulting file to see what exactly is wrong. The system-boot partition (the grub.cfg file) will be modified when a new kernel/os get installed. However this is a different partition than the EFI partition which is not touched by snapd.

I was looking through my post history and thought I’d follow up on this in case anyone searches for similar issues in the future.

The issues was the file was being touched in our application and was shut down before the file was written to disk. This was removed and we made the change in the gadget snap we are using so editing grub.cfg while the system is running is not required. This seems like a more correct method than editing the file while the system is running.