Failure to refresh the telegraf snap on on kubernetes hosts


On 2021-01-29, the telegraf snap changed its behaviour to run its daemon as the telegraf user instead of root:


root     22025     1  3 Jan17 ?        11:19:05 /snap/telegraf/182/bin/telegraf --config /var/snap/telegraf/182/telegraf.conf --config-directory /var/snap/telegraf/182/telegraf.d


root     31363     1  0 Jan29 ?        00:00:00 sudo -u telegraf /snap/telegraf/206/bin/telegraf --config /var/snap/telegraf/206/telegraf.conf --config-directory /var/snap/telegraf/206/telegraf.d`
`telegraf 31445 31363  0 Jan29 ?        00:35:36 /snap/telegraf/206/bin/telegraf --config /var/snap/telegraf/206/telegraf.conf --config-directory /var/snap/telegraf/206/telegraf.d

It refreshed successfully on ~575 machines, but I noticed 33 not picking up the new version.
They’re all kubernetes hosts, more precisely kubernetes-masters, kubernetes-workers and etcds, deployed with CDK.

In their dmesgs:

[32066322.871930] audit: type=1400 audit(1612170360.031:39037): apparmor="DENIED" operation="ptrace" profile="/snap/core/10583/usr/lib/snapd/snap-confine" pid=6485 comm="telegraf" requested_mask="tracedby" denied_mask="tracedby" peer="snap.telegraf.telegraf

A manual sudo snap refresh telegraf worked fine.

Due to the time correlation, I guessed that this was due to a combination of telegraf performing a sudo -u [...], and CDK apparmor profiles, but I don’t know that for sure.

I could perform manual refreshes for my machines in a few minutes, but would prefer to identify the root cause, and make sure it doesn’t hit that snap again in the future, nor others.
Help investigating this would be appreciated.

this is definitely not the way to run any binary in a snap (and note that classic confinement does not make a difference here), you are breaking confinement and are also not using the correct environment this way … to run a snap application either use snap run <name of the app> or execute it via /snap/bin/<name of the app> (which should have been automatically added to your path when snapd was installed) anything else is broken and wrong.

also please note that there is no (permitted) way for a snap to create users or to run a daemon as a specific user beyond the “snap_daemon” user or root …

if the telegraf snap randomly modifies data in the systems password db this is wrong behaviour.

I must be missing something, this is a rather common pattern ?

root     13895     1  5  2020 ?        12-01:41:42 /snap/kubelet/1500/kubelet --config=/root/cdk/kubelet/config.yaml --container-runtime=docker --dynamic-config-dir=/root/cdk/kubelet/dynamic-config --config=/root/cdk/kubelet/config.yaml --kubeconfig=/root/cdk/kubeconfig --logtostderr --network-plugin=cni --node-ip=<IP> --pod-infra-container-image=<UREL>:5000/cdk/pause-amd64:3.1 --v=0
root     21468     1  0 Jan17 ?        00:04:43 /usr/lib/snapd/snapd
root     22025     1  3 Jan17 ?        11:21:41 /snap/telegraf/182/bin/telegraf --config /var/snap/telegraf/182/telegraf.conf --config-directory /var/snap/telegraf/182/telegraf.d
root     32644     1  0  2020 ?        06:01:29 /snap/kube-proxy/1494/kube-proxy --cluster-cidr=<cidr> --hostname-override=<hostname> --kubeconfig=/root/cdk/kubeproxyconfig --logtostderr --master=https://<IP>:443 --v=0
root     32697     1  0  2020 ?        00:39:00 /snap/canonical-livepatch/95/canonical-livepatchd

I used daemon: simple, calling a small wrapper which performs a few checks and starts $SNAP/bin/telegraf --config [...].

I get the point though regarding user creation. I noticed that users implicitly expect the non-root user to be ‘telegraf’, and didn’t want
to break their experience:

but I understand that it breaks the snap design, so will work on moving to snap_daemon.

If the snap is running telegraf via a snap service, then the output from ps is correct for that, the ps output will always show /snap/whatever/current/.../bin/telegraf instead of /usr/bin/snap run since the latter will exec() as the former so /proc will say that the process is /snap/whatever/current/...

this is correct and the right way to do this

well since the snap is classic it’s not a big deal, but if it was a strict snap then yes it would need to be using system-usernames

Oops, i’m sorry, i missed that the excerpt above was from ps output…

regarding the user, I beg to disagree here … the snap calls system utilities like “install”, “useradd”, “groupadd” from its hook without shiping them.

even classic snaps should be/need to be self contained.
what happens if you install the snap on a system that is lacking one of these tools, what happens if you install the snap on a system that has one of them but not the other (i.e. you end up with half an updated passwd db when the hook fails) or if the version used on the host simply is a different version that uses other/differently named switches … what/who removes the user if the snap gets uninstalled etc ?

while a classic snap can access bits of the host system, it is still good practice to simply stick to the common snap ways of dealing with the runtime env here and i.e. use the snap_daemon user proper to drop privileges instead of potentially trashing the hosts password db

Thanks for the clarifications, I’ll switch to using snap_daemon.

To get back to the original question, is the refresh failure due to the useradd attempt, or should I investigate further ?

something you do during upgrade seems to call out to ptrace() which I personally would not expect to block in classic confinement … probably @ijohnson has an idea here ? else we need to pull in someone from the security team …

Ah well if the snap is indeed using host utilities then it should not be doing that and should instead ship all such utilities itself.

Regarding debugging the failed refresh, I’m not sure what the cause is, but it seems like since the denial is for snap-confine with a classic snap, perhaps your classic snap is trying to call another classic snap and the file descriptors your classic snap opened are trying to be shared with the other classic snap, but they have to go through snap-confine first which doesn’t have permissions to access them hence the denial. If that is indeed the cause, this is bug

1 Like