From 183ccf294f1c72b5e18eb14dbedaae60cda199a3 Mon Sep 17 00:00:00 2001 From: Konrad Bucheli Date: Tue, 4 Oct 2022 17:06:10 +0200 Subject: [PATCH] document Kerberos issues --- rhel8/kerberos.md | 248 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 248 insertions(+) create mode 100644 rhel8/kerberos.md diff --git a/rhel8/kerberos.md b/rhel8/kerberos.md new file mode 100644 index 00000000..4ea58997 --- /dev/null +++ b/rhel8/kerberos.md @@ -0,0 +1,248 @@ +# Kerberos on RHEL 8 + +This document describes the Kerberos issues we encountered during RHEL 8 introduction. + +In RHEL we are using the `KEYRING` (kernel keyring) cache, +whereas for RHEL8 there came early the wish to use `KCM` (Kerberos Cache Manager) instead. + +The Kerberos documentation contains a [reference for all available cache types]( https://web.mit.edu/kerberos/www/krb5-latest/doc/basic/ccache_def.html). + +## Kerberos Use and Test Cases + +- ssh authentication (authentication method `gssapi-with-mic`) +- ssh ticket delegation (with `GSSAPIDelegateCredentials yes`) +- AFS authentication (`aklog`) +- AFS administrative operations where the user switches to a separate admin principal (e.g. `buchel_k-adm`) +- Website authentication (`SPNEGO` with Firefox, Chrome) +- local desktop: ticket renewal after reauthentication on lock screen +- remote desktop with NoMachine NX: ticket renewal after reconnection + + +## KCM + +The `KCM` cache is provided by a dedicated daemon, for RHEL8 this is `sssd_kcm` which has been programmed by Red Hat itself. + +### Advantages of KCM + +The advantage of `KCM` is that the caches are permanent and survive daemon restarts and system reboots without the need to fiddle around with files and file permission. This simplifies daemon and container use cases. +It also does automatically renew tickets which is handy for every use case. + +### User Based vs Session Based + +Intuitively I would expect that something delicate as authentication is managed per session (ssh, desktop, console login, ...). + +Aparently with `KCM` this is not the case. It provides a default cache which is supposed to be the optimal for you and that can change any time. + +Problems I see with this are +- user may change his principal, eg. for admin operations (`kinit buchel_k-adm`) which is then used by all sessions +- user may destroy the cache (it is good security practice to have a `kdestroy` in `.bash_logout` to ensure nobody on the machine can use your tokens after logging out) +- software may put tokens into the cache which suddenly are not there any more +- the magic/heuristic used to select might not work optimally for all use cases (as we see below `sshd-kcm` fails horribly..) + +So if we have more than one session on a machine (e.g. people connecting via remote desktop and ssh at the same time), the cross-session side-effects can cause unexpected behaviour. + +In contrast to this for AFS token renewal having access to new tokens is helpful, as this allows prolong the time a `PAG` (group of processes authenticated against AFS) is working as long as there is at least one valid ticket available. +Or even to recover when a new ticket comes available again. + +A way to get `KCM` of of the business of selecting the "optimal" cache is to select it yourself and provide the session/software one specific cache by setting the `KRB5CCNAME` environment variable accordingly (e.g. `KCM:44951:66120`). Note when set to `KCM:` it will use the default cache from `KCM`. + + +### Problems of sssd_kcm + +The most obvious and well [known problem](https://github.com/SSSD/sssd/issues/3593) of `sshd-kcm` is that does not remove expired tokens and credential caches. I agree that it should not have an impact as this is mostly cosmetic. But that is only the case when everything can cope with that... + +To check the Kerberos credential cache, you can use `klist` to look a the current default cache and `klist -l` to look at all available caches. Note that there the first listed cache is the default cache. Of course that is only valid when there is no `KRB5CCNAME` environment variable set or it is `KCM:`. + +#### Use of Expired Credential Caches +In below example you see that on the ssh login, I got a new default cache. But after a few minutes (there was a Desktop login from my side and maybe an automatic AFS token renewal in between), I get an expired cache as default cache. +``` +$ ssh lxdev01.psi.ch +Last login: Tue Oct 4 09:50:33 2022 +[buchel_k@lxdev01 ~]$ klist -l +Principal name Cache name +-------------- ---------- +buchel_k@D.PSI.CH KCM:44951:42923 +buchel_k@D.PSI.CH KCM:44951:12312 (Expired) +buchel_k@D.PSI.CH KCM:44951:42199 (Expired) +buchel_k@D.PSI.CH KCM:44951:40168 +buchel_k@D.PSI.CH KCM:44951:8914 (Expired) +buchel_k@D.PSI.CH KCM:44951:62275 (Expired) +buchel_k@D.PSI.CH KCM:44951:27078 (Expired) +buchel_k@D.PSI.CH KCM:44951:73924 (Expired) +buchel_k@D.PSI.CH KCM:44951:72006 +buchel_k@D.PSI.CH KCM:44951:64449 (Expired) +buchel_k@D.PSI.CH KCM:44951:60061 (Expired) +buchel_k@D.PSI.CH KCM:44951:36925 (Expired) +buchel_k@D.PSI.CH KCM:44951:48361 (Expired) +buchel_k@D.PSI.CH KCM:44951:49651 (Expired) +buchel_k@D.PSI.CH KCM:44951:76984 (Expired) +buchel_k@D.PSI.CH KCM:44951:54227 (Expired) +buchel_k@D.PSI.CH KCM:44951:85800 (Expired) +[buchel_k@lxdev01 ~]$ klist -l +Principal name Cache name +-------------- ---------- +buchel_k@D.PSI.CH KCM:44951:12312 (Expired) +buchel_k@D.PSI.CH KCM:44951:42199 (Expired) +buchel_k@D.PSI.CH KCM:44951:40168 +buchel_k@D.PSI.CH KCM:44951:8914 (Expired) +buchel_k@D.PSI.CH KCM:44951:62275 (Expired) +buchel_k@D.PSI.CH KCM:44951:27078 (Expired) +buchel_k@D.PSI.CH KCM:44951:73924 (Expired) +buchel_k@D.PSI.CH KCM:44951:72006 +buchel_k@D.PSI.CH KCM:44951:64449 (Expired) +buchel_k@D.PSI.CH KCM:44951:60061 (Expired) +buchel_k@D.PSI.CH KCM:44951:36925 (Expired) +buchel_k@D.PSI.CH KCM:44951:48361 (Expired) +buchel_k@D.PSI.CH KCM:44951:42923 +buchel_k@D.PSI.CH KCM:44951:49651 (Expired) +buchel_k@D.PSI.CH KCM:44951:76984 (Expired) +buchel_k@D.PSI.CH KCM:44951:54227 (Expired) +buchel_k@D.PSI.CH KCM:44951:85800 (Expired) +[buchel_k@lxdev01 ~]$ +``` +Note that the automatic AFS token renewal was created after we have experienced this issue. + + +#### Busy Loop of goa-daemon +If the [GNOME Online Accounts](https://wiki.gnome.org/Projects/GnomeOnlineAccounts) encounters a number of Kerberos credential caches it goes into a busy loop and causes `sssd-kcm` to consume 100% of one core. Happily ignored bugs at [Red Hat](https://bugzilla.redhat.com/show_bug.cgi?id=1645624#c113) and [Gnome](https://gitlab.gnome.org/GNOME/gnome-online-accounts/-/issues/79). + +#### Zombie Caches by NoMachine NX +On a machine with remote desktop access using NoMachine NX I have seen following cache list in the log: +``` +# /usr/bin/klist -l +Principal name Cache name +-------------- ---------- +fische_r@D.PSI.CH KCM:45334:73632 (Expired) +buchel_k@D.PSI.CH KCM:45334:55706 (Expired) +fische_r@D.PSI.CH KCM:45334:44226 (Expired) +fische_r@D.PSI.CH KCM:45334:40904 (Expired) +fische_r@D.PSI.CH KCM:45334:62275 (Expired) +fische_r@D.PSI.CH KCM:45334:89020 (Expired) +buchel_k@D.PSI.CH KCM:45334:25061 (Expired) +buchel_k@D.PSI.CH KCM:45334:35168 (Expired) +fische_r@D.PSI.CH KCM:45334:73845 (Expired) +fische_r@D.PSI.CH KCM:45334:47508 (Expired) +fische_r@D.PSI.CH KCM:45334:34317 (Expired) +fische_r@D.PSI.CH KCM:45334:52058 (Expired) +fische_r@D.PSI.CH KCM:45334:16150 (Expired) +fische_r@D.PSI.CH KCM:45334:84445 (Expired) +fische_r@D.PSI.CH KCM:45334:69076 (Expired) +buchel_k@D.PSI.CH KCM:45334:87346 (Expired) +fische_r@D.PSI.CH KCM:45334:57070 (Expired) +``` +or on another machine in my personal list: +``` +[buchel_k@pc14831 ~]$ klist -l +Principal name Cache name +-------------- ---------- +buchel_k@D.PSI.CH KCM:44951:69748 +buchel_k@D.PSI.CH KCM:44951:18506 (Expired) +buchel_k@D.PSI.CH KCM:44951:5113 (Expired) +buchel_k@D.PSI.CH KCM:44951:52685 (Expired) +buchel_k@D.PSI.CH KCM:44951:13951 (Expired) +PC14831$@D.PSI.CH KCM:44951:43248 (Expired) +PC14831$@D.PSI.CH KCM:44951:58459 (Expired) +buchel_k@D.PSI.CH KCM:44951:14668 (Expired) +buchel_k@D.PSI.CH KCM:44951:92516 (Expired) +[buchel_k@pc14831 ~]$ +``` +Both show principals which I am very sure that they have not been added manually by the user. And somewhere there is a security issue, either `sssd-kcm` or NoMachine NX. + +In another experiment I logged into a machine with `ssh` and did `kdestroy -A` which should destroy all caches: + +``` +[buchel_k@mpc2959 ~]$ kdestroy -A +[buchel_k@mpc2959 ~]$ klist -l +Principal name Cache name +[buchel_k@mpc2959 ~]$ +``` + +After I login via NoMachine NX and get an cache expired since more than two month: + +``` +[buchel_k@mpc2959 ~]$ klist -l +Principal name Cache name + +buchel_k@D.PSI.CH KCM:44951:16795 (Expired) +buchel_k@D.PSI.CH KCM:44951:69306 +[buchel_k@mpc2959 ~]$ klist +Ticket cache: KCM:44951:16795 +Default principal: buchel_k@D.PSI.CH + +Valid starting Expires Service principal +13.07.2022 11:35:51 13.07.2022 21:26:19 krbtgt/D.PSI.CH@D.PSI.CH +renew until 14.07.2022 11:26:19 +[buchel_k@mpc2959 ~]$ date +Do Sep 22 08:37:41 CEST 2022 +[buchel_k@mpc2959 ~]$ +``` +Note that a non-expired cache is available, but NoMachine NX explicitely sets `KRB5CCNAME` to a specific KCM cache. And it contains a ticket/cache which is supposed to the gone. + +So there is a security bug in `sssd-kcm`: it does not fully destroy tickets when being told so. And there is another security issue in the NoMachine NX -> `sssd-kcm` interaction. I assume that it talks with the `KCM` as root and gets somehow (or has saved somewhere) old caches and moves them over into user context. But the cache may not originally belong to the user... + +I have not found a lot concerning Kerberos on the NoMachine website. + +## Solution Attempts + +Ideally we would get to a solution which can do the following: + +- interactive user sessions are isolated do not interfer with each other +- AFS can get hold of new tickets and inject them into the PAGs as long as the user somehow regular authenticates +- `systemd --user` which is residing outside of the interactive user sessions is happy as well +- `goa-daemon` sees only one cache + +### renew-afstoken Script/Daemon + +For AFS we (Achim and I) made the script `renew-afstoken` which is started as per PAG daemon by PAM upon login. +Out of the available `KCM` caches it selects a suitable one to regulary get a new AFS token. +This now works very robust and can also recover from expiration when a new ticket gets available. + +### Session Isolation with KRB5CCNAME + +#### At the End of PAM +The idea is to set `KRB5CCNAME` to the very cache which has been created while going through the PAM stack. +A self-made PAM module just does this. + +It works well for ssh sessions and might also work well for simple desktop sessions, but not for GNOME. +In GNOME the user programms do not start as child of the login screen and thus do not inherit the environment variables. +They are started by `systemd --user` and which sets `KRB5CCNAME` to `KCM:` instead of using the system default (which results in the same behaviour). +I had a very short look at the `systemd` source code, but could not yet find the place where `KRB5CCNAME` is set. And the RHEL8 version of `systemd` has more than 800 patches compared with upstream... (OK, some might be backports...) + +#### At the Start of PAM +At some point I also made a test by setting `KRB5CCNAME` at the start of PAM to a fixed name of an existing cache, so that the TGT, etc. end up in a well known place. +That worked well, I also tested sucessfully that autheticating on the screen lock updates the TGT. + +Using a random, non-existing cache name resulted in a failure, not in the creation of that cache as it would happen if you do that with `kinit`. +So that self made PAM module would need to be extended to also create the cache. +I assumed that the "End of PAM" solution would be easier to implement, so I opted for that. + +## Options for Next Steps + +### Try out End of PAM +For the "Start of PAM" experiments I got more into PAM and Kerberos programming with C than I wanted and I think I would get that working in reasonable time. + +### Try out KEYRING +Maybe we can try to create a solution with `KEYRING` which isolates the interactive sessions and still allows the AFS token renewal to access all caches. This then also needs `renew-afstoken` to care about Kerberos ticket renewal. + +For the listed use cases above the caches and tickets do not need to survive reboots. If there is something/someone needing `KCM` for some reason, it can be used specifically and privately and will not interfer with the rest of the system. + +### Red Hat Ticket +I have an [ticket](https://access.redhat.com/support/cases/#/case/03280446) open with Red Hat on this case. On the first part I concentrated on the missing session isolation, but it showed that this is the supposed behaviour of a KCM setup. + +One Problem is that our machines have some non-standard software which is not covered by the support. It is YFS for AFS and NoMachine NX. + +Then it is not that easy to reproduce as the problem is best seen in a long running and used system. To create such a test system with several users and many expired sessions means quite some effort. + +I posted a few strange looking `klist` outputs and asked for explanation, but that seamed not yet to have reached someone with intimidate `sssd-kcm` knowledge. + +How to proceed here? Post this document and ask how to proceed? + +### Other Options + +Fill in your ideas. + +## PS +There is an advanage in the broken `sssd-kcm` default cache selection: it forces us to make our stuff robust against `KCM` glitches, which might also occur with a better manager, just way less often and then it would be more harder to explain and to track down. + + +