14 KiB
Kerberos on RHEL 8
This document describes the Kerberos issues we encountered during RHEL 8 introduction.
In RHEL 7 we are using the KEYRING (kernel keyring) cache,
whereas for RHEL 8 there came early the wish to use KCM (Kerberos Cache Manager) instead.
The Kerberos documentation contains a reference for all available cache types.
Kerberos Use and Test Cases
- ssh authentication (authentication method
gssapi-with-mic) - ssh ticket delegation (with
GSSAPIDelegateCredentials yes) - AFS authentication (
aklog) - AFS administrative operations where the user switches to a separate admin principal (e.g.
buchel_k-adm) - local desktop: get new TGT on login
- local desktop: ticket renewal after reauthentication on lock screen
- remote desktop with NoMachine NX: get new TGT on login
- remote desktop with NoMachine NX: ticket renewal after reconnection
- Website authentication (
SPNEGOwith Firefox, Chrome)
KCM
The KCM cache is provided by a dedicated daemon, for RHEL8 this is sssd_kcm which has been programmed by Red Hat itself.
Advantages of KCM
The advantage of KCM is that the caches are permanent and survive daemon restarts and system reboots without the need to fiddle around with files and file permission. This simplifies daemon and container use cases.
It also automatically renews tickets which is handy for every use case.
User Based vs Session Based
Intuitively I would expect that something delicate as authentication is managed per session (ssh, desktop, console login, ...).
Aparently with KCM this is not the case. It provides a default cache which is supposed to be the optimal for you and that can change any time.
Problems I see with this are
- user may change his principal, eg. for admin operations (
kinit buchel_k-adm) which is then used by all sessions - user may destroy the cache (it is good security practice to have a
kdestroyin.bash_logoutto ensure nobody on the machine can use your tokens after logging out) - software may put tokens into the cache which suddenly are not there any more
- the magic/heuristic used to select might not work optimally for all use cases (as we see below
sshd-kcmfails horribly...)
So if we have more than one session on a machine (e.g. people connecting via remote desktop and ssh at the same time), the cross-session side-effects can cause unexpected behaviour.
In contrast to this for AFS token renewal having access to new tokens is helpful, as this allows prolong the time a PAG (group of processes authenticated against AFS) is working as long as there is at least one valid ticket available.
Or even to recover when a new ticket comes available again.
A way to get KCM of of the business of selecting the "optimal" cache is to select it yourself and provide the session/software one specific cache by setting the KRB5CCNAME environment variable accordingly (e.g. KCM:44951:66120). Note when set to KCM: it will use the default cache from KCM.
Problems of sssd_kcm
The most obvious and well known problem of sshd-kcm is that it does not remove expired tokens and credential caches. I agree that it should not have an impact as this is mostly cosmetic. But that is only the case when everything can cope with that...
To check the Kerberos credential cache, you can use klist to look a the current default cache and klist -l to look at all available caches. Note that the first listed cache is the default cache. Of course that is only valid when there is no KRB5CCNAME environment variable set or it is KCM:.
Use of Expired Credential Caches
In below example you see that on the ssh login, I got a new default cache. But after a few minutes (there was a Desktop login from my side and maybe an automatic AFS token renewal in between), I get an expired cache as default cache.
$ ssh lxdev01.psi.ch
Last login: Tue Oct 4 09:50:33 2022
[buchel_k@lxdev01 ~]$ klist -l
Principal name Cache name
-------------- ----------
buchel_k@D.PSI.CH KCM:44951:42923
buchel_k@D.PSI.CH KCM:44951:12312 (Expired)
buchel_k@D.PSI.CH KCM:44951:42199 (Expired)
buchel_k@D.PSI.CH KCM:44951:40168
buchel_k@D.PSI.CH KCM:44951:8914 (Expired)
buchel_k@D.PSI.CH KCM:44951:62275 (Expired)
buchel_k@D.PSI.CH KCM:44951:27078 (Expired)
buchel_k@D.PSI.CH KCM:44951:73924 (Expired)
buchel_k@D.PSI.CH KCM:44951:72006
buchel_k@D.PSI.CH KCM:44951:64449 (Expired)
buchel_k@D.PSI.CH KCM:44951:60061 (Expired)
buchel_k@D.PSI.CH KCM:44951:36925 (Expired)
buchel_k@D.PSI.CH KCM:44951:48361 (Expired)
buchel_k@D.PSI.CH KCM:44951:49651 (Expired)
buchel_k@D.PSI.CH KCM:44951:76984 (Expired)
buchel_k@D.PSI.CH KCM:44951:54227 (Expired)
buchel_k@D.PSI.CH KCM:44951:85800 (Expired)
[buchel_k@lxdev01 ~]$ klist -l
Principal name Cache name
-------------- ----------
buchel_k@D.PSI.CH KCM:44951:12312 (Expired)
buchel_k@D.PSI.CH KCM:44951:42199 (Expired)
buchel_k@D.PSI.CH KCM:44951:40168
buchel_k@D.PSI.CH KCM:44951:8914 (Expired)
buchel_k@D.PSI.CH KCM:44951:62275 (Expired)
buchel_k@D.PSI.CH KCM:44951:27078 (Expired)
buchel_k@D.PSI.CH KCM:44951:73924 (Expired)
buchel_k@D.PSI.CH KCM:44951:72006
buchel_k@D.PSI.CH KCM:44951:64449 (Expired)
buchel_k@D.PSI.CH KCM:44951:60061 (Expired)
buchel_k@D.PSI.CH KCM:44951:36925 (Expired)
buchel_k@D.PSI.CH KCM:44951:48361 (Expired)
buchel_k@D.PSI.CH KCM:44951:42923
buchel_k@D.PSI.CH KCM:44951:49651 (Expired)
buchel_k@D.PSI.CH KCM:44951:76984 (Expired)
buchel_k@D.PSI.CH KCM:44951:54227 (Expired)
buchel_k@D.PSI.CH KCM:44951:85800 (Expired)
[buchel_k@lxdev01 ~]$
Note that the automatic AFS token renewal was created after we have experienced this issue.
Busy Loop of goa-daemon
If the GNOME Online Accounts encounters a number of Kerberos credential caches it goes into a busy loop and causes sssd-kcm to consume 100% of one core. Happily ignored bugs at Red Hat and Gnome.
Zombie Caches by NoMachine NX
On a machine with remote desktop access using NoMachine NX I have seen following cache list in the log:
# /usr/bin/klist -l
Principal name Cache name
-------------- ----------
fische_r@D.PSI.CH KCM:45334:73632 (Expired)
buchel_k@D.PSI.CH KCM:45334:55706 (Expired)
fische_r@D.PSI.CH KCM:45334:44226 (Expired)
fische_r@D.PSI.CH KCM:45334:40904 (Expired)
fische_r@D.PSI.CH KCM:45334:62275 (Expired)
fische_r@D.PSI.CH KCM:45334:89020 (Expired)
buchel_k@D.PSI.CH KCM:45334:25061 (Expired)
buchel_k@D.PSI.CH KCM:45334:35168 (Expired)
fische_r@D.PSI.CH KCM:45334:73845 (Expired)
fische_r@D.PSI.CH KCM:45334:47508 (Expired)
fische_r@D.PSI.CH KCM:45334:34317 (Expired)
fische_r@D.PSI.CH KCM:45334:52058 (Expired)
fische_r@D.PSI.CH KCM:45334:16150 (Expired)
fische_r@D.PSI.CH KCM:45334:84445 (Expired)
fische_r@D.PSI.CH KCM:45334:69076 (Expired)
buchel_k@D.PSI.CH KCM:45334:87346 (Expired)
fische_r@D.PSI.CH KCM:45334:57070 (Expired)
or on another machine in my personal list:
[buchel_k@pc14831 ~]$ klist -l
Principal name Cache name
-------------- ----------
buchel_k@D.PSI.CH KCM:44951:69748
buchel_k@D.PSI.CH KCM:44951:18506 (Expired)
buchel_k@D.PSI.CH KCM:44951:5113 (Expired)
buchel_k@D.PSI.CH KCM:44951:52685 (Expired)
buchel_k@D.PSI.CH KCM:44951:13951 (Expired)
PC14831$@D.PSI.CH KCM:44951:43248 (Expired)
PC14831$@D.PSI.CH KCM:44951:58459 (Expired)
buchel_k@D.PSI.CH KCM:44951:14668 (Expired)
buchel_k@D.PSI.CH KCM:44951:92516 (Expired)
[buchel_k@pc14831 ~]$
Both show principals which I am very sure that they have not been added manually by the user. And somewhere there is a security issue, either sssd-kcm or NoMachine NX.
In another experiment I logged into a machine with ssh and did kdestroy -A which should destroy all caches:
[buchel_k@mpc2959 ~]$ kdestroy -A
[buchel_k@mpc2959 ~]$ klist -l
Principal name Cache name
[buchel_k@mpc2959 ~]$
After I logged in via NoMachine NX I got a cache expired since more than two month:
[buchel_k@mpc2959 ~]$ klist -l
Principal name Cache name
buchel_k@D.PSI.CH KCM:44951:16795 (Expired)
buchel_k@D.PSI.CH KCM:44951:69306
[buchel_k@mpc2959 ~]$ klist
Ticket cache: KCM:44951:16795
Default principal: buchel_k@D.PSI.CH
Valid starting Expires Service principal
13.07.2022 11:35:51 13.07.2022 21:26:19 krbtgt/D.PSI.CH@D.PSI.CH
renew until 14.07.2022 11:26:19
[buchel_k@mpc2959 ~]$ date
Do Sep 22 08:37:41 CEST 2022
[buchel_k@mpc2959 ~]$
Note that a non-expired cache is available, but NoMachine NX explicitely sets KRB5CCNAME to a specific KCM cache. And it contains a ticket/cache which is supposed to the gone.
So there is a security bug in sssd-kcm: it does not fully destroy tickets when being told so. And there is another security issue in the NoMachine NX -> sssd-kcm interaction. I assume that it talks with the KCM as root and gets somehow (or has saved somewhere) old caches and moves them over into user context. But the cache may originally not have belonged to the user...
I have not found a lot concerning Kerberos on the NoMachine website.
Solution Attempts
Ideally we would get to a solution which can do the following:
- interactive user sessions are isolated do not interfer with each other
- AFS can get hold of new tickets and inject them into the PAGs as long as the user somehow regularly authenticates
systemd --userwhich is residing outside of the interactive user sessions is happy as wellgoa-daemonsees only one cache
renew-afstoken Script/Daemon
For AFS we (Achim and I) made the script renew-afstoken which is started as per PAG daemon by PAM upon login.
Out of the available KCM caches it selects a suitable one to regulary get a new AFS token.
This now works very robust and can also recover from expiration when a new ticket gets available.
Session Isolation with KRB5CCNAME
At the End of PAM
The idea is to set KRB5CCNAME to the very cache which has been created while going through the PAM stack.
A self-made PAM module just does this.
It works well for ssh sessions and might also work well for simple desktop sessions, but not for GNOME.
In GNOME the user programms do not start as child of the login screen and thus do not inherit the environment variables.
They are started by systemd --user and which sets KRB5CCNAME to KCM: instead of using the system default (which results in the same behaviour).
I had a very short look at the systemd source code, but could not yet find the place where KRB5CCNAME is set. And the RHEL8 version of systemd has more than 800 patches compared with upstream... (OK, some might be backports...)
At the Start of PAM
At some point I also made a test by setting KRB5CCNAME at the start of PAM to a fixed name of an existing cache, so that the TGT, etc. end up in a well known place.
That worked well, I also tested sucessfully that authenticating on the screen lock updates the TGT.
Using a random, non-existing cache name resulted in a failure, not in the creation of that cache as it would happen if you do that with kinit.
So that self made PAM module would need to be extended to also create the cache.
I assumed that the "End of PAM" solution would be easier to implement, so I opted for that.
Options for Next Steps
Try out End of PAM
For the "Start of PAM" experiments I got more into PAM and Kerberos programming with C than I wanted and I think I would get that working in reasonable time.
Try out KEYRING
Maybe we can try to create a solution with KEYRING which isolates the interactive sessions and still allows the AFS token renewal to access all caches. This then also needs renew-afstoken to care about Kerberos ticket renewal.
For the listed use cases above the caches and tickets do not need to survive reboots. If there is something/someone needing KCM for some reason, it can be used specifically and privately and will not interfer with the rest of the system.
Red Hat Ticket
I have an ticket open with Red Hat on this case. On the first part I concentrated on the missing session isolation, but it showed that this is the supposed behaviour of a KCM setup.
One Problem is that our machines have some non-standard software which is not covered by the support. It is YFS for AFS and NoMachine NX.
Then it is not that easy to reproduce as the problem is best seen in a long running and used system. To create such a test system with several users and many expired sessions means quite some effort.
I posted a few strange looking klist outputs and asked for explanation, but that seamed not yet to have reached someone with intimidate sssd-kcm knowledge.
How to proceed here? Post this document and ask how to proceed?
Other Options
- another selfmade daemon to monitor/clean up
sssd-kcm
Fill in your ideas.
PS
There is an advanage in the broken sssd-kcm default cache selection: it forces us to make our stuff robust against KCM glitches, which might also occur with a better manager, just way less often and then it would be more harder to explain and to track down.