Files
gitea-pages/rhel8/kerberos.md

17 KiB

Kerberos on RHEL 8

This document describes the state of Kerberos on RHEL 8. This includes the current open issues, a user guide and how we solved the KCM (Kerberos Cache Manager) issues. At the bottom you find sequence diagrams showing the interactions concerning authentication and Kerberos.

Open Problems

  • cleanup of caches, else we might end up in DoS situation. Best we do this systemd --unit managed.
  • Kerberos with Firefox does not work yet.

User Guide

Manage Ticket for Admin User

If you need for administrative operations a TGT from your admin user (e.g. buchel_k-adm), then do

OLD_KRB5CCNAME=$KRB5CCNAME
export KRB5CCNAME=KCM:$(id -u):admin
kinit $(id -un)-adm

and after you are done do

kdestroy
export KRB5CCNAME=$OLD_KRB5CCNAME

to delete your administrative tickets and to get back to your normal credential cache.

Update TGT on Long Running Sessions

The TGT will be automatically renewed for 7 days. Note that a screen unlock or a new connection with NoMachine NX will update the credential cache with a new TGT.

But also manual reauthentication is possible. Inside the session you can do

kinit

Outside of the session you first need to figure out the credential cache used. First get the process ID of the process which needs authentication, then

$ strings /proc/$PID/environ | grep KRB5CCNAME
KRB5CCNAME=KCM:44951:iepgjskbkd
$

and then a

KRB5CCNAME=KCM:44951:iepgjskbkd kinit

will update given credential cache.

Note that for AFS it will look in all caches for a valid TGT, so logging in on the desktop or ssh with password or ticket delegation is sufficient to make AFS access work for another week.

List all Credential Caches

KRB5CCNAME=KCM: klist -l

lists all caches and

KRB5CCNAME=KCM: klist -A

also the tickets therein.

Kerberos Use and Test Cases

  • ssh authentication (authentication method gssapi-with-mic)
  • ssh TGT (ticket granting ticket) delegation (with GSSAPIDelegateCredentials yes)
  • AFS authentication (aklog)
  • AFS administrative operations where the user switches to a separate admin principal (e.g. buchel_k-adm)
  • local desktop: get new TGT on login
  • local desktop: TGT renewal after reauthentication on lock screen
  • remote desktop with NoMachine NX: get new TGT on login
  • remote desktop with NoMachine NX: TGT renewal after reconnection
  • website authentication (SPNEGO with Firefox, Chrome)

KCM (Kerberos Cache Manager)

In RHEL 7 we are using the KEYRING (kernel keyring) cache, whereas for RHEL 8 there came early the wish to use KCM instead, which also is the new default.

The Kerberos documentation contains a reference for all available cache types.

The KCM cache is provided by a dedicated daemon, for RHEL8 this is sssd_kcm which has been programmed by Red Hat itself.

Advantages of KCM

The advantage of KCM is that the caches are permanent and survive daemon restarts and system reboots without the need to fiddle around with files and file permission. This simplifies daemon and container use cases. It also automatically renews tickets which is handy for every use case.

User Based vs Session Based

Intuitively I would expect that something delicate as authentication is managed per session (ssh, desktop, console login, ...).

Aparently with KCM this is not the case. It provides a default cache which is supposed to be the optimal for you and that can change any time.

Problems I see with this are

  • user may change his principal, eg. for admin operations (kinit buchel_k-adm) which is then used by all sessions
  • user may destroy the cache (it is good security practice to have a kdestroy in .bash_logout to ensure nobody on the machine can use your tokens after logging out)
  • software may put tokens into the cache which suddenly are not there any more
  • the magic/heuristic used to select might not work optimally for all use cases (as we see below sshd-kcm fails horribly...)

So if we have more than one session on a machine (e.g. people connecting via remote desktop and ssh at the same time), the cross-session side-effects can cause unexpected behaviour.

In contrast to this for AFS token renewal having access to new tokens is helpful, as this allows prolong the time a PAG (group of processes authenticated against AFS) is working as long as there is at least one valid ticket available. Or even to recover when a new ticket comes available again.

A way to get KCM of of the business of selecting the "optimal" cache is to select it yourself and provide the session/software one specific cache by setting the KRB5CCNAME environment variable accordingly (e.g. KCM:44951:66120). Note when set to KCM: it will use as default cache the one KCM believes should be the default cache. And that can change for whatever reason.

Problems of sssd_kcm

To check the Kerberos credential cache, you can use klist to look a the current default cache and klist -l to look at all available caches. Note that the first listed cache is the default cache. Of course that is only valid when there is no KRB5CCNAME environment variable set or it is KCM:.

No Cleanup of Expired Caches

The most obvious and well known problem of sshd-kcm is that it does not remove expired tokens and credential caches. I agree that it should not have an impact as this is mostly cosmetic. But that is only the case when everything can cope with that...

By default is is limited to 64 caches, but when that limit was hit, then it was not possible any more to authenticate on the lock screen:

Okt 05 14:57:11 lxdev01.psi.ch krb5_child[43689]: Internal credentials cache error

So this causes a denial of service problem, we need to deal with somehow, e.g. by regulary removing expired caches. And note that these caches are persistent and do not get removed on reboot.

Use of Expired Credential Caches

In below example you see that on the ssh login, I got a new default cache. But after a few minutes (there was a Desktop login from my side and maybe an automatic AFS token renewal in between), I get an expired cache as default cache.

$ ssh lxdev01.psi.ch
Last login: Tue Oct  4 09:50:33 2022
[buchel_k@lxdev01 ~]$ klist -l
Principal name                 Cache name
--------------                 ----------
buchel_k@D.PSI.CH              KCM:44951:42923
buchel_k@D.PSI.CH              KCM:44951:12312 (Expired)
buchel_k@D.PSI.CH              KCM:44951:42199 (Expired)
buchel_k@D.PSI.CH              KCM:44951:40168
buchel_k@D.PSI.CH              KCM:44951:8914 (Expired)
buchel_k@D.PSI.CH              KCM:44951:62275 (Expired)
buchel_k@D.PSI.CH              KCM:44951:27078 (Expired)
buchel_k@D.PSI.CH              KCM:44951:73924 (Expired)
buchel_k@D.PSI.CH              KCM:44951:72006
buchel_k@D.PSI.CH              KCM:44951:64449 (Expired)
buchel_k@D.PSI.CH              KCM:44951:60061 (Expired)
buchel_k@D.PSI.CH              KCM:44951:36925 (Expired)
buchel_k@D.PSI.CH              KCM:44951:48361 (Expired)
buchel_k@D.PSI.CH              KCM:44951:49651 (Expired)
buchel_k@D.PSI.CH              KCM:44951:76984 (Expired)
buchel_k@D.PSI.CH              KCM:44951:54227 (Expired)
buchel_k@D.PSI.CH              KCM:44951:85800 (Expired)
[buchel_k@lxdev01 ~]$ klist -l
Principal name                 Cache name
--------------                 ----------
buchel_k@D.PSI.CH              KCM:44951:12312 (Expired)
buchel_k@D.PSI.CH              KCM:44951:42199 (Expired)
buchel_k@D.PSI.CH              KCM:44951:40168
buchel_k@D.PSI.CH              KCM:44951:8914 (Expired)
buchel_k@D.PSI.CH              KCM:44951:62275 (Expired)
buchel_k@D.PSI.CH              KCM:44951:27078 (Expired)
buchel_k@D.PSI.CH              KCM:44951:73924 (Expired)
buchel_k@D.PSI.CH              KCM:44951:72006
buchel_k@D.PSI.CH              KCM:44951:64449 (Expired)
buchel_k@D.PSI.CH              KCM:44951:60061 (Expired)
buchel_k@D.PSI.CH              KCM:44951:36925 (Expired)
buchel_k@D.PSI.CH              KCM:44951:48361 (Expired)
buchel_k@D.PSI.CH              KCM:44951:42923
buchel_k@D.PSI.CH              KCM:44951:49651 (Expired)
buchel_k@D.PSI.CH              KCM:44951:76984 (Expired)
buchel_k@D.PSI.CH              KCM:44951:54227 (Expired)
buchel_k@D.PSI.CH              KCM:44951:85800 (Expired)
[buchel_k@lxdev01 ~]$

Note that the automatic AFS token renewal was created after we have experienced this issue.

Busy Loop of goa-daemon

If the GNOME Online Accounts encounters a number of Kerberos credential caches it goes into a busy loop and causes sssd-kcm to consume 100% of one core. Happily ignored bugs at Red Hat and Gnome.

Zombie Caches by NoMachine NX

On a machine with remote desktop access using NoMachine NX I have seen following cache list in the log:

# /usr/bin/klist -l
Principal name                 Cache name
--------------                 ----------
fische_r@D.PSI.CH              KCM:45334:73632 (Expired)
buchel_k@D.PSI.CH              KCM:45334:55706 (Expired)
fische_r@D.PSI.CH              KCM:45334:44226 (Expired)
fische_r@D.PSI.CH              KCM:45334:40904 (Expired)
fische_r@D.PSI.CH              KCM:45334:62275 (Expired)
fische_r@D.PSI.CH              KCM:45334:89020 (Expired)
buchel_k@D.PSI.CH              KCM:45334:25061 (Expired)
buchel_k@D.PSI.CH              KCM:45334:35168 (Expired)
fische_r@D.PSI.CH              KCM:45334:73845 (Expired)
fische_r@D.PSI.CH              KCM:45334:47508 (Expired)
fische_r@D.PSI.CH              KCM:45334:34317 (Expired)
fische_r@D.PSI.CH              KCM:45334:52058 (Expired)
fische_r@D.PSI.CH              KCM:45334:16150 (Expired)
fische_r@D.PSI.CH              KCM:45334:84445 (Expired)
fische_r@D.PSI.CH              KCM:45334:69076 (Expired)
buchel_k@D.PSI.CH              KCM:45334:87346 (Expired)
fische_r@D.PSI.CH              KCM:45334:57070 (Expired)

or on another machine in my personal list:

[buchel_k@pc14831 ~]$ klist -l
Principal name                 Cache name
--------------                 ----------
buchel_k@D.PSI.CH              KCM:44951:69748
buchel_k@D.PSI.CH              KCM:44951:18506 (Expired)
buchel_k@D.PSI.CH              KCM:44951:5113 (Expired)
buchel_k@D.PSI.CH              KCM:44951:52685 (Expired)
buchel_k@D.PSI.CH              KCM:44951:13951 (Expired)
PC14831$@D.PSI.CH              KCM:44951:43248 (Expired)
PC14831$@D.PSI.CH              KCM:44951:58459 (Expired)
buchel_k@D.PSI.CH              KCM:44951:14668 (Expired)
buchel_k@D.PSI.CH              KCM:44951:92516 (Expired)
[buchel_k@pc14831 ~]$ 

Both show principals which I am very sure that they have not been added manually by the user. And somewhere there is a security issue, either sssd-kcm or NoMachine NX.

In another experiment I logged into a machine with ssh and did kdestroy -A which should destroy all caches:

[buchel_k@mpc2959 ~]$ kdestroy -A
[buchel_k@mpc2959 ~]$ klist -l
Principal name Cache name
[buchel_k@mpc2959 ~]$

After I logged in via NoMachine NX I got a cache expired since more than two month:

[buchel_k@mpc2959 ~]$ klist -l
Principal name Cache name

buchel_k@D.PSI.CH KCM:44951:16795 (Expired)
buchel_k@D.PSI.CH KCM:44951:69306
[buchel_k@mpc2959 ~]$ klist
Ticket cache: KCM:44951:16795
Default principal: buchel_k@D.PSI.CH

Valid starting Expires Service principal
13.07.2022 11:35:51 13.07.2022 21:26:19 krbtgt/D.PSI.CH@D.PSI.CH
renew until 14.07.2022 11:26:19
[buchel_k@mpc2959 ~]$ date
Do Sep 22 08:37:41 CEST 2022
[buchel_k@mpc2959 ~]$ 

Note that a non-expired cache is available, but NoMachine NX explicitely sets KRB5CCNAME to a specific KCM cache. And it contains a ticket/cache which is supposed to the gone.

So there is a security bug in sssd-kcm: it does not fully destroy tickets when being told so. And there is another security issue in the NoMachine NX -> sssd-kcm interaction. I assume that it talks with the KCM as root and gets somehow (or has saved somewhere) old caches and moves them over into user context. But the cache may originally not have belonged to the user...

I have not found a lot concerning Kerberos on the NoMachine website.

Solution Attempts

Ideally we would get to a solution which can do the following:

  • interactive user sessions are isolated do not interfer with each other
  • AFS can get hold of new tickets and inject them into the PAGs as long as the user somehow regularly authenticates
  • systemd --user which is residing outside of the interactive user sessions is happy as well
  • goa-daemon sees only one cache
  • expired caches get somehow cleaned up

Only One Cache

The sssd-kcm limits the number of caches by default to 64, but that can be changed to 1 with the max_uid_ccaches. So there would be only one cache, shared by all sessions, but at least the KCM cannot serve anything but the latest.

But some logins do not work any more when the maximum number of caches is hit as already documented above in the chapter "No Cleanup of Expired Caches".

renew-afstoken Script/Daemon

For AFS we (Achim and I) made the script renew-afstoken which is started as per PAG daemon by PAM upon login. Out of the available KCM caches it selects a suitable one to regulary get a new AFS token. This now works very robust and can also recover from expiration when a new ticket gets available.

Setup Shared or Isolated Caches with KRB5CCNAME in own PAM Module

The self-made PAM module pam_single_kcm_cache.so improves the situation by setting

  • KRB5CCNAME=KCM:$UID:desktop to use a shared credential cache for desktop sessions and systemd --user
  • KRB5CCNAME=KCM:$UID:$RANDOM_LETTERS for text sessions to provide session isolation

and providing a working TGT in these caches.

I identified so far two cases of the program flow in PAM to manage:

  • TGT delegation as done by sshd with authentication method gssapi-with-mic, where a new cache is created by sshd and then filled with the delegated ticket
  • TGT creation as done by pam_sss.so upon password authentication, where a new TGT is created an placed into the KCM managed default cache.

Now there is no simple and bullet proof selection of where the TGT ends up in KCM. The KCM designated default cache might it be or not. To work around this, the module iterates through all credential caches provided by the KCM copies a TGT which is younger than 10 s and has a principal fitting the username.

Note that the reason for systemd --user to use the same credential cache as the desktop sessions is that at least Gnome uses it to start the user programs like Evolution or Firefox.

The code is publicly available on Github.

Below diagrams show how PAM and especially pam_single_kcm_cache.so interact with the KCM in different use cases.

Login with SSH using Password Authentication

Login with SSH and Password Authentication

That is kind of the "common" authentication case where all important work is done in PAM. This is the same for login on the virtual console or when using su with password. At the end there is an shell session with a credential cache which is not used by any other session (unless the user shares it somehow manually). Like this session isolation is achieved.

Login with SSH using Kerberos Authentication and TGT Delegation

Login with SSH and Password Authentication

This is a bit simpler as all the authentication is done in sshd and only the session setup is done by PAM. Note that sshd does not use the default cache, but instead creates always a new one with the delegated TGT.

Systemd User Instance

In above diagrams we see how systemd --user is being started. It is also using PAM to setup its own session, but it does not do any authentication.

Login with SSH and Password Authentication

Here we use a predefined name for the credential cache so it can be shared with the desktop sessions. The next diagram shows more in detail how systemd --user and the Gnome desktop interact.

Gnome Desktop

This is the most complex use case:

Gnome Desktop

At the end we have a well known shared credential cache between Gnome and systemd --user. This is needed systemd --user is used extensively by Gnome. Important is that the Kerberos setup already happens at authentication phase as there is no session setup phase for screen unlock as the user returns there to an already existing session.

With NoMachine NX this is configured similarly.

PS

There is an advanage in the broken sssd-kcm default cache selection: it forces us to make our stuff robust against KCM glitches, which might also occur with a better manager, just way less often and then it would be more harder to explain and to track down.