document Kerberos issues

This commit is contained in:
2022-10-04 17:06:10 +02:00
parent 1e663e1875
commit 183ccf294f

248
rhel8/kerberos.md Normal file
View File

@@ -0,0 +1,248 @@
# Kerberos on RHEL 8
This document describes the Kerberos issues we encountered during RHEL 8 introduction.
In RHEL we are using the `KEYRING` (kernel keyring) cache,
whereas for RHEL8 there came early the wish to use `KCM` (Kerberos Cache Manager) instead.
The Kerberos documentation contains a [reference for all available cache types]( https://web.mit.edu/kerberos/www/krb5-latest/doc/basic/ccache_def.html).
## Kerberos Use and Test Cases
- ssh authentication (authentication method `gssapi-with-mic`)
- ssh ticket delegation (with `GSSAPIDelegateCredentials yes`)
- AFS authentication (`aklog`)
- AFS administrative operations where the user switches to a separate admin principal (e.g. `buchel_k-adm`)
- Website authentication (`SPNEGO` with Firefox, Chrome)
- local desktop: ticket renewal after reauthentication on lock screen
- remote desktop with NoMachine NX: ticket renewal after reconnection
## KCM
The `KCM` cache is provided by a dedicated daemon, for RHEL8 this is `sssd_kcm` which has been programmed by Red Hat itself.
### Advantages of KCM
The advantage of `KCM` is that the caches are permanent and survive daemon restarts and system reboots without the need to fiddle around with files and file permission. This simplifies daemon and container use cases.
It also does automatically renew tickets which is handy for every use case.
### User Based vs Session Based
Intuitively I would expect that something delicate as authentication is managed per session (ssh, desktop, console login, ...).
Aparently with `KCM` this is not the case. It provides a default cache which is supposed to be the optimal for you and that can change any time.
Problems I see with this are
- user may change his principal, eg. for admin operations (`kinit buchel_k-adm`) which is then used by all sessions
- user may destroy the cache (it is good security practice to have a `kdestroy` in `.bash_logout` to ensure nobody on the machine can use your tokens after logging out)
- software may put tokens into the cache which suddenly are not there any more
- the magic/heuristic used to select might not work optimally for all use cases (as we see below `sshd-kcm` fails horribly..)
So if we have more than one session on a machine (e.g. people connecting via remote desktop and ssh at the same time), the cross-session side-effects can cause unexpected behaviour.
In contrast to this for AFS token renewal having access to new tokens is helpful, as this allows prolong the time a `PAG` (group of processes authenticated against AFS) is working as long as there is at least one valid ticket available.
Or even to recover when a new ticket comes available again.
A way to get `KCM` of of the business of selecting the "optimal" cache is to select it yourself and provide the session/software one specific cache by setting the `KRB5CCNAME` environment variable accordingly (e.g. `KCM:44951:66120`). Note when set to `KCM:` it will use the default cache from `KCM`.
### Problems of sssd_kcm
The most obvious and well [known problem](https://github.com/SSSD/sssd/issues/3593) of `sshd-kcm` is that does not remove expired tokens and credential caches. I agree that it should not have an impact as this is mostly cosmetic. But that is only the case when everything can cope with that...
To check the Kerberos credential cache, you can use `klist` to look a the current default cache and `klist -l` to look at all available caches. Note that there the first listed cache is the default cache. Of course that is only valid when there is no `KRB5CCNAME` environment variable set or it is `KCM:`.
#### Use of Expired Credential Caches
In below example you see that on the ssh login, I got a new default cache. But after a few minutes (there was a Desktop login from my side and maybe an automatic AFS token renewal in between), I get an expired cache as default cache.
```
$ ssh lxdev01.psi.ch
Last login: Tue Oct 4 09:50:33 2022
[buchel_k@lxdev01 ~]$ klist -l
Principal name Cache name
-------------- ----------
buchel_k@D.PSI.CH KCM:44951:42923
buchel_k@D.PSI.CH KCM:44951:12312 (Expired)
buchel_k@D.PSI.CH KCM:44951:42199 (Expired)
buchel_k@D.PSI.CH KCM:44951:40168
buchel_k@D.PSI.CH KCM:44951:8914 (Expired)
buchel_k@D.PSI.CH KCM:44951:62275 (Expired)
buchel_k@D.PSI.CH KCM:44951:27078 (Expired)
buchel_k@D.PSI.CH KCM:44951:73924 (Expired)
buchel_k@D.PSI.CH KCM:44951:72006
buchel_k@D.PSI.CH KCM:44951:64449 (Expired)
buchel_k@D.PSI.CH KCM:44951:60061 (Expired)
buchel_k@D.PSI.CH KCM:44951:36925 (Expired)
buchel_k@D.PSI.CH KCM:44951:48361 (Expired)
buchel_k@D.PSI.CH KCM:44951:49651 (Expired)
buchel_k@D.PSI.CH KCM:44951:76984 (Expired)
buchel_k@D.PSI.CH KCM:44951:54227 (Expired)
buchel_k@D.PSI.CH KCM:44951:85800 (Expired)
[buchel_k@lxdev01 ~]$ klist -l
Principal name Cache name
-------------- ----------
buchel_k@D.PSI.CH KCM:44951:12312 (Expired)
buchel_k@D.PSI.CH KCM:44951:42199 (Expired)
buchel_k@D.PSI.CH KCM:44951:40168
buchel_k@D.PSI.CH KCM:44951:8914 (Expired)
buchel_k@D.PSI.CH KCM:44951:62275 (Expired)
buchel_k@D.PSI.CH KCM:44951:27078 (Expired)
buchel_k@D.PSI.CH KCM:44951:73924 (Expired)
buchel_k@D.PSI.CH KCM:44951:72006
buchel_k@D.PSI.CH KCM:44951:64449 (Expired)
buchel_k@D.PSI.CH KCM:44951:60061 (Expired)
buchel_k@D.PSI.CH KCM:44951:36925 (Expired)
buchel_k@D.PSI.CH KCM:44951:48361 (Expired)
buchel_k@D.PSI.CH KCM:44951:42923
buchel_k@D.PSI.CH KCM:44951:49651 (Expired)
buchel_k@D.PSI.CH KCM:44951:76984 (Expired)
buchel_k@D.PSI.CH KCM:44951:54227 (Expired)
buchel_k@D.PSI.CH KCM:44951:85800 (Expired)
[buchel_k@lxdev01 ~]$
```
Note that the automatic AFS token renewal was created after we have experienced this issue.
#### Busy Loop of goa-daemon
If the [GNOME Online Accounts](https://wiki.gnome.org/Projects/GnomeOnlineAccounts) encounters a number of Kerberos credential caches it goes into a busy loop and causes `sssd-kcm` to consume 100% of one core. Happily ignored bugs at [Red Hat](https://bugzilla.redhat.com/show_bug.cgi?id=1645624#c113) and [Gnome](https://gitlab.gnome.org/GNOME/gnome-online-accounts/-/issues/79).
#### Zombie Caches by NoMachine NX
On a machine with remote desktop access using NoMachine NX I have seen following cache list in the log:
```
# /usr/bin/klist -l
Principal name Cache name
-------------- ----------
fische_r@D.PSI.CH KCM:45334:73632 (Expired)
buchel_k@D.PSI.CH KCM:45334:55706 (Expired)
fische_r@D.PSI.CH KCM:45334:44226 (Expired)
fische_r@D.PSI.CH KCM:45334:40904 (Expired)
fische_r@D.PSI.CH KCM:45334:62275 (Expired)
fische_r@D.PSI.CH KCM:45334:89020 (Expired)
buchel_k@D.PSI.CH KCM:45334:25061 (Expired)
buchel_k@D.PSI.CH KCM:45334:35168 (Expired)
fische_r@D.PSI.CH KCM:45334:73845 (Expired)
fische_r@D.PSI.CH KCM:45334:47508 (Expired)
fische_r@D.PSI.CH KCM:45334:34317 (Expired)
fische_r@D.PSI.CH KCM:45334:52058 (Expired)
fische_r@D.PSI.CH KCM:45334:16150 (Expired)
fische_r@D.PSI.CH KCM:45334:84445 (Expired)
fische_r@D.PSI.CH KCM:45334:69076 (Expired)
buchel_k@D.PSI.CH KCM:45334:87346 (Expired)
fische_r@D.PSI.CH KCM:45334:57070 (Expired)
```
or on another machine in my personal list:
```
[buchel_k@pc14831 ~]$ klist -l
Principal name Cache name
-------------- ----------
buchel_k@D.PSI.CH KCM:44951:69748
buchel_k@D.PSI.CH KCM:44951:18506 (Expired)
buchel_k@D.PSI.CH KCM:44951:5113 (Expired)
buchel_k@D.PSI.CH KCM:44951:52685 (Expired)
buchel_k@D.PSI.CH KCM:44951:13951 (Expired)
PC14831$@D.PSI.CH KCM:44951:43248 (Expired)
PC14831$@D.PSI.CH KCM:44951:58459 (Expired)
buchel_k@D.PSI.CH KCM:44951:14668 (Expired)
buchel_k@D.PSI.CH KCM:44951:92516 (Expired)
[buchel_k@pc14831 ~]$
```
Both show principals which I am very sure that they have not been added manually by the user. And somewhere there is a security issue, either `sssd-kcm` or NoMachine NX.
In another experiment I logged into a machine with `ssh` and did `kdestroy -A` which should destroy all caches:
```
[buchel_k@mpc2959 ~]$ kdestroy -A
[buchel_k@mpc2959 ~]$ klist -l
Principal name Cache name
[buchel_k@mpc2959 ~]$
```
After I login via NoMachine NX and get an cache expired since more than two month:
```
[buchel_k@mpc2959 ~]$ klist -l
Principal name Cache name
buchel_k@D.PSI.CH KCM:44951:16795 (Expired)
buchel_k@D.PSI.CH KCM:44951:69306
[buchel_k@mpc2959 ~]$ klist
Ticket cache: KCM:44951:16795
Default principal: buchel_k@D.PSI.CH
Valid starting Expires Service principal
13.07.2022 11:35:51 13.07.2022 21:26:19 krbtgt/D.PSI.CH@D.PSI.CH
renew until 14.07.2022 11:26:19
[buchel_k@mpc2959 ~]$ date
Do Sep 22 08:37:41 CEST 2022
[buchel_k@mpc2959 ~]$
```
Note that a non-expired cache is available, but NoMachine NX explicitely sets `KRB5CCNAME` to a specific KCM cache. And it contains a ticket/cache which is supposed to the gone.
So there is a security bug in `sssd-kcm`: it does not fully destroy tickets when being told so. And there is another security issue in the NoMachine NX -> `sssd-kcm` interaction. I assume that it talks with the `KCM` as root and gets somehow (or has saved somewhere) old caches and moves them over into user context. But the cache may not originally belong to the user...
I have not found a lot concerning Kerberos on the NoMachine website.
## Solution Attempts
Ideally we would get to a solution which can do the following:
- interactive user sessions are isolated do not interfer with each other
- AFS can get hold of new tickets and inject them into the PAGs as long as the user somehow regular authenticates
- `systemd --user` which is residing outside of the interactive user sessions is happy as well
- `goa-daemon` sees only one cache
### renew-afstoken Script/Daemon
For AFS we (Achim and I) made the script `renew-afstoken` which is started as per PAG daemon by PAM upon login.
Out of the available `KCM` caches it selects a suitable one to regulary get a new AFS token.
This now works very robust and can also recover from expiration when a new ticket gets available.
### Session Isolation with KRB5CCNAME
#### At the End of PAM
The idea is to set `KRB5CCNAME` to the very cache which has been created while going through the PAM stack.
A self-made PAM module just does this.
It works well for ssh sessions and might also work well for simple desktop sessions, but not for GNOME.
In GNOME the user programms do not start as child of the login screen and thus do not inherit the environment variables.
They are started by `systemd --user` and which sets `KRB5CCNAME` to `KCM:` instead of using the system default (which results in the same behaviour).
I had a very short look at the `systemd` source code, but could not yet find the place where `KRB5CCNAME` is set. And the RHEL8 version of `systemd` has more than 800 patches compared with upstream... (OK, some might be backports...)
#### At the Start of PAM
At some point I also made a test by setting `KRB5CCNAME` at the start of PAM to a fixed name of an existing cache, so that the TGT, etc. end up in a well known place.
That worked well, I also tested sucessfully that autheticating on the screen lock updates the TGT.
Using a random, non-existing cache name resulted in a failure, not in the creation of that cache as it would happen if you do that with `kinit`.
So that self made PAM module would need to be extended to also create the cache.
I assumed that the "End of PAM" solution would be easier to implement, so I opted for that.
## Options for Next Steps
### Try out End of PAM
For the "Start of PAM" experiments I got more into PAM and Kerberos programming with C than I wanted and I think I would get that working in reasonable time.
### Try out KEYRING
Maybe we can try to create a solution with `KEYRING` which isolates the interactive sessions and still allows the AFS token renewal to access all caches. This then also needs `renew-afstoken` to care about Kerberos ticket renewal.
For the listed use cases above the caches and tickets do not need to survive reboots. If there is something/someone needing `KCM` for some reason, it can be used specifically and privately and will not interfer with the rest of the system.
### Red Hat Ticket
I have an [ticket](https://access.redhat.com/support/cases/#/case/03280446) open with Red Hat on this case. On the first part I concentrated on the missing session isolation, but it showed that this is the supposed behaviour of a KCM setup.
One Problem is that our machines have some non-standard software which is not covered by the support. It is YFS for AFS and NoMachine NX.
Then it is not that easy to reproduce as the problem is best seen in a long running and used system. To create such a test system with several users and many expired sessions means quite some effort.
I posted a few strange looking `klist` outputs and asked for explanation, but that seamed not yet to have reached someone with intimidate `sssd-kcm` knowledge.
How to proceed here? Post this document and ask how to proceed?
### Other Options
Fill in your ideas.
## PS
There is an advanage in the broken `sssd-kcm` default cache selection: it forces us to make our stuff robust against `KCM` glitches, which might also occur with a better manager, just way less often and then it would be more harder to explain and to track down.