document Kerberos issues
This commit is contained in:
248
rhel8/kerberos.md
Normal file
248
rhel8/kerberos.md
Normal file
@@ -0,0 +1,248 @@
|
||||
# Kerberos on RHEL 8
|
||||
|
||||
This document describes the Kerberos issues we encountered during RHEL 8 introduction.
|
||||
|
||||
In RHEL we are using the `KEYRING` (kernel keyring) cache,
|
||||
whereas for RHEL8 there came early the wish to use `KCM` (Kerberos Cache Manager) instead.
|
||||
|
||||
The Kerberos documentation contains a [reference for all available cache types]( https://web.mit.edu/kerberos/www/krb5-latest/doc/basic/ccache_def.html).
|
||||
|
||||
## Kerberos Use and Test Cases
|
||||
|
||||
- ssh authentication (authentication method `gssapi-with-mic`)
|
||||
- ssh ticket delegation (with `GSSAPIDelegateCredentials yes`)
|
||||
- AFS authentication (`aklog`)
|
||||
- AFS administrative operations where the user switches to a separate admin principal (e.g. `buchel_k-adm`)
|
||||
- Website authentication (`SPNEGO` with Firefox, Chrome)
|
||||
- local desktop: ticket renewal after reauthentication on lock screen
|
||||
- remote desktop with NoMachine NX: ticket renewal after reconnection
|
||||
|
||||
|
||||
## KCM
|
||||
|
||||
The `KCM` cache is provided by a dedicated daemon, for RHEL8 this is `sssd_kcm` which has been programmed by Red Hat itself.
|
||||
|
||||
### Advantages of KCM
|
||||
|
||||
The advantage of `KCM` is that the caches are permanent and survive daemon restarts and system reboots without the need to fiddle around with files and file permission. This simplifies daemon and container use cases.
|
||||
It also does automatically renew tickets which is handy for every use case.
|
||||
|
||||
### User Based vs Session Based
|
||||
|
||||
Intuitively I would expect that something delicate as authentication is managed per session (ssh, desktop, console login, ...).
|
||||
|
||||
Aparently with `KCM` this is not the case. It provides a default cache which is supposed to be the optimal for you and that can change any time.
|
||||
|
||||
Problems I see with this are
|
||||
- user may change his principal, eg. for admin operations (`kinit buchel_k-adm`) which is then used by all sessions
|
||||
- user may destroy the cache (it is good security practice to have a `kdestroy` in `.bash_logout` to ensure nobody on the machine can use your tokens after logging out)
|
||||
- software may put tokens into the cache which suddenly are not there any more
|
||||
- the magic/heuristic used to select might not work optimally for all use cases (as we see below `sshd-kcm` fails horribly..)
|
||||
|
||||
So if we have more than one session on a machine (e.g. people connecting via remote desktop and ssh at the same time), the cross-session side-effects can cause unexpected behaviour.
|
||||
|
||||
In contrast to this for AFS token renewal having access to new tokens is helpful, as this allows prolong the time a `PAG` (group of processes authenticated against AFS) is working as long as there is at least one valid ticket available.
|
||||
Or even to recover when a new ticket comes available again.
|
||||
|
||||
A way to get `KCM` of of the business of selecting the "optimal" cache is to select it yourself and provide the session/software one specific cache by setting the `KRB5CCNAME` environment variable accordingly (e.g. `KCM:44951:66120`). Note when set to `KCM:` it will use the default cache from `KCM`.
|
||||
|
||||
|
||||
### Problems of sssd_kcm
|
||||
|
||||
The most obvious and well [known problem](https://github.com/SSSD/sssd/issues/3593) of `sshd-kcm` is that does not remove expired tokens and credential caches. I agree that it should not have an impact as this is mostly cosmetic. But that is only the case when everything can cope with that...
|
||||
|
||||
To check the Kerberos credential cache, you can use `klist` to look a the current default cache and `klist -l` to look at all available caches. Note that there the first listed cache is the default cache. Of course that is only valid when there is no `KRB5CCNAME` environment variable set or it is `KCM:`.
|
||||
|
||||
#### Use of Expired Credential Caches
|
||||
In below example you see that on the ssh login, I got a new default cache. But after a few minutes (there was a Desktop login from my side and maybe an automatic AFS token renewal in between), I get an expired cache as default cache.
|
||||
```
|
||||
$ ssh lxdev01.psi.ch
|
||||
Last login: Tue Oct 4 09:50:33 2022
|
||||
[buchel_k@lxdev01 ~]$ klist -l
|
||||
Principal name Cache name
|
||||
-------------- ----------
|
||||
buchel_k@D.PSI.CH KCM:44951:42923
|
||||
buchel_k@D.PSI.CH KCM:44951:12312 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:44951:42199 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:44951:40168
|
||||
buchel_k@D.PSI.CH KCM:44951:8914 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:44951:62275 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:44951:27078 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:44951:73924 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:44951:72006
|
||||
buchel_k@D.PSI.CH KCM:44951:64449 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:44951:60061 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:44951:36925 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:44951:48361 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:44951:49651 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:44951:76984 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:44951:54227 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:44951:85800 (Expired)
|
||||
[buchel_k@lxdev01 ~]$ klist -l
|
||||
Principal name Cache name
|
||||
-------------- ----------
|
||||
buchel_k@D.PSI.CH KCM:44951:12312 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:44951:42199 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:44951:40168
|
||||
buchel_k@D.PSI.CH KCM:44951:8914 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:44951:62275 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:44951:27078 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:44951:73924 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:44951:72006
|
||||
buchel_k@D.PSI.CH KCM:44951:64449 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:44951:60061 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:44951:36925 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:44951:48361 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:44951:42923
|
||||
buchel_k@D.PSI.CH KCM:44951:49651 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:44951:76984 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:44951:54227 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:44951:85800 (Expired)
|
||||
[buchel_k@lxdev01 ~]$
|
||||
```
|
||||
Note that the automatic AFS token renewal was created after we have experienced this issue.
|
||||
|
||||
|
||||
#### Busy Loop of goa-daemon
|
||||
If the [GNOME Online Accounts](https://wiki.gnome.org/Projects/GnomeOnlineAccounts) encounters a number of Kerberos credential caches it goes into a busy loop and causes `sssd-kcm` to consume 100% of one core. Happily ignored bugs at [Red Hat](https://bugzilla.redhat.com/show_bug.cgi?id=1645624#c113) and [Gnome](https://gitlab.gnome.org/GNOME/gnome-online-accounts/-/issues/79).
|
||||
|
||||
#### Zombie Caches by NoMachine NX
|
||||
On a machine with remote desktop access using NoMachine NX I have seen following cache list in the log:
|
||||
```
|
||||
# /usr/bin/klist -l
|
||||
Principal name Cache name
|
||||
-------------- ----------
|
||||
fische_r@D.PSI.CH KCM:45334:73632 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:45334:55706 (Expired)
|
||||
fische_r@D.PSI.CH KCM:45334:44226 (Expired)
|
||||
fische_r@D.PSI.CH KCM:45334:40904 (Expired)
|
||||
fische_r@D.PSI.CH KCM:45334:62275 (Expired)
|
||||
fische_r@D.PSI.CH KCM:45334:89020 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:45334:25061 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:45334:35168 (Expired)
|
||||
fische_r@D.PSI.CH KCM:45334:73845 (Expired)
|
||||
fische_r@D.PSI.CH KCM:45334:47508 (Expired)
|
||||
fische_r@D.PSI.CH KCM:45334:34317 (Expired)
|
||||
fische_r@D.PSI.CH KCM:45334:52058 (Expired)
|
||||
fische_r@D.PSI.CH KCM:45334:16150 (Expired)
|
||||
fische_r@D.PSI.CH KCM:45334:84445 (Expired)
|
||||
fische_r@D.PSI.CH KCM:45334:69076 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:45334:87346 (Expired)
|
||||
fische_r@D.PSI.CH KCM:45334:57070 (Expired)
|
||||
```
|
||||
or on another machine in my personal list:
|
||||
```
|
||||
[buchel_k@pc14831 ~]$ klist -l
|
||||
Principal name Cache name
|
||||
-------------- ----------
|
||||
buchel_k@D.PSI.CH KCM:44951:69748
|
||||
buchel_k@D.PSI.CH KCM:44951:18506 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:44951:5113 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:44951:52685 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:44951:13951 (Expired)
|
||||
PC14831$@D.PSI.CH KCM:44951:43248 (Expired)
|
||||
PC14831$@D.PSI.CH KCM:44951:58459 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:44951:14668 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:44951:92516 (Expired)
|
||||
[buchel_k@pc14831 ~]$
|
||||
```
|
||||
Both show principals which I am very sure that they have not been added manually by the user. And somewhere there is a security issue, either `sssd-kcm` or NoMachine NX.
|
||||
|
||||
In another experiment I logged into a machine with `ssh` and did `kdestroy -A` which should destroy all caches:
|
||||
|
||||
```
|
||||
[buchel_k@mpc2959 ~]$ kdestroy -A
|
||||
[buchel_k@mpc2959 ~]$ klist -l
|
||||
Principal name Cache name
|
||||
[buchel_k@mpc2959 ~]$
|
||||
```
|
||||
|
||||
After I login via NoMachine NX and get an cache expired since more than two month:
|
||||
|
||||
```
|
||||
[buchel_k@mpc2959 ~]$ klist -l
|
||||
Principal name Cache name
|
||||
|
||||
buchel_k@D.PSI.CH KCM:44951:16795 (Expired)
|
||||
buchel_k@D.PSI.CH KCM:44951:69306
|
||||
[buchel_k@mpc2959 ~]$ klist
|
||||
Ticket cache: KCM:44951:16795
|
||||
Default principal: buchel_k@D.PSI.CH
|
||||
|
||||
Valid starting Expires Service principal
|
||||
13.07.2022 11:35:51 13.07.2022 21:26:19 krbtgt/D.PSI.CH@D.PSI.CH
|
||||
renew until 14.07.2022 11:26:19
|
||||
[buchel_k@mpc2959 ~]$ date
|
||||
Do Sep 22 08:37:41 CEST 2022
|
||||
[buchel_k@mpc2959 ~]$
|
||||
```
|
||||
Note that a non-expired cache is available, but NoMachine NX explicitely sets `KRB5CCNAME` to a specific KCM cache. And it contains a ticket/cache which is supposed to the gone.
|
||||
|
||||
So there is a security bug in `sssd-kcm`: it does not fully destroy tickets when being told so. And there is another security issue in the NoMachine NX -> `sssd-kcm` interaction. I assume that it talks with the `KCM` as root and gets somehow (or has saved somewhere) old caches and moves them over into user context. But the cache may not originally belong to the user...
|
||||
|
||||
I have not found a lot concerning Kerberos on the NoMachine website.
|
||||
|
||||
## Solution Attempts
|
||||
|
||||
Ideally we would get to a solution which can do the following:
|
||||
|
||||
- interactive user sessions are isolated do not interfer with each other
|
||||
- AFS can get hold of new tickets and inject them into the PAGs as long as the user somehow regular authenticates
|
||||
- `systemd --user` which is residing outside of the interactive user sessions is happy as well
|
||||
- `goa-daemon` sees only one cache
|
||||
|
||||
### renew-afstoken Script/Daemon
|
||||
|
||||
For AFS we (Achim and I) made the script `renew-afstoken` which is started as per PAG daemon by PAM upon login.
|
||||
Out of the available `KCM` caches it selects a suitable one to regulary get a new AFS token.
|
||||
This now works very robust and can also recover from expiration when a new ticket gets available.
|
||||
|
||||
### Session Isolation with KRB5CCNAME
|
||||
|
||||
#### At the End of PAM
|
||||
The idea is to set `KRB5CCNAME` to the very cache which has been created while going through the PAM stack.
|
||||
A self-made PAM module just does this.
|
||||
|
||||
It works well for ssh sessions and might also work well for simple desktop sessions, but not for GNOME.
|
||||
In GNOME the user programms do not start as child of the login screen and thus do not inherit the environment variables.
|
||||
They are started by `systemd --user` and which sets `KRB5CCNAME` to `KCM:` instead of using the system default (which results in the same behaviour).
|
||||
I had a very short look at the `systemd` source code, but could not yet find the place where `KRB5CCNAME` is set. And the RHEL8 version of `systemd` has more than 800 patches compared with upstream... (OK, some might be backports...)
|
||||
|
||||
#### At the Start of PAM
|
||||
At some point I also made a test by setting `KRB5CCNAME` at the start of PAM to a fixed name of an existing cache, so that the TGT, etc. end up in a well known place.
|
||||
That worked well, I also tested sucessfully that autheticating on the screen lock updates the TGT.
|
||||
|
||||
Using a random, non-existing cache name resulted in a failure, not in the creation of that cache as it would happen if you do that with `kinit`.
|
||||
So that self made PAM module would need to be extended to also create the cache.
|
||||
I assumed that the "End of PAM" solution would be easier to implement, so I opted for that.
|
||||
|
||||
## Options for Next Steps
|
||||
|
||||
### Try out End of PAM
|
||||
For the "Start of PAM" experiments I got more into PAM and Kerberos programming with C than I wanted and I think I would get that working in reasonable time.
|
||||
|
||||
### Try out KEYRING
|
||||
Maybe we can try to create a solution with `KEYRING` which isolates the interactive sessions and still allows the AFS token renewal to access all caches. This then also needs `renew-afstoken` to care about Kerberos ticket renewal.
|
||||
|
||||
For the listed use cases above the caches and tickets do not need to survive reboots. If there is something/someone needing `KCM` for some reason, it can be used specifically and privately and will not interfer with the rest of the system.
|
||||
|
||||
### Red Hat Ticket
|
||||
I have an [ticket](https://access.redhat.com/support/cases/#/case/03280446) open with Red Hat on this case. On the first part I concentrated on the missing session isolation, but it showed that this is the supposed behaviour of a KCM setup.
|
||||
|
||||
One Problem is that our machines have some non-standard software which is not covered by the support. It is YFS for AFS and NoMachine NX.
|
||||
|
||||
Then it is not that easy to reproduce as the problem is best seen in a long running and used system. To create such a test system with several users and many expired sessions means quite some effort.
|
||||
|
||||
I posted a few strange looking `klist` outputs and asked for explanation, but that seamed not yet to have reached someone with intimidate `sssd-kcm` knowledge.
|
||||
|
||||
How to proceed here? Post this document and ask how to proceed?
|
||||
|
||||
### Other Options
|
||||
|
||||
Fill in your ideas.
|
||||
|
||||
## PS
|
||||
There is an advanage in the broken `sssd-kcm` default cache selection: it forces us to make our stuff robust against `KCM` glitches, which might also occur with a better manager, just way less often and then it would be more harder to explain and to track down.
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user