update state of Kerberos

This commit is contained in:
2022-10-24 17:14:13 +02:00
parent 14d58979ae
commit 7c07e8f312
+50 -55
View File
@@ -3,7 +3,8 @@
This document describes the Kerberos issues we encountered during RHEL 8 introduction.
In RHEL 7 we are using the `KEYRING` (kernel keyring) cache,
whereas for RHEL 8 there came early the wish to use `KCM` (Kerberos Cache Manager) instead.
whereas for RHEL 8 there came early the wish to use `KCM` (Kerberos Cache Manager) instead,
which also is the [new default](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/considerations_in_adopting_rhel_8/identity-management_considerations-in-adopting-rhel-8#kcm-replace-keyring-default-cache_considerations-in-adopting-RHEL-8).
The Kerberos documentation contains a [reference for all available cache types]( https://web.mit.edu/kerberos/www/krb5-latest/doc/basic/ccache_def.html).
@@ -17,13 +18,13 @@ The Kerberos documentation contains a [reference for all available cache types](
- local desktop: ticket renewal after reauthentication on lock screen
- remote desktop with NoMachine NX: get new TGT on login
- remote desktop with NoMachine NX: ticket renewal after reconnection
- Website authentication (`SPNEGO` with Firefox, Chrome)
- website authentication (`SPNEGO` with Firefox, Chrome)
## KCM
## `KCM`
The `KCM` cache is provided by a dedicated daemon, for RHEL8 this is `sssd_kcm` which has been programmed by Red Hat itself.
### Advantages of KCM
### Advantages of `KCM`
The advantage of `KCM` is that the caches are permanent and survive daemon restarts and system reboots without the need to fiddle around with files and file permission. This simplifies daemon and container use cases.
It also automatically renews tickets which is handy for every use case.
@@ -48,12 +49,20 @@ Or even to recover when a new ticket comes available again.
A way to get `KCM` of of the business of selecting the "optimal" cache is to select it yourself and provide the session/software one specific cache by setting the `KRB5CCNAME` environment variable accordingly (e.g. `KCM:44951:66120`). Note when set to `KCM:` it will use the default cache from `KCM`.
### Problems of sssd_kcm
The most obvious and [well known problem](https://github.com/SSSD/sssd/issues/3593) of `sshd-kcm` is that it does not remove expired tokens and credential caches. I agree that it should not have an impact as this is mostly cosmetic. But that is only the case when everything can cope with that...
### Problems of `sssd_kcm`
To check the Kerberos credential cache, you can use `klist` to look a the current default cache and `klist -l` to look at all available caches. Note that the first listed cache is the default cache. Of course that is only valid when there is no `KRB5CCNAME` environment variable set or it is `KCM:`.
#### No Cleanup of Expired Caches
The most obvious and [well known problem](https://github.com/SSSD/sssd/issues/3593) of `sshd-kcm` is that it does not remove expired tokens and credential caches. I agree that it should not have an impact as this is mostly cosmetic. But that is only the case when everything can cope with that...
By default is is limited to 64 caches, but when that limit was hit, then it was not possible any more to authenticate on the lock screen:
```
Okt 05 14:57:11 lxdev01.psi.ch krb5_child[43689]: Internal credentials cache error
```
So this causes a denial of service problem, we need to deal with somehow, e.g. by regulary removing expired caches. And note that these caches are persistent and do not get removed on reboot.
#### Use of Expired Credential Caches
In below example you see that on the ssh login, I got a new default cache. But after a few minutes (there was a Desktop login from my side and maybe an automatic AFS token renewal in between), I get an expired cache as default cache.
```
@@ -104,7 +113,7 @@ buchel_k@D.PSI.CH KCM:44951:85800 (Expired)
Note that the automatic AFS token renewal was created after we have experienced this issue.
#### Busy Loop of goa-daemon
#### Busy Loop of `goa-daemon`
If the [GNOME Online Accounts](https://wiki.gnome.org/Projects/GnomeOnlineAccounts) encounters a number of Kerberos credential caches it goes into a busy loop and causes `sssd-kcm` to consume 100% of one core. Happily ignored bugs at [Red Hat](https://bugzilla.redhat.com/show_bug.cgi?id=1645624#c113) and [Gnome](https://gitlab.gnome.org/GNOME/gnome-online-accounts/-/issues/79).
#### Zombie Caches by NoMachine NX
@@ -191,6 +200,15 @@ Ideally we would get to a solution which can do the following:
- AFS can get hold of new tickets and inject them into the PAGs as long as the user somehow regularly authenticates
- `systemd --user` which is residing outside of the interactive user sessions is happy as well
- `goa-daemon` sees only one cache
- expired caches get somehow cleaned up
### Only One Cache
The `sssd-kcm` limits the number of caches by default to 64, but that can be changed to 1 with the `max_uid_ccaches`.
So there would be only one cache, shared by all sessions, but at least the `KCM` cannot serve anything but the latest.
But some logins do not work any more when the maximum number of caches is hit as already documented above in the chapter "No Cleanup of Expired Caches".
### renew-afstoken Script/Daemon
@@ -198,73 +216,50 @@ For AFS we (Achim and I) made the script `renew-afstoken` which is started as pe
Out of the available `KCM` caches it selects a suitable one to regulary get a new AFS token.
This now works very robust and can also recover from expiration when a new ticket gets available.
### Session Isolation with KRB5CCNAME
#### At the End of PAM
The idea is to set `KRB5CCNAME` to the very cache which has been created while going through the PAM stack.
A self-made PAM module just does this.
It works well for ssh sessions and might also work well for simple desktop sessions, but not for GNOME.
In GNOME the user programms do not start as child of the login screen and thus do not inherit the environment variables.
They are started by `systemd --user` and which sets `KRB5CCNAME` to `KCM:` instead of using the system default (which results in the same behaviour).
I had a very short look at the `systemd` source code, but could not yet find the place where `KRB5CCNAME` is set. And the RHEL8 version of `systemd` has more than 800 patches compared with upstream... (OK, some might be backports...)
### Setup Shared or Isolated Caches with KRB5CCNAME in own PAM Module
#### At the Start of PAM
At some point I also made a test by setting `KRB5CCNAME` at the start of PAM to a fixed name of an existing cache, so that the TGT, etc. end up in a well known place.
That worked well, I also tested sucessfully that authenticating on the screen lock updates the TGT.
A self-made PAM module `pam_single_kcm_cache.so` runs at session setup to set:
Using a random, non-existing cache name resulted in a failure, not in the creation of that cache as it would happen if you do that with `kinit`.
So that self made PAM module would need to be extended to also create the cache.
I assumed that the "End of PAM" solution would be easier to implement, so I opted for that.
- `KRB5CCNAME=KCM:$UID:desktop` to use a shared credential cache for desktop sessions and `systemd --user`
- `KRB5CCNAME=KCM:$UID:$RANDOM_LETTERS` for text sessions to provide session isolation
#### Copy Credentials from Automatic Generated Cache to Self-Made Cache Both "Start of PAM" or "End of PAM"
In my latest version of my self-made PAM module can be used both at auth (early) and at session (late) part of the PAM stack. It either sets `KRB5CCNAME` to a randomly generated cache ( e.g. `KCM:44951:bzfandqspm`) or to a given value, e.g. `suffix=desktop` would create `KCM:44951:desktop`.
Now as the automatically created credential caches by `sshd` (ticket delegation) or by `gdm` (created in PAM by `sss.so`, I guess) end up in a new `KCM` cache with the pattern `KCM:$UID:$RANDOM_NUMBER`.
The credentials therein are now copied over in the newly created or already existing cache. The former, automatic created cache is then destroyed.
To make ticket delegation work, it copies the credentials from the default cache to the new cache.
Now there is no simple and bullet proof selection of the automatically created credential cache.
The default cache used select by KCM might it be or not.
To work around this, the module iterates through all credential caches provided by the KCM and selects only those with the pattern `KCM:$UID` or `KCM:$UID:$RANDOM_NUMBER` which has a principal fitting the username.
From all of those it selects the one which is the youngest.
First experiments with with ssh are successful. For the desktop I use a fixed cache `KCM:44951:desktop`. This makes`KRB5CCNAME` is correctly set, but the credentials are in the current version not yet available.
Note that the reason for `systemd --user` to use the same credential cache as the desktop sessions is that at least Gnome uses it to start the user programs like Evolution or Firefox.
TODOs:
- better selection of the source cache as the default cache is not always ideal (I have code for that already in the "End of PAM"-only version.
- do not delete credentials in target cache
- the "random" might not be nessesary, at least for ssh it it would be sufficient to fix it to the automatic generated cache
- or destroy source cache?
- might be the early part (auth, "Start of PAM") is not needed.
Ideally at the end there exist only caches with the naming pattern `KCM:$UID:desktop` and `KCM:$UID:$RANDOM_LETTERS`.
If there are still some `KCM:$UID:$RANDOM_NUMBER` then they were not caught, e.g. because they use an so far unknown authentication path.
There are still some more experiments required.
The code you find in [Gitlab](https://git.psi.ch/linux-infra/pam_single_kcm_cache) where there is currently an [open merge request for the initial commit](https://git.psi.ch/linux-infra/pam_single_kcm_cache/-/merge_requests/1). I plan to make that public on Github.
### Only One Cache
The `sssd-kcm` limits the number of caches by default to 64, but that can be changed to 1 with the `max_uid_ccaches`.
So there would be only one cache, shared by all sessions, but at least the `KCM` cannot serve anything but the latest.
I did not exactly test this, but I tested what happens when all 64 caches are used up.
It was not possible any more to authenticate on the lock screen:
```
Okt 05 14:57:11 lxdev01.psi.ch krb5_child[43689]: Internal credentials cache error
```
So this causes a denial of service problem, we need to deal with somehow, e.g. by regulary removing expired caches.
## Open Problems
- for NX and su I do not get a copy of the initial cache (or is there an initial cache?), this needs more investigation
- when getting out of Gnome screen lock it puts the new TGT into the default KCM cache and not necessarily into `KCM:$UID:desktop`
- cleanup of caches, else we might end up in DoS situation
- `pam_single_kcm_cache.so` could be extended to destroy cache on end of session => not a good idea with AFS and long running background calculations
- `pam_single_kcm_cache.so` could be extended to optionally destroy all caches at the end of session => useful for `systemd --user`, because that ends after the last user process has ended and would then do a full cleanup. This would also ensure a empty KCM after a clean shutdown.
- alternatively we might do a `systemd --user` unit doing so, maybe also as daemon to clean up old expired caches
## Options for Next Steps
### Continue Experiments with Own PAM Module Setting KRB5CCNAME
I think we can get here a solution where we get the KCM out of the business of selecting the best cache while still using the rest of its advantages. But that needs a bit more work and experimentation on that.
### Continue with `pam_single_kcm_cache.so`
I think we can get here a solution where we get the KCM out of the business of selecting the best cache while still using the rest of its advantages.
### Try out KEYRING
Maybe we can try to create a solution with `KEYRING` which isolates the interactive sessions and still allows the AFS token renewal to access all caches. This then also needs `renew-afstoken` to care about Kerberos ticket renewal.
For the listed use cases above the caches and tickets do not need to survive reboots. If there is something/someone needing `KCM` for some reason, it can be used specifically and privately and will not interfer with the rest of the system.
### How to deal with Systemd --user ?
The `systemd --user` process is started at the beginning of the first session and ends at the end of the last session. And some desktop environments depend heavily on it.
So the ideal solution might be to have just one known "desktop" cache (e.g. `KCM:44951:desktop`) which is shared by `systemd --user` and all desktop sessions.
Additinally I figured out that it is possble to inject environment variables into `systemd --user` with `systemctl --user import-environment` or `systemctl --user set-environment`, but that then affects only newly started software.
Only experiments would show if this is good enought or if some important processes live longer than the desktop session.
### Red Hat Ticket
I have an [ticket](https://access.redhat.com/support/cases/#/case/03280446) open with Red Hat on this case. On the first part I concentrated on the missing session isolation, but it showed that this is the supposed behaviour of a KCM setup.
@@ -274,7 +269,7 @@ Then it is not that easy to reproduce as the problem is best seen in a long runn
I posted a few strange looking `klist` outputs and asked for explanation, but that seamed not yet to have reached someone with intimidate `sssd-kcm` knowledge.
How to proceed here? Post this document and ask how to proceed?
I posted this document, but so far the response was not very helpful.
### Other Options