Ticket 16586 - The auth/jwt plugin does not properly handle the x5c field in a JWK to verify JWT signatures
Summary: The auth/jwt plugin does not properly handle the x5c field in a JWK to verify...
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 23.02.1
Hardware: Linux Linux
: --- 4 - Minor Issue
Assignee: Nate Rini
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2023-04-25 15:02 MDT by David Gloe
Modified: 2023-09-22 10:42 MDT (History)
2 users (show)

See Also:
Site: CRAY
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: Cray Internal
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 23.02.6,23.11.0rc1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description David Gloe 2023-04-25 15:02:01 MDT
Passing along an issue reported to me by Zach Crisler internally at HPE:

How to reproduce: Run Slurmctld with a configuration that specifes a jwks_file= option in AuthAltParamaters containing a JWKS from Keycloak (confirmed with version 21.0.2), whose JWKs contain a valid x5c field. Set and export SLURM_JWT to a valid access_token or id_token as obtained from the token endpoint specified in Keycloak's openid-configuration endpoint (e.g., https://keycloak.example.com/realms/master/protocol/openid-connect/token) using a configured Keycloak client with a scope configured to provide the sun (or username) claim. (Note that the openid scope must be used to obtain an id_token. Use jwt.io to verify the tokens are valid and contain the necessary claims, see https://slurm.schedmd.com/jwt.html#compatibility.) Run a SLURM client command, e.g., sinfo fails with slurm_load_partitions: Unexpected message received. Check the Slurmctld logs for a failed authentication attempt like:

slurmctld: debug:  auth/jwt: _verify_rs256_jwt: matched on kid '1lYLJR3xYZNf8bUyX0DHXR4MNgaIZ-u3iItxufoEuGM'
slurmctld: error: failed to verify jwt, rc=22
slurmctld: error: could not find matching kid or decode failed
slurmctld: error: slurm_unpack_received_msg: [[slurm]:58038] auth_g_verify: REQUEST_PARTITION_INFO has authentication error: Unspecified error
slurmctld: error: slurm_unpack_received_msg: [[slurm]:58038] Protocol authentication error
slurmctld: error: slurm_receive_msg [172.24.250.250:58038]: Protocol authentication error

The rc=22 indicates an EINVAL error occurred during verification.

Analysis: The x5c field, as described in https://datatracker.ietf.org/doc/html/rfc7517#section-4.7, contains the trust chain of DER encoded certificates which should be used to verify signatures on JWTs when it is specified in the JWK. Lines https://github.com/SchedMD/slurm/blob/master/src/plugins/auth/jwt/auth_jwt.c#L144-L145 incorrectly parse the x5c field as if it contained a single PEM encoded certificate. When verifying a JWT the configured JWKS is searched for the corresponding JWK and jwt_decode() (from libjwt) is called to verify the token with it's key, see https://github.com/SchedMD/slurm/blob/master/src/plugins/auth/jwt/auth_jwt.c#L334-L341. When using OpenSSL, libjwt loads the provider's public key into a BIO buffer using PEM_read_bio_PUBKEY() at https://github.com/benmcollins/libjwt/blob/master/libjwt/jwt-openssl.c#L369 which returns NULL and causes EINVAL to be returned up the stack. (Presummably, when using GnuTLS gnutls_pubkey_import() at https://github.com/benmcollins/libjwt/blob/master/libjwt/jwt-gnutls.c#L308-L311 will exhibit the same issue and also return EINVAL.)

Potential workaround: Since the x5c field is technically optional, we are able to workaround this issue by removing it from JWKs. When a JWK does not contain an x5c field, the auth/jwt plugin generates a PEM certificate from it's public key parameters n and e, see https://github.com/SchedMD/slurm/blob/master/src/plugins/auth/jwt/auth_jwt.c#L158.

Expected functionality: Although generating a PEM certificate from the public key parameters works, comments in https://github.com/SchedMD/slurm/blob/master/src/plugins/auth/jwt/pem_key.c#L188-L190 suggest the pem_from_mod_exp() function is a hack to deal with non-OpenSSL libraries and JWKs without the x5c field. If the auth/jwt plugin is going to support the x5c field (highly recommended), then it should properly support the specified trust chain. Even though the AuthAltParameters parameter requires the jwks_file= option to specify key providers, it should not implicitly trust them without chaining to the system's trust chain. As described in the Security Considerations section on Key Provenance and Trust, https://datatracker.ietf.org/doc/html/rfc7517#section-9.1, JWTs that contain a reference to its provider JWK should be verified without the auth/jwt configuration needing to explicitly trust specific providers other than widely trusted roots.
Comment 1 Nate Rini 2023-04-25 15:08:32 MDT
Parsing of x5c is broken for JWKS in auth/jwt. I have a patchset that I've been debugging on Azure in bug#16168 that should fix the issue.
Comment 2 Nate Rini 2023-04-25 15:09:29 MDT
Which SSL library is libjwt compiled against on the test system?
Comment 3 Zachary Crisler 2023-04-25 17:07:43 MDT
(In reply to Nate Rini from comment #2)
> Which SSL library is libjwt compiled against on the test system?

OpenSSL 1.1.1l
Comment 4 Zachary Crisler 2023-04-25 17:58:05 MDT
(In reply to Nate Rini from comment #1)
> Parsing of x5c is broken for JWKS in auth/jwt. I have a patchset that I've
> been debugging on Azure in bug#16168 that should fix the issue.

I'm happy to test, is it available? I don't see a corresponding "bug" branch at https://github.com/SchedMD/slurm.
Comment 5 Nate Rini 2023-04-26 14:23:25 MDT
(In reply to Zachary Crisler from comment #4)
> (In reply to Nate Rini from comment #1)
> > Parsing of x5c is broken for JWKS in auth/jwt. I have a patchset that I've
> > been debugging on Azure in bug#16168 that should fix the issue.
> 
> I'm happy to test, is it available? I don't see a corresponding "bug" branch
> at https://github.com/SchedMD/slurm.

It is currently not shared. I have requested to be able to share it here. I will update once something changes.

Is Cray using an unmodified keycloak implementation for generating JWKS atleast w/rt to the resultant JWKS?
Comment 6 Zachary Crisler 2023-04-26 15:33:17 MDT
(In reply to Nate Rini from comment #5)
> Is Cray using an unmodified keycloak implementation for generating JWKS
> atleast w/rt to the resultant JWKS?

I used Keycloak version 21.0.2, deployed via the Kubernetes operator, installed without OLM as documented at https://www.keycloak.org/operator/installation#_installing_by_using_kubectl_without_operator_lifecycle_manager.

The following Kustomization reflects how it was deployed on my test cluster:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: keycloak-system
resources:
- https://raw.githubusercontent.com/keycloak/keycloak-k8s-resources/21.0.2/kubernetes/keycloaks.k8s.keycloak.org-v1.yml
- https://raw.githubusercontent.com/keycloak/keycloak-k8s-resources/21.0.2/kubernetes/keycloakrealmimports.k8s.keycloak.org-v1.yml
- https://raw.githubusercontent.com/keycloak/keycloak-k8s-resources/21.0.2/kubernetes/kubernetes.yml

Keycloak users and groups are federated from the same LDAP the SLURM cluster uses, so there should be parity between Keycloak and OS users and groups.

I did not customize the JWKS automatically generated by Keycloak, other than to remove the "x5c" field as a workaround to get JWT authentication working with Slurmctld.

I did create a "slurm-jwt" scope that adds the "sun" claim to issued JWTs. For testing, I created a Keycloak client, added the "slurm-jwt" scope to it, and obtained access and ID tokens via:

$ curl -fsSL "https://${KC_HOST}/realms/master/.well-known/openid-configuration" \
| jq -r '.token_endpoint' \
| xargs curl -fsSL -X POST \
    -H 'Content-Type: application/x-www-form-urlencoded' \
    --data-urlencode "client_id=${KC_CLIENT_ID}" \
    --data-urlencode "client_secret=${KC_CLIENT_SECRET}" \
    --data-urlencode 'scope=openid slurm-jwt' \
    --data-urlencode 'grant_type=password' \
    --data-urlencode "username=${KC_USERNAME}" \
    --data-urlencode "password=${KC_PASSWORD}"

Also, based on other open issues and what I've seen in the documentation and source w.r.t. authorization and access control using JWTs, I do not plan to expose direct access to Slurmctld or Slurmd. Instead connections will be proxied using OAuth2 Proxy, which is relatively straight-forward to do via the Ingress NGINX controller. OAuth2 Proxy has "--allowed-group" and "--allowed-role" flags which enables access control based on Keycloak groups and roles, as not every valid JWT should be granted access to every SLURM cluster. I'm curious if anyone has considered using e.g. Open Policy Agent (OPA) to handle complex authorization policies with SLURM, but I do not expect to require that level of control in the near term.


OAuth2 Proxy: https://oauth2-proxy.github.io/oauth2-proxy/
Ingress NGINX controller: https://github.com/kubernetes/ingress-nginx
OPA: https://www.openpolicyagent.org/
Comment 7 Nate Rini 2023-09-06 09:24:34 MDT
Zachary

The correction for this bug is still in the works due to a variety of other changes to the auth system. Have there been any issues with just removing the x5c field?

Thanks,
--Nate
Comment 8 Nate Rini 2023-09-21 17:26:47 MDT
(In reply to Zachary Crisler from comment #6)
> (In reply to Nate Rini from comment #5)
> > Is Cray using an unmodified keycloak implementation for generating JWKS
> > atleast w/rt to the resultant JWKS?
> 
> I used Keycloak version 21.0.2, deployed via the Kubernetes operator,
> installed without OLM as documented at

Thanks for providing the details on the keycloak implementation. We used it to verify the fix worked, and it was a lot easier than having to deploy a cluster in one of the cloud providers. We may also eventually add examples to our documentation for other sites as it was a relatively simple process.
 
> Also, based on other open issues and what I've seen in the documentation and
> source w.r.t. authorization and access control using JWTs, I do not plan to
> expose direct access to Slurmctld or Slurmd.

Sorry for the delay in responding, I missed this part as work on this ticket got delayed due to internal discussions about how best to handle x509 certs. We ended up deciding to drop the invalid code that tried to parse the x5c fields. We chose to continue relying on the other RSA values in the JWKS keys as all sites affected had no issue dropping the 'x5c' field in their JWKS files. The potential benefits of parsing the full certificate chain may have us revisit this again in the future. This change is upstream for the upcoming Slurm-23.02.6 release:
> https://github.com/SchedMD/slurm/commit/2674a3a8aa67031e12b73903612a3c36b290f787

Using auth/jwt doesn't work with slurmd and has an the outstanding RFE bug#12618. It doesn't currently have a sponsor so I have no ETA on/or if it will be implemented. Communications to slurmctld and slurmdbd should work fine with auth/jwt though. This will mainly limit control of job steps which may not matter if that control only needs to be done from the batch step which will run on the compute nodes and will have auth/munge working.

> Instead connections will be
> proxied using OAuth2 Proxy, which is relatively straight-forward to do via
> the Ingress NGINX controller. OAuth2 Proxy has "--allowed-group" and
> "--allowed-role" flags which enables access control based on Keycloak groups
> and roles, as not every valid JWT should be granted access to every SLURM
> cluster.

This sounds like a good implementation. slurmrestd was designed with the hope that existing solutions for proxying could be used to avoid having to recreate the wheel.

> I'm curious if anyone has considered using e.g. Open Policy Agent
> (OPA) to handle complex authorization policies with SLURM, but I do not
> expect to require that level of control in the near term.

We have not had any sites report using OPA. Most of the sites that have provided any information have generally implemented an internal auth system based on some one-off user/group/account system specific to their organization making sharing the details unhelpful to any other sites.

I'm going to close this ticket. Please respond with any related questions or issues and we can continue the discussion in the ticket.
Comment 9 Zachary Crisler 2023-09-22 09:05:58 MDT
(In reply to Nate Rini from comment #8)
> Sorry for the delay in responding, I missed this part as work on this ticket
> got delayed due to internal discussions about how best to handle x509 certs.

Have you considered SPIFFE and SPIRE? https://spiffe.io/ It is quickly
becoming a defacto standard.

> We ended up deciding to drop the invalid code that tried to parse the x5c
> fields. We chose to continue relying on the other RSA values in the JWKS
> keys as all sites affected had no issue dropping the 'x5c' field in their
> JWKS files. 

I think this is a reasonable work around for now. At least I won't need to
edit the JWKS downloaded from Keycloak.
Comment 10 Nate Rini 2023-09-22 10:42:53 MDT
(In reply to Zachary Crisler from comment #9)
> (In reply to Nate Rini from comment #8)
> > Sorry for the delay in responding, I missed this part as work on this ticket
> > got delayed due to internal discussions about how best to handle x509 certs.
> 
> Have you considered SPIFFE and SPIRE? https://spiffe.io/ It is quickly
> becoming a defacto standard.

It is not currently under consideration as a new auth system for Slurm. That could change if someone decides to sponsor adding it.