Passing along an issue reported to me by Zach Crisler internally at HPE: How to reproduce: Run Slurmctld with a configuration that specifes a jwks_file= option in AuthAltParamaters containing a JWKS from Keycloak (confirmed with version 21.0.2), whose JWKs contain a valid x5c field. Set and export SLURM_JWT to a valid access_token or id_token as obtained from the token endpoint specified in Keycloak's openid-configuration endpoint (e.g., https://keycloak.example.com/realms/master/protocol/openid-connect/token) using a configured Keycloak client with a scope configured to provide the sun (or username) claim. (Note that the openid scope must be used to obtain an id_token. Use jwt.io to verify the tokens are valid and contain the necessary claims, see https://slurm.schedmd.com/jwt.html#compatibility.) Run a SLURM client command, e.g., sinfo fails with slurm_load_partitions: Unexpected message received. Check the Slurmctld logs for a failed authentication attempt like: slurmctld: debug: auth/jwt: _verify_rs256_jwt: matched on kid '1lYLJR3xYZNf8bUyX0DHXR4MNgaIZ-u3iItxufoEuGM' slurmctld: error: failed to verify jwt, rc=22 slurmctld: error: could not find matching kid or decode failed slurmctld: error: slurm_unpack_received_msg: [[slurm]:58038] auth_g_verify: REQUEST_PARTITION_INFO has authentication error: Unspecified error slurmctld: error: slurm_unpack_received_msg: [[slurm]:58038] Protocol authentication error slurmctld: error: slurm_receive_msg [172.24.250.250:58038]: Protocol authentication error The rc=22 indicates an EINVAL error occurred during verification. Analysis: The x5c field, as described in https://datatracker.ietf.org/doc/html/rfc7517#section-4.7, contains the trust chain of DER encoded certificates which should be used to verify signatures on JWTs when it is specified in the JWK. Lines https://github.com/SchedMD/slurm/blob/master/src/plugins/auth/jwt/auth_jwt.c#L144-L145 incorrectly parse the x5c field as if it contained a single PEM encoded certificate. When verifying a JWT the configured JWKS is searched for the corresponding JWK and jwt_decode() (from libjwt) is called to verify the token with it's key, see https://github.com/SchedMD/slurm/blob/master/src/plugins/auth/jwt/auth_jwt.c#L334-L341. When using OpenSSL, libjwt loads the provider's public key into a BIO buffer using PEM_read_bio_PUBKEY() at https://github.com/benmcollins/libjwt/blob/master/libjwt/jwt-openssl.c#L369 which returns NULL and causes EINVAL to be returned up the stack. (Presummably, when using GnuTLS gnutls_pubkey_import() at https://github.com/benmcollins/libjwt/blob/master/libjwt/jwt-gnutls.c#L308-L311 will exhibit the same issue and also return EINVAL.) Potential workaround: Since the x5c field is technically optional, we are able to workaround this issue by removing it from JWKs. When a JWK does not contain an x5c field, the auth/jwt plugin generates a PEM certificate from it's public key parameters n and e, see https://github.com/SchedMD/slurm/blob/master/src/plugins/auth/jwt/auth_jwt.c#L158. Expected functionality: Although generating a PEM certificate from the public key parameters works, comments in https://github.com/SchedMD/slurm/blob/master/src/plugins/auth/jwt/pem_key.c#L188-L190 suggest the pem_from_mod_exp() function is a hack to deal with non-OpenSSL libraries and JWKs without the x5c field. If the auth/jwt plugin is going to support the x5c field (highly recommended), then it should properly support the specified trust chain. Even though the AuthAltParameters parameter requires the jwks_file= option to specify key providers, it should not implicitly trust them without chaining to the system's trust chain. As described in the Security Considerations section on Key Provenance and Trust, https://datatracker.ietf.org/doc/html/rfc7517#section-9.1, JWTs that contain a reference to its provider JWK should be verified without the auth/jwt configuration needing to explicitly trust specific providers other than widely trusted roots.
Parsing of x5c is broken for JWKS in auth/jwt. I have a patchset that I've been debugging on Azure in bug#16168 that should fix the issue.
Which SSL library is libjwt compiled against on the test system?
(In reply to Nate Rini from comment #2) > Which SSL library is libjwt compiled against on the test system? OpenSSL 1.1.1l
(In reply to Nate Rini from comment #1) > Parsing of x5c is broken for JWKS in auth/jwt. I have a patchset that I've > been debugging on Azure in bug#16168 that should fix the issue. I'm happy to test, is it available? I don't see a corresponding "bug" branch at https://github.com/SchedMD/slurm.
(In reply to Zachary Crisler from comment #4) > (In reply to Nate Rini from comment #1) > > Parsing of x5c is broken for JWKS in auth/jwt. I have a patchset that I've > > been debugging on Azure in bug#16168 that should fix the issue. > > I'm happy to test, is it available? I don't see a corresponding "bug" branch > at https://github.com/SchedMD/slurm. It is currently not shared. I have requested to be able to share it here. I will update once something changes. Is Cray using an unmodified keycloak implementation for generating JWKS atleast w/rt to the resultant JWKS?
(In reply to Nate Rini from comment #5) > Is Cray using an unmodified keycloak implementation for generating JWKS > atleast w/rt to the resultant JWKS? I used Keycloak version 21.0.2, deployed via the Kubernetes operator, installed without OLM as documented at https://www.keycloak.org/operator/installation#_installing_by_using_kubectl_without_operator_lifecycle_manager. The following Kustomization reflects how it was deployed on my test cluster: apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization namespace: keycloak-system resources: - https://raw.githubusercontent.com/keycloak/keycloak-k8s-resources/21.0.2/kubernetes/keycloaks.k8s.keycloak.org-v1.yml - https://raw.githubusercontent.com/keycloak/keycloak-k8s-resources/21.0.2/kubernetes/keycloakrealmimports.k8s.keycloak.org-v1.yml - https://raw.githubusercontent.com/keycloak/keycloak-k8s-resources/21.0.2/kubernetes/kubernetes.yml Keycloak users and groups are federated from the same LDAP the SLURM cluster uses, so there should be parity between Keycloak and OS users and groups. I did not customize the JWKS automatically generated by Keycloak, other than to remove the "x5c" field as a workaround to get JWT authentication working with Slurmctld. I did create a "slurm-jwt" scope that adds the "sun" claim to issued JWTs. For testing, I created a Keycloak client, added the "slurm-jwt" scope to it, and obtained access and ID tokens via: $ curl -fsSL "https://${KC_HOST}/realms/master/.well-known/openid-configuration" \ | jq -r '.token_endpoint' \ | xargs curl -fsSL -X POST \ -H 'Content-Type: application/x-www-form-urlencoded' \ --data-urlencode "client_id=${KC_CLIENT_ID}" \ --data-urlencode "client_secret=${KC_CLIENT_SECRET}" \ --data-urlencode 'scope=openid slurm-jwt' \ --data-urlencode 'grant_type=password' \ --data-urlencode "username=${KC_USERNAME}" \ --data-urlencode "password=${KC_PASSWORD}" Also, based on other open issues and what I've seen in the documentation and source w.r.t. authorization and access control using JWTs, I do not plan to expose direct access to Slurmctld or Slurmd. Instead connections will be proxied using OAuth2 Proxy, which is relatively straight-forward to do via the Ingress NGINX controller. OAuth2 Proxy has "--allowed-group" and "--allowed-role" flags which enables access control based on Keycloak groups and roles, as not every valid JWT should be granted access to every SLURM cluster. I'm curious if anyone has considered using e.g. Open Policy Agent (OPA) to handle complex authorization policies with SLURM, but I do not expect to require that level of control in the near term. OAuth2 Proxy: https://oauth2-proxy.github.io/oauth2-proxy/ Ingress NGINX controller: https://github.com/kubernetes/ingress-nginx OPA: https://www.openpolicyagent.org/
Zachary The correction for this bug is still in the works due to a variety of other changes to the auth system. Have there been any issues with just removing the x5c field? Thanks, --Nate
(In reply to Zachary Crisler from comment #6) > (In reply to Nate Rini from comment #5) > > Is Cray using an unmodified keycloak implementation for generating JWKS > > atleast w/rt to the resultant JWKS? > > I used Keycloak version 21.0.2, deployed via the Kubernetes operator, > installed without OLM as documented at Thanks for providing the details on the keycloak implementation. We used it to verify the fix worked, and it was a lot easier than having to deploy a cluster in one of the cloud providers. We may also eventually add examples to our documentation for other sites as it was a relatively simple process. > Also, based on other open issues and what I've seen in the documentation and > source w.r.t. authorization and access control using JWTs, I do not plan to > expose direct access to Slurmctld or Slurmd. Sorry for the delay in responding, I missed this part as work on this ticket got delayed due to internal discussions about how best to handle x509 certs. We ended up deciding to drop the invalid code that tried to parse the x5c fields. We chose to continue relying on the other RSA values in the JWKS keys as all sites affected had no issue dropping the 'x5c' field in their JWKS files. The potential benefits of parsing the full certificate chain may have us revisit this again in the future. This change is upstream for the upcoming Slurm-23.02.6 release: > https://github.com/SchedMD/slurm/commit/2674a3a8aa67031e12b73903612a3c36b290f787 Using auth/jwt doesn't work with slurmd and has an the outstanding RFE bug#12618. It doesn't currently have a sponsor so I have no ETA on/or if it will be implemented. Communications to slurmctld and slurmdbd should work fine with auth/jwt though. This will mainly limit control of job steps which may not matter if that control only needs to be done from the batch step which will run on the compute nodes and will have auth/munge working. > Instead connections will be > proxied using OAuth2 Proxy, which is relatively straight-forward to do via > the Ingress NGINX controller. OAuth2 Proxy has "--allowed-group" and > "--allowed-role" flags which enables access control based on Keycloak groups > and roles, as not every valid JWT should be granted access to every SLURM > cluster. This sounds like a good implementation. slurmrestd was designed with the hope that existing solutions for proxying could be used to avoid having to recreate the wheel. > I'm curious if anyone has considered using e.g. Open Policy Agent > (OPA) to handle complex authorization policies with SLURM, but I do not > expect to require that level of control in the near term. We have not had any sites report using OPA. Most of the sites that have provided any information have generally implemented an internal auth system based on some one-off user/group/account system specific to their organization making sharing the details unhelpful to any other sites. I'm going to close this ticket. Please respond with any related questions or issues and we can continue the discussion in the ticket.
(In reply to Nate Rini from comment #8) > Sorry for the delay in responding, I missed this part as work on this ticket > got delayed due to internal discussions about how best to handle x509 certs. Have you considered SPIFFE and SPIRE? https://spiffe.io/ It is quickly becoming a defacto standard. > We ended up deciding to drop the invalid code that tried to parse the x5c > fields. We chose to continue relying on the other RSA values in the JWKS > keys as all sites affected had no issue dropping the 'x5c' field in their > JWKS files. I think this is a reasonable work around for now. At least I won't need to edit the JWKS downloaded from Keycloak.
(In reply to Zachary Crisler from comment #9) > (In reply to Nate Rini from comment #8) > > Sorry for the delay in responding, I missed this part as work on this ticket > > got delayed due to internal discussions about how best to handle x509 certs. > > Have you considered SPIFFE and SPIRE? https://spiffe.io/ It is quickly > becoming a defacto standard. It is not currently under consideration as a new auth system for Slurm. That could change if someone decides to sponsor adding it.