Troubleshoot irrevocable leases
Introduction
Vault creates leases for both dynamic secrets and service tokens, and it maintains the lifecycle of those leases with an internal system called the expiration manager.
The expiration manager handles the revocation of leases when they reach their specified time to live value.
Certain problems can prevent Vault from revoking a lease. For example, leases on secrets issued from a dynamic secrets engine can become irrevocable if Vault cannot communicate with the server configured in the secrets engine.
Challenge
Irrevocable leases accumulate over time and can cause degraded performance at critical stages of Vault operations, such as during startup or when the server assumes active cluster leadership.
Before Vault version 1.8.0, the server would try to revoke all expired leases at once during startup. With the accumulation of tens of thousands of irrevocable leases request handling can become degraded when the expiration manager is attempting revocation.
Solution
Vault 1.8.0 introduced enhanced expiration manager functionality to internally mark leases as irrevocable after 6 failed attempts at revocation.
This provides a way to stop attempting revocation on leases identified as irrevocable.
An HTTP API and CLI command are also available to assist operators in identifying irrevocable leases.
Example scenario
You can follow the example scenario in this tutorial to learn more about Vault lease handling and troubleshooting irrevocable leases.
Prerequisites
To perform the steps in this tutorial, you need:
- Docker Desktop
- Extra configuration from the learn-vault-lease-lab repository
Scenario introduction
The example scenario runs in a Docker environment. You will create a Docker network, and run a Vault dev mode server container. The scenario script will create the PostgreSQL container, configure the secrets engine, and create a dynamic credential with leases. This saves time so that you can focus on interpreting log output, and using the new API and CLI functionality.
Scenario environment setup
Before you can explore the scenarios, you need to prepare the environment.
First, define a learn-vault
Docker network.
$ docker network create --attachable --subnet 10.42.74.0/24 learn-vault
Start a Vault dev mode container.
$ docker run \ --name learn-vault \ --publish 8200:8200 \ --ip 10.42.74.100 \ --network learn-vault \ --detach \ --rm \ vault server -dev -dev-root-token-id root
Export environment variables for communicating with the Vault dev mode container using the root token value.
$ export VAULT_ADDR=http://localhost:8200 VAULT_TOKEN=root
Look up your token to ensure that you can communicate with the Vault dev mode container.
$ vault token lookup | grep policiespolicies [root]
Now that the Vault container is ready, you can begin exploring the example lease revocation scenarios.
Retrieve the example scenario scripts by cloning or downloading the hashicorp-education/learn-vault-lease-lab
repository from GitHub.
Clone the repository.
$ git clone https://github.com/hashicorp-education/learn-vault-lease-lab
Or download the repository.
This repository holds supporting content for all the Vault learn tutorials. The content specific to this tutorial resides in a sub-directory.
Change your working directory to learn-vault-lease-lab
.
$ cd learn-vault-lease-lab
Explore problematic leases
Before you can begin to resolve issues with problematic leases, you should first learn how to identify situations in which Vault is unable to revoke leases.
In this scenario you will identify the appearance of successful and unsuccessful lease revocation entries in the Vault server log, along with identifying an irrevocable lease entry.
The example script starts a PostgreSQL container, configures the Vault container to connect to it, defines a role for creating dynamic credentials, and creates one dynamic credential.
Execute scenario script
Set the dynamic-postgres.sh
file to executable.
$ chmod +x dynamic-postgres.sh
With the Vault server running, execute the script.
$ ./dynamic-postgres.shStart PostgreSQL container.572094a9ad9b7d8dd500945f22e7ff5692c378f10ebb7ac1c5de2024f09ac474.Configure PostgreSQL secrets engine.Success! Enabled the database secrets engine at: database/Success! Data written to: database/roles/db-dbaSuccess! Uploaded policy: db-dbaCreate PostgreSQL dynamic credential using DBA token.Complete.
Explore successful lease revocation message
Wait over 1 minute for the TTL value on the 2 leases to expire, then check the Vault server logs.
$ docker logs 2>&1 learn-vault | grep revoked
You should find a log line indicating successful revocation of the 1 lease created by the script.
2021-07-27T10:52:23.083-0400 [INFO] expiration: revoked lease: lease_id=database/creds/db-dba/3er4DsaHXUbkwj0lh5wDGroJ
The log entry shows that the expiration manager revoked the lease for the credential. Note that the lease_id entry has a prefix to indicate the secrets engine type (database
) and has a reference to the role name (db-dba
).
You are ready to examine a case where revocation is failing so you can understand how that situation appears in the server logs.
Explore unsuccessful lease revocation message
First, disable the database secrets engine that the script enabled. This will remove all associated configuration so that you can then reconfigure it with a second execution of the script.
$ vault secrets disable databaseSuccess! Disabled the secrets engine (if it existed) at: database/
Then stop the PostgreSQL container.
$ docker stop learn-postgres
Note
The script starts the containers with the remove flag --remove
so the container will be automatically cleaned up when you stop it.
Execute the dynamic-postgres.sh
script again, but this time stop the PostgreSQL container after the script execution completes.
$ ./dynamic-postgres.sh ; docker stop learn-postgres
By stopping the PostgreSQL container, you prevent Vault from connecting to it and revoking the lease when it reaches expiration.
Wait a minute for the TTL value on the leases to expire, then check the Vault server logs.
$ docker logs 2>&1 learn-vault | grep revoked
You should find an [ERROR]
line indicating failure to revoke the lease.
2021-07-27T10:56:05.428-0400 [ERROR] expiration: failed to revoke lease: lease_id=database/creds/db-dba/k7yLujVyreaN8HWXNKnpWTz1 error="failed to revoke entry: resp: (*logical.Response)(nil) err: dial tcp [::1]:5432: connect: connection refused"
The information making up the lease_id
value has details about the secrets engine type and role name.
Note also that there is an error message, which states that Vault failed to revoke the entry, with more detail provided in the response. In this case, Vault cannot connect to the PostgreSQL server at [::1]:5432
because you stopped the Docker container.
Since Vault cannot connect to PostgreSQL, it cannot issue the revocation statements required to revoke the credentials and associated lease.
Irrevocable lease behavior
When Vault encounters irrevocable leases, it behaves differently depending on the version in use.
For versions before 1.8.0, Vault will always try to revoke all expired leases. This means that if you have a scenario like that which you just explored where the database server is unavailable, Vault will be periodically and indefinitely attempting connections with that server to revoke the credentials.
For versions at or beyond 1.8.0, Vault will try to revoke an expired lease 6 times. If it fails to revoke the lease on the sixth try, it will internally mark the lease as irrevocable. You can identify such leases with the CLI.
For this scenario, after several minutes have elapsed, you can check the logs again to learn if the expiration manager has attempted to revoke the lease at least 6 times.
Note
The time taken for revocation attempts is considerable because Vault uses exponential back off to avoid overloading the PostgreSQL server with revocation requests.
$ docker logs 2>&1 | grep 'failed to revoke lease' | wc -l 6
Once you have observed that 6 revocation attempts have occurred and failed, use the vault
CLI to report on the irrevocable leases.
$ vault read sys/leases/count type=irrevocableKey Value--- -----counts map[database_23ec392d:1]lease_count 1
The result is one irrevocable lease associated with the database secrets engine accessor 23ec392d.
Clean up irrevocable leases
You can clean up leases by revoking them based on their prefix.
In this case, the prefix corresponds to the path you have observed in the lease ID, database/creds/db-dba
.
Try to revoke the irrevocable lease by its prefix.
$ vault write -force sys/leases/revoke-prefix/database/creds/db-dba
This fails with an error that is similar to the one logged when the expiration manager cannot revoke the lease.
Error writing data to sys/leases/revoke-prefix/database/creds/db-dba: Error making API request.URL: PUT http://localhost:8200/v1/sys/leases/revoke-prefix/database/creds/db-dbaCode: 400. Errors:* failed to revoke "database/creds/db-dba/k7yLujVyreaN8HWXNKnpWTz1" (1 / 1): failed to revoke entry: resp: (*logical.Response)(nil) err: dial tcp [::1]:5432: connect: connection refused
How can you clean up this irrevocable lease, then?
You can use the Revoke Force API, instead.
Try to forcibly revoke the lease.
$ vault write -force /sys/leases/revoke-force/database/creds/db-dbaSuccess! Data written to: sys/leases/revoke-force/database/creds/db-dba
CAUTION
This operation will revoke all leases at the specified prefix.
Try to list irrevocable leases again, and you should find that the 1 lease has now been forcibly revoked.
$ vault read sys/leases/count type=irrevocableKey Value--- -----counts map[]lease_count 0
Note
When you revoke large batches of leases, you can change the sync parameter to false
so that the lease revocation returns when completed.
Token leases
You can confirm token revoked leases clean up by listing the path, and noting that leases are no longer found.
$ vault list sys/leases/lookup/auth/$PREFIXNo value found at sys/leases/lookup/auth/$PREFIX
Metrics
Besides exploring the Vault server logs for indications of lease revocation issues, there other key Vault telemetry metrics related to the expiration manager, which you can monitor and alert on.
Metric | Description | Unit | Type |
---|---|---|---|
vault.expire.fetch-lease-times | Time taken to fetch lease times | ms | summary |
vault.expire.fetch-lease-times-by-token | Time taken to fetch lease times by token | ms | summary |
vault.expire.num_leases | Number of all leases which are eligible for eventual expiry | leases | gauge |
vault.expire.leases.by_expiration (cluster,gauge,expiring,namespace) | Number of leases set to expire, grouped by a time interval. This time interval and total number of time intervals are configurable via lease_metrics_epsilon and num_lease_metrics_buckets in the telemetry stanza of a vault server configuration. The default values for these are 1hr and 168 respectively, so the metric will report the number of leases that will expire each hour from the current time to a week from the current time. One can also group lease expiration by namespace by setting add_lease_metrics_namespace_labels to true in the configuration file (default is false ). | leases | gauge |
vault.expire.lease_expiration | Count of lease expirations | leases | counter |
vault.expire.job_manager.total_jobs | Total pending revocation jobs | leases | sample |
vault.expire.job_manager.queue_length | Total pending revocation jobs by auth method | leases | sample |
vault.expire.lease_expiration | Count of lease expirations | leases | counter |
vault.expire.lease_expiration.time_in_queue | Time taken for lease to get to the front of the revoke queue | ms | summary |
vault.expire.lease_expiration.error | Count of lease expiration errors | errors | counter |
Cleanup
Follow these steps to clean up your example scenario environment.
Stop the PostgreSQL and Vault containers.
$ docker stop learn-postgres learn-vault
Remove the Docker network
$ docker network rm learn-vault
Summary
You learned about the Vault expiration manager and lease handling behavior along with how to identify irrevocable leases, and resolve issues with them.
You also learned about some key Vault telemetry metrics related to the expiration manager and lease handling, which you can monitor and alert on.