TTL as a Service: Automatic Revocation of Stale Privileges
-
Aaron Loo, Engineering Manager
- Nov 19, 2018
Security and usability are often at odds with one another, a fact that is best illustrated by access control. Deny everyone, and you’ll have a super secure system that no one can use; allow everyone, and you’ll maximize usability at the cost of security.
The Principle of Least Privilege exists to balance both security and usability by giving users only the minimum amount of access they need to do their job. This reduces the attack surface by preventing attackers from leveraging a compromised user’s important, albeit unused, privileges for vertical/horizontal escalation.
The Problem
That said, there are a few key reasons why least privilege is hard to enforce:
-
No one asks for their access to be taken away.
Developer velocity is important to us. As long as there’s an audit trail, we generally allow people access to the resources necessary to do their job. However, once their task is complete, be it one day or several years later, they move on. This often results in people accumulating access like scout badges throughout their tenure.
It’s much more common for people to complain about not having access, rather than having too much access. That’s just human nature.
-
There is no singular governing system of access control.
Enterprises are comprised of many different systems, including external vendors and internal builds/hosts, and each of these may have its own access control management system. An employee’s holistic set of privileges includes their access to each one of these different systems.
While it may technically be possible to have one single, centralized system that maps every user to all things they have access to, this can quickly become unruly given the level of granularity that a solid access control system should provide.
-
Audits are painfully manual.
Manual audits are a necessary burden to ensure that the state of the world is as we know it to be. However, they are also very time consuming and becoming increasingly more difficult to scale as the amount of privileges grows in a company.
The Solution
To address this issue, we designed “TTL-as-a-Service” (Time-To-Live): a system to identify and flag users and their stale privileges. The premise is simple: if you haven’t used your access in X days, you probably don’t need it anymore and won’t notice if we take it away.
At its core, this requires two things:
-
Knowledge of every time a user has used a given privilege.
-
The ability to revoke access upon detecting staleness.
Embracing the UNIX philosophy of portability and minimalism, this system is designed to simply ingest logs, then perform a daily scan to process and identify stale privileges. Upon detecting staleness, it will fire off an alert to execute custom integrations or automatically generate tickets to revoke the identified user’s stale privilege.
The architectural diagram below provides a clearer image of our implementation:
We use Splunk as both a log ingestor and alerting mechanism, powered by savedsearches.
-
Splunk ingests logs from various sources, including our access control system and osquery. These upstream log providers are also configured to log upon permission usage.
-
On a daily basis, our customized saved searches are triggered to perform two things:
-
Aggregate daily use and throw them in a summary index.
This allows us to perform more efficient searches, since we merely need to know whether a person has used a permission in a given day, rather than every single time they use it.
-
Query the last X days to detect new stale permissions.
This rolling report is the essence of this solution, as it enables us to minimize the amount of manual effort necessary for periodic audits through automation.
-
-
If stale permissions are detected, actions are automatically triggered. By default, this results in JIRA ticket creation, however, it can also be uploaded to an S3 bucket for downstream consumption.
-
An example of downstream consumption is a batch worker for our access control system. On a daily basis, this pulls the latest changes from S3 and subsequently revokes access for the (user, permission) pairs listed.
This system can be easily applied to a variety of different access control systems by merely feeding the access log and receiving actionable alerts. These alerts can be further expanded through optional custom integrations that read from S3, and revoke privileges appropriately. Holistically, this allows us to assert that anyone with a given privilege has actively used it within the last X days.
Issue: Cold Start
The Cold Start issue occurs when a system has not processed enough data to make accurate judgements on an individual user level. In the case of least permissions, this occurs when a user is first granted new permissions, or when a permission is exercised infrequently or irregularly. How do we know when the right time is to remove a privilege with no prior knowledge of expected use cases?
To address cold start issues, we leverage anomaly detection techniques and try to bootstrap our knowledge of an individual by comparing their permission usage against the rest of their team and the company as a whole. For example, we attempted to identify “unusual” permissions by aggregating a given team’s permission set. If 95% of the team has a given permission, it would suggest they need it for their job. On the flipside, if only 1% of the team has a given permission, it might suggest an anomaly that should be more closely investigated.
With the additional assistance of on-the-ground managers to process and validate this data analysis, we’re able to answer the following questions:
-
Which employees currently have privileges they should not need to do their job?
-
Given these usage statistics, what seems like an appropriate upper bound for a permission to be considered stale for the entire team?
Though we’re unable to completely avoid manual processing, this solution has helped ensure that we only have to do it once. For future potential improvements, we can also train a machine learning (ML) model to better improve the performance of our statistical analysis.
Issue: Other Edge Cases
No project implementation is complete without a few hiccups along the way. Some edge cases to consider include:
-
Emergency-only Privileges
There are certain privileges that are only used in an emergency or rare, time-sensitive situations. By definition, these will be flagged by the system as “stale,” yet may not be advisable to be removed if it would require additional overhead when they’re actually needed.
However, this varies from case to case and is implementation-dependent, as it depends on the system’s ease of acquiring a permission when necessary.
-
Periodic Usage
Some activities are only done periodically, e.g., once a quarter. By definition, this may also exceed the X days configured for your staleness definition. Therefore, depending on your implementation, you can either revoke immediately (requiring the user to request the permission again every period) or create an exception for these privileges.
In general, we found that a smooth, auditable process to quickly and securely reinstate an employee’s privileges was incredibly helpful, allowing us to be more aggressive in revoking privileges. For example, if it only takes a couple hours to restore revoked privileges after ninety days of non-use, people are more willing to give up stale access.
Takeaways
The ability to quickly provision user access is important, especially for a high growth company. In the same way, it’s important to be able to quickly deprovision users when access is no longer needed. Unfortunately, the latter is a lot harder to manage and scale.
Using this system, we’re able to identify and subsequently revoke stale privileges without hindering developer velocity. This allows us to confidently assert that users will not have unused access longer than X days, thereby systematically enforcing least privilege with minimal manual effort.
Finally, through the process of building and rolling this out, we learned that it is also beneficial to have a smooth, speedy process in place for restoring revoked privileges, as it will reduce friction when trying to establish this new process.
Contributors
I would like to credit the following people (in alphabetical order) for their hard work in building this system and in continuing to bolster Yelp’s security.
Security Engineering at Yelp
Want to build automated systems to reduce manual effort, and help keep the Yelps secure? Apply to join!
View Job