Introduction: The Invisible System Behind Every “Share” Button
Every day, we click “Share” on a Google Doc, grant access to a private photo album, or set a YouTube video to “unlisted.” We implicitly trust that the right people—and only the right people—will see our content. Behind this simple act of trust is a colossal engineering effort. At Google, this trust is underwritten by a single, unified authorization system called Zanzibar.
Zanzibar is the global system that powers permissions for hundreds of Google services, including Drive, Photos, Cloud, Calendar, Maps, and YouTube. It handles trillions of access control lists with millisecond latency and extreme reliability. This article explores four of the most surprising and impactful engineering takeaways from how this planet-scale system operates.
——————————————————————————–
Takeaway 1: A Scale That’s Hard to Comprehend
A Scale That’s Hard to Comprehend
The sheer magnitude of Zanzibar’s operational load is staggering. It’s one thing to build a permissions system for a single application; it’s another to build one that serves the entire globe for hundreds of diverse products. The numbers alone explain why a dedicated, unified system is not just a good idea, but an absolute necessity.
- Access Control Lists: Zanzibar stores more than two trillion access control lists (ACLs).
- Request Volume: It serves more than 10 million client queries per second (QPS), which in turn generates a staggering 22 million internal “delegated” RPCs per second as checks fan out across the system.
- Global Infrastructure: The system is deployed across more than 10,000 servers in dozens of clusters around the world.
- Data Footprint: It manages close to 100 terabytes of permission data (called “relation tuples”), which is fully replicated in more than 30 global locations.
This immense scale is not just a vanity metric; it is the direct cause of extreme engineering challenges, from ensuring data consistency across the globe to fighting off performance bottlenecks at every layer.
Takeaway 2: Solving the “New Enemy” Problem
Solving the “New Enemy” Problem
Zanzibar provides a critical guarantee of consistency to prevent subtle but dangerous permission errors. The most important of these is what the system’s designers call the “new enemy” problem.
Imagine this scenario: Alice removes Bob from a shared document’s access list. A moment later, she adds new, sensitive information to that same document. A system with weak consistency might process the content update first and check permissions against a stale access list from before Bob was removed. The result? Bob, the “new enemy,” could mistakenly gain access to the sensitive content.
Zanzibar’s elegant solution is an opaque token called a zookie. A zookie is a small piece of data that encodes a precise, globally consistent timestamp. This is made possible by storing ACLs in Spanner, Google’s globally distributed database, whose TrueTime mechanism provides the causally meaningful timestamps that Zanzibar encodes into each zookie. When content like a document is updated, the client application gets a new zookie from Zanzibar and stores it with the content change in an atomic write to the client storage. When a user later tries to access that content, the client sends the zookie along with the permission check request.
This enables Zanzibar’s “at-least-as-fresh” guarantee. The system ensures that the permission check is performed using ACL data that is at least as recent as the content being accessed. This perfectly respects the causal order of events—the ACL removal is always seen before the new content—and slams the door on the “new enemy” problem.
Preventing the “new enemy” problem requires Zanzibar to understand and respect the causal ordering between ACL or content updates, including updates on different ACLs or objects and those coordinated via channels invisible to Zanzibar.
Takeaway 3: Waging a Constant War on “Hot Spots”
Waging a Constant War on “Hot Spots”
Zanzibar is relentlessly optimized to handle performance bottlenecks, or “hot spots.” A hot spot occurs when a single piece of data is needed for a massive number of simultaneous permission checks. This could be the membership list for a very large Google Group or the permissions for a viral YouTube video.
This can lead to a “cache stampede,” where thousands of concurrent requests for the same uncached permission data all hit the database at once, potentially overwhelming it. To defend against this, Zanzibar employs a multi-layered defense system:
- Distributed Caching: Cache entries are spread across thousands of servers using consistent hashing, so no single server bears the full load.
- Request Deduplication: A
lock tabletracks requests that are already in flight. If 1,000 requests for the same permission arrive at once, only one is actually sent to the database; the other 999 wait for the result. - Timestamp Quantization: Instead of evaluating requests at microsecond-level timestamps, Zanzibar rounds timestamps to a coarser granularity (e.g., one or ten seconds). This allows vast numbers of recent requests to share the exact same cached results, dramatically increasing cache hit rates.
- Dynamic Prefetching: For extremely popular objects, Zanzibar’s monitoring detects the hot spot and proactively reads and caches all relation tuples of
⟨object#relation⟩for the hot object, trading a temporary burst of database reads for much higher, sustained cache performance.
While the cache hit rates may seem modest (e.g., a 10% hit rate for checks on the delegate’s side), the paper reveals these mechanisms prevent a staggering 500,000 internal RPCs per second from creating hot spots, with the lock table saving an additional 12% of requests.
Takeaway 4: Building Infinite Flexibility from a Simple Lego Brick
Building Infinite Flexibility from a Simple Lego Brick
Zanzibar’s data model is both profoundly simple and incredibly flexible. The entire system, with its trillions of entries, is built on a single concept: the relation tuple. Every permission in Zanzibar is stored in a simple structure that can be represented as:
object#relation@user
This one structure is used for everything. A user being a viewer of a document might be doc:readme#viewer@user:10. A user being a member of a group might be group:eng#member@user:11.
Critically, the user can be a user id or another userset, which is itself a complete object#relation pair. This is how Zanzibar unifies the concepts of direct permissions and groups. A tuple like doc:readme#viewer@group:eng#member means that any user who is a member of group:eng is also a viewer of doc:readme.
This simple model is made even more powerful through configurable userset rewrite rules. Instead of changing the core system, clients can define their own complex, application-specific logic. For example, a client can configure a rule stating that anyone who is an editor of a document is automatically a viewer, or that a document automatically inherits the viewer list from its parent folder. This is a brilliant design choice: it keeps the core authorization engine simple and fast, while empowering clients to build rich and complex access control policies on top of it.
——————————————————————————–
Conclusion: The Hidden Foundation of Digital Trust
Zanzibar is a masterclass in distributed systems engineering. Its design reveals a relentless focus on the fundamentals: operating at an unbelievable scale, providing bulletproof consistency guarantees with its zookie protocol, waging a constant war on performance hot spots, and achieving incredible flexibility from a simple, elegant data model. It is the hidden foundation that makes digital sharing and collaboration work safely for billions of users.
What other critical, “invisible” infrastructure do you rely on every day without a second thought?
Discover more from OpenSaaS
Subscribe to get the latest posts sent to your email.