How AWS S3 is built - The Pragmatic Engineer Recap

Podcast: The Pragmatic Engineer

Published: 2026-01-21

Duration: 1 hr 18 min

Summary

In this episode, Mylon, the VP of Data and Analytics at AWS, dives into the massive scale and engineering complexities of AWS S3, detailing its evolution from eventually consistent to strongly consistent storage. The conversation highlights the sheer volume of data S3 handles and its foundational architecture.

What Happened

Mylon begins by illustrating the incredible scale of AWS S3, stating that it currently holds over 500 trillion objects and processes over a quadrillion requests each year. This vast infrastructure is built upon tens of millions of hard drives, which, if stacked, would reach the International Space Station. The sheer scale is difficult for many to comprehend, particularly for customers who may not grasp the enormity of their data lakes, which can contain exabytes of information.

The discussion shifts to the origins of S3, which began in 2005 as engineers sought to create a reliable storage solution for unstructured data like PDFs and images. Initially designed with an eventual consistency model to optimize durability and availability, S3 catered to the needs of early cloud users, including companies like Netflix and Pinterest. However, as the landscape evolved, so did S3's architecture, transitioning to a strongly consistent model, enabling businesses to build extensive data lakes and utilize new analytics frameworks like Apache Iceberg, which allows for decentralized analytics architectures.

Key Insights

AWS S3 holds over 500 trillion objects and hundreds of exabytes of data.
The initial design of S3 prioritized durability and availability with eventual consistency.
The shift to strongly consistent storage marked a significant evolution in S3's architecture.
Apache Iceberg has become crucial for customers seeking flexible analytics solutions.

Key Questions Answered

What is the current scale of AWS S3?

Mylon shares that AWS S3 currently holds over 500 trillion objects and serves hundreds of exabytes of data. It processes over a quadrillion requests annually, showcasing its immense scale. The infrastructure supporting this is built on tens of millions of hard drives, emphasizing the vastness of the storage service that many customers may not fully grasp.

How did S3 evolve from eventual consistency to strong consistency?

Initially, S3 was designed around eventual consistency, meaning data might not be immediately visible after being stored. This model was effective for early users, particularly e-commerce platforms where a brief delay was acceptable. However, as customer needs evolved, S3 transitioned to a strongly consistent model to better support complex data operations and analytics.

What role does Apache Iceberg play in S3's architecture?

Apache Iceberg has become integral for customers using S3, allowing them to implement decentralized analytics architectures. This open-source data format provides the flexibility needed for businesses to select various analytics engines while ensuring compatibility, ultimately enhancing their ability to manage and analyze large datasets effectively.

What was the initial purpose behind building AWS S3?

AWS S3 was developed to provide a cost-effective and reliable storage solution for unstructured data types that were prevalent in early cloud usage. Launched in 2006, the service aimed to allow engineers to store data without worrying about the underlying infrastructure, which was essential for the growth of applications like e-commerce websites.

What are the latest features introduced in S3?

In December 2024, AWS introduced S3 tables, enhancing the service's capabilities for managing tabular data. Alongside this, the launch of S3 vectors in July and the addition of over 15 new features in the same year showcase AWS's commitment to evolving S3 to meet customer demands and technological advancements.