Azure Data Lake Store: a hyperscale distributed file service for big data analytics | the morning paper
Azure data lake store: a hyperscale distributed file service for big data analytics Douceur et al., SIGMOD’17
Today’s paper takes us inside Microsoft Azure’s distributed file service called the Azure Data Lake Store (ADLS). ADLS is the successor to an internal file system called Cosmos, and marries Cosmos semantics with HDFS, supporting both Cosmos and Hadoop workloads. Microsoft are in the process of migrating all Cosmos data and workloads onto ADLS.
Virtually all groups across the company, including Ad platforms, Bing, Halo, Office, Skype, Windows and XBOX, store many exabytes of heterogenous data in Cosmos, doing everything from exploratory analysis and stream processing to production workflows.
ADLS is not just used internally of course, it’s a part of the Azure cloud offerings, complementing Azure Data Lake Analytics.
ADLS is the first public PaaS cloud service that is designed to support full filesystem functionality at extreme scale… The largest Hadoop clusters that we are aware of are about 5K nodes; Cosmos clusters exceed 50K nodes each ; individual jobs can execute over more than 10K nodes. Every day, we process several hundred petabytes of data, and deliver tens of millions of compute hours to thousands of internal users.
Several aspects of the design stand out as noteworthy to me:
- ADLS stores files across multiple storage tiers with support for partial overlapping
- ADLS is architected as a collection of microservices, which themselves need to be scalable, highly available, have low-latency, and be strongly consistent. “The approach we have taken to solve the hard scalability problem for metadata management differs from typical filesystems in its deep integration of relational database and distributed systems technologies.”
- ADLS is designed ground up for security.
- ADLS has a fast path for small appends
Let’s go into each of these areas in more detail.
The structure of a file in ADLS
An ADFS file is referred to by URL and comprises a sequence of extents, each of which is in turn a sequence of blocks. Extents are the units of locality, blocks are the units of append atomicity and parallelism. All extents but the last one are sealed. Only an unsealed last extent may be appended to.
The notion of tiered storage is integral to ADLS. Any part of a file can be in one or more of several storage tiers.
In general, the design supports local tiers (including local SSD and HDD tiers), whose data is distributed across ADLS nodes for easy access during job computation, and remote tiers, whose data is stored outside the ADLS cluster.