Moving an Elephant: Large Scale Hadoop Data Migration at Facebook
Users share billions of pieces of content daily on Facebook, and it’s the data infrastructure team’s job to analyze that data so we can present it to those users and their friends in the quickest and most relevant manner. This requires a lot of infrastructure and supporting data, so much so that we need to move that data periodically to ever larger data centers. Just last month, the data infrastructure team finished our largest data migration ever – moving dozens of petabytes of data from one data center to another.
During the past two years, the number of shared items has grown exponentially, and the corresponding requirements for the analytics data warehouse have increased as well. As the majority of the analytics is performed with Hive, we store the data on HDFS — the Hadoop distributed file system. In 2010, Facebook had the largest Hadoop cluster in the world, with over 20 PB of storage. By March 2011, the cluster had grown to 30 PB — that’s 3,000 times the size of the Library of Congress! At that point, we had run out of power and space to add more nodes, necessitating the move to a larger data center.
Move the Boxes or Mirror the Data?
The scale of this migration exceeded all previous ones, and we considered a couple of different migration strategies. One was a physical move of the machines to the new data center. We could have moved all the machines within a few days with enough hands at the job. However, this was not a viable option as our users and analysts depend on the data 24/7, and the downtime would be too long.
Another approach was to set up a replication system that mirrors changes from the old cluster to the new, larger cluster. Then at the switchover time, we could simply redirect everything to the new cluster. This approach is more complex as the source is a live file system, with files being created and deleted continuously. Due to the unprecedented cluster size, a new replication system that could handle the load would need to be developed. However, because replication minimizes downtime, it was the approach that we decided to use for this massive migration.
Replication It Is
Once the required systems were developed, the replication approach was executed in two steps. First, a bulk copy transferred most of the data from the source cluster to the destination. Most of the directories were copied via DistCp — an application shipped with Hadoop that uses a MapReduce job to copy files in parallel. Our Hadoop engineers made code and configuration changes to handle special cases with Facebook’s dataset, including the ability for multiple mappers to copy a single large file, and for the proper handling of directories with many small files. After the bulk copy was done, file changes after the start of the bulk copy were copied over to the destination cluster through the new replication system. File changes were detected through a custom Hive plug-in that recorded the changes to an audit log. The replication system continuously polled the audit log and copied modified files so that the destination would never be more than a couple of hours behind. The plug-in recorded Hive metadata changes as well, so that metadata modifications such as the last accessed time of Hive tables and partitions were propagated. Both the plug-in and the replication system were developed in-house by members of the Hive team.
- Follow this posting on Facebook…
- Find other postings from Facebook…