Best Practices for Selecting Apache Hadoop Hardware
We get asked a lot of questions about how to select Apache Hadoop worker node hardware. During my time at Yahoo!, we bought a lot of nodes with 6*2TB SATA drives, 24GB RAM and 8 cores in a dual socket configuration. This has proven to be a pretty good configuration. This year, I’ve seen systems with 12*2TB SATA drives, 48GB RAM and 8 cores in a dual socket configurations. We will see a move to 3TB drives this year.
What configuration makes sense for any given organization is driven by such ratios as the storage-to-compute ratio of your workload and other factors that cannot be answered in a generic way. Further, the hardware industry moves quickly. In this post I’ll try to outline the principles that have generally guided Hadoop hardware configuration selections over the last six years. All of these thoughts are aimed at designing medium to large Apache Hadoop clusters. Scott Carey made a good case for smaller machines for small clusters the other day on the Apache mailing list.
The key for Hadoop clusters is to buy quality commodity equipment. Most Hadoop purchasers are cost conscious and as your clusters grow, their cost can be significant. When thinking about cost, one needs to think about the whole system, including network, power and the extra components included in many high-end systems. Remember that Hadoop is built to handle component failure well and to scale out on low cost gear. RAID cards, redundant power supplies and other per-component reliability features are not needed. Buy error-correcting RAM and SATA drives with good MTBF numbers. Good RAM allows you to trust the quality of your computations. Hard drives are the largest source of failures, so buy decent ones.