How-to: Improve Apache HBase Performance via Data Serialization with Apache Avro – Cloudera Engineering Blog
Taking a thoughtful approach to data serialization can achieve significant performance improvements for HBase deployments.
The question of using tall versus wide tables in Apache HBase is a commonly discussed design pattern (see reference here and here). However, there are more considerations here than making that simple choice. Because HBase stores each column of a table as an independent row in the underlying HFiles, significant storage overhead can occur when storing small pieces of information. For example, in storing a simple Boolean column, there can be 35 (or more) bytes stored to disk (depending on the key size, column family, and so on). This overhead can become quite significant for overall I/O and network utilization, especially if multiple columns are read and written together within a transaction.
A simple solution is to serialize data that is accessed together: that is, serialize multiple columns and store the serialized data in a single HBase column. In this post, I’ll describe an implementation of this concept by Cloudera’s HBase team and the comparative performance improvements it achieved based on testing.
First, let’s consider serialization. To be as efficient as possible in our implementation, we used Apache Avro to serialize the data. Avro uses schemas, which must be presented when reading and writing data. This approach permits each datum to be written with no per-value overheads, making serialization both fast and small. (For more information about using Avro, see “Avro Usage” in the CDH documentation.)