Introducing Apache Arrow: A Fast, Interoperable In-Memory Columnar Data Structure Standard – Cloudera Engineering Blog
Engineers from across the Apache Hadoop community are collaborating to establish Arrow as a de-facto standard for columnar in-memory processing and interchange. Here’s how it works.
Apache Arrow is an in-memory data structure specification for use by engineers building data systems. It has several key benefits:
- A columnar memory-layout permitting O(1) random access. The layout is highly cache-efficient in analytics workloads and permits SIMD optimizations with modern processors. Developers can create very fast algorithms which process Arrow data structures.
- Efficient and fast data interchange between systems without the serialization costs associated with other systems like Thrift, Avro, and Protocol Buffers.
- A flexible structured data model supporting complex types that handles flat tables as well as real-world JSON-like data engineering workloads.
Arrow isn’t a standalone piece of software but rather a component used to accelerate analytics within a particular system and to allow Arrow-enabled systems to exchange data with low overhead. It is sufficiently flexible to support most complex data models.
For the Python and R communities, Arrow is extremely important, as data interoperability has been one of the biggest roadblocks to tighter integration with big data systems (which largely run on the JVM).
In this post, we’ll explain some of the motivations for Arrow, how Cloudera will use it in its projects, and how the open source community can get involved to create open source analytical software that is both faster and more interoperable.
Moving Data Efficiently Between Systems
In the Apache Hadoop ecosystem, the data is often stored in one of the following ways:
- In HDFS in a binary format like Apache Parquet or a text format such as sequence files or CSV/TSV
- An online storage system for structured data like Apache Cassandra, Apache HBase, or Apache Kudu (incubating)
Computation engines (such as Apache Impala [incubating] or Apache Spark) request data from these systems, converting it as it arrives to its native in-memory data structures in order to perform analytics or data transformations. Generally, the in-memory data structures within each computation system are specific to that system, and adapter code converts to and from file formats and marshals data to and from the wire-protocol formats of the online Hadoop storage systems.
Arrow improves the performance for data movement within a cluster in these ways:
- Two processes utilizing Arrow as their in-memory data representation can “relocate” the data from one process to the other without serialization or deserialization. For example, Spark could send Arrow data to a Python process for evaluating a user-defined function.
- Arrow data can be received from Arrow-enabled database-like systems without costly deserialization on receipt. For example, Kudu could send Arrow data to Impala for analytics purposes.
How Arrow’s in-memory columnar memory layout enables better performance