In this guest post, Skool’s architects at BT Group explain its origins, design, and functionality.
With increased adoption of big data comes the challenge of integrating existing data sitting in various relational and file-based systems with Apache Hadoop infrastructure. Although open source connectors (such as Apache Sqoop) and utilities (such as Httpfs/Curl on Linux) make it easy to exchange data, data engineering teams often spend an inordinate amount of time writing code for this purpose. This time investment usually is needed for two main reasons: because of the multiple different data structures involved, and because of necessary support for enterprise security frameworks like Kerberos.
For example, at BT, one common use for Hadoop involves analyzing the performance of broadband lines. This process involves importing data from multiple different relational database systems containing columns for product, location, faults, orders, configuration, and so on, spanning hundred and thousands of tables. Furthermore, many different file feeds containing various network performance parameters have to be ingested, in real time, into HDFS for analysis. This effort also includes creating Apache Oozie workflow jobs for milestone and incremental pulls with different latencies, and subsequent creation of requisite Apache Hive tables.
To facilitate this process, BT’s data engineering team wanted a re-usable framework that would generate all the scripts that are ready to be deployed, support automated regression testing, and offer the flexibility to add desired customizations. They evaluated several commercial and open source tools to fill that need; however, all of them were ruled out for one reason or another. For example: