You Are Here: Home » Big Data » Skool: An Open Source Data Integration Tool for Apache Hadoop from BT Group – Cloudera Engineering Blog

Skool: An Open Source Data Integration Tool for Apache Hadoop from BT Group – Cloudera Engineering Blog

In this guest post, Skool’s architects at BT Group explain its origins, design, and functionality.

With increased adoption of big data comes the challenge of integrating existing data sitting in various relational and file-based systems with Apache Hadoop infrastructure. Although open source connectors (such as Apache Sqoop) and utilities (such as Httpfs/Curl on Linux) make it easy to exchange data, data engineering teams often spend an inordinate amount of time writing code for this purpose. This time investment usually is needed for two main reasons: because of the multiple different data structures involved, and because of necessary support for enterprise security frameworks like Kerberos.

For example, at BT, one common use for Hadoop involves analyzing the performance of broadband lines. This process involves importing data from multiple different relational database systems containing columns for product, location, faults, orders, configuration, and so on, spanning hundred and thousands of tables. Furthermore, many different file feeds containing various network performance parameters have to be ingested, in real time, into HDFS for analysis. This effort also includes creating Apache Oozie workflow jobs for milestone and incremental pulls with different latencies, and subsequent creation of requisite Apache Hive tables.

To facilitate this process, BT’s data engineering team wanted a re-usable framework that would generate all the scripts that are ready to be deployed, support automated regression testing, and offer the flexibility to add desired customizations. They evaluated several commercial and open source tools to fill that need; however, all of them were ruled out for one reason or another. For example:

  • Gobblin is more focused on data flow scheduling than on ingestion or extraction.
  • Apache Nifi does not cover requisite end-to-end flow, nor does it integrate with Kerberos.
  • Oracle Data Integrator has limited support for big data sources.

Source: Skool: An Open Source Data Integration Tool for Apache Hadoop from BT Group – Cloudera Engineering Blog

About The Author

Number of Entries : 426

2015 © Big Data Cloud Inc. All Rights Reserved.

Hadoop and the Hadoop elephant logo, Sprark are trademarks of the Apache Software Foundation.

Scroll to top