Data Loader for NOSQL Databases
In one of my recent projects, I had to load product data from a CSV file into HBase and also to index it for search purpose.. I decided to separate out the loader part of the project as a stand alone tool and make available as open source. Currently, it’s hosted in github. It supports HBase. I will be adding support for Cassandra soon. I am working on Solr indexing right now.
Introduction
The tool is very generic and configurable. It takes a CSV file as input and writes to HBase or Cassandra.
The CSV file could have been generated from queries on Oracle or MySQL. So it could be used to migrate data from RDBMS to NOSQL databases.
It also takes a JSON file, which defines the the mapping between the columns in the CSV and the NOSQL column family and column along with other metadata.
Here is a quick summary of the features. The terminology I am using is based on HBase.
- Loads data from CSV file.
- Mapping between CSV columns and NOSQL column family and column is provided in JSON file.
- There is many to many association between CSV column and NOSQL column family and column.
- The row key for NOSQL could be created by concatenating multiple CSV columns.
- Solr indexing of data as it’s being loaded.
The indexing feature is not implemented yet. I will be working on it next. A CSV column could be split into multiple parts and used to populate multiple NOSQL columns. On the flip side, multiple CSV columns could be consolidated to populate one NOSQL column.
