Analytic is your Doctor’s Friend

In this post, I will be venturing into the medical domain and show how big data analytic can play a crucial role in the complex and daunting world of health care. There is a  kind of cancer that affects the male population above a certain age. There are also other important  contributing factors like race, family history etc. In this post, I will provide a Hadoop based machine learning solution to predict t ...

Read more

Explore Customer Churn with Cramer Index

Classification problems involve predicting a response variable based on  a set of feature variables for some entity. But there is another problem whose solution is a prerequisite for solving classification problem. We may want to know which among the set of feature variables are most strongly correlated to the response variables. Once we have identified those, we may only want to use that sub set of the fea ...

Read more

Get Social with Pearson Correlation

[caption id="attachment_2522" align="alignnone" width="500"] Get Social with Pearson Correlation[/caption] In one of my earlier posts, I discussed about using Pearson correlation for making social recommendation. In this post we will delve deeper into it including the Hadoop map reduce implementation. There are many correlation techniques, including cosine distance, slope one etc. These are already implemen ...

Read more

Making Hive Squawk like a Real Database

Hive is great for large scale data warehousing applications. In one of my recent projects I was handed over the interesting and challenging task of making Hive behave like an OLTP system i.e., support update and delete. To be more specific, the Hive database needed to be kept in near real time synchronization with multiple OLTP systems. In this post, I will discuss the high level features of the solution ba ...

Read more

It’s a lonely life for outliers

In this post, I am back to outliers and fraud analytic. In this earlier post, I did an overview of outliers detection techniques that are being implemented with Hadoop in my open source project beymani. In this earlier post, I talked about a multivariate distribution model based implementation in beymani. We postulated that data point falling in the low frequency histogram bins are potentially outliers. In ...

Read more

Big Web Analytic

I had started on a Hadoop based web analytic open source project some time ago. Recently I did some work on it and decided blog about the development I did on the the project. The project is called visitante and it’s available on github. It’s goal is two fold. First, there are a set of MR jobs for various descriptive analytic metric e.g., bounce rate, checkout abandonment etc. I find the blog site of Avinas ...

Read more

Hive Plays Well with JSON

Hive is an abstraction on Hadoop Map Reduce. It provides a SQL like interface for querying HDFS data, whch accounts for most of it’s popularity. In Hive, table structured data in HDFS is encapsulated with a table as in RDBMS. The DDL for table creation in Hive looks very similar to table creation DDL in RDBMS. In one of my recent projects, I had a need for storing and querying JSON formatted hierarchical da ...

Read more

Socially Accepted Recommendation

All my earlier posts on recommendation  systems focused on the so called content based recommendation. These systems rely on finding similarities between the attributes of entities  e.g., between products. They are useful to address the so called cold start problem, when sufficient user engagement data is not available.  However, they are limited in use e.g., you can not make cross sell recommendation with ...

Read more

Redis as Messaging Middleware

I needed a simple middleware for fluxua, my Hadoop workflow engine project on github. In fluxua, you define a set of Hadoop jobs running as an workflow. The dependency between the jobs is represented as a directed acyclic graph (DAG). The workflow engine will execute the Hadoop jobs in proper order. Since the Hadopp jobs take significant amount of time for execution, I needed a simple async messaging middle ...

Read more

Cassandra Range Query Made Simple

In Cassandra, rows are hash partitioned by default. If you want to data sorted by some attribute, column name sorting feature of Cassandra is usually exploited. If you look at the Cassandra slice range API, you will find that you can specify only the range start, range end and an upper limit on the number of columns fetched. However in many applications the need is to paginate through the data i.e each call ...

Read more

2013 © Big Data Cloud Inc. All Rights Reserved.

Hadoop and the Hadoop elephant logo are trademarks of the Apache Software Foundation.

Scroll to top