Comparing Cloud-based Machine Learning Platforms — Amazon ML vs Microsoft Azure ML vs Databricks (Spark) Cloud.
In this blog post, data scientists at Third Eye Consulting Services will compare leading Cloud-based machine learning platforms. We will start with sharing the framework we used to compare the platforms and then we will apply the framework to compare Amazon ML, Microsoft Azure ML and Databricks Cloud.
Updated as of:
(Note that these services have an aggressive release cycle so if you’re reading this after few months, please refer to the official documentations but our framework should still help you)
Machine learning projects is a four step process. They are 1) Data Preparation 2) Data Selection 3) Algorithms 4) Optimize Algorithm — for each stage, we will ask following questions to evaluate the machine learning platforms:
Note that the above process is not a waterfall process but it’s usually iterative. You might have to keep iterating with things like adding/removing features and switching algorithms.
So, Now that we have a framework, let’s apply them to three leading Cloud-based Machine learning platforms:
- Amazon ML
- Microsoft Azure ML
- Databricks Cloud
Amazon ML was announced in April 2015 and it’s a relatively young offering and so it’s understandable that it’s limited in capabilities/algorithms offered. It seems that Amazon launched a version-1 of their ML product for their existing AWS customers to help them get started — and if there is more demand from customers then I think the service would evolve over time. Here are few more things you should know about Amazon ML:
- Amazon ML has a wizard that walks you through each step and so it enables developers without ML know-how to get started
- Amazon ML supports data sources available on AWS platform like Redshift, S3 etc — so you will have to move your data to AWS before you can use Amazon ML — but it’s great if you are an existing customer!
- Amazon ML supports basic data cleaning and transformation tasks — but you will have to do the heavy lifting of cleaning/transformation data somewhere else for intermediate to complex needs.
- Amazon ML currently supports following ML problems:
- Binary and Multi-class classification
- Amazon ML does not let the developer select the algorithm for the problem at hand — for instance, if you have a binary classification problem then it automatically uses Logistic Regression algorithm for you. It doesn’t let you change the algorithm to something like Two-class SVM or Two-class decision forest
- For each algorithm, you can set some training and evaluation parameters — so it’s limiting for advanced users
- Amazon ML does give you common performance metrics to evaluate your model’s performance — for example, if you are building a binary classification model then it gives you Binary AUC.
So with that, let’s use our framework, to evaluate Amazon ML:
Microsoft Azure ML
Microsoft Azure ML was made generally available in Feb’2015 so like Amazon ML it’s relatively young offering but it’s a feature-rich offering! There’s something for everyone — beginner to advanced users. Users who are just starting out, they also have a workflow that helps them get started quickly and for intermediate/advanced users, there is support of R and IPython notebooks! So here’s more information about Azure ML:
- Azure ML has a workflow and a visual editor that beginners can easily follow and build their first ML project with Azure ML!
- Azure ML supports following data sources: CSV, SQL Database tables, RData among others. You can check out the list here: https://azure.microsoft.com/en-us/documentation/articles/machine-learning-data-science-import-data/ — you can also automate it using the Azure Data Factory which is one of the other services that is part of the Azure Cloud offerign as well.
- Azure ML has common data cleaning and transformation tasks that you can use or you can also build the data pipeline using R code with Azure ML
- Azure ML supports following problems:
- Binary & Multiclass classification
- Anomaly detection
- For each problem, Azure ML gives you the option to try multiple algorithms — you can also bring other algorithms supported on R or IPyhton (or build your own!)
- Azure ML also helps you tune the parameters for each algorithm — in fact they have a “sweep parameter” task that iterates you multiple input options for each algorithm parameter and identifies the optimal parameter setting for your problem
- Azure ML also makes it easy to compare the performance of different algorithms and help you select the best one for the problem at hand!
- It also supports R and IPyhton notebooks so you can port your existing R/Python code as well and use Azure Platform to operationalize your Machine learning project
So with that, let’s use our framework, to evaluate Azure ML:
Databricks (Spark) Cloud:
Databricks cloud was made generally available in June’15 but don’t let the recent release distract you! Here’s why: Databricks makes it easy for companies to use Apache Spark (which is a defacto standard in big data now just like Hadoop) and so it can leverage number of years of development work that went into spark and that makes the Databricks cloud a compelling offering. It supports the notebooks approach and let’s data scientist write R, Python or Scala code to build data & ML pipelines. Here are few more details about Databricks cloud:
- Databricks cloud supports a number of data sources
- You can build sophisticated data pipelines using Databricks cloud
- They don’t have a wizard so there’s a learning curve for beginners
- They support a number of algorithms which can be extended using R/Python/Scala code
- Notebooks approach let’s data scientist quickly explore and visualize the data — making it easy to iterate through the process
- You can tune the algorithm parameters
- It also helps you evaluate the algorithms by looking at performance metrics
- Overall, it’s a great service for intermediate to advance level users to work on a machine learning project
So with that, let’s apply the framework to Databricks cloud:
So we saw two different approaches. Point & click (aka Wizard based) and the Notebooks approach. Let’s classify them:
Point & click: Amazon ML, Azure ML
Notebooks: Azure ML, Databricks Cloud.
And to wrap up:
Amazon ML may be sufficient for:
- customers that already have data residing in those providers
- simpler/fewer options are acceptable
AzureML has a strong usability and workflow approach and provides a reasonable cross section of algorithms available for casual & intermediate users
Databricks Cloud is a comprehensive offering:
- Variety, performance, configurability of Algorithms
- Richness of the capabilities of the Notebooks
- Options /configurability of the hosting clusters/environment
You can also view the webinar on this topic. Please find the link below
Third Eye is a direct vendor to Microsoft, Amazon & Google. Third Eye has implemented numerous Big Data projects for them over last 3 years. Third Eye is NOT a reseller of the cloud services of these companies. Third Eye does NOT financially benefit for making any of the following recommendations. This work is purely meant for a technical evaluation of the ML platforms and should not be construed for any other purposes.
About the Author:
Paras Doshi is an analytics & data science professional with passion for helping businesses extract actionable insights out of their data — You can read his blog posts at http://insightextractor.com/
Thank you for reading!