You Are Here: Home » ETL

Apache Kafka and the four challenges of production machine learning systems – O’Reilly Media

Machine learning has become mainstream, and suddenly businesses everywhere are looking to build systems that use it to optimize aspects of their product, processes or customer experience. The cartoon version of machine learning sounds quite easy: you feed in training data made up of examples of good and bad outcomes, and the computer automatically learns from these and spits out a model that can make simila ...

Read more

Go Python, Go! Stream Processing for Python · Wallaroo Labs

We’ve been hard at work for 18 months on a new processing engine called Wallaroo for deploying and operating big data, fast data, and machine learning applications. We designed Wallaroo to make the infrastructure virtually disappear, so you get rapid deployment and easy-to-operate applications. It provides a simple model for building fast applications that scale automatically across any number of workers. W ...

Read more

Exactly-once Semantics is Possible: Here’s How Apache Kafka Does it

I’m thrilled that we have hit an exciting milestone the Kafka community has long been waiting for: we have introduced exactly-once semantics in Apache Kafka in the 0.11 release. In this post, I’d like to tell you what exactly-once semantics mean in Apache Kafka, why it is a hard problem, and how the new idempotence and transactions features in Kafka enable correct exactly-once stream processing using Kafka’ ...

Read more

Azure Data Lake Store: a hyperscale distributed file service for big data analytics | the morning paper

Azure data lake store: a hyperscale distributed file service for big data analytics Douceur et al., SIGMOD’17 Today’s paper takes us inside Microsoft Azure’s distributed file service called the Azure Data Lake Store (ADLS). ADLS is the successor to an internal file system called Cosmos, and marries Cosmos semantics with HDFS, supporting both Cosmos and Hadoop workloads. Microsoft are in the process of migra ...

Read more

Introducing Apache Arrow: A Fast, Interoperable In-Memory Columnar Data Structure Standard – Cloudera Engineering Blog

Engineers from across the Apache Hadoop community are collaborating to establish Arrow as a de-facto standard for columnar in-memory processing and interchange. Here’s how it works. Apache Arrow is an in-memory data structure specification for use by engineers building data systems. It has several key benefits: A columnar memory-layout permitting O(1) random access. The layout is highly cache-efficient in a ...

Read more

Playing with 80 Million Amazon Product Review Ratings Using Apache Spark

Amazon product reviews and ratings are a very important business. Customers on Amazon often make purchasing decisions based on those reviews, and a single bad review can cause a potential purchaser to reconsider. A couple years ago, I wrote a blog post titled A Statistical Analysis of 1.2 Million Amazon Reviews, which was well-received. Back then, I was only limited to 1.2M reviews because attempting to pro ...

Read more

Apache Beam and Spark: New coopetition for squashing the Lambda Architecture? | ZDNet

The nice thing about open source projects and standards is that there are so many of them to choose from. And on January 10, the Apache community welcomed Beam as its latest "top level" project (getting top level means your project has made it to prime time in Apache). Google traditionally kept its technology to itself, typically publishing research papers that the open source community would then reinvent ...

Read more

This company is using Amazon Snowmobile to transfer petabytes of data to the cloud

One of the most dramatic announcements from Amazon Web Services at its 2016 re:Invent conference was the announcement of Snowmobile: It’s a 45’ semi truck that trailers a data center on wheels. Customers can load it up with up to 100 petabytes of data per Snowmobile, which is then driven to an AWS data center and loaded into the company’s cloud. It begs the question: Who’s actually using this? DigitalGlobe ...

Read more

Data Wrangling at Slack

For a company like Slack that strives to be as data-driven as possible, understanding how our users use our product is essential. The Data Engineering team at Slack works to provide an ecosystem to help people in the company quickly and easily answer questions about usage, so they can make better and data informed decisions: “Based on a team’s activity within its first week, what is the probability that it ...

Read more

Achieving a 300% speedup in ETL with Apache Spark – Cloudera Engineering Blog

A common design pattern often emerges when teams begin to stitch together existing systems and an EDH cluster: file dumps, typically in a format like CSV, are regularly uploaded to EDH, where they are then unpacked, transformed into optimal query format, and tucked away in HDFS where various EDH components can use them. When these file dumps are large or happen very often, these simple steps can significant ...

Read more

2015 © Big Data Cloud Inc. All Rights Reserved.

Hadoop and the Hadoop elephant logo, Sprark are trademarks of the Apache Software Foundation.

Scroll to top