You Are Here: Home » Data Pipelines

When every drop counts: Schneider Electric transforms agriculture with the Internet of Things for sustainable farming – Transform

In the grassy Canterbury Plains of New Zealand, Craig Blackburn raises cattle and sheep in a line of work with a long tradition, in which he keeps a close eye on crops, land, weather and water. But Blackburn blends modern technology with his agricultural roots to manage the 990-acre Blackhills farm, a complex, bustling operation with 2,100 cattle and 800 sheep. The farm runs on irrigated water from the scen ...

Read more

Azure Data Lake Store: a hyperscale distributed file service for big data analytics | the morning paper

Azure data lake store: a hyperscale distributed file service for big data analytics Douceur et al., SIGMOD’17 Today’s paper takes us inside Microsoft Azure’s distributed file service called the Azure Data Lake Store (ADLS). ADLS is the successor to an internal file system called Cosmos, and marries Cosmos semantics with HDFS, supporting both Cosmos and Hadoop workloads. Microsoft are in the process of migra ...

Read more

Stop overdoing it when cleaning your big data – TechRepublic

Stop overdoing it when cleaning your big dataEnough is enough--your big data might actually be getting too clean. Find out why it can be useful to keep bad, garbage data.When you got a job as a data scientist, I bet you didn't imagine you'd spend so much time cleaning up bad data. Don't feel badly—none of us did.When data science rolled on the scene, many of us who were already in the data warehousing and b ...

Read more

How FICO scores big with cloud-based collaboration and data solutions – Microsoft Enterprise

Data rules everything around us. From traffic lights to medical records, those ones and zeros quickly dictate everything from the ads we see to the music we hear. As more and more organizations adopt big data strategies, we as consumers see exciting new innovations and solutions that leverage these possibilities. Smart phones, driver-less cars—this wave of data analytics brings the future to life in excitin ...

Read more

Apache Beam and Spark: New coopetition for squashing the Lambda Architecture? | ZDNet

The nice thing about open source projects and standards is that there are so many of them to choose from. And on January 10, the Apache community welcomed Beam as its latest "top level" project (getting top level means your project has made it to prime time in Apache). Google traditionally kept its technology to itself, typically publishing research papers that the open source community would then reinvent ...

Read more

Data Wrangling at Slack

For a company like Slack that strives to be as data-driven as possible, understanding how our users use our product is essential. The Data Engineering team at Slack works to provide an ecosystem to help people in the company quickly and easily answer questions about usage, so they can make better and data informed decisions: “Based on a team’s activity within its first week, what is the probability that it ...

Read more

Build Apache Spark workflows with Databricks

Today we are excited to announce Notebook Workflows in Databricks. Notebook Workflows is a set of APIs that allow users to chain notebooks together using the standard control structures of the source programming language — Python, Scala, or R — to build production pipelines. This functionality makes Databricks the first and only product to support building Apache Spark workflows directly from notebooks, off ...

Read more

Skool: An Open Source Data Integration Tool for Apache Hadoop from BT Group – Cloudera Engineering Blog

In this guest post, Skool’s architects at BT Group explain its origins, design, and functionality. With increased adoption of big data comes the challenge of integrating existing data sitting in various relational and file-based systems with Apache Hadoop infrastructure. Although open source connectors (such as Apache Sqoop) and utilities (such as Httpfs/Curl on Linux) make it easy to exchange data, data en ...

Read more

Why some Data Lakes are built to last

Hadoop-based Data Lakes can be game-changers, but too many are under performing. Here's a checklist to make your data lake a wild success. Hadoop-based data lakes can be game changers: better, cheaper and faster integrated enterprise information. Knowledge workers can access data directly, where project cycles are measured in days rather than months, and business users can leverage a shared data source rath ...

Read more

33x Faster Queries on Google Cloud’s Dataproc

I'd like to thank Felipe Hoffa, Developer Advocate at Google and Dennis Huo, Tech Lead and Manager of Google Cloud's Dataproc for their research and insights they shared with me on getting better performance out of Dataproc. A few weeks ago I published a blog post outlining how to launch a 5-node Dataproc cluster on Google's Cloud service. In that post I hadn't taken the time to tune any table or storage ba ...

Read more

2015 © Big Data Cloud Inc. All Rights Reserved.

Hadoop and the Hadoop elephant logo, Sprark are trademarks of the Apache Software Foundation.

Scroll to top