You Are Here: Home » Batch Processing

Open Sourcing Vespa, Yahoo’s Big Data Processing and Serving Engine

Ever since we open sourced Hadoop in 2006, Yahoo – and now, Oath – has been committed to opening up its big data infrastructure to the larger developer community. Today, we are taking another major step in this direction by making Vespa, Yahoo's big data processing and serving engine, available as open source on GitHub. Vespa architecture overview Building applications increasingly means dealing with huge a ...

Read more

Unite Real-Time and Batch Analytics Using the Big Data Lambda Architecture, Without Servers! | AWS Big Data Blog

The Big Data Lambda Architecture seeks to provide data engineers and architects with a scalable, fault-tolerant data processing architecture and framework using loosely coupled, distributed systems. At a high level, the Lambda Architecture is designed to handle both real-time and historically aggregated batched data in an integrated fashion. It separates the duties of real-time and batch processing so purpo ...

Read more

Using Apache Spark for large-scale language model training | Engineering Blog | Facebook Code

Processing large-scale data is at the heart of what the data infrastructure group does at Facebook. Over the years we have seen tremendous growth in our analytics needs, and to satisfy those needs we either have to design and build a new system or adopt an existing open source solution and improve it so it works at our scale. For some of our batch-processing use cases we decided to use Apache Spark, a fast- ...

Read more

Working with UDFs in Apache Spark – Cloudera Engineering Blog

User-defined functions (UDFs) are a key feature of most SQL environments to extend the system’s built-in functionality.  UDFs allow developers to enable new functions in higher level languages such as SQL by abstracting their lower level language implementations.  Apache Spark is no exception, and offers a wide range of options for integrating UDFs with Spark SQL workflows. In this blog post, we’ll review s ...

Read more

IoT And Big Data At Caterpillar: How Predictive Maintenance Saves Millions Of Dollars

When it comes to big data and Internet of Things (IoT) initiatives most companies are still in the design or early adoption phases which make it hard to get a solid return on investment (ROI) figures.  So it’s refreshing to share a story of an organization delivering real-world ROI for their customers by vastly ramping up their data collection and predictive maintenance analytics. The Marine Division of Cat ...

Read more

How-to: Fuzzy Name Indexing in Apache Hadoop with Rosette and Cloudera Search – Cloudera Engineering Blog

In this guide, learn how to use Cloudera Search with Basis Technology’s Rosette®  to perform fuzzy name searches in multiple languages and scripts. Our thanks to Basis Technology team (Jeanne Le Garrec, Hannah MacKenzie-Margulies and Brian Sawyer) for supporting writing this how-to blog. Cloudera Search, powered by Apache Solr brings full-text, interactive search, and scalable indexing to Apache Hadoop by m ...

Read more

Nebula as a Storage Platform to Build Airbnb’s Search Backends – Airbnb Engineering & Data Science – Medium

Last year Airbnb grew to a point that a scalable and distributed storage system was required to store data for some applications. For example, personalization data for search grew larger than what a single machine can hold. While we could rebuild just the personalization service to scale up, we foresaw other services to have similar requirements and decided to build a common platform to simplify such tasks ...

Read more

Why some Data Lakes are built to last

Hadoop-based Data Lakes can be game-changers, but too many are under performing. Here's a checklist to make your data lake a wild success. Hadoop-based data lakes can be game changers: better, cheaper and faster integrated enterprise information. Knowledge workers can access data directly, where project cycles are measured in days rather than months, and business users can leverage a shared data source rath ...

Read more

Uber’s case for incremental processing on Hadoop – O’Reilly Media

Uber’s mission is to provide “transportation as reliable as running water, everywhere, for everyone.” To fulfill this promise, Uber relies on making data-driven decisions at every level, and most of these decisions can benefit from faster data processing. For example, using data to understand areas for growth or accessing of fresh data by the city operations team to debug each city. Needless to say, the cho ...

Read more

How-to: Analyze Fantasy Sports using Apache Spark and SQL – Cloudera Engineering Blog

In the United States, many diehard sports fans morph into amateur statisticians to get an edge over the competition in their fantasy sports leagues. Depending on one’s technical chops, this “edge” is usually no more sophisticated than simple spreadsheet analysis, but some particularly intense people go to the extent of creating their own player rankings and projection systems. Online tools can provide simil ...

Read more

2015 © Big Data Cloud Inc. All Rights Reserved.

Hadoop and the Hadoop elephant logo, Sprark are trademarks of the Apache Software Foundation.

Scroll to top