You Are Here: Home » Batch Processing

Using Apache Spark for large-scale language model training | Engineering Blog | Facebook Code

Processing large-scale data is at the heart of what the data infrastructure group does at Facebook. Over the years we have seen tremendous growth in our analytics needs, and to satisfy those needs we either have to design and build a new system or adopt an existing open source solution and improve it so it works at our scale. For some of our batch-processing use cases we decided to use Apache Spark, a fast- ...

Read more

Working with UDFs in Apache Spark – Cloudera Engineering Blog

User-defined functions (UDFs) are a key feature of most SQL environments to extend the system’s built-in functionality.  UDFs allow developers to enable new functions in higher level languages such as SQL by abstracting their lower level language implementations.  Apache Spark is no exception, and offers a wide range of options for integrating UDFs with Spark SQL workflows. In this blog post, we’ll review s ...

Read more

IoT And Big Data At Caterpillar: How Predictive Maintenance Saves Millions Of Dollars

When it comes to big data and Internet of Things (IoT) initiatives most companies are still in the design or early adoption phases which make it hard to get a solid return on investment (ROI) figures.  So it’s refreshing to share a story of an organization delivering real-world ROI for their customers by vastly ramping up their data collection and predictive maintenance analytics. The Marine Division of Cat ...

Read more

How-to: Fuzzy Name Indexing in Apache Hadoop with Rosette and Cloudera Search – Cloudera Engineering Blog

In this guide, learn how to use Cloudera Search with Basis Technology’s Rosette®  to perform fuzzy name searches in multiple languages and scripts. Our thanks to Basis Technology team (Jeanne Le Garrec, Hannah MacKenzie-Margulies and Brian Sawyer) for supporting writing this how-to blog. Cloudera Search, powered by Apache Solr brings full-text, interactive search, and scalable indexing to Apache Hadoop by m ...

Read more

Nebula as a Storage Platform to Build Airbnb’s Search Backends – Airbnb Engineering & Data Science – Medium

Last year Airbnb grew to a point that a scalable and distributed storage system was required to store data for some applications. For example, personalization data for search grew larger than what a single machine can hold. While we could rebuild just the personalization service to scale up, we foresaw other services to have similar requirements and decided to build a common platform to simplify such tasks ...

Read more

Why some Data Lakes are built to last

Hadoop-based Data Lakes can be game-changers, but too many are under performing. Here's a checklist to make your data lake a wild success. Hadoop-based data lakes can be game changers: better, cheaper and faster integrated enterprise information. Knowledge workers can access data directly, where project cycles are measured in days rather than months, and business users can leverage a shared data source rath ...

Read more

Uber’s case for incremental processing on Hadoop – O’Reilly Media

Uber’s mission is to provide “transportation as reliable as running water, everywhere, for everyone.” To fulfill this promise, Uber relies on making data-driven decisions at every level, and most of these decisions can benefit from faster data processing. For example, using data to understand areas for growth or accessing of fresh data by the city operations team to debug each city. Needless to say, the cho ...

Read more

How-to: Analyze Fantasy Sports using Apache Spark and SQL – Cloudera Engineering Blog

In the United States, many diehard sports fans morph into amateur statisticians to get an edge over the competition in their fantasy sports leagues. Depending on one’s technical chops, this “edge” is usually no more sophisticated than simple spreadsheet analysis, but some particularly intense people go to the extent of creating their own player rankings and projection systems. Online tools can provide simil ...

Read more

Apache Apex Is Promoted To Top-Level Project – InformationWeek

Streaming and batch big data analytics technology Apache Apex has been elevated to a Top-Level Project by the Apache Software Foundation. Used by organizations including Capital One and GE, the technology can help developers more quickly create apps that leverage real-time data. The rise of interest in Apache Spark has demonstrated just how important streaming data has become in the big data ecosystem. Real ...

Read more

Introducing Azure Cool Blob Storage | Blog | Microsoft Azure

Data in the cloud is growing at an exponential pace, and we have been working on ways to help you manage the cost of storing this data. An important aspect of managing storage costs is tiering your data based on attributes like frequency of access, retention period, etc. A common tier of customer data is cool data which is infrequently accessed but requires similar latency and performance to hot data. Today ...

Read more

2015 © Big Data Cloud Inc. All Rights Reserved.

Hadoop and the Hadoop elephant logo, Sprark are trademarks of the Apache Software Foundation.

Scroll to top