You Are Here: Home » Hive

Introducing Apache Arrow: A Fast, Interoperable In-Memory Columnar Data Structure Standard – Cloudera Engineering Blog

Engineers from across the Apache Hadoop community are collaborating to establish Arrow as a de-facto standard for columnar in-memory processing and interchange. Here’s how it works. Apache Arrow is an in-memory data structure specification for use by engineers building data systems. It has several key benefits: A columnar memory-layout permitting O(1) random access. The layout is highly cache-efficient in a ...

Read more

Using Apache Spark for large-scale language model training | Engineering Blog | Facebook Code

Processing large-scale data is at the heart of what the data infrastructure group does at Facebook. Over the years we have seen tremendous growth in our analytics needs, and to satisfy those needs we either have to design and build a new system or adopt an existing open source solution and improve it so it works at our scale. For some of our batch-processing use cases we decided to use Apache Spark, a fast- ...

Read more

Data Wrangling at Slack

For a company like Slack that strives to be as data-driven as possible, understanding how our users use our product is essential. The Data Engineering team at Slack works to provide an ecosystem to help people in the company quickly and easily answer questions about usage, so they can make better and data informed decisions: “Based on a team’s activity within its first week, what is the probability that it ...

Read more

Skool: An Open Source Data Integration Tool for Apache Hadoop from BT Group – Cloudera Engineering Blog

In this guest post, Skool’s architects at BT Group explain its origins, design, and functionality. With increased adoption of big data comes the challenge of integrating existing data sitting in various relational and file-based systems with Apache Hadoop infrastructure. Although open source connectors (such as Apache Sqoop) and utilities (such as Httpfs/Curl on Linux) make it easy to exchange data, data en ...

Read more

Hadoop Still Beats Spark In These Cases | Bowen Gong | Pulse | LinkedIn

Summary There has been many talks about Spark replacing Hadoop in the big data space due to its speed and ease of use. While there are major benefits of using Spark (I am one of its advocates), it is far from a replacement for Hadoop for two reasons. One, Spark does not have the HDFS component. Two, Spark is not more scalable or fault-tolerant than hadoop. Spark Strengths Although this article is to show yo ...

Read more

Data Warehousing With Google BigQuery

Data warehousing and the resulting business intelligence are the basic necessities of business today. And today’s technologies makes it possible to have a sophisticated data warehouse up and running in the clouds at a price and scale that was never possible before.     This webinar showcases the reasons, ways and means of developing such modern day data warehouses using Google BigQuery.   ...

Read more

Innovative Big Data Application Optimizes Lead Conversions, built on the Google Cloud Platform – CASE STUDY

In the era of Big Data, many enterprise executives are struggling with the sheer volume of available data and how to transform all that information into intelligence they can use to make the best business decisions. Typically, they try to determine exactly what data is available and then apply it to a specific question or problem. Unfortunately, this is an ineffective strategy, returning only fifty cents on ...

Read more

Hadoop Market is Expected to Reach USD 37,759.0 Million in 2023: Transparency Market Research

According to a new market report published by Transparency Market Research "Hadoop Market - Global Industry Analysis, Size, Share, Growth, Trends and Forecast 2015 - 2023," global hadoop market was valued at USD 4,120.0 million in 2014, growing at a CAGR of 28.4% from 2015 to 2023 to account for USD 37,759.0 million in 2023. Full Research Report on Global Hadoop Market with detailed figures and segmentation ...

Read more

Unified security and Management Solution for BI that is Native on Hadoop

Traditional BI and visualization tools rely on decentralized security models, which makes extracting and managing data from Hadoop complicated and vulnerable. By delivering a security architecture converged from within Hadoop’s distributed processing framework, rather than imposed from without, Arcadia Data continues to deliver on its vision of on-cluster analytics by eliminating the complexities of redunda ...

Read more

Anomaly Detection Using PySpark, Hive, and Hue on Amazon EMR – AWS Big Data Blog

We are surrounded by more and more sensors – some of which we’re not even consciously aware. As sensors become cheaper and easier to connect, they create an increasing flood of data that's getting cheaper and easier to store and process. However, sensor readings are notoriously “noisy” or “dirty”. To produce meaningful analyses, we’d like to identify anomalies in the sensor data and remove them before we pe ...

Read more

2015 © Big Data Cloud Inc. All Rights Reserved.

Hadoop and the Hadoop elephant logo, Sprark are trademarks of the Apache Software Foundation.

Scroll to top