You Are Here: Home » Cloudera

Introducing Apache Arrow: A Fast, Interoperable In-Memory Columnar Data Structure Standard – Cloudera Engineering Blog

Engineers from across the Apache Hadoop community are collaborating to establish Arrow as a de-facto standard for columnar in-memory processing and interchange. Here’s how it works. Apache Arrow is an in-memory data structure specification for use by engineers building data systems. It has several key benefits: A columnar memory-layout permitting O(1) random access. The layout is highly cache-efficient in a ...

Read more

A CIO’s guide to chatbots: Everything you need to know

Chatbots became a mainstream technology in 2016, so what do CIOs and their subordinates need to know about them this year? Chatbots officially hit the mainstream in 2016 with the news that the biggest online players, including Facebook and Microsoft, are embracing the technology. In contrast to chat-based bots of the past – those primitive agents often incapable of anything but pre-scripted spam advertising ...

Read more

Why the Financial Sector in India is betting big on Chatbots

Why the Financial Sector in India is betting big on Chatbots Banking gets smarter with as Deep Learning and NLP technologiesHere’s a common scenario in the financial sector – high volume customer support tasks are being standardized and executed by chatbots or software applications that can be scaled up accordingly and are streamlining operations.  From insurance discovery, buying and renewal to handling da ...

Read more

Working with UDFs in Apache Spark – Cloudera Engineering Blog

User-defined functions (UDFs) are a key feature of most SQL environments to extend the system’s built-in functionality.  UDFs allow developers to enable new functions in higher level languages such as SQL by abstracting their lower level language implementations.  Apache Spark is no exception, and offers a wide range of options for integrating UDFs with Spark SQL workflows. In this blog post, we’ll review s ...

Read more

Achieving a 300% speedup in ETL with Apache Spark – Cloudera Engineering Blog

A common design pattern often emerges when teams begin to stitch together existing systems and an EDH cluster: file dumps, typically in a format like CSV, are regularly uploaded to EDH, where they are then unpacked, transformed into optimal query format, and tucked away in HDFS where various EDH components can use them. When these file dumps are large or happen very often, these simple steps can significant ...

Read more

Unified security and Management Solution for BI that is Native on Hadoop

Traditional BI and visualization tools rely on decentralized security models, which makes extracting and managing data from Hadoop complicated and vulnerable. By delivering a security architecture converged from within Hadoop’s distributed processing framework, rather than imposed from without, Arcadia Data continues to deliver on its vision of on-cluster analytics by eliminating the complexities of redunda ...

Read more

Apache Hadoop at 10 – Cloudera VISION

2016 marks the 10th Anniversary of Hadoop. This birthday provides us an opportunity to celebrate, and also to reflect on how we got here and where we are going. Hadoop has come to symbolize big data, itself central to this century’s industrial revolution: the digital transformation of business. Ten years ago, digital business was limited to a few sectors, like e-commerce and media. Since then, we have seen ...

Read more

DistCp Performance Improvements in Apache Hadoop – Cloudera Engineering Blog

Recent improvements to Apache Hadoop’s native backup utility, which are now shipping in CDH, make that process much faster. DistCp is a popular tool in Apache Hadoop for periodically backing up data across and within clusters. (Each run of DistCp in the backup process is referred to as a backup cycle.) Its popularity has grown in popularity despite relatively slow performance. In this post, we’ll provide a ...

Read more

Cloudera Submitting Impala and Kudu to the Apache Incubator – Cloudera VISION

Almost 10 years ago I helped found Hadoop at Apache. Since then the project has seen tremendous success, spawning an ecosystem of over 20 projects around it. Institutions throughout the world use Apache Hadoop and related projects to better understand their customers, markets and products. Hadoop has been central to the ongoing shift by enterprises to a new computing platform that is more powerful, scalable ...

Read more

How-to: Translate from MapReduce to Apache Spark (Part 2) | Cloudera Engineering Blog

The conclusion to this series covers Combiner-like aggregation functionality, counters, partitioning, and serialization.Apache Spark is rising in popularity as an alternative to MapReduce, in a large part due to its expressive API for complex data processing. A few months ago, my colleague, Sean Owen wrote a post describing how to translate functionality from MapReduce into Spark, and in this post, I’ll ext ...

Read more

2015 © Big Data Cloud Inc. All Rights Reserved.

Hadoop and the Hadoop elephant logo, Sprark are trademarks of the Apache Software Foundation.

Scroll to top