You Are Here: Home » Apache Spark

Working with UDFs in Apache Spark – Cloudera Engineering Blog

User-defined functions (UDFs) are a key feature of most SQL environments to extend the system’s built-in functionality.  UDFs allow developers to enable new functions in higher level languages such as SQL by abstracting their lower level language implementations.  Apache Spark is no exception, and offers a wide range of options for integrating UDFs with Spark SQL workflows. In this blog post, we’ll review s ...

Read more

Yahoo supercharges TensorFlow with Apache Spark 

Yahoo, model Apache Spark citizen and developer of CaffeOnSpark, which made it easier for developers building deep learning models in Caffe to scale with parallel processing, is open sourcing a new project called TensorFlowOnSpark. The pairing of Spark and TensorFlow should make the deep learning framework more attractive to developers who are creating models that need to run on large computing clusters. Fo ...

Read more

Distributed Deep Learning with Apache Spark and Keras | Databases at CERN

In the following blog posts we study the topic of Distributed Deep Learning, or rather, how to parallelize gradient descent using data parallel methods. We start by laying out the theory, while supplying you with some intuition into the techniques we applied. At the end of this blog post, we conduct some experiments to evaluate how different optimization schemes perform in identical situations. We also intr ...

Read more

Powering Amazon Redshift Analytics with Apache Spark and Amazon Machine Learning | AWS Big Data Blog

Air travel can be stressful due to the many factors that are simply out of airline passengers’ control. As passengers, we want to minimize this stress as much as we can. We can do this by using past data to make predictions about how likely a flight will be delayed based on the time of day or the airline carrier. In this post, we generate a predictive model for flight delays that can be used to help us pick ...

Read more

Playing with 80 Million Amazon Product Review Ratings Using Apache Spark

Amazon product reviews and ratings are a very important business. Customers on Amazon often make purchasing decisions based on those reviews, and a single bad review can cause a potential purchaser to reconsider. A couple years ago, I wrote a blog post titled A Statistical Analysis of 1.2 Million Amazon Reviews, which was well-received. Back then, I was only limited to 1.2M reviews because attempting to pro ...

Read more

Analysis of the USA election of 2016 with Apache Spark GraphX and Neo4j | articles about programming on mkdev

Fig. 1 - The most popular tweets by Hillary Clinton and Donald Trump after the election has ended Almost right before the election started, I decided that it might have been interesting to analyse what people think and, more importantly, say on this topic. Because, as you know, this election had promised to be an extraordinary one. This is when I came up with an idea to utilise Twitter's streaming API to co ...

Read more

Apache Beam and Spark: New coopetition for squashing the Lambda Architecture? | ZDNet

The nice thing about open source projects and standards is that there are so many of them to choose from. And on January 10, the Apache community welcomed Beam as its latest "top level" project (getting top level means your project has made it to prime time in Apache). Google traditionally kept its technology to itself, typically publishing research papers that the open source community would then reinvent ...

Read more

Data Wrangling at Slack

For a company like Slack that strives to be as data-driven as possible, understanding how our users use our product is essential. The Data Engineering team at Slack works to provide an ecosystem to help people in the company quickly and easily answer questions about usage, so they can make better and data informed decisions: “Based on a team’s activity within its first week, what is the probability that it ...

Read more

Achieving a 300% speedup in ETL with Apache Spark – Cloudera Engineering Blog

A common design pattern often emerges when teams begin to stitch together existing systems and an EDH cluster: file dumps, typically in a format like CSV, are regularly uploaded to EDH, where they are then unpacked, transformed into optimal query format, and tucked away in HDFS where various EDH components can use them. When these file dumps are large or happen very often, these simple steps can significant ...

Read more

Monitoring Real-Time Uber Data Using Spark Machine Learning, Streaming, and the Kafka API (Part 1) | MapR

According to Gartner, by 2020, a quarter of a billion connected cars will form a major element of the Internet of Things. Connected vehicles are projected to generate 25GB of data per hour, which can be analyzed to provide real-time monitoring and apps, and will lead to new concepts of mobility and vehicle usage. One of the 10 major areas in which big data is currently being used to excellent advantage is i ...

Read more

2015 © Big Data Cloud Inc. All Rights Reserved.

Hadoop and the Hadoop elephant logo, Sprark are trademarks of the Apache Software Foundation.

Scroll to top