You Are Here: Home » Cloud

Getting Started Analyzing Twitter Data in Apache Kafka through KSQL

KSQL is the open source streaming SQL engine for Apache Kafka. It lets you do sophisticated stream processing on Kafka topics, easily, using a simple and interactive SQL interface. In this short article we’ll see how easy it is to get up and running with a sandbox for exploring it, using everyone’s favourite demo streaming data source: Twitter. We’ll go from ingesting the raw stream of tweets, through to fi ...

Read more

Unite Real-Time and Batch Analytics Using the Big Data Lambda Architecture, Without Servers! | AWS Big Data Blog

The Big Data Lambda Architecture seeks to provide data engineers and architects with a scalable, fault-tolerant data processing architecture and framework using loosely coupled, distributed systems. At a high level, the Lambda Architecture is designed to handle both real-time and historically aggregated batched data in an integrated fashion. It separates the duties of real-time and batch processing so purpo ...

Read more

Using Apache Kafka as a Scalable, Event Driven Backbone for Service Architectures

The last post in this microservices series looked at building systems on a backbone of events, where events become both a trigger as well as a mechanism for distributing state. These have a long history of implementation using a wide range of messaging technologies. But while Apache Kafka™ is a messaging system of sorts, it’s quite different from typical brokers. It comes with both pros and cons and, like a ...

Read more

TensorForce: A TensorFlow library for applied reinforcement learning – reinforce.io

This blogpost will give an introduction to the architecture and ideas behind TensorForce, a new reinforcement learning API built on top of TensorFlow. This post is about a practical question: How can the applied reinforcement learning community move from collections of scripts and individual examples closer to an API for reinforcement learning (RL) — a ‘tf-learn’ or ‘skikit-learn’ for RL? Before discussing ...

Read more

Exactly-once Semantics is Possible: Here’s How Apache Kafka Does it

I’m thrilled that we have hit an exciting milestone the Kafka community has long been waiting for: we have introduced exactly-once semantics in Apache Kafka in the 0.11 release. In this post, I’d like to tell you what exactly-once semantics mean in Apache Kafka, why it is a hard problem, and how the new idempotence and transactions features in Kafka enable correct exactly-once stream processing using Kafka’ ...

Read more

Baidu employs the PaddlePaddle framework internally for prediction systems, along with Python to make training models and deriving predictions a snap Many of the latest machine learning and data science tools purport to be easy to work with compared to previous generations of such frameworks and libraries. Chinese search engine giant Baidu now has an open source project in the same vein: a machine learning ...

Read more

Azure Data Lake Store: a hyperscale distributed file service for big data analytics | the morning paper

Azure data lake store: a hyperscale distributed file service for big data analytics Douceur et al., SIGMOD’17 Today’s paper takes us inside Microsoft Azure’s distributed file service called the Azure Data Lake Store (ADLS). ADLS is the successor to an internal file system called Cosmos, and marries Cosmos semantics with HDFS, supporting both Cosmos and Hadoop workloads. Microsoft are in the process of migra ...

Read more

Serverless Scaling for Ingesting, Aggregating, and Visualizing Apache Logs with Amazon Kinesis Firehose, AWS Lambda, and Amazon Elasticsearch Service | AWS Database Blog

In 2016, AWS introduced the EKK stack (Amazon Elasticsearch Service, Amazon Kinesis, and Kibana, an open source plugin from Elastic) as an alternative to ELK (Amazon Elasticsearch Service, the open source tool Logstash, and Kibana) for ingesting and visualizing Apache logs. One of the main features of the EKK stack is that the data transformation is handled via the Amazon Kinesis Firehose agent. In this pos ...

Read more

Google Spanner: Beginning of the End of the NoSQL World? – ACM SIGMOD Blog

Google has recently announced that its flagship wide-area database named Spanner has been made available on the Google Cloud. Google Spanner is the next generation globally-distributed database built inside Google and announced to the world through the paper published in OSDI 2012 [1]. This article explores the implication of Google Spanner, in particular to the NoSQL world. CAP Theorem: A Quick Recap The t ...

Read more

Introducing Apache Arrow: A Fast, Interoperable In-Memory Columnar Data Structure Standard – Cloudera Engineering Blog

Engineers from across the Apache Hadoop community are collaborating to establish Arrow as a de-facto standard for columnar in-memory processing and interchange. Here’s how it works. Apache Arrow is an in-memory data structure specification for use by engineers building data systems. It has several key benefits: A columnar memory-layout permitting O(1) random access. The layout is highly cache-efficient in a ...

Read more

2015 © Big Data Cloud Inc. All Rights Reserved.

Hadoop and the Hadoop elephant logo, Sprark are trademarks of the Apache Software Foundation.

Scroll to top