Go Python, Go! Stream Processing for Python · Wallaroo Labs
We’ve been hard at work for 18 months on a new processing engine called Wallaroo for deploying and operating big data, fast data, and machine learning applications. We designed Wallaroo to make the infrastructure virtually disappear, so you get rapid deployment and easy-to-operate applications. It provides a simple model for building fast applications that scale automatically across any number of workers.
With Wallaroo, you focus on your business algorithms, not your infrastructure, and you can use the Python libraries you’re already familiar with. Wallaroo uses an embedded Python interpreter to run your code rather than calling out to a separate Python process, which makes your application run faster. Wallaroo isn’t built on the JVM, which provides advantages that we will cover in a later blog post. And finally, Wallaroo is open-source.
This blog post will show you how to use Wallaroo’s Python API to build elastic event-by-event processing applications.
The Python API
A Motivating Example
The canonical streaming data processing application is Word Count, in which a stream of input text is analyzed and the total number of times each word has been seen is reported. This description is broad enough to allow developers to make different design tradeoffs in their implementations. You can find this example in it’s entirety in our GitHub repository.
For this example we will make the following assumptions:
- Incoming messages will come from a TCP connection and be sent to another TCP connection.
- Words are sent to the system in messages that can contain zero or more words.
- Incoming messages consist of a string.
- Outgoing messages consist of a word and the number of times that word has been seen in the event stream.
In our example, we will also split the state (the number of times each word has been seen) into 26 partitions, where each partition handles words that start with different letters. For example “acorn” and “among” would go to the “a” partition, while “bacon” would go to the “b” partition.
This application will process messages as they arrive. This contrasts with some other streaming data processing systems that are designed around processing messages in micro-batches. This results in lower latencies because message processing is not delayed.