Running Streaming Jobs Once a Day For 10x Cost Savings – The Databricks Blog
This is the sixth post in a multi-part series about how you can perform complex streaming analytics using Apache Spark.
Traditionally, when people think about streaming, terms such as “real-time,” “24/7,” or “always on” come to mind. You may have cases where data only arrives at fixed intervals. That is, data appears every hour or once a day. For these use cases, it is still beneficial to perform incremental processing on this data. However, it would be wasteful to keep a cluster up and running 24/7 just to perform a short amount of processing once a day.
Fortunately, by using the new Run Once trigger feature added to Structured Streaming in Spark 2.2, you will get all the benefits of the Catalyst Optimizer incrementalizing your workload, and the cost savings of not having an idle cluster lying around. In this post, we will examine how to employ triggers to accomplish both.
Triggers in Structured Streaming
In Structured Streaming, triggers are used to specify how often a streaming query should produce results. Once a trigger fires, Spark checks to see if there is new data available. If there is new data, then the query is executed incrementally on whatever has arrived since the last trigger. If there is no new data, then the stream sleeps until the next trigger fires.
The default behavior of Structured Streaming is to run with the lowest latency possible, so triggers fire as soon as the previous trigger finishes. For use cases with lower latency requirements, Structured Streaming supports a ProcessingTime trigger which will fire every user-provided interval, for example every minute.
While this is great, it still requires the cluster to remain running 24/7. In contrast, a RunOnce trigger will fire only once and then will stop the query. As we’ll see below, this lets you effectively utilize an external scheduling mechanism such as Databricks Jobs.