Gnip CEO on the Challenges of Handling the Real-Time, Big Data Firehose
Last fall, Twitter announced a partnership with Gnip, making the latter company the only commercial provider of the Twitter activity stream. And although the “firehose” metaphor has been beaten to death, says Gnip CEO Jud Valeski, it still holds true.
Valeski spoke today at Gluecon about the challenges of handling the firehose – what it means to process high volume, real-time data streams and to be able to do so “in a consistent and predictable manner.”
Recent statistics demonstrate just how high a volume this Twitter data really is. Twitter is seeing around 155,000,000 tweets per day. At about 2500 bytes on average for each tweet, do the math and calculate that Twitter (and Gnip) are truly handling an immense amount of data – about 35 Mb per second – and handling it at a sustained rate.
Valeski spoke today about how this big data stream doesn’t work with “the pipes we’re used to.” Rather than the typical http services with “standard, highly nascent TCP connections,” this sort of real-time big data streaming is a very different scenario – something skin to video streaming. Furthermore, the connections can no longer be nascent and small. They are “full blast connections.” Also the processing dynamics are different as well; the synchronous GET request handling just doesn’t work.
- Follow this posting on ReadWriteWeb…
- Find other postings from ReadWriteWeb…