apache spark - Google Cloud DataFlow for NRT data application

Keywords:apache  spark 


I'm evaluating Kafka/Spark/HDFS for developing NRT (sub sec) java application that receives data from an external gateway and publishes it to desktop/mobile clients (consumer) for various topics. At the same time the data will be fed through streaming and batching (persistent) pipelines for analytics and ML.

For example the flow would be...

  1. A standalone TCP client reads streaming data from external TCP server
  2. The client publishes data for different topics based on the packets (Kafka) and passes it to the streaming pipeline for analytics (Spark)
  3. A desktop/mobile consumer app subscribes to various topics and receives NRT data events (Kafka)
  4. The consumer also receives analytics from the streaming/batch pipelines (Spark)

Kafka clusters have to be managed, configured and monitored for optimal performance and scalability. This may require additional person resources and tools to manage the operation.

Kafka, Spark and HDFS can optionally be deployed over Amazon EC2 (or Google Cloud using connectors).

I was reading about Google Cloud DataFlow, Cloud Storage, BigQuery and Pub-Sub. The data flow provides auto scaling and tools to monitor data pipelines in real-time, which is extremely useful. But the setup has a few restrictions e.g. pub-sub push requires the client to use https endpoint and the app deployment needs to use web server e.g. App engine webapp or web server on GCE.

This may not be as efficient (I'm concerned about latency when using http) as deploying a bidirectional tcp/ip app that can leverage the pub-sub and data flow pipelines for streaming data.

Ideally, the preferable setup on Google cloud would be to run the TCP client connecting to the external gateway deployed on GCE that pushes data using pub-sub to the desktop consumer app. In addition, it would leverage the DataFlow pipeline for analytics and cloud storage with spark for ML (prediction API is a bit restrictive) using the cloudera spark connector for data flow.

One could deploy Kafka/Spark/HDFS etc on Google cloud but that kinda defeats the purpose of leveraging the Google cloud technology.

Appreciate any thoughts on whether the above setup is possible using Google cloud or stay with EC2/Kafka/Spark etc.

2 Answers: 

Speaking about the Cloud Pub/Sub side, there are a couple of things to keep in mind:


From the Dataflow side of things, this sounds like a good fit, particularly as you'll be mixing streaming and batch style analytics. If you haven't yet, check out our Mobile Gaming walkthrough.

I'm not quite sure what you mean about using Cloudera's Dataflow/Spark runner for ML. That runner runs Dataflow code on Spark, but not the other way around.