Structured streaming kafka integration

In this post, I will show you how to create an end-to-end structured streaming pipeline. Let’s say, we have a requirement like: JSON data being received in Kafka, Parse nested JSON, flatten it and store in structured Parquet table and get end-to-end failure guarantees.

//Step-1 Creating a Kafka Source for Streaming Queries
val rawData = spark.readStream
  .format("kafka")
  .option("kafka.boostrap.servers", "")
  .option("subscribe", "topic")
  .load()

//Step-2
val parsedData = rawData
  .selectExpr("cast (value as string) as json")
  .select(from_json("json", schema).as("data"))
  .select("data.*")

//Step-3 Writing Data to parquet
val query = parsedData.writeStream
  .option("checkpointLocation", "/checkpoint")
  .partitionBy("date")
  .format("parquet")
  .start("/parquetTable")

Step-1: Reading Data from Kafka Specify kafka options to configure How to configure kafka server?
kafka.boostrap.servers => broker1,broker2 .load()
What to subscribe?
Subscribe => topic1,topic2,topic3 // fixed list of topics subscribePattern => topic // dynamic list of topics
assign => {“topicA”:[0,1] } // specific partitions
Where to read?*
startingOffsets => latest (default) / earliest / {“topicA”:{“0″:23,”1”:345} }

Step-2: Transforming Data Each row in the source(rawData) has the following schema:

Column Type
key binary
value binary
topic string
partition int
offset long
timestamp long
timestampType int

Cast binary value to string Name it column json
//selectExpr(“cast (value as string) as json”)
Parse json string and expand into nested columns, name it data
//select(from_json(“json”, schema).as(“data”)))

Step-3: Writing to parquet.
Save parsed data as Parquet table in the given path
Partition files by date so that future queries on time slices of data is fast
Checkpointing
Enable checkpointing by setting the checkpoint location to save offset logs
//.option(“checkpointLocation”, …)
start actually starts a continuous running StreamingQuery in the Spark cluster
//.start(“/parquetTable/”)

Stay tuned for the next post. 🙂

Reference: https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html