Stream & Batch Processing of Tweets (Kafka + Spark)
Lambda‑style pipeline for tweets using Kafka + Spark Streaming and batch jobs — real‑time insights with historical recomputation.
Overview
An end‑to‑end data engineering project that processes tweets in both real‑time and batch. Kafka handles ingestion, while Spark powers streaming transformations and batch recomputation for correctness.
Problem
Real‑time analytics is great — until you need to fix late data, schema changes, or logic bugs. The system must support low‑latency insights *and* reliable backfills without rewriting everything.
Approach
Built a Kafka ingestion layer, Spark Streaming jobs for near‑real‑time processing, and complementary batch pipelines for historical recomputation. Structured the pipeline to keep transformations consistent across streaming and batch paths. Documented setup + run steps so the project is reproducible on a fresh machine.
Impact
Demonstrates modern big‑data thinking: low‑latency delivery, correct recomputation, and a pipeline you can explain in an interview from architecture down to operators.