Projects · Jan 2023 – May 2023 · PES University

Stream & Batch Processing of Tweets (Kafka + Spark)

Lambda‑style pipeline for tweets using Kafka + Spark Streaming and batch jobs — real‑time insights with historical recomputation.

Streaming Systems Batch Analytics ETL Pipelines Scalable Data Processing

Overview

An end‑to‑end data engineering project that processes tweets in both real‑time and batch. Kafka handles ingestion, while Spark powers streaming transformations and batch recomputation for correctness.

Problem

Real‑time analytics is great — until you need to fix late data, schema changes, or logic bugs. The system must support low‑latency insights *and* reliable backfills without rewriting everything.

Approach

Built a Kafka ingestion layer, Spark Streaming jobs for near‑real‑time processing, and complementary batch pipelines for historical recomputation. Structured the pipeline to keep transformations consistent across streaming and batch paths. Documented setup + run steps so the project is reproducible on a fresh machine.

Impact

Demonstrates modern big‑data thinking: low‑latency delivery, correct recomputation, and a pipeline you can explain in an interview from architecture down to operators.