Sydney Traffic Analysis

Big Data pipeline analysing NSW road traffic patterns (2006โ€“2025) with PySpark on HDFS, orchestrated via Docker

3.9M
Raw Records
711
Traffic Stations
19 yrs
Date Range
61,450
Curated Records

Pipeline Architecture

๐Ÿ“ฅ
Ingest
4 CSV files + station ref
โ†’
๐Ÿ—„๏ธ
HDFS
NameNode + DataNode
โ†’
โšก
PySpark
Clean + Aggregate
โ†’
๐Ÿ”—
Enrich
Join station metadata
โ†’
๐Ÿ“Š
Analytics
SQL queries + viz
Python PySpark HDFS Docker Compose Jupyter Spark SQL Google Cloud VM
Regional Analysis

Urban Pressure Score

Composite of load per station ร— lane efficiency (2023โ€“2025)

Lane Efficiency by Region

Average monthly volume per lane across all stations
Trends & Seasonality

Long-term Traffic Growth

Percentage growth from first to last recorded month per region

Seasonal Peaks by Region

Top-3 highest-volume months for each region
Station Distribution

Stations per Region

Number of permanent counting stations by RMS region

Data Pipeline Summary

Records at each stage of the ETL pipeline