Sydney Traffic Analysis

Big Data pipeline analysing NSW road traffic patterns (2006–2025) with PySpark on HDFS, orchestrated via Docker

3.9M

Raw Records

711

Traffic Stations

19 yrs

Date Range

61,450

Curated Records

📥

Ingest

4 CSV files + station ref

→

🗄️

HDFS

NameNode + DataNode

→

⚡

PySpark

Clean + Aggregate

→

🔗

Enrich

Join station metadata

→

📊

Analytics

SQL queries + viz

Python PySpark HDFS Docker Compose Jupyter Spark SQL Google Cloud VM

Regional Analysis

Composite of load per station × lane efficiency (2023–2025)

Average monthly volume per lane across all stations

Trends & Seasonality

Percentage growth from first to last recorded month per region

Top-3 highest-volume months for each region

Station Distribution

Number of permanent counting stations by RMS region

Records at each stage of the ETL pipeline