Building a Real-Time Transit Data Pipeline: Dublin Bus Analytics
January 30, 20266 min readBy Mayank

Building a Real-Time Transit Data Pipeline: Dublin Bus Analytics

How I built a complete data pipeline that tracks 680+ buses in real-time across Dublin, from API ingestion to interactive dashboards.

Building a Real-Time Transit Data Pipeline: Dublin Bus Analytics

How I built a complete data pipeline that tracks 680+ buses in real-time across Dublin, from API ingestion to interactive dashboards—all in a single day.


The Opportunity: Open Transit Data#

Transport for Ireland provides one of Europe's best public transit data APIs. Their GTFS-Realtime (GTFS-RT) feed offers:

  • Vehicle Positions: Real-time GPS coordinates for every bus
  • Trip Updates: Arrival and departure delays at each stop
  • Service Alerts: Disruptions and schedule changes

This data covers Dublin Bus, Bus Éireann, Go-Ahead Ireland, and even LUAS (Dublin's tram system). As a data engineer living in Dublin, I saw an opportunity to build something useful while demonstrating my skills.

What I Built#

A complete data engineering project that:

  1. Ingests real-time bus data every 30 seconds
  2. Stores positions and delays in a SQLite database
  3. Analyzes patterns with Python (Pandas)
  4. Visualizes results in an interactive Streamlit dashboard

Let me walk you through each component.


Architecture Overview#

Plain Text
1┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
2│ TFI API │────▶│ Data Collector │────▶│ SQLite DB │
3│ (GTFS-RT) │ │ (Python) │ │ │
4│ │ │ │ │ • positions │
5│ • /Vehicles │ │ • Polling │ │ • trip_updates │
6│ • /TripUpdates │ │ • Parsing │ │ • indexes │
7└─────────────────┘ │ • Validation │ └────────┬────────┘
8 └─────────────────┘ │
9
10 ┌─────────────────┐
11 │ Analytics │
12 │ Engine │
13 └────────┬────────┘
14
15 ┌──────────────────┬──────┴──────────────────┐
16 ▼ ▼ ▼
17 ┌──────────┐ ┌──────────┐ ┌──────────┐
18 │ Streamlit│ │ Jupyter │ │ JSON │
19 │Dashboard │ │ Notebook │ │ Export │
20 └──────────┘ └──────────┘ └──────────┘

Step 1: Understanding GTFS-Realtime#

GTFS-RT is an extension of the General Transit Feed Specification (GTFS) designed for real-time transit data. The API returns Protocol Buffer (protobuf) format by default, but TFI also supports JSON.

Here's what a vehicle position looks like:

Plain Text
1{
2 "id": "V123",
3 "vehicle": {
4 "trip": {
5 "trip_id": "5240_626",
6 "route_id": "5240_119666",
7 "start_time": "22:05:00",
8 "start_date": "20260130",
9 "direction_id": 0
10 },
11 "position": {
12 "latitude": 53.3538666,
13 "longitude": -7.09222
14 },
15 "timestamp": "1769812820",
16 "vehicle": {
17 "id": "25"
18 }
19 }
20}

The route IDs follow a pattern: 5240_* is Dublin Bus, 5249_* is Go-Ahead Ireland. This lets us segment analysis by operator.


Step 2: Building the Data Collector#

The collector is a Python script that polls the API and stores results. Key design decisions:

Idempotent Database Initialization#

Plain Text
1def _init_database(self):
2 """Initialize SQLite database with required tables"""
3 conn = sqlite3.connect(DATABASE_PATH)
4 cursor = conn.cursor()
5
6 cursor.execute("""
7 CREATE TABLE IF NOT EXISTS vehicle_positions (
8 id INTEGER PRIMARY KEY AUTOINCREMENT,
9 collected_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
10 vehicle_id TEXT,
11 trip_id TEXT,
12 route_id TEXT,
13 latitude REAL,
14 longitude REAL,
15 timestamp INTEGER,
16 start_time TEXT,
17 start_date TEXT,
18 direction_id INTEGER
19 )
20 """)
21
22 # Indexes for common queries
23 cursor.execute("CREATE INDEX IF NOT EXISTS idx_positions_time ON vehicle_positions(collected_at)")
24 cursor.execute("CREATE INDEX IF NOT EXISTS idx_positions_route ON vehicle_positions(route_id)")
25
26 conn.commit()

Error-Resilient Fetching#

Plain Text
1def fetch_vehicle_positions(self) -> dict:
2 """Fetch current vehicle positions"""
3 try:
4 response = requests.get(
5 f"{VEHICLES_ENDPOINT}?format=json",
6 headers={"x-api-key": API_KEY},
7 timeout=30
8 )
9 response.raise_for_status()
10 return response.json()
11 except requests.RequestException as e:
12 print(f"Error fetching vehicle positions: {e}")
13 return {} # Return empty dict, don't crash

Configurable Collection Cycles#

Plain Text
1def run_continuous_collection(interval_seconds=60, duration_minutes=30):
2 """Run continuous data collection"""
3 collector = DataCollector()
4 end_time = datetime.now().timestamp() + (duration_minutes * 60)
5
6 while datetime.now().timestamp() < end_time:
7 try:
8 collector.collect()
9 time.sleep(interval_seconds)
10 except KeyboardInterrupt:
11 break

Running python data_collector.py --duration 30 --interval 60 collects data every minute for 30 minutes.


Step 3: What the Data Reveals#

After just 5 minutes of collection, I had:

  • 2,732 vehicle position records
  • 42,042 trip update records
  • 693 unique vehicles tracked
  • 194 unique routes

On-Time Performance#

Analyzing the delay data reveals interesting patterns:

| Metric | Value | |--------|-------| | Average Delay | 2.3 minutes | | On-Time (±1 min) | 45.2% | | Slight Delay (1-5 min) | 28.7% | | Moderate (5-15 min) | 18.4% | | Severe (over 15 min) | 7.7% |

Nearly half of all buses are on-time, but over 7% have severe delays of 15+ minutes.

Geographic Patterns#

The heatmap visualization clearly shows:

  • Highest density: Dublin City Centre (O'Connell Street area)
  • Major corridors: Towards Dublin Airport, Dun Laoghaire, Tallaght
  • Lower coverage: Western suburbs and industrial areas

Step 4: Building the Dashboard#

I used Streamlit for the interactive dashboard. Key features:

Real-Time Map with Plotly#

Plain Text
1def create_map(positions_df):
2 latest = positions_df.groupby('vehicle_id').last().reset_index()
3
4 fig = px.scatter_mapbox(
5 latest,
6 lat='latitude',
7 lon='longitude',
8 color='route_id',
9 hover_name='vehicle_id',
10 zoom=10,
11 height=600
12 )
13
14 fig.update_layout(
15 mapbox_style="carto-darkmatter", # Dark theme
16 mapbox=dict(center=dict(lat=53.35, lon=-6.26))
17 )
18
19 return fig

Performance Gauge#

Plain Text
1def create_delay_gauge(on_time_pct):
2 fig = go.Figure(go.Indicator(
3 mode="gauge+number",
4 value=on_time_pct,
5 title={'text': "On-Time Performance"},
6 gauge={
7 'axis': {'range': [0, 100]},
8 'bar': {'color': "#10b981"},
9 'steps': [
10 {'range': [0, 50], 'color': 'rgba(239, 68, 68, 0.3)'},
11 {'range': [50, 80], 'color': 'rgba(245, 158, 11, 0.3)'},
12 {'range': [80, 100], 'color': 'rgba(16, 185, 129, 0.3)'}
13 ]
14 }
15 ))
16 return fig

Lessons Learned#

1. GTFS-RT is Well-Designed#

The spec is thorough and consistent across transit agencies worldwide. Skills learned here transfer to any city's transit data.

2. SQLite is Underrated#

For this scale (millions of records), SQLite performs excellently:

  • No server to manage
  • Single file to backup
  • Indexes make queries fast
  • Perfect for development and small-scale production

3. Timestamps Need Care#

GTFS-RT uses Unix timestamps (seconds since 1970). Always convert to datetime for analysis:

Plain Text
1df['datetime'] = pd.to_datetime(df['timestamp'], unit='s')

4. Rate Limiting Matters#

TFI's API is generous, but I still implemented:

  • 30-second minimum between requests
  • Exponential backoff on errors
  • Request timeouts

What's Next#

This project could be extended in several ways:

  1. Predictive Modeling: Train an ML model to predict delays based on time, route, and weather
  2. Historical Analysis: Collect data over weeks to identify patterns
  3. Real-Time Alerts: Notify when specific routes are delayed
  4. Cloud Deployment: Move to AWS Lambda + S3 for scalability

Try It Yourself#

The complete source code is available on GitHub:

Plain Text
1git clone https://github.com/mayankgulaty/mycodingjourney
2cd projects/dublin-bus-pipeline
3
4# Install dependencies
5pip install -r requirements.txt
6
7# Get your API key from https://developer.nationaltransport.ie
8echo "TFI_API_KEY=your_key_here" > .env
9
10# Collect data
11python src/data_collector.py --once
12
13# Run the dashboard
14streamlit run app.py

Conclusion#

Building this pipeline took about 3 hours from start to working dashboard. The key was:

  1. Start with the data: Understand what's available before writing code
  2. Keep it simple: SQLite + Python + Streamlit is powerful enough
  3. Iterate quickly: Get something working, then improve

Public transit data is a goldmine for data engineering projects. It's real, it's messy, and it tells interesting stories about how cities move.


Questions about this project? Reach out on LinkedIn or GitHub.

Mayank Gulaty

Written by Mayank Gulaty

Senior Data Engineer with 8+ years of experience at Citi and Nagarro, specializing in building petabyte-scale data pipelines and cloud-native architectures. I combine deep data engineering expertise with full-stack development skills to create end-to-end solutions.