Tag: movement data

Analyzing GTFS Realtime Data for Public Transport Insights

In today’s post, we (that is, Gaspard Merten from Universite Libre de Bruxelles and yours truly) are going to dive deep into how to analyze public transport data, using both schedule and real time information. This collaboration has been made possible by the EMERALDS project.

Previously, I already shared news about GTFS algorithms for Trajectools that add GTFS preprocessing tools (incl. Route, segment, and stop layer extraction) to the QGIS Processing toolbox. 

Today, we’ll discuss the aspect of handling realtime GTFS data and how we approach analytics that combine both data sources.

About Realtime GTFS 

Many of us have come to rely on real-time public transport updates in apps like Google Maps. These apps are powered by standardized data formats that ensure different systems can communicate. Google first introduced GTFS in 2005, a format designed to organize transit schedules, stop locations, and other static transit information. Then, in 2011, they introduced GTFS Realtime (GTFS-RT), which added the capability to include live updates on vehicle positions, delays, speeds, and much more.

However, as the name suggests, GTFS Realtime is all about live data. This means that while GTFS-RT APIs are useful for providing real-time insights,  they don’t hold historical data for analytics. Moreover, most transit agencies don’t keep past GTFS-RT records, and even fewer make them available to the public. This can be a significant challenge for anyone looking to analyze past trends and extract valuable insights from the data. For this reason, we had to implement our own solution to efficiently archive GTFS-RT files while making sure the files could be queried easily.

There are two main challenges in the implementation of such a solution:

  • Data Volume: While individual GTFS-RT files are relatively small—typically ranging from 50KB to 500KB depending on the public transport network size—the challenge lies in ingestion frequency. With an average file size of 100KB and updates every 5 seconds, a full day’s worth of data quickly scales up to 1.728GB.
  • Data Usability: GTFS-RT is a deeply nested format based on Protobuf, making direct conversion into a more accessible structure like a DataFrame difficult. Efficiently unnesting the data without losing critical details would significantly improve usability and streamline analysis.

Parquet to the Rescue

Storing and analyzing real-time transit data efficiently isn’t just about saving space—it’s about making the data easy to work with. Luckily, modern data formats have come a long way, allowing us to store massive amounts of data while keeping retrieval and analytics processing fast. One of the best tools for the job is Apache Parquet, a columnar storage format originally designed for Hadoop but now widely adopted in data science. With built-in support in libraries like Polars and Pandas, it’s become a go-to choice for handling large datasets efficiently. Moreover, Parquet can be converted to GeoParquet for smoother integration with GIS such as GeoPandas.

What makes Parquet particularly well-suited for GTFS Realtime data is the way it compresses columnar data. It leverages multiple compression algorithms and encodings, significantly reducing file sizes while keeping access speeds high. However, to get the most out of Parquet’s compression, we need to be smart about how we structure our data. Simply converting each GTFS-RT file into its own Parquet file might give us around 60% compression, which is decent. But if we group all GTFS-RT records for an entire hour into a single file, we can push that number up to 95%. The reason? A lot of transit data—like trip IDs and stop locations—doesn’t change much within an hour, while other values, such as coordinates, often share common elements. By organizing data in larger batches, we allow Parquet’s compression algorithms to work their magic, drastically reducing storage needs. And with a smaller disk footprint, retrieval is faster, making the entire analytics pipeline more efficient.

One more challenge to tackle is the structure of the data itself. GTFS-RT files tend to be highly nested, which isn’t an issue for Parquet but can be problematic for most data science tools. While Parquet technically supports nested structures, many analytical frameworks don’t handle them well. To fix this, we apply a lightweight preprocessing step to “unnest” the data. In the original GTFS-RT format, the vehicle position feed is deeply nested, making it difficult to work with. But once unnesting is applied, the structure becomes flat, with clear column names derived from the original hierarchy. This makes it easy to convert the data into a table format, ensuring smooth integration with tools commonly used by data scientists.

The GTFS-RT Pipelines

With this in mind, let’s walk through the two pipelines we built to store and retrieve GTFS-RT data efficiently.

The entire system relies on two key pipelines that work together. The first pipeline fetches GTFS-RT data from an API every five seconds, processes it, and stores it in an S3 bucket. The second pipeline runs hourly, gathering all the individual files from the past hour, merging them into a single Parquet file, and saving it back to the bucket in a structured format. We will now take a look at each pipeline in more detail.

Pipeline 1: Fetching and Storing Data

The first step in the process is retrieving GTFS-RT data. This is done via an API, which returns files in the Protocol Buffer (ProtoBuf) format. Fortunately, Google provides libraries (such as gtfs-realtime-bindings) that make it easy to parse ProtoBuf and convert it into a more accessible format like JSON. 

Once we have the data in JSON format, we need to split it based on entity type. GTFS-RT files contain different types of data, such as TripUpdate, which provides updated arrival times for stops, and VehiclePosition, which tracks real-time locations and speeds. Not all GTFS-RT feeds contain every entity type, but TripUpdate and VehiclePosition are the most commonly used. The full list of entity types can be found in the GTFS Realtime documentation.

We separate entity types because they have different schemas, making it difficult to store them in a single Parquet file. Keeping each entity type separate not only improves organization but also enhances compression efficiency. Once split, we apply the same unnesting process as described earlier, ensuring the data is structured in a way that’s easy to analyze. After that, we convert the data into a data frame and store it as a Parquet file in memory before uploading it to an S3 bucket. The files follow a structured naming convention like this:

{feed_type}/YYYY-MM-DD/hour/individual_{date-isoformat}.parquet

This format makes it easy to navigate the storage bucket manually while also ensuring seamless integration with the second pipeline.

Pipeline 2: Merging and Optimizing Storage

The second pipeline’s job is to take all the small Parquet files generated by Pipeline 1 and merge them into a single, optimized file per hour. To do this, it scans the storage bucket for the earliest unprocessed “hour folder” and begins processing from there. This design ensures that if the pipeline is temporarily interrupted, it can easily resume without skipping any data.

Once it identifies the files to merge, the pipeline loads them, assigns a proper timestamp to each record, and concatenates them into a single Parquet table. The final file is then uploaded to the S3 bucket using the following naming convention:

{feed_type}/YYYY-MM-DD/hour/HH.parquet

If any files fail to merge, they are renamed with the prefix unmerged_{date-isoformat}.parquet for manual inspection. After successfully storing the merged file, the pipeline deletes the individual files to keep storage clean and avoid unnecessary clutter.

One critical advantage of converting GTFS-RT data into Parquet early in the process is that it prevents memory overload. If we had to merge raw GTFS-RT files instead of pre-converted Parquet files, we would likely run into memory constraints, especially on standard servers with limited RAM. This makes Parquet not just a storage solution but an enabler of efficient large-scale processing.

Ready for Analytics

In this section, we will explore how to use the GTFS-RT data for public transport analytics. Specifically, we want to compute delays, that is, the difference between the scheduled travel time and the real travel time. 

The previously created Parquet files can be loaded into QGIS as tables without geometries. To turn them into point layers, we use the “Create points layer from table” algorithm from the Processing “Vector creation” toolbox. And once we convert the unixtimes to datetimes (using the datetime_from_epoch function), we have a point layer that is ready for use in Trajectools. 

Let’s have a look at one bus route. Bus 3 is one of the busiest routes in Riga. We apply a filter to the point layer which reveals the location of the route. 

Computing segment travel times

Computing travel times on public transport segments, i.e. between two scheduled stops, comes with a couple of challenges:

  1. The GTFS-RT location updates are provided in a rather sparse fashion with irregular reporting intervals. It is not clear that we “see” every stop that happens. 
  2. We cannot rely solely on stop detection since, sometimes, a vehicle will not come to a halt at scheduled stop locations (if nobody wants to get off or on)
  3. The stop ID, representing the next stop the vehicle will visit, is not always exact. Updates are often delayed and happen some time after passing the stop. 

Here’s an example visualization of the stop ID information of a single trip of bus 3, overlaid on top of the GTFS route and stops (in red):

To compute the desired delays, we decided to compare GTFS-RT travel times based on stop ID info with the scheduled travel times. To get the GTFS-RT travel times, we use Trajectools and create trajectories by splitting at stop ID change using the Split by value change algorithm:

Computing delays

The final step is to compute travel time differences between schedule and real time. For this, we implemented a SQL join that matches GTFS-RT trajectories with the corresponding entry in the GTFS schedule using route information and temporal information: 

The temporal information is important since the schedule accounts for different travel times during peak hours and off peak: 

This information is extracted from the GTFS schedule using the Trajectools Extract segments algorithm, if we chose the “Add scheduled speeds” option:

This will add the time windows, speeds, and runtimes per segment to the resulting segment layer: 

Joining the GTFS-RT trajectories with the scheduled segment information, we compute delays for every segment and trip. For example, here are the resulting delays for trip ‘AUTO3-18-1-240501-ab-2230’: 

Red lines mark segments where time is lost compared to the schedule, while blue lines indicate that the vehicle traversed the segment faster than the schedule suggested.

What’s next

When interpreting the results, it is important to acknowledge the effects caused by the timing of the next stop ID updates in the real-time GTFS feed. Sometimes, these updates come very late and thus introduce distortions where one segment’s travel time gets too long and the other too short. 

We will continue refining the analytics and related libraries, including the QGIS Trajectools plugin, to facilitate analytics of GTFS-RT & GTFS.

After successful testing of this analytics approach in Riga, we aim to transfer it to other cities. But for this to work, public transport companies need ways to efficiently store their data and, ideally, to release them openly to allow for analysis.

The pipelines we described, help keep storage needs low, which allows us to drastically reduce costs (for a year we would only have a few gigabytes, which is inexpensive to store in S3 storage). Let us know if you would be interested in an online platform on which one could register a GTFS-RT feed & GTFS, which would then automatically start being archived (in exchange, the provider would only need to accept sharing the archives as open data, at no cost for them).

Learn More

Trajectools 2.4 release

In this new release, you will find new algorithms, default output styles, and other usability improvements, in particular for working with public transport schedules in GTFS format, including:

  • Added GTFS algorithms for extracting stops, fixes #43
  • Added default output styles for GTFS stops and segments c600060
  • Added Trajectory splitting at field value changes 286fdbd
  • Added option to add selected fields to output trajectories layer, fixes #53
  • Improved UI of the split by observation gap algorithm, fixes #36

Note: To use this new version of Trajectools, please upgrade your installation of MovingPandas to >= 0.21.2, e.g. using

import pip; pip.main(['install', '--upgrade', 'movingpandas'])

or

conda install movingpandas==0.21.2

Learn More