SGData Platform

SGData Platform

Status: Active | Source code: https://github.com/second-order-ai/sgdata_platform

A Kedro-based data platform for sourcing, processing, and analysing Singapore geospatial and government data. It ingests data from five distinct sources, processes each independently through Kedro pipelines, and converges everything into a unified spatial-relationship layer.

Data Sources

SourceWhat it providesScale
data.gov.sgSingapore government open data catalogue + downloads5,114 datasets, 467 GeoParquet files
Foursquare Open Source PlacesCommercial POI dataset for Singapore435K places, 434K geocoded
OpenStreetMapBuildings, roads, POIs, land use, boundaries155K buildings, 252K road segments
Overture MapsBuildings, places, addresses, transport, divisions311K buildings, 134K places, 141K addresses
Microsoft Global ML Building FootprintsAI-derived building polygons123K building footprints

Architecture

The platform is structured as a uv workspace with seven independent Kedro sub-projects, each responsible for one data source or processing stage:

  1. datagovsg_scraping: scrapes the data.gov.sg API and downloads files using a RabbitMQ producer/consumer pipeline with Playwright and httpx.
  2. datagovsg-scraping-processing: deduplicates, identifies spatial datasets, geocodes postcode columns, and converts downloads to GeoParquet.
  3. foursquare-processing: downloads and processes Foursquare Open Source Places from HuggingFace, including quality scoring and geocoding.
  4. osm-processing: extracts Singapore features from a local .osm.pbf file using PyROSM across 21 attribute tags and 6 special functions.
  5. overturemaps-processing: downloads all 15 Overture Maps feature types for Singapore’s bounding box.
  6. global-ml-building-footprints-processing: downloads Microsoft building footprints from Azure Blob Storage across multiple quadkey tiles.
  7. spatial-relationships: the convergence layer. Computes pairwise spatial overlap metadata between every dataset combination (24,099 records).

All sub-projects share a single data/ directory and follow Kedro conventions (PartitionedDataset, catalog globals, tagged nodes, parameter files per pipeline).

Scale

DatasetRawProcessed
data.gov.sg catalogue5,114 datasets4,618 deduplicated
data.gov.sg geospatial downloads638 files (~3.5 GB)467 GeoParquet files
Foursquare Singapore places435,239 records434,821 geocoded POIs
OSM Singapore1 .osm.pbf file27 GeoParquet files
Overture MapsSingapore bounding box15 feature-type GeoParquet files
Microsoft Building FootprintsMultiple quadkey tiles1 combined GeoParquet
Spatial relationships511 input datasets24,099 pairwise overlap records