Compare and optimize BigQuery and Snowflake performance on GIS queries. Part 2.

by Vladi ‐ 5 min read

Geometry-in-Polygon Query with dynamic geometry.

This second post in the series compares performances of GIS queries between Snowflake. In a previous post, I tested Geometry-in-Polygon Query with constant geometry, showing that without Enterprise-only Snowflake Search Optimization, BigQuery significantly outperforms Snowflake due to spatial clustering.

However, how often do we see constant geometry in real-life pipelines? In my experience, geometry in most cases comes from another dataset.

Get all the roads in Berlin – spatial join.

This practical example covers common geometry in polygons scenarios. But this time I query boundary geometry from one dataset and use it to filter data in the other dataset.

View interactive map

BigQuery SQL

WITH berlin_boundary AS (
    SELECT geometry
    FROM `bigquery-public-data.overture_maps.division_area`
    WHERE LOWER(names.primary) = 'berlin'
      AND country = 'DE'
)
SELECT s.id, s.geometry, s.class, s.subtype
FROM `bigquery-public-data.overture_maps.segment` s
JOIN berlin_boundary b
  ON ST_CONTAINS(b.geometry, s.geometry)
WHERE s.subtype = 'road'
  AND s.class IN ('primary', 'secondary', 'tertiary');

Snowflake SQL

WITH berlin_boundary AS (
    SELECT geometry
    FROM OVERTURE_MAPS__DIVISIONS.CARTO.DIVISION_AREA
    WHERE LOWER(names:primary) = 'berlin'
      AND country = 'DE'
)
SELECT s.id, s.geometry, s.class, s.subtype
FROM OVERTURE_MAPS__TRANSPORTATION.CARTO.SEGMENT s
JOIN berlin_boundary b
  ON ST_CONTAINS(b.geometry, s.geometry)
WHERE s.subtype = 'road'
  AND s.class IN ('primary', 'secondary', 'tertiary');

First results

Query/dbDurationBytes ScannedCost
BigQuery3 seconds86.81 GB$0.424
Snowflake1 minute 51 seconds82.53GB$0.080

Query Plans

First observations

  • Both BigQuery and Snowflake perform full-table scans.
  • BigQuery, being highly parallel, completes query execution much faster, however at a higher cost.
  • Snowflake takes longer to complete but at a lower cost.

Why does BigQuery perform a full scan?

In the previous example, with static geometry, BigQuery was pruning partitions very efficiently using spatial clustering. However, here, the planner decides to perform a join instead.

BigQuery commonly uses two joining types:

  1. Broadcast join: Used when joining a large table to a small table. The small table is sent to each slot, processing the large table.
  2. Hash join: Used when joining two large tables. BigQuery uses hash and shuffle operations to move matching keys to the same slot for a local join.

Given a small amount of data returned from the first query and a small amount of data shuffled (moved between nodes), BigQuery likely used broadcast join in this case. A small result of the first query is sent to each slot processing a large Segment table, leading to fast execution but a full table scan.

Can we help BigQuery planner to avoid joining?

Yes, with some hints from the planner, I was able to avoid join and decrease Total Slot Time from 32 min 16 sec to 16 min 48 sec.

Here I’m trying to give the planner all the hints.

WITH berlin_boundary AS (
    SELECT geometry
    FROM `bigquery-public-data.overture_maps.division_area`
    WHERE LOWER(names.primary) = 'berlin'
      AND country = 'DE'
    LIMIT 1
)
SELECT s.id, s.geometry, s.class, s.subtype
FROM `bigquery-public-data.overture_maps.segment` s
WHERE s.subtype = 'road'
  AND s.class IN ('primary', 'secondary', 'tertiary')
  AND ST_CONTAINS((select geometry from berlin_boundary limit 1), s.geometry)

And indeed, no join in this case

However, the query performs a full scan anyway. It looks like BigQuery did not utilize spatial clustering in this case. I did not find a way to make this query cheaper.

I’m aware about Michael Entin post how to split it to run cheap. But I think in real-life scenarios with queries having many CTEs performing this SQL optimization, it will be too hard.

Can we help Snowflake planner to avoid expensive joins?

Yes. In the first query, while filtering applied before GeoJoin, it did not reduce the number of records and apparently GeoJoin was performed against the full dataset, also spilling 11Gb on the disc.

Let’s try now to give the planner some hints:

WITH berlin_boundary AS (
    SELECT geometry
    FROM OVERTURE_MAPS__DIVISIONS.CARTO.DIVISION_AREA
    WHERE LOWER(names:primary) = 'berlin'
      AND country = 'DE'
    LIMIT 1
)
SELECT s.id, s.geometry, s.class, s.subtype
FROM OVERTURE_MAPS__TRANSPORTATION.CARTO.SEGMENT s
WHERE s.subtype = 'road'
  AND s.class IN ('primary', 'secondary', 'tertiary')
  AND ST_CONTAINS((SELECT geometry from berlin_boundary), s.geometry)

Query plan looks very different.

Now Snowflake seems to apply filters more effectively and reduce the amount of table scans to 63.61GB and reduce execution time.

Can we get execution faster with a larger Snowflake warehouse?

Yes, with a Medium (4x cost of XS) warehouse size, we were able to reduce query execution time to 20s. While the query runs faster, it’s not 4x faster, so the cost is also higher.

Final results

Query/dbDurationBites ScannedCost
BigQuery3 seconds86.81 GB$0.424
BigQuery optimized SQL2 seconds86.81 GB$0.424
Snowflake1 minute 51 seconds82.53GB$0.080
Snowflake optimized SQL59s765.81MB + 63.61GB$0.043
Snowflake optimized SQL and Medium data warehouse size20s765.81MB + 63.61GB$0.05777

Key Insights

  • BigQuery can skip spatial clustering in complex scenarios, leading to costly full table scans.
  • BigQuery is impressively fast, but bytes scanned couldn’t be reduced without splitting query; slot time was halved with query optimization.
  • Snowflake without hints scanned the entire dataset; optimized query cut time and data processed by 50%.
  • Snowflake’s XS warehouse is cost-efficient but slower; larger warehouses trade cost for speed.

Short summary – BigQuery is 10x faster, and Snowflake is 10x cheaper.

What is next?

In my next article, I’ll test Nearest Neighbor Search and Buffer Analysis, diving deeper into GIS performance on BigQuery and Snowflake.

Until then, explore my curated collection of SQL examples for creating maps with free public datasets on both platforms.

👉 Find the examples here