Skip to content

SQL Queries with DuckDB

Cookiecutter-Spatial-Data-Science encourages a particular pattern for SQL-heavy work, particularly with DuckDB: keep query text in discrete .sql files inside a dedicated sql/ sub-package, and load them at runtime with a small helper.

This page describes the rationale, the recommended layout, and the patterns for using DuckDB's spatial extension — including in restricted-network and air-gapped environments.

DuckDB ships with ArcGIS Pro

ArcGIS Pro's bundled Python environment includes the duckdb package out of the box, so projects can use DuckDB in-place alongside arcpy — no extra installation, no separate environment. Just import duckdb and go. The spatial extension, however, is fetched on first use (see below).

Why externalize SQL?

  • IDE support. VS Code (and most editors) provide rich SQL syntax highlighting, formatting, and linting (e.g. SQLTools, SQLFluff) for .sql files. None of that works on SQL embedded inside a Python triple-quoted string.
  • Readable diffs. Long, multi-CTE queries diff cleanly when they live on their own.
  • No escaping pain. No quote-escaping, backslash gymnastics, or f-string {} collisions with SQL or Jinja syntax.
  • Encourages parameterization. Externalizing naturally pushes you toward bind parameters ($name, ?) rather than string interpolation — which sidesteps SQL injection entirely.
  • Reusable. The same .sql file can be consumed by Python, a notebook, dbt, or a CLI without copy-paste.

File layout

Add a sql/ package under your project's source tree:

src/myproject/
└── sql/
    ├── __init__.py        # exposes load_sql()
    ├── nearest_poi.sql
    └── h3_aggregate.sql

Naming: snake_case.sql, named after the operation (spatial_join_blocks.sql, build_h3_index.sql), not the table.

When to externalize vs. inline

Inline (in Python) Externalize (.sql file)
Short ad-hoc queries (≤ ~10 lines) Multi-statement or multi-CTE queries
No user-supplied input Anything parameterized
Used in exactly one place Reused in more than one place
Trivial (SELECT COUNT(*) FROM t) Anything you'd want to lint or format

A small load_sql() helper

A minimal helper that reads a .sql file from the sql/ resource directory and caches the result for the lifetime of the process:

# src/myproject/sql/__init__.py
from __future__ import annotations

from functools import lru_cache
from importlib.resources import files


@lru_cache(maxsize=None)
def load_sql(name: str) -> str:
    """Load a `.sql` file shipped inside this `sql/` resource directory.

    Pass the file *stem* (no `.sql` extension). Always pass user-supplied or
    runtime values via DuckDB bind parameters rather than interpolating them
    into the returned string.
    """
    resource = files(__name__).joinpath(f"{name}.sql")
    if not resource.is_file():
        raise FileNotFoundError(f"SQL resource '{name}.sql' not found.")
    return resource.read_text(encoding="utf-8")

Using it:

import duckdb
from myproject.sql import load_sql

con = duckdb.connect()

df = con.execute(
    load_sql("nearest_poi"),
    {"max_distance_m": 1500, "category": "grocery"},
).fetch_df()

Note

importlib.resources.files() works correctly whether the package is installed editable (pip install -e .), installed normally, or executed from inside a wheel.

Always parameterize

Use DuckDB's named-parameter syntax ($name) and pass a dict to execute(). The DuckDB driver handles type conversion and quoting safely.

-- src/myproject/sql/nearest_poi.sql
SELECT
    poi_id,
    ST_Distance(origin_geom, poi_geom) AS distance_m
FROM
    pois
WHERE
    category = $category
    AND ST_Distance(origin_geom, poi_geom) <= $max_distance_m
ORDER BY
    distance_m
LIMIT
    $limit;
df = con.execute(
    load_sql("nearest_poi"),
    {"category": "grocery", "max_distance_m": 1500, "limit": 10},
).fetch_df()

Never f-string your SQL

Do not build SQL with f-strings or %-formatting, even for "trusted" internal inputs. Today's internal value is tomorrow's user input.

# BAD — vulnerable to SQL injection
con.execute(f"SELECT * FROM pois WHERE category = '{category}'")

# GOOD — bind parameter, type-safe and injection-safe
con.execute(
    "SELECT * FROM pois WHERE category = $category",
    {"category": category},
)

When you can't bind a value

DuckDB (like all SQL engines) only allows binding values, not identifiers (table names, column names) or file paths inside table-valued functions like read_parquet(...). For those cases:

  1. Validate the value against an allow-list of known-safe strings, then
  2. Substitute it via Python str.format, an f-string on the template (not the value), or a templating engine such as Jinja.
ALLOWED_TABLES = {"pois", "blocks", "trips"}


def query_table(con, table: str, limit: int):
    if table not in ALLOWED_TABLES:
        raise ValueError(f"Unknown table: {table!r}")
    sql = load_sql("select_from_table").format(table=table)  # safe: validated
    return con.execute(sql, {"limit": limit}).fetch_df()

Configuration over hardcoding

Paths to Parquet datasets, working CRS WKIDs, H3 resolutions, and similar values belong in your project's configuration (e.g. config/config.yml) and should be passed in as bind parameters — not baked into the .sql file.

from myproject.config import config
from myproject.sql import load_sql

df = con.execute(
    load_sql("load_partitioned"),
    {
        "parquet_path": config.data.points_parquet,
        "h3_resolution": config.spatial.h3_resolution,
    },
).fetch_df()

Spatial Extension

DuckDB's spatial capabilities (ST_* functions, GeoParquet I/O, reprojection) are provided by the official spatial extension. It is not bundled with the core engine — each connection that needs it must INSTALL (once per machine) and LOAD (once per connection) the extension.

One-time install, per-connection load

import duckdb

con = duckdb.connect()
con.install_extension("spatial")   # downloads & caches; safe to call repeatedly
con.load_extension("spatial")      # makes ST_* functions available in this connection

In a project, prefer wrapping this in a small helper so every entry point gets a consistently configured connection:

# src/myproject/db.py
from __future__ import annotations

import duckdb


def get_spatial_connection(database: str = ":memory:") -> duckdb.DuckDBPyConnection:
    """Return a DuckDB connection with the spatial extension loaded."""
    con = duckdb.connect(database)
    con.install_extension("spatial")
    con.load_extension("spatial")
    return con

Common spatial operations

-- src/myproject/sql/buffer_and_count.sql
--
-- Count points within a buffer of each origin, in a projected CRS.
WITH origins_proj AS (
    SELECT
        id,
        ST_Transform(geom, 'EPSG:4326', $working_crs) AS geom
    FROM
        ST_Read($origins_path)
),
points_proj AS (
    SELECT
        ST_Transform(geom, 'EPSG:4326', $working_crs) AS geom
    FROM
        ST_Read($points_path)
)
SELECT
    o.id,
    COUNT(p.geom) AS n_within
FROM
    origins_proj AS o
LEFT JOIN
    points_proj AS p
    ON ST_Intersects(ST_Buffer(o.geom, $buffer_m), p.geom)
GROUP BY
    o.id;
from myproject.config import config
from myproject.sql import load_sql
from myproject.db import get_spatial_connection

con = get_spatial_connection()

df = con.execute(
    load_sql("buffer_and_count"),
    {
        "origins_path": str(config.data.origins_geoparquet),
        "points_path": str(config.data.points_geoparquet),
        "working_crs": config.spatial.crs_working,   # e.g. "EPSG:3857"
        "buffer_m": 1500,
    },
).fetch_df()

Reading and writing spatial formats

The spatial extension exposes ST_Read() (powered by GDAL) and COPY ... TO ... (FORMAT GDAL, ...), giving you read access to most vector formats — Shapefile, GeoPackage, GeoJSON, FlatGeobuf, file geodatabase, and GeoParquet.

-- Read GeoParquet directly (preferred for large datasets)
SELECT * FROM read_parquet($parquet_path);

-- Read any GDAL-supported vector format
SELECT * FROM ST_Read($shapefile_path);

-- Write GeoParquet
COPY (SELECT * FROM origins) TO 'out.parquet' (FORMAT PARQUET);

-- Write any GDAL-supported format
COPY (SELECT * FROM origins) TO 'out.gpkg' (FORMAT GDAL, DRIVER 'GPKG');

Prefer GeoParquet for intermediate spatial data

GeoParquet preserves geometry as binary WKB plus a CRS in the file metadata, is columnar, splittable, and queryable in-place by DuckDB without loading the whole file into memory. Use it for everything between data/raw/ and final delivery.

CRS handling

Always work in a projected CRS (e.g. Web Mercator, a UTM zone, or a local equidistant projection) when computing distances, areas, or buffers. ST_Transform accepts authority codes ('EPSG:4326') or full WKT/PROJ strings. Read the working CRS from configuration — never hardcode it inside a .sql file.

Restricted-network and air-gapped environments

install_extension("spatial") downloads the extension binary once per (DuckDB version, platform) and caches it on disk. After that, every load_extension("spatial") call — across all connections, all databases, and all processes — reads from that cache. The download is not repeated per database or per connection.

The cache lives at:

Platform Path
Windows %USERPROFILE%\.duckdb\extensions\<version>\<platform>\
macOS / Linux ~/.duckdb/extensions/<version>/<platform>/

Because <version> is part of the path, upgrading DuckDB (e.g. when ArcGIS Pro ships a new bundled duckdb) triggers a one-time re-download on first use.

On networks where https://extensions.duckdb.org/ is blocked, choose one of the following patterns:

1. Pre-seed the cache during onboarding

Run con.install_extension("spatial") once on a machine with internet access, then copy the resulting ~/.duckdb/extensions/ directory to user profiles on locked-down machines (via group policy, MDM, or a login script). The cache is just files — no DuckDB process needs to be running on the target machine.

2. Host extensions on an internal mirror

DuckDB supports a custom extension repository:

SET custom_extension_repository = 'https://extensions.internal.example.com';
INSTALL spatial;

IT mirrors the contents of https://extensions.duckdb.org/ (or just the extensions you need) onto an internal HTTPS host once. Document the internal URL in your project config and apply it in get_spatial_connection():

# src/myproject/db.py
from myproject.config import config
import duckdb


def get_spatial_connection(database: str = ":memory:") -> duckdb.DuckDBPyConnection:
    con = duckdb.connect(database)
    repo = getattr(config.duckdb, "extension_repository", None)
    if repo:
        con.execute(f"SET custom_extension_repository = '{repo}';")
    con.install_extension("spatial")
    con.load_extension("spatial")
    return con

3. Bundle the extension with the project (most reliable for air-gap)

Commit the .duckdb_extension binary into the repository (or distribute it with your release artifact) and LOAD it by absolute path — bypassing INSTALL entirely:

vendor/
└── duckdb-extensions/
    └── v1.1.3/
        └── windows_amd64/
            └── spatial.duckdb_extension
from pathlib import Path
import duckdb
from myproject.config import config


def get_spatial_connection(database: str = ":memory:") -> duckdb.DuckDBPyConnection:
    con = duckdb.connect(database)
    ext_path = Path(config.duckdb.spatial_extension_path)  # absolute path on disk
    con.execute(f"LOAD '{ext_path.as_posix()}';")
    return con

Tip

To allow loading extensions from arbitrary paths, you may need SET allow_unsigned_extensions = true; before the LOAD — the bundled binary is signed by DuckDB Labs, so this is usually unnecessary, but worth knowing if you build the extension yourself.

Verifying the cache state

# Windows
Test-Path "$env:USERPROFILE\.duckdb\extensions"
Get-ChildItem "$env:USERPROFILE\.duckdb\extensions" -Recurse -Filter "spatial*"
# macOS / Linux
ls -R ~/.duckdb/extensions/ 2>/dev/null | grep spatial

If the file is present, load_extension("spatial") will succeed offline.

Package-data setup

.sql files are shipped with the wheel via pyproject.toml:

[tool.setuptools.package-data]
"myproject.sql" = ["*.sql"]

If you add a sub-directory under sql/ (e.g. sql/spatial/), add a matching entry:

[tool.setuptools.package-data]
"myproject.sql"         = ["*.sql"]
"myproject.sql.spatial" = ["*.sql"]

…and create an __init__.py inside the new directory so it is a proper sub-package.