Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
235 changes: 235 additions & 0 deletions quickstarts/duckdb.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,235 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Bringing Planetary Computer data into DuckDB",
"",
"This notebook reads the [Sentinel-2 L2A](https://planetarycomputer.microsoft.com/dataset/sentinel-2-l2a) STAC-GeoParquet snapshot with [DuckDB](https://duckdb.org/), which treats the Planetary Computer's snapshots as ordinary SQL tables. Key benefits:",
"",
"1. **Catalog as SQL**: the GeoParquet snapshot reads as a table, so you query it with `SELECT`, `WHERE`, and `GROUP BY`.",
"2. **Filter pushdown**: predicates push to Parquet row-group statistics and the bbox column accelerates spatial filters, so DuckDB skips files that can't match.",
"3. **Joins across sources**: join the catalog against remote GeoParquet, local files, and your own tables in one statement.",
"4. **No download step**: read the remote snapshot in place, with no extract-transform-load.",
"5. **Portable output**: export results straight to Parquet or GeoJSON for BI tools or visualization.",
"",
"We'll read low-cloud Sentinel-2 scenes over Portland, Oregon, join them against [Microsoft Building Footprints](https://planetarycomputer.microsoft.com/dataset/ms-buildings), and export the result.",
"",
"The companion [DuckDB tutorial](../overview/duckdb.md) has the full narrative.",
"",
"> **Note on coverage:** the Sentinel-2 GeoParquet snapshot covers 2015-07 through 2018-10, not the live archive. The examples below use 2017 dates. For current data, query the live STAC API."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Install"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install --quiet duckdb pandas"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Connect and load extensions",
"",
"`spatial` gives you `ST_*` functions and native GeoParquet reads. Without it, geometry columns come back as opaque blobs. `azure` handles the storage authentication below. Use the Python client rather than the CLI so you can refresh the SAS token mid-session.",
"",
"**Expected result:** a `con` connection with both extensions loaded, no output printed."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import duckdb\n",
"\n",
"con = duckdb.connect()\n",
"con.execute(\"INSTALL spatial; LOAD spatial; INSTALL azure; LOAD azure;\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Authenticate to the Planetary Computer",
"",
"DuckDB's Azure extension reads signed Planetary Computer blobs directly. Fetch a container SAS from the token API, register a secret, and switch the Azure transport to curl.",
"",
"Two important details:",
"",
"- `azure_transport_option_type = 'curl'` is required. The default transport fails with an opaque SSL CA-certificate error.",
"- The `SCOPE` matters as soon as you add a second account (the join below). DuckDB routes each read to the secret whose scope is the longest matching prefix. Without scopes, a second Azure secret shadows the first and every read fails to authenticate.",
"",
"**Expected result:** a registered `pc` secret, no output printed."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"import urllib.request\n",
"\n",
"def container_sas(account, container):\n",
" url = f\"https://planetarycomputer.microsoft.com/api/sas/v1/token/{account}/{container}\"\n",
" return json.load(urllib.request.urlopen(url))[\"token\"]\n",
"\n",
"sas = container_sas(\"pcstacitems\", \"items\")\n",
"con.execute(\"SET azure_transport_option_type = 'curl';\")\n",
"con.execute(\n",
" \"CREATE SECRET pc (TYPE azure, PROVIDER config, ACCOUNT_NAME 'pcstacitems', \"\n",
" f\"CONNECTION_STRING 'BlobEndpoint=https://pcstacitems.blob.core.windows.net;SharedAccessSignature={sas}', \"\n",
" \"SCOPE 'az://items')\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Read STAC-GeoParquet as a table",
"",
"The snapshot is a directory of monthly Parquet files. Glob them and query as you would any table, pushing your spatial, temporal, and property filters into the read. Predicates push down to Parquet row-group statistics and the bbox column accelerates spatial filters, so DuckDB skips files and row groups that cannot match. The first query spends some time scanning the part files; later queries are quicker.",
"",
"**Expected result:** a DataFrame of a few low-cloud Sentinel-2 scenes over Portland, sorted by cloud cover."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"S2 = \"az://items/sentinel-2-l2a.parquet/*.parquet\"\n",
"\n",
"con.execute(f\"\"\"\n",
" SELECT id, datetime, \"eo:cloud_cover\"\n",
" FROM read_parquet('{S2}')\n",
" WHERE datetime BETWEEN '2017-07-01' AND '2017-08-01'\n",
" AND \"eo:cloud_cover\" < 20\n",
" AND ST_Intersects(geometry, ST_MakeEnvelope(-122.7, 45.5, -122.6, 45.6))\n",
" ORDER BY \"eo:cloud_cover\"\n",
"\"\"\").df()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Join against building footprints",
"",
"The point of pulling the catalog into SQL is to join it against other data. Here, scenes meet [Microsoft Building Footprints](https://planetarycomputer.microsoft.com/dataset/ms-buildings).",
"",
"The footprints live on a different storage account, so they need their own scoped secret. They are not public, despite being a widely used open dataset, which is exactly why the `SCOPE` from the auth step matters: each read routes to the secret whose scope is the longest matching prefix. The footprints are partitioned by quadkey; quadkey `21230223` covers Portland.",
"",
"**Expected result:** a `(building_scene_pairs, scenes)` tuple, with thousands of pairs across about 10 distinct scenes."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"bf_sas = container_sas(\"bingmlbuildings\", \"footprints\")\n",
"con.execute(\n",
" \"CREATE SECRET buildings (TYPE azure, PROVIDER config, ACCOUNT_NAME 'bingmlbuildings', \"\n",
" f\"CONNECTION_STRING 'BlobEndpoint=https://bingmlbuildings.blob.core.windows.net;SharedAccessSignature={bf_sas}', \"\n",
" \"SCOPE 'az://footprints')\"\n",
")\n",
"\n",
"BF = (\"az://footprints/delta/2023-04-25/ml-buildings.parquet/\"\n",
" \"RegionName=UnitedStates/quadkey=21230223/*.parquet\")\n",
"\n",
"con.execute(f\"\"\"\n",
" WITH buildings AS (\n",
" SELECT geometry, meanHeight\n",
" FROM read_parquet('{BF}')\n",
" WHERE ST_Within(geometry, ST_MakeEnvelope(-122.69, 45.51, -122.66, 45.53))\n",
" ),\n",
" scenes AS (\n",
" SELECT id, datetime, geometry\n",
" FROM read_parquet('{S2}')\n",
" WHERE datetime BETWEEN '2017-07-01' AND '2017-09-01'\n",
" AND \"eo:cloud_cover\" < 20\n",
" )\n",
" SELECT count(*) AS building_scene_pairs,\n",
" count(DISTINCT scenes.id) AS scenes\n",
" FROM buildings JOIN scenes\n",
" ON ST_Intersects(buildings.geometry, scenes.geometry)\n",
"\"\"\").fetchone()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Export results",
"",
"Cache the filtered Portland scenes in a table once, then write them out: Parquet (with `zstd` compression) for BI tools, GeoJSON (via the spatial extension's GDAL driver) for visualization.",
"",
"**Expected result:** `portland.parquet` and `portland.geojson` written to the working directory."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"con.execute(f\"\"\"\n",
" CREATE OR REPLACE TABLE portland_scenes AS\n",
" SELECT id, datetime, \"eo:cloud_cover\", geometry\n",
" FROM read_parquet('{S2}')\n",
" WHERE datetime BETWEEN '2017-07-01' AND '2017-08-01'\n",
" AND \"eo:cloud_cover\" < 20\n",
" AND ST_Intersects(geometry, ST_MakeEnvelope(-122.7, 45.5, -122.6, 45.6))\n",
"\"\"\")\n",
"con.execute(\"COPY portland_scenes TO 'portland.parquet' (COMPRESSION zstd)\")\n",
"con.execute(\n",
" \"COPY (SELECT id, datetime, geometry FROM portland_scenes) \"\n",
" \"TO 'portland.geojson' (FORMAT GDAL, DRIVER 'GeoJSON')\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## You're done",
"",
"If every cell ran, you read a remote STAC-GeoParquet snapshot as a SQL table, filtered it by space, time, and cloud cover, joined it against building footprints on a second storage account, and exported to Parquet and GeoJSON: remote GeoParquet \u2192 SQL filter and join \u2192 local files.",
"",
"Swap in your own bbox, collection, or a local CSV or Parquet of your own and the same query shape applies. To pull items into Python without writing SQL, see the [rustac tutorial](../overview/rustac.md). For the imagery itself, hand asset URLs from a result to the [async-geotiff tutorial](../overview/async-geotiff.md)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.12.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}