microsoft · aboydnw · Jun 29, 2026 · Jul 1, 2026 · Jul 1, 2026 · Jul 1, 2026
diff --git a/quickstarts/duckdb.ipynb b/quickstarts/duckdb.ipynb
@@ -0,0 +1,235 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Bringing Planetary Computer data into DuckDB",
+    "",
+    "This notebook reads the [Sentinel-2 L2A](https://planetarycomputer.microsoft.com/dataset/sentinel-2-l2a) STAC-GeoParquet snapshot with [DuckDB](https://duckdb.org/), which treats the Planetary Computer's snapshots as ordinary SQL tables. Key benefits:",
+    "",
+    "1. **Catalog as SQL**: the GeoParquet snapshot reads as a table, so you query it with `SELECT`, `WHERE`, and `GROUP BY`.",
+    "2. **Filter pushdown**: predicates push to Parquet row-group statistics and the bbox column accelerates spatial filters, so DuckDB skips files that can't match.",
+    "3. **Joins across sources**: join the catalog against remote GeoParquet, local files, and your own tables in one statement.",
+    "4. **No download step**: read the remote snapshot in place, with no extract-transform-load.",
+    "5. **Portable output**: export results straight to Parquet or GeoJSON for BI tools or visualization.",
+    "",
+    "We'll read low-cloud Sentinel-2 scenes over Portland, Oregon, join them against [Microsoft Building Footprints](https://planetarycomputer.microsoft.com/dataset/ms-buildings), and export the result.",
+    "",
+    "The companion [DuckDB tutorial](../overview/duckdb.md) has the full narrative.",
+    "",
+    "> **Note on coverage:** the Sentinel-2 GeoParquet snapshot covers 2015-07 through 2018-10, not the live archive. The examples below use 2017 dates. For current data, query the live STAC API."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Install"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install --quiet duckdb pandas"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Connect and load extensions",
+    "",
+    "`spatial` gives you `ST_*` functions and native GeoParquet reads. Without it, geometry columns come back as opaque blobs. `azure` handles the storage authentication below. Use the Python client rather than the CLI so you can refresh the SAS token mid-session.",
+    "",
+    "**Expected result:** a `con` connection with both extensions loaded, no output printed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import duckdb\n",
+    "\n",
+    "con = duckdb.connect()\n",
+    "con.execute(\"INSTALL spatial; LOAD spatial; INSTALL azure; LOAD azure;\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Authenticate to the Planetary Computer",
+    "",
+    "DuckDB's Azure extension reads signed Planetary Computer blobs directly. Fetch a container SAS from the token API, register a secret, and switch the Azure transport to curl.",
+    "",
+    "Two important details:",
+    "",
+    "- `azure_transport_option_type = 'curl'` is required. The default transport fails with an opaque SSL CA-certificate error.",
+    "- The `SCOPE` matters as soon as you add a second account (the join below). DuckDB routes each read to the secret whose scope is the longest matching prefix. Without scopes, a second Azure secret shadows the first and every read fails to authenticate.",
+    "",
+    "**Expected result:** a registered `pc` secret, no output printed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "import urllib.request\n",
+    "\n",
+    "def container_sas(account, container):\n",
+    "    url = f\"https://planetarycomputer.microsoft.com/api/sas/v1/token/{account}/{container}\"\n",
+    "    return json.load(urllib.request.urlopen(url))[\"token\"]\n",
+    "\n",
+    "sas = container_sas(\"pcstacitems\", \"items\")\n",
+    "con.execute(\"SET azure_transport_option_type = 'curl';\")\n",
+    "con.execute(\n",
+    "    \"CREATE SECRET pc (TYPE azure, PROVIDER config, ACCOUNT_NAME 'pcstacitems', \"\n",
+    "    f\"CONNECTION_STRING 'BlobEndpoint=https://pcstacitems.blob.core.windows.net;SharedAccessSignature={sas}', \"\n",
+    "    \"SCOPE 'az://items')\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Read STAC-GeoParquet as a table",
+    "",
+    "The snapshot is a directory of monthly Parquet files. Glob them and query as you would any table, pushing your spatial, temporal, and property filters into the read. Predicates push down to Parquet row-group statistics and the bbox column accelerates spatial filters, so DuckDB skips files and row groups that cannot match. The first query spends some time scanning the part files; later queries are quicker.",
+    "",
+    "**Expected result:** a DataFrame of a few low-cloud Sentinel-2 scenes over Portland, sorted by cloud cover."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "S2 = \"az://items/sentinel-2-l2a.parquet/*.parquet\"\n",
+    "\n",
+    "con.execute(f\"\"\"\n",
+    "    SELECT id, datetime, \"eo:cloud_cover\"\n",
+    "    FROM read_parquet('{S2}')\n",
+    "    WHERE datetime BETWEEN '2017-07-01' AND '2017-08-01'\n",
+    "      AND \"eo:cloud_cover\" < 20\n",
+    "      AND ST_Intersects(geometry, ST_MakeEnvelope(-122.7, 45.5, -122.6, 45.6))\n",
+    "    ORDER BY \"eo:cloud_cover\"\n",
+    "\"\"\").df()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Join against building footprints",
+    "",
+    "The point of pulling the catalog into SQL is to join it against other data. Here, scenes meet [Microsoft Building Footprints](https://planetarycomputer.microsoft.com/dataset/ms-buildings).",
+    "",
+    "The footprints live on a different storage account, so they need their own scoped secret. They are not public, despite being a widely used open dataset, which is exactly why the `SCOPE` from the auth step matters: each read routes to the secret whose scope is the longest matching prefix. The footprints are partitioned by quadkey; quadkey `21230223` covers Portland.",
+    "",
+    "**Expected result:** a `(building_scene_pairs, scenes)` tuple, with thousands of pairs across about 10 distinct scenes."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "bf_sas = container_sas(\"bingmlbuildings\", \"footprints\")\n",
+    "con.execute(\n",
+    "    \"CREATE SECRET buildings (TYPE azure, PROVIDER config, ACCOUNT_NAME 'bingmlbuildings', \"\n",
+    "    f\"CONNECTION_STRING 'BlobEndpoint=https://bingmlbuildings.blob.core.windows.net;SharedAccessSignature={bf_sas}', \"\n",
+    "    \"SCOPE 'az://footprints')\"\n",
+    ")\n",
+    "\n",
+    "BF = (\"az://footprints/delta/2023-04-25/ml-buildings.parquet/\"\n",
+    "      \"RegionName=UnitedStates/quadkey=21230223/*.parquet\")\n",
+    "\n",
+    "con.execute(f\"\"\"\n",
+    "    WITH buildings AS (\n",
+    "        SELECT geometry, meanHeight\n",
+    "        FROM read_parquet('{BF}')\n",
+    "        WHERE ST_Within(geometry, ST_MakeEnvelope(-122.69, 45.51, -122.66, 45.53))\n",
+    "    ),\n",
+    "    scenes AS (\n",
+    "        SELECT id, datetime, geometry\n",
+    "        FROM read_parquet('{S2}')\n",
+    "        WHERE datetime BETWEEN '2017-07-01' AND '2017-09-01'\n",
+    "          AND \"eo:cloud_cover\" < 20\n",
+    "    )\n",
+    "    SELECT count(*) AS building_scene_pairs,\n",
+    "           count(DISTINCT scenes.id) AS scenes\n",
+    "    FROM buildings JOIN scenes\n",
+    "      ON ST_Intersects(buildings.geometry, scenes.geometry)\n",
+    "\"\"\").fetchone()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Export results",
+    "",
+    "Cache the filtered Portland scenes in a table once, then write them out: Parquet (with `zstd` compression) for BI tools, GeoJSON (via the spatial extension's GDAL driver) for visualization.",
+    "",
+    "**Expected result:** `portland.parquet` and `portland.geojson` written to the working directory."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "con.execute(f\"\"\"\n",
+    "    CREATE OR REPLACE TABLE portland_scenes AS\n",
+    "    SELECT id, datetime, \"eo:cloud_cover\", geometry\n",
+    "    FROM read_parquet('{S2}')\n",
+    "    WHERE datetime BETWEEN '2017-07-01' AND '2017-08-01'\n",
+    "      AND \"eo:cloud_cover\" < 20\n",
+    "      AND ST_Intersects(geometry, ST_MakeEnvelope(-122.7, 45.5, -122.6, 45.6))\n",
+    "\"\"\")\n",
+    "con.execute(\"COPY portland_scenes TO 'portland.parquet' (COMPRESSION zstd)\")\n",
+    "con.execute(\n",
+    "    \"COPY (SELECT id, datetime, geometry FROM portland_scenes) \"\n",
+    "    \"TO 'portland.geojson' (FORMAT GDAL, DRIVER 'GeoJSON')\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## You're done",
+    "",
+    "If every cell ran, you read a remote STAC-GeoParquet snapshot as a SQL table, filtered it by space, time, and cloud cover, joined it against building footprints on a second storage account, and exported to Parquet and GeoJSON: remote GeoParquet \u2192 SQL filter and join \u2192 local files.",
+    "",
+    "Swap in your own bbox, collection, or a local CSV or Parquet of your own and the same query shape applies. To pull items into Python without writing SQL, see the [rustac tutorial](../overview/rustac.md). For the imagery itself, hand asset URLs from a result to the [async-geotiff tutorial](../overview/async-geotiff.md)."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.12.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}