diff --git a/quickstarts/rustac.ipynb b/quickstarts/rustac.ipynb new file mode 100644 index 0000000..b1a01a7 --- /dev/null +++ b/quickstarts/rustac.ipynb @@ -0,0 +1,257 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Querying Planetary Computer STAC items with rustac", + "", + "This notebook queries the [Sentinel-2 L2A](https://planetarycomputer.microsoft.com/dataset/sentinel-2-l2a) STAC-GeoParquet snapshot with [rustac](https://github.com/stac-utils/rustac-py), a Rust-powered STAC toolkit that reads GeoParquet through a bundled DuckDB engine. Key benefits:", + "", + "1. **Filter pushdown**: spatial, temporal, and property predicates go into the Parquet read, so only matching items cross the network.", + "2. **No download step**: query the snapshot in place instead of pulling the whole file into pandas first.", + "3. **Bundled engine**: the DuckDB reader ships with rustac, so there's no separate database to install.", + "4. **Arrow-native**: results come back as an Arrow table you can write to Parquet or hand to GeoPandas without a Python-dictionary round-trip.", + "5. **Bulk scale**: built for scanning across many items at once, where the STAC API would need many paged requests.", + "", + "We'll pull a handful of low-cloud Sentinel-2 scenes over Portland, Oregon from the historical snapshot, then filter them, export to Parquet, and bridge them into GeoPandas.", + "", + "The companion [rustac tutorial](../overview/rustac.md) has the full narrative.", + "", + "> **Note on coverage:** the Sentinel-2 GeoParquet snapshot is a point-in-time export covering 2015-07 through 2018-10, not the live archive. Use it for bulk historical analysis. For current acquisitions, query the live STAC API with `pystac-client`." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Install" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install --quiet 'rustac[arrow]' pystac-client planetary-computer geopandas pyarrow" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Find the GeoParquet snapshot", + "", + "Every Planetary Computer collection carries a collection-level `geoparquet-items` asset that points at its STAC-items snapshot. The href resolves to a directory of monthly `part-NNNN` Parquet files on the `pcstacitems` storage account.", + "", + "**Expected result:** `abfs://items/sentinel-2-l2a.parquet`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pystac_client\n", + "\n", + "catalog = pystac_client.Client.open(\"https://planetarycomputer.microsoft.com/api/stac/v1\")\n", + "asset = catalog.get_collection(\"sentinel-2-l2a\").assets[\"geoparquet-items\"]\n", + "asset.href" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Authenticate the DuckDB engine", + "", + "The account name looks public, but reads still need a SAS token. The high-level `rustac.search()` coroutine cannot pass Azure credentials, so use `rustac.DuckdbClient` instead and configure its connection once: fetch a container SAS from the Planetary Computer token API, register an Azure secret, and switch the Azure transport to curl.", + "", + "The `azure_transport_option_type = 'curl'` line is not optional. Without it, DuckDB's default Azure transport fails with an opaque SSL CA-certificate error. The SAS expires after about an hour, so long-running jobs re-fetch it.", + "", + "**Expected result:** a configured `client`, no output printed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "import urllib.request\n", + "import rustac\n", + "\n", + "sas = json.load(urllib.request.urlopen(\n", + " \"https://planetarycomputer.microsoft.com/api/sas/v1/token/pcstacitems/items\"\n", + "))[\"token\"]\n", + "\n", + "client = rustac.DuckdbClient()\n", + "client.execute(\"INSTALL azure; LOAD azure; SET azure_transport_option_type = 'curl';\")\n", + "client.execute(\n", + " \"CREATE SECRET pc (TYPE azure, PROVIDER config, ACCOUNT_NAME 'pcstacitems', \"\n", + " f\"CONNECTION_STRING 'BlobEndpoint=https://pcstacitems.blob.core.windows.net;SharedAccessSignature={sas}')\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Query the snapshot", + "", + "Point `search()` at the snapshot glob and pass the filters you want pushed into the read, so only matching rows cross the network. The glob (`/*.parquet`) spans the monthly part files.", + "", + "`DuckdbClient.search` is synchronous (no `await`); the bundled engine scans the part files for you, which takes a little time on the first call.", + "", + "**Expected result:** a handful of Portland scenes (5)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "SNAPSHOT = \"az://items/sentinel-2-l2a.parquet/*.parquet\"\n", + "\n", + "items = client.search(\n", + " SNAPSHOT,\n", + " collections=[\"sentinel-2-l2a\"],\n", + " bbox=[-122.7, 45.5, -122.6, 45.6],\n", + " datetime=\"2017-07-01/2017-08-01\",\n", + ")\n", + "len(items)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Filter on space, time, and properties", + "", + "`search()` accepts the STAC query parameters you would expect, including CQL2-JSON property filters. Here we add a cloud-cover threshold.", + "", + "**Expected result:** a list of low-cloud scenes over the wider summer window." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "items = client.search(\n", + " SNAPSHOT,\n", + " collections=[\"sentinel-2-l2a\"],\n", + " bbox=[-122.7, 45.5, -122.6, 45.6],\n", + " datetime=\"2017-06-01/2017-09-01\",\n", + " filter={\"op\": \"<\", \"args\": [{\"property\": \"eo:cloud_cover\"}, 20]},\n", + ")\n", + "len(items)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Write results without materializing Python objects", + "", + "For bulk work, skip the round-trip through Python dictionaries. `search_to_arrow` takes the same arguments as `search` and returns an Arrow table. It is backed by `arro3` (rustac's Arrow runtime), so adopt it into PyArrow with `pa.table(...)`, a zero-copy hand-off, before writing it to a new Parquet file.", + "", + "**Expected result:** a `portland-2017.parquet` file written locally; `table.num_rows` is 5." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pyarrow as pa\n", + "import pyarrow.parquet as pq\n", + "\n", + "table = pa.table(client.search_to_arrow(\n", + " SNAPSHOT,\n", + " collections=[\"sentinel-2-l2a\"],\n", + " bbox=[-122.7, 45.5, -122.6, 45.6],\n", + " datetime=\"2017-07-01/2017-08-01\",\n", + "))\n", + "pq.write_table(table, \"portland-2017.parquet\")\n", + "table.num_rows" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Bridge to GeoPandas", + "", + "To analyze results as a GeoDataFrame, convert the Arrow table. The snapshot geometries are lon/lat, but the CRS is not carried on the Arrow table, so set it explicitly.", + "", + "**Expected result:** a 5-row GeoDataFrame with `Polygon` geometry." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import geopandas\n", + "\n", + "gdf = geopandas.GeoDataFrame.from_arrow(table).set_crs(4326)\n", + "gdf[[\"id\", \"datetime\", \"eo:cloud_cover\"]].head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## From metadata to pixels", + "", + "A search returns item metadata, not imagery. To read the actual rasters, pull asset hrefs from the results and sign them, then hand them to a reader such as [async-geotiff](../overview/async-geotiff.md) for windowed reads.", + "", + "**Expected result:** a signed HTTPS URL for the B04 (red) band of the first scene." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import planetary_computer\n", + "\n", + "href = planetary_computer.sign(items[0][\"assets\"][\"B04\"][\"href\"])\n", + "href" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## You're done", + "", + "If every cell above ran, you pushed a spatial, temporal, and cloud-cover filter into a remote GeoParquet snapshot, read back only the matching Sentinel-2 scenes, wrote them to Parquet, and bridged them into GeoPandas: STAC-GeoParquet \u2192 filtered Arrow table \u2192 GeoDataFrame, with no full download.", + "", + "Swap in your own bbox, datetime window, or collection and the same pattern applies. For small or current-data queries, the live STAC API through `pystac-client` is the better tool. To ask analytical questions or join the catalog against other datasets, see the [DuckDB tutorial](../overview/duckdb.md), which uses the same authentication. For the imagery behind the metadata, hand a signed asset href to the [async-geotiff tutorial](../overview/async-geotiff.md)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.12.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}