Skip to main content

Ingestion Operations

Local Configuration

Environment variables are loaded automatically from projects/ai_<oem>/.env when dg dev starts. Copy the relevant .env file and fill in the values for your machine before running locally.

Required variables

VariablePurpose
RAW_STORAGE_URIfsspec-compatible URI where raw JSON assets are written (e.g. temp/raw).
ICEBERG_CATALOG_URISQLite connection string for the local Iceberg catalog (e.g. sqlite:////absolute/path/temp/iceberg/catalog.db).
ICEBERG_WAREHOUSELocal warehouse root for Iceberg data files (e.g. file:///absolute/path/temp/iceberg/warehouse). Note: three slashes for file:// URIs on absolute paths.

Optional variables

VariablePurpose
ICEBERG_CATALOG_TYPECatalog backend type. Defaults to sql. Override in deployed environments (e.g. for Postgres).
ICEBERG_CATALOG_NAMEName of the catalog entry. Defaults to <oem>_local (e.g. audi_local). Set to the OEM name in deployed environments.
ICEBERG_INIT_DIRSSet to 1 on first setup or after wiping temp/iceberg/ to auto-create the local SQLite and warehouse directories at startup. Leave unset day-to-day.
DUCKDB_PATHPath to a DuckDB file for the transformed tier. Omit to use Dagster's default filesystem IO manager.
PPM_SOURCE_OVERRIDESPath to a YAML file that injects data into raw or transformed assets instead of calling the live API. See below.
NESSIE_URINessie REST catalog base URI. When set, the IO manager connects to Nessie instead of the local SQLite catalog. Used in deployed environments; leave unset for local development.
NESSIE_READ_ONLYSet to 1 to read from Nessie but write outputs to the local SQLite/file catalog. Useful for testing transformed or consolidated assets against production raw data locally.

For NESSIE_AUTH and Auth0 credentials, see Nessie.

First-time local setup

Create the Iceberg directories and initialize the catalog namespace once before running for the first time (or after a reset):

./scripts/reset_iceberg_local.sh --init   # create dirs without wiping data

Then start dg dev with ICEBERG_INIT_DIRS=1 set for the first startup only — Dagster will create the catalog namespace automatically:

ICEBERG_INIT_DIRS=1 deployments/local/.venv/bin/dg dev

Subsequent starts do not need the flag.

Resetting local Iceberg state

To wipe the local catalog and all stored data and start fresh:

./scripts/reset_iceberg_local.sh

Then restart dg dev with ICEBERG_INIT_DIRS=1 as above.

Source overrides

The source override mechanism lets you inject data from BigQuery tables or local JSON files into any raw source or transformed entity asset, without modifying source module code. This is useful for local testing when live API access is unavailable or when you want to replay a known-good dataset.

Create a YAML file (e.g. temp/source_overrides.yaml) and point PPM_SOURCE_OVERRIDES at it:

overrides:
# Read from a BigQuery partitioned table.
# partition_date defaults to the asset's partition key if omitted.
- oem: audi
source: onegraph
resource: carlines
type: bigquery
table: "project.dataset.table"
# partition_date: "2026-01-27"

# Read from a local JSON file previously written by RawJsonIOManager.
# {partition} is interpolated with the asset's partition key.
- oem: audi
source: pss
resource: dealers
type: local_file
path: "temp/raw/audi/raw/pss_dealers/{partition}.json"

# Override a transformed entity asset — bypasses the extractor function.
# The `entity` field distinguishes this from a raw source override.
# Use partition_range to serve the same file for a range of dates.
- oem: audi
source: onegraph
resource: carlines
entity: models
type: local_file
path: "/absolute/path/to/models_2026-01-27.json"
partition_range: ["2026-01-27", "2026-01-31"]

Override types:

  • bigquery — queries SELECT entry FROM \table` WHERE DATE(_PARTITIONTIME) = '{partition}'. Each row's entrycolumn must be a JSON object; all rows are collected into a list. Thepartition_date` field pins a specific date when you need to replay a partition that differs from the one being materialized.
  • local_file — reads a JSON file at the given path. \{partition\} in the path is replaced with the asset's partition key.

Additional fields:

  • entity — when present, the override applies to the transformed entity asset instead of the raw source asset. Only TransformedEntityComponent checks for overrides with an entity field.
  • partition_range — a two-element list [start, end] (inclusive). The override only applies to partitions within this range. Multiple overrides for the same source with non-overlapping ranges are allowed. Entries without partition_range match any partition.

A full local launch command looks like (env vars are loaded from .env automatically):

macOS / Linux

cd projects/ai_audi
PPM_SOURCE_OVERRIDES="$(pwd)/../../temp/source_overrides.yaml" \
.venv/bin/dg launch \
--assets "audi/raw/onegraph_carlines,audi/transformed/models_from_onegraph_carlines,audi/consolidated/models" \
--partition 2026-01-27

Windows (PowerShell)

cd projects/ai_audi
$env:PPM_SOURCE_OVERRIDES = "$PWD/../../temp/source_overrides.yaml"
.venv\Scripts\dg.exe launch `
--assets "audi/raw/onegraph_carlines,audi/transformed/models_from_onegraph_carlines,audi/consolidated/models" `
--partition 2026-01-27

Viewing Asset Check Results

Asset checks run automatically after each materialization. To view results, open the asset in the Dagster UI and click the Checks tab.