Skip to main content

Ingestion Design

The ingestion layer extracts raw data from OEM-specific external APIs, validates it, and progressively refines it into consolidated entity records partitioned by day. Each OEM runs its own independent Dagster deployment; shared infrastructure lives in a common library so OEM pipelines don't share state or affect each other.

Pipeline Structure

Every ingestion pipeline follows the same three-stage structure:

StageResponsibility
RawFetch data from the source API and store the response exactly as returned
TransformedExtract entity records from the raw response, deduplicate by key group, and validate
ConsolidatedMerge transformed records across all sources for an entity type

Each stage has asset checks that run automatically after materialization.

Dependency Ordering

The asset DAG has two independent ordering dimensions that do not form cycles:

  • Source dependencies operate in the raw and transformed tiers. One raw source may depend on another (e.g. fetching search results requires dealer IDs from a prior fetch). These dependencies are fully resolved by the end of the transformed tier.

  • Entity dependencies begin at the consolidated tier, where foreign_keys allow one entity's consolidation to reference another consolidated entity's same-partition data. For example, consolidated inventory can reference consolidated models to attach a resolved model ID to each row. Since all source-level dependencies are already resolved, entity ordering is independent of source ordering.

This separation guarantees no circular dependencies: sources never depend on entities, and entities never depend on sources.

Asset Key Conventions

Dagster asset keys form a path hierarchy using / as the separator:

StagePatternExample
Source<oem>/sources/<source>_<resource>audi/sources/scs_search
Raw<oem>/raw/<source>_<resource>audi/raw/scs_search
Transformed<oem>/transformed/<name>_<entity>audi/transformed/scs_inventory
Consolidated<oem>/consolidated/<entity>audi/consolidated/inventory

Asset Checks

StageCheck nameWhat it verifiesBlocking
Rawfetch_attemptsAll HTTP requests made by the fetcher returned 2xx; per-status failure patterns surfaced in metadataNo
TransformeddiscrepanciesNo conflicting rows for the same entity ID; same-source field disagreements grouped into discrepancy patternsNo
ConsolidatedintegrityEvery row has a non-null entity ID; records row countNo
Consolidatedunresolved_refsCross-entity reference links resolved successfullyNo
Transformed (highest-priority source)cross_source_discrepanciesCross-source merges produce no conflicting field values for the same entity; field-level disagreements grouped into discrepancy patternsNo

All checks are non-blocking (severity WARN) — they surface issues in the Dagster UI without aborting the run. The fetch_attempts check additionally reports tracking=instrumented or tracking=uninstrumented in its metadata so operators can tell whether the source module is recording per-request outcomes (see Component Reference).

Asset Groups and Tags

Source observability assets are grouped sources. Raw fetch assets are grouped raw. Transformed assets are grouped transformed. Consolidated assets are grouped consolidated. Groups are visible in the Dagster UI's asset catalog and can be used to filter and select assets in bulk.

Every ingestion asset also carries an oem tag for per-OEM filtering:

oem: audi

Partitioning

All ingestion assets use a DailyPartitionsDefinition with end_offset=1. The end_offset=1 means today's date is always a valid partition, so assets can be materialized on the day they are due without waiting for the day to roll over.