Skip to main content

DealerConnect Source

The dealerconnect source reads pre-parsed PDF blob data from BigQuery tables populated by the ppm-ordering-services collection pipeline. It is specific to the Stellantis project (projects/ai_stellantis).

BigQuery Tables

Table patternDescriptionPartition
inventory_{dealerId}In-stock vehicles with POC blobsDAY (fetch_date)
sales_{dealerId}Sold vehicles with POC blobsNone (filtered by fetch_date + sale_date)
code_guidesTrim/equipment spec blobsNone (filtered by fetch_date + blob_id dedup)

Dealer tables are discovered dynamically at runtime by listing all tables in the configured dataset that match the relevant prefix (inventory_ or sales_). There is no static dealer ID list — new dealers are picked up automatically as they are onboarded to the collection pipeline.

Required Environment Variables

VariableDescriptionExample
STELLANTIS_GCP_PROJECTGCP project ID containing the DealerConnect dataset (defaults to ai-app-stellantis)my-gcp-project
GCP_CREDENTIALSService account JSON key content (production only; omit to use ADC locally){"type":"service_account",...}

The BigQuery dataset name defaults to dealerconnect and is set as a component YAML attribute (dataset:) on each DealerConnectRawSourceComponent subclass — it is not an environment variable.

Authentication

In local development, authenticate with Application Default Credentials (ADC):

gcloud auth application-default login

In production, set GCP_CREDENTIALS to the JSON content of a service account key (not a file path). The BigQueryResource registered in definitions.py reads this env var via gcp_credentials=dg.EnvVar("GCP_CREDENTIALS").

The service account requires the following BigQuery IAM roles on the dataset:

  • roles/bigquery.dataViewer — read table data
  • roles/bigquery.metadataViewer — list tables (required for dealer discovery)
  • roles/bigquery.jobUser — run queries

Dagster Resource Key

The BigQuery client is injected via the standard BIG_QUERY_RESOURCE_KEY resource (value: "big_query_resource") exported from ai_core.components.raw_historical. All DealerConnectRawSourceComponent subclasses declare this key in _required_resource_keys() and obtain the client via context.resources[BIG_QUERY_RESOURCE_KEY].get_client().

Raw Asset Contract

Each raw asset row has the standard RAW_ROW_SCHEMA columns. The _response_body column contains a JSON-serialized BQ row dict with all original columns from the source table, including poc_raw_blob (for inventory/sales) or pdf_raw_blob (for code guides), which may be null if the collection pipeline failed to parse that PDF.

Operational Notes

  • NULL blobs are retained in the raw asset so collection failures are auditable. The transform-tier extractors skip NULL blobs gracefully and log a warning per skip.
  • Missing dealer tables are warned and skipped — a partition can materialize successfully even if some dealers have no data for that date.
  • Zero tables found is logged at info level and results in an empty DataFrame — this is valid for early backfill dates before any dealers were onboarded.