Data Lineage & Provenance¶

IndentiaDB provides built-in data lineage and provenance tracking for every triple stored in the graph. Built on the W3C PROV-O (Provenance Ontology) standard, the lineage system records where data came from, how it was transformed, and who or what was responsible -- giving you a complete audit trail from source to destination. This is essential for regulatory compliance (GDPR, SOX, HIPAA), data quality management, and debugging complex ETL pipelines.

Architecture¶

The lineage system follows the PROV-O entity-activity-agent model:

+--------------+     +------------------+     +--------------+
|   Sources    |---->|   Interceptor    |---->|   Storage    |
| File / API / |     |   + Annotator    |     | PROV-O +     |
| SPARQL / ETL |     |                  |     | SurrealDB    |
+--------------+     +------------------+     +------+-------+
                                                     |
                                                     v
                                               +-------------+
                                               |   Tracer    |
                                               | Upstream /  |
                                               | Downstream  |
                                               +-------------+

Every write operation passes through an ingest interceptor that captures the data source, attaches a confidence score, links the operation to an activity, and stores the provenance metadata alongside the triple.

Core Concepts¶

Entities, Activities, and Agents¶

IndentiaDB maps the three PROV-O pillars to concrete objects:

PROV-O Concept	IndentiaDB Type	Description
Entity	`TripleProvenance`	A triple with provenance metadata attached
Activity	`ProvenanceActivity`	An action that created, modified, or consumed entities (import, transform, query)
Agent	`ProvenanceAgent`	The user, service, or system that performed an activity

Relationships¶

Relationship	Predicate	Meaning
`wasDerivedFrom`	`prov:wasDerivedFrom`	Entity A was derived from Entity B
`wasGeneratedBy`	`prov:wasGeneratedBy`	Entity was generated by an activity
`wasAttributedTo`	`prov:wasAttributedTo`	Entity was attributed to an agent
`used`	`prov:used`	Activity consumed an entity as input
`wasAssociatedWith`	`prov:wasAssociatedWith`	Activity was performed by an agent
`wasInformedBy`	`prov:wasInformedBy`	Activity was informed by another activity

Data Sources¶

IndentiaDB tracks seven distinct data source types, each capturing source-specific metadata:

Source Type	Fields	Use Case
File	`path`, `format`, `checksum`	Turtle, N-Triples, JSON-LD, CSV imports
API	`endpoint`, `method`, `request_id`	REST API ingestion
SparqlUpdate	`query`, `user`	SPARQL INSERT/DELETE operations
Pipeline	`pipeline_id`, `step`, `run_id`	ETL/ELT pipeline stages
External	`system`, `identifier`	Federation from external systems
UserAssertion	`user_id`	Manual data entry or assertion
Inferred	`rule_id`, `source_triples`	Inference engine output

Confidence Scores¶

Every triple can carry a confidence score between 0.0 and 1.0, representing the reliability of the data:

{
  "score": 0.85,
  "method": "source_reliability",
  "assessed_at": "2026-03-23T10:00:00Z"
}

Confidence Methods¶

Method	Description
`SourceProvided`	Confidence was supplied by the source system itself
`SourceReliability`	Calculated based on pre-configured source reliability ratings
`MLModel`	Determined by a machine learning model (includes `model_id`)
`UserAsserted`	Manually asserted by a user
`Default`	Fallback value (0.5) when no confidence information is available

Filtering by confidence

Use the ProvenanceFilter with min_confidence and max_confidence to query only high-quality data. For example, setting min_confidence: 0.8 excludes low-confidence inferred triples from analytical queries.

Recording Provenance¶

Starting an Activity¶

Every batch of provenance records is grouped under an activity. Start an activity before ingesting data:

curl -X POST http://localhost:8000/api/lineage/activities \
  -H "Content-Type: application/json" \
  -d '{
    "type": "import",
    "agent_id": "system://file-importer",
    "description": "Import customer data from CRM export"
  }'

Response:

{
  "activity_id": "act_01HY4K8M3X...",
  "started_at": "2026-03-23T10:00:00Z",
  "type": "import",
  "agent_id": "system://file-importer"
}

Activity Types¶

Type	Description
`Import`	Loading data from an external source
`Transform`	Data transformation or enrichment
`Merge`	Merging data from multiple sources
`Delete`	Removing data
`Inference`	Generating new triples via reasoning
`Query`	Read-only access (for audit logging)
`Correction`	Correcting previously recorded data

Recording Triple Provenance¶

Attach provenance metadata when inserting triples:

curl -X POST http://localhost:8000/api/lineage/provenance \
  -H "Content-Type: application/json" \
  -d '{
    "triple_id": "triple_01HY4K...",
    "source": {
      "type": "File",
      "path": "/data/customers.ttl",
      "format": "turtle",
      "checksum": "sha256:e3b0c44298fc1c149..."
    },
    "confidence": {
      "score": 0.85,
      "method": "SourceReliability"
    },
    "activity_id": "act_01HY4K8M3X...",
    "metadata": {
      "batch_id": "batch-2026-03-23",
      "row_number": 42
    }
  }'

Using the File Import Interceptor¶

The interceptor automatically captures provenance during bulk imports:

import requests

# Start an import activity
activity = requests.post("http://localhost:8000/api/lineage/activities", json={
    "type": "import",
    "agent_id": "system://python-etl",
    "description": "Daily CRM sync"
}).json()

# Upload a file with automatic provenance tracking
with open("customers.ttl", "rb") as f:
    response = requests.post(
        "http://localhost:8000/api/import",
        files={"file": f},
        data={
            "format": "turtle",
            "activity_id": activity["activity_id"],
            "confidence_score": 0.9,
            "confidence_method": "SourceReliability"
        }
    )

print(f"Imported {response.json()['triple_count']} triples with provenance")

Automatic checksums

When using the file import interceptor, IndentiaDB automatically computes a SHA-256 checksum of the source file and stores it in the provenance record. This enables integrity verification when re-importing data.

Derivation Chains¶

When data is transformed, IndentiaDB records derivation relationships between the original and derived triples. This creates a chain that can be traced upstream (back to sources) or downstream (to all derived data).

Recording Derivations¶

curl -X POST http://localhost:8000/api/lineage/provenance \
  -H "Content-Type: application/json" \
  -d '{
    "triple_id": "triple_derived_01...",
    "source": {
      "type": "Pipeline",
      "pipeline_id": "etl-customer-enrichment",
      "step": "normalize-addresses",
      "run_id": "run-2026-03-23-001"
    },
    "activity_id": "act_transform_01...",
    "derived_from": "prov_original_01..."
  }'

Tracing Upstream Lineage¶

Trace a triple back to its original sources:

curl "http://localhost:8000/api/lineage/trace/upstream/triple_derived_01...?max_depth=5"

Response:

{
  "root": "triple_derived_01...",
  "depth": 3,
  "nodes": [
    {
      "triple_id": "triple_derived_01...",
      "source": {"type": "Pipeline", "pipeline_id": "etl-customer-enrichment"},
      "activity_type": "Transform"
    },
    {
      "triple_id": "triple_intermediate_01...",
      "source": {"type": "SparqlUpdate", "query": "INSERT { ... }"},
      "activity_type": "Transform"
    },
    {
      "triple_id": "triple_original_01...",
      "source": {"type": "File", "path": "/data/customers.ttl"},
      "activity_type": "Import"
    }
  ],
  "relationships": [
    {"from": "triple_derived_01...", "to": "triple_intermediate_01...", "type": "wasDerivedFrom"},
    {"from": "triple_intermediate_01...", "to": "triple_original_01...", "type": "wasDerivedFrom"}
  ]
}

Tracing Downstream Lineage¶

Find all data derived from a given triple:

curl "http://localhost:8000/api/lineage/trace/downstream/triple_original_01...?max_depth=10"

RDF-Star Annotations¶

IndentiaDB uses RDF-star (RDF 1.2) to embed provenance metadata directly into the graph as annotations on triples. This allows provenance to be queried alongside the data using standard SPARQL-star syntax.

How It Works¶

A regular triple:

ex:alice ex:worksAt ex:acme .

With RDF-star provenance annotation:

<< ex:alice ex:worksAt ex:acme >> prov:wasGeneratedBy ex:activity_import_01 ;
                                   prov:generatedAtTime "2026-03-23T10:00:00Z"^^xsd:dateTime ;
                                   ig:confidence 0.95 ;
                                   ig:source "crm-system" .

Querying Provenance with SPARQL-Star¶

PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX ig:   <http://indentiagraph.io/ontology/lineage#>

SELECT ?s ?p ?o ?source ?confidence ?when WHERE {
    << ?s ?p ?o >> ig:source     ?source ;
                   ig:confidence ?confidence ;
                   prov:generatedAtTime ?when .
    FILTER(?confidence > 0.8)
}
ORDER BY DESC(?confidence)

Querying Lineage via SPARQL¶

Find All Triples from a Specific Source¶

PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX ig:   <http://indentiagraph.io/ontology/lineage#>

SELECT ?s ?p ?o WHERE {
    << ?s ?p ?o >> ig:sourceSystem "crm" .
}

Find the Full Provenance Chain for a Triple¶

PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX ig:   <http://indentiagraph.io/ontology/lineage#>

SELECT ?entity ?derivedFrom ?activity ?agent WHERE {
    ?entity prov:wasDerivedFrom* ?derivedFrom .
    ?entity prov:wasGeneratedBy ?activity .
    ?activity prov:wasAssociatedWith ?agent .
    FILTER(?entity = <urn:triple:abc123>)
}

Find All Activities by a Specific Agent¶

PREFIX prov: <http://www.w3.org/ns/prov#>

SELECT ?activity ?type ?started ?ended WHERE {
    ?activity prov:wasAssociatedWith <system://file-importer> ;
              a prov:Activity ;
              prov:startedAtTime ?started .
    OPTIONAL { ?activity prov:endedAtTime ?ended }
}
ORDER BY DESC(?started)

Identify Low-Confidence Triples¶

PREFIX ig: <http://indentiagraph.io/ontology/lineage#>

SELECT ?s ?p ?o ?confidence WHERE {
    << ?s ?p ?o >> ig:confidence ?confidence .
    FILTER(?confidence < 0.5)
}
ORDER BY ?confidence

Querying Lineage via SurrealQL¶

Lineage data is also accessible through SurrealQL for applications that use the document model:

List Recent Activities¶

SELECT * FROM prov_activity
WHERE started_at > time::now() - 24h
ORDER BY started_at DESC
LIMIT 50;

Find Provenance for a Triple¶

SELECT * FROM triple_provenance
WHERE triple_id = "triple_01HY4K..."
ORDER BY created_at DESC;

Find All Derivations from a Source¶

SELECT * FROM prov_derived
WHERE source_id = "prov_original_01..."
FETCH derived_id;

Filter by Source Type and Confidence¶

SELECT * FROM triple_provenance
WHERE source.type = "File"
  AND confidence_score >= 0.8
ORDER BY created_at DESC
LIMIT 100;

Integration with Bitemporal¶

The lineage system integrates with IndentiaDB's bitemporal time-travel to answer questions about when data was derived and when derivation relationships were recorded:

PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX ig:   <http://indentiagraph.io/ontology/lineage#>

SELECT ?entity ?source ?confidence ?tx_time WHERE {
    TEMPORAL AS OF TX "2026-01-15T00:00:00Z"

    ?entity prov:wasDerivedFrom ?source .
    << ?entity ?p ?o >> ig:confidence ?confidence .
}

This query returns the lineage graph as it was known on January 15, 2026 -- even if derivation records were later corrected or deleted.

Retroactive corrections

When a data correction is applied retroactively, the lineage system creates a new activity of type Correction with a derivation link to the original provenance record. The bitemporal model ensures the original record remains queryable at its original transaction time.

Compliance & Governance¶

Trace all data derived from a specific source entity to identify every triple that must be deleted:

curl "http://localhost:8000/api/lineage/trace/downstream/triple_user_01...?max_depth=50" \
  | jq '.nodes[].triple_id'

Audit Report¶

Generate a complete audit report for a time range:

curl "http://localhost:8000/api/lineage/audit?from=2026-01-01&to=2026-03-31" \
  -H "Accept: application/json"

Response:

{
  "period": {"from": "2026-01-01", "to": "2026-03-31"},
  "summary": {
    "total_activities": 1842,
    "total_agents": 12,
    "total_provenance_records": 458293,
    "by_activity_type": {
      "Import": 324,
      "Transform": 1205,
      "Delete": 42,
      "Inference": 271
    },
    "by_source_type": {
      "File": 182304,
      "Api": 94521,
      "Pipeline": 128944,
      "Inferred": 52524
    }
  }
}

Data Quality Dashboard Query¶

Find the confidence distribution across the graph:

PREFIX ig: <http://indentiagraph.io/ontology/lineage#>

SELECT
  (COUNT(?s) AS ?total)
  (AVG(?c) AS ?avg_confidence)
  (MIN(?c) AS ?min_confidence)
  (SUM(IF(?c >= 0.9, 1, 0)) AS ?high_quality)
  (SUM(IF(?c < 0.5, 1, 0)) AS ?low_quality)
WHERE {
    << ?s ?p ?o >> ig:confidence ?c .
}

Retention Policies¶

Provenance data grows over time. Configure retention policies to automatically clean up old records while preserving required audit history:

[lineage.retention]
# Keep provenance records for 2 years
max_age_days = 730

# Keep activity records for 5 years (compliance)
activity_max_age_days = 1825

# Run cleanup every 24 hours
cleanup_interval_hours = 24

# Archive to cold storage before deleting
archive_before_delete = true
archive_bucket = "s3://lineage-archive"

Compliance requirements

Ensure retention periods meet your regulatory requirements before reducing them. GDPR requires deletion records to be maintained, SOX typically requires 7 years of financial data provenance, and HIPAA requires 6 years of access audit trails.

Configuration¶

Full Lineage Configuration¶

[lineage]
# Enable/disable lineage tracking
enabled = true

# Capture provenance for all write operations (recommended)
capture_all_writes = true

# Default confidence score when none is provided
default_confidence = 0.5

# Default confidence method
default_confidence_method = "Default"

# Maximum depth for upstream/downstream tracing
max_trace_depth = 50

# Store RDF-star annotations in the graph
rdf_star_annotations = true

[lineage.interceptor]
# Enable automatic file checksum computation
compute_checksums = true

# Checksum algorithm (sha256, sha512, md5)
checksum_algorithm = "sha256"

# Maximum file size for checksum computation (bytes)
max_checksum_file_size = 1073741824  # 1 GB

[lineage.confidence]
# Source reliability ratings (used by SourceReliability method)
[lineage.confidence.source_ratings]
"crm"          = 0.95
"erp"          = 0.90
"external-api" = 0.70
"user-input"   = 0.80
"inference"    = 0.60

[lineage.retention]
max_age_days = 730
activity_max_age_days = 1825
cleanup_interval_hours = 24
archive_before_delete = false

SurrealDB Schema¶

The lineage system uses the following SurrealDB tables:

Table	Purpose
`triple_provenance`	Provenance metadata per triple (source, confidence, derivation links)
`prov_activity`	PROV-O activities (import, transform, delete, etc.)
`prov_agent`	PROV-O agents (users, systems, services)
`prov_used`	Usage relationships (activity used entity)
`prov_generated`	Generation relationships (activity generated entity)
`prov_derived`	Derivation relationships (entity derived from entity)

Indexes¶

The schema includes indexes for efficient querying:

Index	Table	Fields	Purpose
`idx_prov_triple`	`triple_provenance`	`triple_id`	Look up provenance by triple
`idx_prov_activity`	`triple_provenance`	`activity_id`	Find triples created by an activity
`idx_prov_source`	`triple_provenance`	`source.type`	Filter by source type
`idx_prov_confidence`	`triple_provenance`	`confidence_score`	Range queries on confidence
`idx_prov_derived`	`triple_provenance`	`derived_from`	Traverse derivation chains
`idx_activity_type`	`prov_activity`	`activity_type`	Filter by activity type
`idx_activity_agent`	`prov_activity`	`agent_id`	Find activities by agent
`idx_activity_time`	`prov_activity`	`started_at`	Time-range queries on activities

Vocabulary Reference¶

IndentiaDB uses two RDF namespaces for lineage:

W3C PROV-O Namespace¶

Prefix: prov: -- http://www.w3.org/ns/prov#

Term	Type	Description
`prov:Entity`	Class	Data item with provenance
`prov:Activity`	Class	Action that occurs over time
`prov:Agent`	Class	Actor responsible for activities
`prov:wasDerivedFrom`	Property	Entity derivation relationship
`prov:wasGeneratedBy`	Property	Entity was generated by an activity
`prov:wasAttributedTo`	Property	Entity was attributed to an agent
`prov:used`	Property	Activity consumed an entity
`prov:wasInformedBy`	Property	Activity informed by another activity
`prov:wasAssociatedWith`	Property	Activity performed by an agent
`prov:generatedAtTime`	Property	Timestamp when entity was generated
`prov:startedAtTime`	Property	Timestamp when activity started
`prov:endedAtTime`	Property	Timestamp when activity ended

IndentiaGraph Lineage Namespace¶

Prefix: ig: -- http://indentiagraph.io/ontology/lineage#

Term	Type	Description
`ig:hasProvenance`	Property	Links triple to its provenance record
`ig:source`	Property	Source identifier string
`ig:confidence`	Property	Confidence score (0.0-1.0)
`ig:sourceSystem`	Property	Source system identifier

Data Lineage & Provenance¶

Architecture¶

Core Concepts¶

Entities, Activities, and Agents¶

Relationships¶

Data Sources¶

Confidence Scores¶

Confidence Methods¶

Recording Provenance¶

Starting an Activity¶

Activity Types¶

Recording Triple Provenance¶

Using the File Import Interceptor¶

Derivation Chains¶

Recording Derivations¶

Tracing Upstream Lineage¶

Tracing Downstream Lineage¶

RDF-Star Annotations¶

How It Works¶

Querying Provenance with SPARQL-Star¶

Querying Lineage via SPARQL¶

Find All Triples from a Specific Source¶

Find the Full Provenance Chain for a Triple¶

Find All Activities by a Specific Agent¶

Identify Low-Confidence Triples¶

Querying Lineage via SurrealQL¶

List Recent Activities¶

Find Provenance for a Triple¶

Find All Derivations from a Source¶

Filter by Source Type and Confidence¶

Integration with Bitemporal¶

Compliance & Governance¶

GDPR Right to Erasure¶

Audit Report¶

Data Quality Dashboard Query¶

Retention Policies¶

Configuration¶

Full Lineage Configuration¶

SurrealDB Schema¶

Indexes¶

Vocabulary Reference¶

W3C PROV-O Namespace¶

IndentiaGraph Lineage Namespace¶