Skip to content

Data Lineage & Provenance

IndentiaDB provides built-in data lineage and provenance tracking for every triple stored in the graph. Built on the W3C PROV-O (Provenance Ontology) standard, the lineage system records where data came from, how it was transformed, and who or what was responsible -- giving you a complete audit trail from source to destination. This is essential for regulatory compliance (GDPR, SOX, HIPAA), data quality management, and debugging complex ETL pipelines.


Architecture

The lineage system follows the PROV-O entity-activity-agent model:

+--------------+     +------------------+     +--------------+
|   Sources    |---->|   Interceptor    |---->|   Storage    |
| File / API / |     |   + Annotator    |     | PROV-O +     |
| SPARQL / ETL |     |                  |     | SurrealDB    |
+--------------+     +------------------+     +------+-------+
                                                     |
                                                     v
                                               +-------------+
                                               |   Tracer    |
                                               | Upstream /  |
                                               | Downstream  |
                                               +-------------+

Every write operation passes through an ingest interceptor that captures the data source, attaches a confidence score, links the operation to an activity, and stores the provenance metadata alongside the triple.


Core Concepts

Entities, Activities, and Agents

IndentiaDB maps the three PROV-O pillars to concrete objects:

PROV-O Concept IndentiaDB Type Description
Entity TripleProvenance A triple with provenance metadata attached
Activity ProvenanceActivity An action that created, modified, or consumed entities (import, transform, query)
Agent ProvenanceAgent The user, service, or system that performed an activity

Relationships

Relationship Predicate Meaning
wasDerivedFrom prov:wasDerivedFrom Entity A was derived from Entity B
wasGeneratedBy prov:wasGeneratedBy Entity was generated by an activity
wasAttributedTo prov:wasAttributedTo Entity was attributed to an agent
used prov:used Activity consumed an entity as input
wasAssociatedWith prov:wasAssociatedWith Activity was performed by an agent
wasInformedBy prov:wasInformedBy Activity was informed by another activity

Data Sources

IndentiaDB tracks seven distinct data source types, each capturing source-specific metadata:

Source Type Fields Use Case
File path, format, checksum Turtle, N-Triples, JSON-LD, CSV imports
API endpoint, method, request_id REST API ingestion
SparqlUpdate query, user SPARQL INSERT/DELETE operations
Pipeline pipeline_id, step, run_id ETL/ELT pipeline stages
External system, identifier Federation from external systems
UserAssertion user_id Manual data entry or assertion
Inferred rule_id, source_triples Inference engine output

Confidence Scores

Every triple can carry a confidence score between 0.0 and 1.0, representing the reliability of the data:

{
  "score": 0.85,
  "method": "source_reliability",
  "assessed_at": "2026-03-23T10:00:00Z"
}

Confidence Methods

Method Description
SourceProvided Confidence was supplied by the source system itself
SourceReliability Calculated based on pre-configured source reliability ratings
MLModel Determined by a machine learning model (includes model_id)
UserAsserted Manually asserted by a user
Default Fallback value (0.5) when no confidence information is available

Filtering by confidence

Use the ProvenanceFilter with min_confidence and max_confidence to query only high-quality data. For example, setting min_confidence: 0.8 excludes low-confidence inferred triples from analytical queries.


Recording Provenance

Starting an Activity

Every batch of provenance records is grouped under an activity. Start an activity before ingesting data:

curl -X POST http://localhost:8000/api/lineage/activities \
  -H "Content-Type: application/json" \
  -d '{
    "type": "import",
    "agent_id": "system://file-importer",
    "description": "Import customer data from CRM export"
  }'

Response:

{
  "activity_id": "act_01HY4K8M3X...",
  "started_at": "2026-03-23T10:00:00Z",
  "type": "import",
  "agent_id": "system://file-importer"
}

Activity Types

Type Description
Import Loading data from an external source
Transform Data transformation or enrichment
Merge Merging data from multiple sources
Delete Removing data
Inference Generating new triples via reasoning
Query Read-only access (for audit logging)
Correction Correcting previously recorded data

Recording Triple Provenance

Attach provenance metadata when inserting triples:

curl -X POST http://localhost:8000/api/lineage/provenance \
  -H "Content-Type: application/json" \
  -d '{
    "triple_id": "triple_01HY4K...",
    "source": {
      "type": "File",
      "path": "/data/customers.ttl",
      "format": "turtle",
      "checksum": "sha256:e3b0c44298fc1c149..."
    },
    "confidence": {
      "score": 0.85,
      "method": "SourceReliability"
    },
    "activity_id": "act_01HY4K8M3X...",
    "metadata": {
      "batch_id": "batch-2026-03-23",
      "row_number": 42
    }
  }'

Using the File Import Interceptor

The interceptor automatically captures provenance during bulk imports:

import requests

# Start an import activity
activity = requests.post("http://localhost:8000/api/lineage/activities", json={
    "type": "import",
    "agent_id": "system://python-etl",
    "description": "Daily CRM sync"
}).json()

# Upload a file with automatic provenance tracking
with open("customers.ttl", "rb") as f:
    response = requests.post(
        "http://localhost:8000/api/import",
        files={"file": f},
        data={
            "format": "turtle",
            "activity_id": activity["activity_id"],
            "confidence_score": 0.9,
            "confidence_method": "SourceReliability"
        }
    )

print(f"Imported {response.json()['triple_count']} triples with provenance")

Automatic checksums

When using the file import interceptor, IndentiaDB automatically computes a SHA-256 checksum of the source file and stores it in the provenance record. This enables integrity verification when re-importing data.


Derivation Chains

When data is transformed, IndentiaDB records derivation relationships between the original and derived triples. This creates a chain that can be traced upstream (back to sources) or downstream (to all derived data).

Recording Derivations

curl -X POST http://localhost:8000/api/lineage/provenance \
  -H "Content-Type: application/json" \
  -d '{
    "triple_id": "triple_derived_01...",
    "source": {
      "type": "Pipeline",
      "pipeline_id": "etl-customer-enrichment",
      "step": "normalize-addresses",
      "run_id": "run-2026-03-23-001"
    },
    "activity_id": "act_transform_01...",
    "derived_from": "prov_original_01..."
  }'

Tracing Upstream Lineage

Trace a triple back to its original sources:

curl "http://localhost:8000/api/lineage/trace/upstream/triple_derived_01...?max_depth=5"

Response:

{
  "root": "triple_derived_01...",
  "depth": 3,
  "nodes": [
    {
      "triple_id": "triple_derived_01...",
      "source": {"type": "Pipeline", "pipeline_id": "etl-customer-enrichment"},
      "activity_type": "Transform"
    },
    {
      "triple_id": "triple_intermediate_01...",
      "source": {"type": "SparqlUpdate", "query": "INSERT { ... }"},
      "activity_type": "Transform"
    },
    {
      "triple_id": "triple_original_01...",
      "source": {"type": "File", "path": "/data/customers.ttl"},
      "activity_type": "Import"
    }
  ],
  "relationships": [
    {"from": "triple_derived_01...", "to": "triple_intermediate_01...", "type": "wasDerivedFrom"},
    {"from": "triple_intermediate_01...", "to": "triple_original_01...", "type": "wasDerivedFrom"}
  ]
}

Tracing Downstream Lineage

Find all data derived from a given triple:

curl "http://localhost:8000/api/lineage/trace/downstream/triple_original_01...?max_depth=10"

RDF-Star Annotations

IndentiaDB uses RDF-star (RDF 1.2) to embed provenance metadata directly into the graph as annotations on triples. This allows provenance to be queried alongside the data using standard SPARQL-star syntax.

How It Works

A regular triple:

ex:alice ex:worksAt ex:acme .

With RDF-star provenance annotation:

<< ex:alice ex:worksAt ex:acme >> prov:wasGeneratedBy ex:activity_import_01 ;
                                   prov:generatedAtTime "2026-03-23T10:00:00Z"^^xsd:dateTime ;
                                   ig:confidence 0.95 ;
                                   ig:source "crm-system" .

Querying Provenance with SPARQL-Star

PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX ig:   <http://indentiagraph.io/ontology/lineage#>

SELECT ?s ?p ?o ?source ?confidence ?when WHERE {
    << ?s ?p ?o >> ig:source     ?source ;
                   ig:confidence ?confidence ;
                   prov:generatedAtTime ?when .
    FILTER(?confidence > 0.8)
}
ORDER BY DESC(?confidence)

Querying Lineage via SPARQL

Find All Triples from a Specific Source

PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX ig:   <http://indentiagraph.io/ontology/lineage#>

SELECT ?s ?p ?o WHERE {
    << ?s ?p ?o >> ig:sourceSystem "crm" .
}

Find the Full Provenance Chain for a Triple

PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX ig:   <http://indentiagraph.io/ontology/lineage#>

SELECT ?entity ?derivedFrom ?activity ?agent WHERE {
    ?entity prov:wasDerivedFrom* ?derivedFrom .
    ?entity prov:wasGeneratedBy ?activity .
    ?activity prov:wasAssociatedWith ?agent .
    FILTER(?entity = <urn:triple:abc123>)
}

Find All Activities by a Specific Agent

PREFIX prov: <http://www.w3.org/ns/prov#>

SELECT ?activity ?type ?started ?ended WHERE {
    ?activity prov:wasAssociatedWith <system://file-importer> ;
              a prov:Activity ;
              prov:startedAtTime ?started .
    OPTIONAL { ?activity prov:endedAtTime ?ended }
}
ORDER BY DESC(?started)

Identify Low-Confidence Triples

PREFIX ig: <http://indentiagraph.io/ontology/lineage#>

SELECT ?s ?p ?o ?confidence WHERE {
    << ?s ?p ?o >> ig:confidence ?confidence .
    FILTER(?confidence < 0.5)
}
ORDER BY ?confidence

Querying Lineage via SurrealQL

Lineage data is also accessible through SurrealQL for applications that use the document model:

List Recent Activities

SELECT * FROM prov_activity
WHERE started_at > time::now() - 24h
ORDER BY started_at DESC
LIMIT 50;

Find Provenance for a Triple

SELECT * FROM triple_provenance
WHERE triple_id = "triple_01HY4K..."
ORDER BY created_at DESC;

Find All Derivations from a Source

SELECT * FROM prov_derived
WHERE source_id = "prov_original_01..."
FETCH derived_id;

Filter by Source Type and Confidence

SELECT * FROM triple_provenance
WHERE source.type = "File"
  AND confidence_score >= 0.8
ORDER BY created_at DESC
LIMIT 100;

Integration with Bitemporal

The lineage system integrates with IndentiaDB's bitemporal time-travel to answer questions about when data was derived and when derivation relationships were recorded:

PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX ig:   <http://indentiagraph.io/ontology/lineage#>

SELECT ?entity ?source ?confidence ?tx_time WHERE {
    TEMPORAL AS OF TX "2026-01-15T00:00:00Z"

    ?entity prov:wasDerivedFrom ?source .
    << ?entity ?p ?o >> ig:confidence ?confidence .
}

This query returns the lineage graph as it was known on January 15, 2026 -- even if derivation records were later corrected or deleted.

Retroactive corrections

When a data correction is applied retroactively, the lineage system creates a new activity of type Correction with a derivation link to the original provenance record. The bitemporal model ensures the original record remains queryable at its original transaction time.


Compliance & Governance

GDPR Right to Erasure

Trace all data derived from a specific source entity to identify every triple that must be deleted:

curl "http://localhost:8000/api/lineage/trace/downstream/triple_user_01...?max_depth=50" \
  | jq '.nodes[].triple_id'

Audit Report

Generate a complete audit report for a time range:

curl "http://localhost:8000/api/lineage/audit?from=2026-01-01&to=2026-03-31" \
  -H "Accept: application/json"

Response:

{
  "period": {"from": "2026-01-01", "to": "2026-03-31"},
  "summary": {
    "total_activities": 1842,
    "total_agents": 12,
    "total_provenance_records": 458293,
    "by_activity_type": {
      "Import": 324,
      "Transform": 1205,
      "Delete": 42,
      "Inference": 271
    },
    "by_source_type": {
      "File": 182304,
      "Api": 94521,
      "Pipeline": 128944,
      "Inferred": 52524
    }
  }
}

Data Quality Dashboard Query

Find the confidence distribution across the graph:

PREFIX ig: <http://indentiagraph.io/ontology/lineage#>

SELECT
  (COUNT(?s) AS ?total)
  (AVG(?c) AS ?avg_confidence)
  (MIN(?c) AS ?min_confidence)
  (SUM(IF(?c >= 0.9, 1, 0)) AS ?high_quality)
  (SUM(IF(?c < 0.5, 1, 0)) AS ?low_quality)
WHERE {
    << ?s ?p ?o >> ig:confidence ?c .
}

Retention Policies

Provenance data grows over time. Configure retention policies to automatically clean up old records while preserving required audit history:

[lineage.retention]
# Keep provenance records for 2 years
max_age_days = 730

# Keep activity records for 5 years (compliance)
activity_max_age_days = 1825

# Run cleanup every 24 hours
cleanup_interval_hours = 24

# Archive to cold storage before deleting
archive_before_delete = true
archive_bucket = "s3://lineage-archive"

Compliance requirements

Ensure retention periods meet your regulatory requirements before reducing them. GDPR requires deletion records to be maintained, SOX typically requires 7 years of financial data provenance, and HIPAA requires 6 years of access audit trails.


Configuration

Full Lineage Configuration

[lineage]
# Enable/disable lineage tracking
enabled = true

# Capture provenance for all write operations (recommended)
capture_all_writes = true

# Default confidence score when none is provided
default_confidence = 0.5

# Default confidence method
default_confidence_method = "Default"

# Maximum depth for upstream/downstream tracing
max_trace_depth = 50

# Store RDF-star annotations in the graph
rdf_star_annotations = true

[lineage.interceptor]
# Enable automatic file checksum computation
compute_checksums = true

# Checksum algorithm (sha256, sha512, md5)
checksum_algorithm = "sha256"

# Maximum file size for checksum computation (bytes)
max_checksum_file_size = 1073741824  # 1 GB

[lineage.confidence]
# Source reliability ratings (used by SourceReliability method)
[lineage.confidence.source_ratings]
"crm"          = 0.95
"erp"          = 0.90
"external-api" = 0.70
"user-input"   = 0.80
"inference"    = 0.60

[lineage.retention]
max_age_days = 730
activity_max_age_days = 1825
cleanup_interval_hours = 24
archive_before_delete = false

SurrealDB Schema

The lineage system uses the following SurrealDB tables:

Table Purpose
triple_provenance Provenance metadata per triple (source, confidence, derivation links)
prov_activity PROV-O activities (import, transform, delete, etc.)
prov_agent PROV-O agents (users, systems, services)
prov_used Usage relationships (activity used entity)
prov_generated Generation relationships (activity generated entity)
prov_derived Derivation relationships (entity derived from entity)

Indexes

The schema includes indexes for efficient querying:

Index Table Fields Purpose
idx_prov_triple triple_provenance triple_id Look up provenance by triple
idx_prov_activity triple_provenance activity_id Find triples created by an activity
idx_prov_source triple_provenance source.type Filter by source type
idx_prov_confidence triple_provenance confidence_score Range queries on confidence
idx_prov_derived triple_provenance derived_from Traverse derivation chains
idx_activity_type prov_activity activity_type Filter by activity type
idx_activity_agent prov_activity agent_id Find activities by agent
idx_activity_time prov_activity started_at Time-range queries on activities

Vocabulary Reference

IndentiaDB uses two RDF namespaces for lineage:

W3C PROV-O Namespace

Prefix: prov: -- http://www.w3.org/ns/prov#

Term Type Description
prov:Entity Class Data item with provenance
prov:Activity Class Action that occurs over time
prov:Agent Class Actor responsible for activities
prov:wasDerivedFrom Property Entity derivation relationship
prov:wasGeneratedBy Property Entity was generated by an activity
prov:wasAttributedTo Property Entity was attributed to an agent
prov:used Property Activity consumed an entity
prov:wasInformedBy Property Activity informed by another activity
prov:wasAssociatedWith Property Activity performed by an agent
prov:generatedAtTime Property Timestamp when entity was generated
prov:startedAtTime Property Timestamp when activity started
prov:endedAtTime Property Timestamp when activity ended

IndentiaGraph Lineage Namespace

Prefix: ig: -- http://indentiagraph.io/ontology/lineage#

Term Type Description
ig:hasProvenance Property Links triple to its provenance record
ig:source Property Source identifier string
ig:confidence Property Confidence score (0.0-1.0)
ig:sourceSystem Property Source system identifier