Data Lineage & Provenance¶
IndentiaDB provides built-in data lineage and provenance tracking for every triple stored in the graph. Built on the W3C PROV-O (Provenance Ontology) standard, the lineage system records where data came from, how it was transformed, and who or what was responsible -- giving you a complete audit trail from source to destination. This is essential for regulatory compliance (GDPR, SOX, HIPAA), data quality management, and debugging complex ETL pipelines.
Architecture¶
The lineage system follows the PROV-O entity-activity-agent model:
+--------------+ +------------------+ +--------------+
| Sources |---->| Interceptor |---->| Storage |
| File / API / | | + Annotator | | PROV-O + |
| SPARQL / ETL | | | | SurrealDB |
+--------------+ +------------------+ +------+-------+
|
v
+-------------+
| Tracer |
| Upstream / |
| Downstream |
+-------------+
Every write operation passes through an ingest interceptor that captures the data source, attaches a confidence score, links the operation to an activity, and stores the provenance metadata alongside the triple.
Core Concepts¶
Entities, Activities, and Agents¶
IndentiaDB maps the three PROV-O pillars to concrete objects:
| PROV-O Concept | IndentiaDB Type | Description |
|---|---|---|
| Entity | TripleProvenance |
A triple with provenance metadata attached |
| Activity | ProvenanceActivity |
An action that created, modified, or consumed entities (import, transform, query) |
| Agent | ProvenanceAgent |
The user, service, or system that performed an activity |
Relationships¶
| Relationship | Predicate | Meaning |
|---|---|---|
wasDerivedFrom |
prov:wasDerivedFrom |
Entity A was derived from Entity B |
wasGeneratedBy |
prov:wasGeneratedBy |
Entity was generated by an activity |
wasAttributedTo |
prov:wasAttributedTo |
Entity was attributed to an agent |
used |
prov:used |
Activity consumed an entity as input |
wasAssociatedWith |
prov:wasAssociatedWith |
Activity was performed by an agent |
wasInformedBy |
prov:wasInformedBy |
Activity was informed by another activity |
Data Sources¶
IndentiaDB tracks seven distinct data source types, each capturing source-specific metadata:
| Source Type | Fields | Use Case |
|---|---|---|
| File | path, format, checksum |
Turtle, N-Triples, JSON-LD, CSV imports |
| API | endpoint, method, request_id |
REST API ingestion |
| SparqlUpdate | query, user |
SPARQL INSERT/DELETE operations |
| Pipeline | pipeline_id, step, run_id |
ETL/ELT pipeline stages |
| External | system, identifier |
Federation from external systems |
| UserAssertion | user_id |
Manual data entry or assertion |
| Inferred | rule_id, source_triples |
Inference engine output |
Confidence Scores¶
Every triple can carry a confidence score between 0.0 and 1.0, representing the reliability of the data:
Confidence Methods¶
| Method | Description |
|---|---|
SourceProvided |
Confidence was supplied by the source system itself |
SourceReliability |
Calculated based on pre-configured source reliability ratings |
MLModel |
Determined by a machine learning model (includes model_id) |
UserAsserted |
Manually asserted by a user |
Default |
Fallback value (0.5) when no confidence information is available |
Filtering by confidence
Use the ProvenanceFilter with min_confidence and max_confidence to query only high-quality data. For example, setting min_confidence: 0.8 excludes low-confidence inferred triples from analytical queries.
Recording Provenance¶
Starting an Activity¶
Every batch of provenance records is grouped under an activity. Start an activity before ingesting data:
curl -X POST http://localhost:8000/api/lineage/activities \
-H "Content-Type: application/json" \
-d '{
"type": "import",
"agent_id": "system://file-importer",
"description": "Import customer data from CRM export"
}'
Response:
{
"activity_id": "act_01HY4K8M3X...",
"started_at": "2026-03-23T10:00:00Z",
"type": "import",
"agent_id": "system://file-importer"
}
Activity Types¶
| Type | Description |
|---|---|
Import |
Loading data from an external source |
Transform |
Data transformation or enrichment |
Merge |
Merging data from multiple sources |
Delete |
Removing data |
Inference |
Generating new triples via reasoning |
Query |
Read-only access (for audit logging) |
Correction |
Correcting previously recorded data |
Recording Triple Provenance¶
Attach provenance metadata when inserting triples:
curl -X POST http://localhost:8000/api/lineage/provenance \
-H "Content-Type: application/json" \
-d '{
"triple_id": "triple_01HY4K...",
"source": {
"type": "File",
"path": "/data/customers.ttl",
"format": "turtle",
"checksum": "sha256:e3b0c44298fc1c149..."
},
"confidence": {
"score": 0.85,
"method": "SourceReliability"
},
"activity_id": "act_01HY4K8M3X...",
"metadata": {
"batch_id": "batch-2026-03-23",
"row_number": 42
}
}'
Using the File Import Interceptor¶
The interceptor automatically captures provenance during bulk imports:
import requests
# Start an import activity
activity = requests.post("http://localhost:8000/api/lineage/activities", json={
"type": "import",
"agent_id": "system://python-etl",
"description": "Daily CRM sync"
}).json()
# Upload a file with automatic provenance tracking
with open("customers.ttl", "rb") as f:
response = requests.post(
"http://localhost:8000/api/import",
files={"file": f},
data={
"format": "turtle",
"activity_id": activity["activity_id"],
"confidence_score": 0.9,
"confidence_method": "SourceReliability"
}
)
print(f"Imported {response.json()['triple_count']} triples with provenance")
Automatic checksums
When using the file import interceptor, IndentiaDB automatically computes a SHA-256 checksum of the source file and stores it in the provenance record. This enables integrity verification when re-importing data.
Derivation Chains¶
When data is transformed, IndentiaDB records derivation relationships between the original and derived triples. This creates a chain that can be traced upstream (back to sources) or downstream (to all derived data).
Recording Derivations¶
curl -X POST http://localhost:8000/api/lineage/provenance \
-H "Content-Type: application/json" \
-d '{
"triple_id": "triple_derived_01...",
"source": {
"type": "Pipeline",
"pipeline_id": "etl-customer-enrichment",
"step": "normalize-addresses",
"run_id": "run-2026-03-23-001"
},
"activity_id": "act_transform_01...",
"derived_from": "prov_original_01..."
}'
Tracing Upstream Lineage¶
Trace a triple back to its original sources:
Response:
{
"root": "triple_derived_01...",
"depth": 3,
"nodes": [
{
"triple_id": "triple_derived_01...",
"source": {"type": "Pipeline", "pipeline_id": "etl-customer-enrichment"},
"activity_type": "Transform"
},
{
"triple_id": "triple_intermediate_01...",
"source": {"type": "SparqlUpdate", "query": "INSERT { ... }"},
"activity_type": "Transform"
},
{
"triple_id": "triple_original_01...",
"source": {"type": "File", "path": "/data/customers.ttl"},
"activity_type": "Import"
}
],
"relationships": [
{"from": "triple_derived_01...", "to": "triple_intermediate_01...", "type": "wasDerivedFrom"},
{"from": "triple_intermediate_01...", "to": "triple_original_01...", "type": "wasDerivedFrom"}
]
}
Tracing Downstream Lineage¶
Find all data derived from a given triple:
RDF-Star Annotations¶
IndentiaDB uses RDF-star (RDF 1.2) to embed provenance metadata directly into the graph as annotations on triples. This allows provenance to be queried alongside the data using standard SPARQL-star syntax.
How It Works¶
A regular triple:
With RDF-star provenance annotation:
<< ex:alice ex:worksAt ex:acme >> prov:wasGeneratedBy ex:activity_import_01 ;
prov:generatedAtTime "2026-03-23T10:00:00Z"^^xsd:dateTime ;
ig:confidence 0.95 ;
ig:source "crm-system" .
Querying Provenance with SPARQL-Star¶
PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX ig: <http://indentiagraph.io/ontology/lineage#>
SELECT ?s ?p ?o ?source ?confidence ?when WHERE {
<< ?s ?p ?o >> ig:source ?source ;
ig:confidence ?confidence ;
prov:generatedAtTime ?when .
FILTER(?confidence > 0.8)
}
ORDER BY DESC(?confidence)
Querying Lineage via SPARQL¶
Find All Triples from a Specific Source¶
PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX ig: <http://indentiagraph.io/ontology/lineage#>
SELECT ?s ?p ?o WHERE {
<< ?s ?p ?o >> ig:sourceSystem "crm" .
}
Find the Full Provenance Chain for a Triple¶
PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX ig: <http://indentiagraph.io/ontology/lineage#>
SELECT ?entity ?derivedFrom ?activity ?agent WHERE {
?entity prov:wasDerivedFrom* ?derivedFrom .
?entity prov:wasGeneratedBy ?activity .
?activity prov:wasAssociatedWith ?agent .
FILTER(?entity = <urn:triple:abc123>)
}
Find All Activities by a Specific Agent¶
PREFIX prov: <http://www.w3.org/ns/prov#>
SELECT ?activity ?type ?started ?ended WHERE {
?activity prov:wasAssociatedWith <system://file-importer> ;
a prov:Activity ;
prov:startedAtTime ?started .
OPTIONAL { ?activity prov:endedAtTime ?ended }
}
ORDER BY DESC(?started)
Identify Low-Confidence Triples¶
PREFIX ig: <http://indentiagraph.io/ontology/lineage#>
SELECT ?s ?p ?o ?confidence WHERE {
<< ?s ?p ?o >> ig:confidence ?confidence .
FILTER(?confidence < 0.5)
}
ORDER BY ?confidence
Querying Lineage via SurrealQL¶
Lineage data is also accessible through SurrealQL for applications that use the document model:
List Recent Activities¶
Find Provenance for a Triple¶
Find All Derivations from a Source¶
Filter by Source Type and Confidence¶
SELECT * FROM triple_provenance
WHERE source.type = "File"
AND confidence_score >= 0.8
ORDER BY created_at DESC
LIMIT 100;
Integration with Bitemporal¶
The lineage system integrates with IndentiaDB's bitemporal time-travel to answer questions about when data was derived and when derivation relationships were recorded:
PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX ig: <http://indentiagraph.io/ontology/lineage#>
SELECT ?entity ?source ?confidence ?tx_time WHERE {
TEMPORAL AS OF TX "2026-01-15T00:00:00Z"
?entity prov:wasDerivedFrom ?source .
<< ?entity ?p ?o >> ig:confidence ?confidence .
}
This query returns the lineage graph as it was known on January 15, 2026 -- even if derivation records were later corrected or deleted.
Retroactive corrections
When a data correction is applied retroactively, the lineage system creates a new activity of type Correction with a derivation link to the original provenance record. The bitemporal model ensures the original record remains queryable at its original transaction time.
Compliance & Governance¶
GDPR Right to Erasure¶
Trace all data derived from a specific source entity to identify every triple that must be deleted:
curl "http://localhost:8000/api/lineage/trace/downstream/triple_user_01...?max_depth=50" \
| jq '.nodes[].triple_id'
Audit Report¶
Generate a complete audit report for a time range:
curl "http://localhost:8000/api/lineage/audit?from=2026-01-01&to=2026-03-31" \
-H "Accept: application/json"
Response:
{
"period": {"from": "2026-01-01", "to": "2026-03-31"},
"summary": {
"total_activities": 1842,
"total_agents": 12,
"total_provenance_records": 458293,
"by_activity_type": {
"Import": 324,
"Transform": 1205,
"Delete": 42,
"Inference": 271
},
"by_source_type": {
"File": 182304,
"Api": 94521,
"Pipeline": 128944,
"Inferred": 52524
}
}
}
Data Quality Dashboard Query¶
Find the confidence distribution across the graph:
PREFIX ig: <http://indentiagraph.io/ontology/lineage#>
SELECT
(COUNT(?s) AS ?total)
(AVG(?c) AS ?avg_confidence)
(MIN(?c) AS ?min_confidence)
(SUM(IF(?c >= 0.9, 1, 0)) AS ?high_quality)
(SUM(IF(?c < 0.5, 1, 0)) AS ?low_quality)
WHERE {
<< ?s ?p ?o >> ig:confidence ?c .
}
Retention Policies¶
Provenance data grows over time. Configure retention policies to automatically clean up old records while preserving required audit history:
[lineage.retention]
# Keep provenance records for 2 years
max_age_days = 730
# Keep activity records for 5 years (compliance)
activity_max_age_days = 1825
# Run cleanup every 24 hours
cleanup_interval_hours = 24
# Archive to cold storage before deleting
archive_before_delete = true
archive_bucket = "s3://lineage-archive"
Compliance requirements
Ensure retention periods meet your regulatory requirements before reducing them. GDPR requires deletion records to be maintained, SOX typically requires 7 years of financial data provenance, and HIPAA requires 6 years of access audit trails.
Configuration¶
Full Lineage Configuration¶
[lineage]
# Enable/disable lineage tracking
enabled = true
# Capture provenance for all write operations (recommended)
capture_all_writes = true
# Default confidence score when none is provided
default_confidence = 0.5
# Default confidence method
default_confidence_method = "Default"
# Maximum depth for upstream/downstream tracing
max_trace_depth = 50
# Store RDF-star annotations in the graph
rdf_star_annotations = true
[lineage.interceptor]
# Enable automatic file checksum computation
compute_checksums = true
# Checksum algorithm (sha256, sha512, md5)
checksum_algorithm = "sha256"
# Maximum file size for checksum computation (bytes)
max_checksum_file_size = 1073741824 # 1 GB
[lineage.confidence]
# Source reliability ratings (used by SourceReliability method)
[lineage.confidence.source_ratings]
"crm" = 0.95
"erp" = 0.90
"external-api" = 0.70
"user-input" = 0.80
"inference" = 0.60
[lineage.retention]
max_age_days = 730
activity_max_age_days = 1825
cleanup_interval_hours = 24
archive_before_delete = false
SurrealDB Schema¶
The lineage system uses the following SurrealDB tables:
| Table | Purpose |
|---|---|
triple_provenance |
Provenance metadata per triple (source, confidence, derivation links) |
prov_activity |
PROV-O activities (import, transform, delete, etc.) |
prov_agent |
PROV-O agents (users, systems, services) |
prov_used |
Usage relationships (activity used entity) |
prov_generated |
Generation relationships (activity generated entity) |
prov_derived |
Derivation relationships (entity derived from entity) |
Indexes¶
The schema includes indexes for efficient querying:
| Index | Table | Fields | Purpose |
|---|---|---|---|
idx_prov_triple |
triple_provenance |
triple_id |
Look up provenance by triple |
idx_prov_activity |
triple_provenance |
activity_id |
Find triples created by an activity |
idx_prov_source |
triple_provenance |
source.type |
Filter by source type |
idx_prov_confidence |
triple_provenance |
confidence_score |
Range queries on confidence |
idx_prov_derived |
triple_provenance |
derived_from |
Traverse derivation chains |
idx_activity_type |
prov_activity |
activity_type |
Filter by activity type |
idx_activity_agent |
prov_activity |
agent_id |
Find activities by agent |
idx_activity_time |
prov_activity |
started_at |
Time-range queries on activities |
Vocabulary Reference¶
IndentiaDB uses two RDF namespaces for lineage:
W3C PROV-O Namespace¶
Prefix: prov: -- http://www.w3.org/ns/prov#
| Term | Type | Description |
|---|---|---|
prov:Entity |
Class | Data item with provenance |
prov:Activity |
Class | Action that occurs over time |
prov:Agent |
Class | Actor responsible for activities |
prov:wasDerivedFrom |
Property | Entity derivation relationship |
prov:wasGeneratedBy |
Property | Entity was generated by an activity |
prov:wasAttributedTo |
Property | Entity was attributed to an agent |
prov:used |
Property | Activity consumed an entity |
prov:wasInformedBy |
Property | Activity informed by another activity |
prov:wasAssociatedWith |
Property | Activity performed by an agent |
prov:generatedAtTime |
Property | Timestamp when entity was generated |
prov:startedAtTime |
Property | Timestamp when activity started |
prov:endedAtTime |
Property | Timestamp when activity ended |
IndentiaGraph Lineage Namespace¶
Prefix: ig: -- http://indentiagraph.io/ontology/lineage#
| Term | Type | Description |
|---|---|---|
ig:hasProvenance |
Property | Links triple to its provenance record |
ig:source |
Property | Source identifier string |
ig:confidence |
Property | Confidence score (0.0-1.0) |
ig:sourceSystem |
Property | Source system identifier |