Architecture¶

This document describes the internal architecture of IndentiaDB: how data is stored, how queries are executed, and how high availability is achieved.

Storage Layer Overview¶

┌──────────────────────────────────────────────────────────────────────┐
│                        IndentiaDB Process                            │
│                                                                      │
│  ┌────────────────────────────────────────────────────────────────┐  │
│  │                     Query Router                               │  │
│  │   SurrealQL  │  SPARQL 1.2  │  LPG JSON DSL  │  ES DSL        │  │
│  └────────┬─────────────┬───────────────┬──────────────┬──────────┘  │
│           │             │               │              │              │
│  ┌────────▼──────┐  ┌───▼────────┐  ┌──▼──────────┐  │              │
│  │  SurrealDB    │  │  RDF Triple │  │  LPG Engine │  │              │
│  │  Engine       │  │  Store      │  │  (CSR)      │  │              │
│  └────────┬──────┘  └───┬────────┘  └──────────────┘  │              │
│           │             │                              │              │
│  ┌────────▼─────────────▼──────────────────────────────▼──────────┐  │
│  │              Physical Storage Backend                           │  │
│  │                                                                 │  │
│  │   Option A: SurrealDB embedded (kv-mem / kv-surrealkv)         │  │
│  │   Option B: TiKV distributed (Raft consensus, multi-DC)        │  │
│  └─────────────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────────────┘

1. Dual Storage Backends¶

IndentiaDB supports two physical storage backends. You choose one at deployment time based on your scale and availability requirements.

Option A: SurrealDB Embedded¶

SurrealDB is embedded directly into the IndentiaDB process. No separate database process is needed. This is the default for development and single-node production deployments.

Storage engines available within SurrealDB embedded:

Engine	Description	Recommended For
`kv-mem`	In-memory only, data lost on restart	Development, testing, ephemeral workloads
`kv-surrealkv`	Persistent on-disk storage using SurrealKV	Production single-node deployments

Characteristics:

Zero operational overhead — one binary, one process
Supports LIVE queries (WebSocket push on data change)
Full ACID transactions with snapshot isolation
Maximum practical dataset size: approximately 1 TB (single-node disk)
Multi-datacenter replication: not supported
High availability: not supported (single node)

Option B: TiKV Distributed¶

TiKV is an external distributed key-value store based on the Raft consensus protocol, developed by PingCAP. IndentiaDB connects to an existing TiKV cluster as its storage backend via SurrealDB's kv-tikv driver.

Characteristics:

Horizontal scaling across multiple nodes and datacenters
Raft-based replication with automatic failover
Unlimited dataset size (add nodes to scale)
Multi-datacenter support with region-aware placement
High availability: yes (3+ TiKV nodes required for quorum)
LIVE queries: not available (TiKV does not support push notifications)
Operational complexity: high (requires TiKV cluster + PD + TiFlash for analytics)

TiKV Cluster Topology¶

A TiKV deployment consists of two components:

┌──────────────────────────────────┐
│   PD (Placement Driver) × 3      │
│   pd-0, pd-1, pd-2               │
│   ─────────────────────────────  │
│   • Raft quorum for metadata     │
│   • Cluster topology coordinator │
│   • Region leader election       │
│   • Port 2379 (client API)       │
│   • Port 2380 (peer replication) │
└────────────────┬─────────────────┘
                 │ registers with PD
┌────────────────▼─────────────────┐
│   TiKV storage nodes × 3         │
│   tikv-0, tikv-1, tikv-2         │
│   ─────────────────────────────  │
│   • Range-based sharding         │
│   • Raft replication per region  │
│   • Port 20160 (gRPC data)       │
│   • Port 20180 (status/metrics)  │
└──────────────────────────────────┘

Minimum cluster: 3 PD nodes + 3 TiKV nodes. All three TiKV nodes must be available for writes (Raft quorum requires a majority).

Configuring IndentiaDB to Use TiKV¶

Set the SURREAL_URL environment variable to the PD endpoints:

# Single PD node (not recommended for production)
SURREAL_URL=tikv://pd-0:2379

# Three PD nodes (recommended — Raft quorum for metadata)
SURREAL_URL=tikv://pd-0:2379,pd-1:2379,pd-2:2379

With Docker:

docker run -d \
  -e SURREAL_URL="tikv://pd-0:2379,pd-1:2379,pd-2:2379" \
  -p 7001:7001 -p 9200:9200 \
  ghcr.io/indentiaplatform/indentiadb-trial:latest

In config.toml:

[storage]
backend = "tikv"
tikv_pd_endpoints = ["pd-0:2379", "pd-1:2379", "pd-2:2379"]

Docker Compose: Full TiKV Stack¶

services:
  pd-0:
    image: pingcap/pd:v8.5.0
    command: >
      --name=pd-0
      --client-urls=http://0.0.0.0:2379
      --peer-urls=http://0.0.0.0:2380
      --advertise-client-urls=http://pd-0:2379
      --advertise-peer-urls=http://pd-0:2380
      --initial-cluster=pd-0=http://pd-0:2380,pd-1=http://pd-1:2380,pd-2=http://pd-2:2380

  tikv-0:
    image: pingcap/tikv:v8.5.0
    command: >
      --addr=0.0.0.0:20160
      --advertise-addr=tikv-0:20160
      --pd=pd-0:2379,pd-1:2379,pd-2:2379
    depends_on: [pd-0, pd-1, pd-2]

  indentiadb:
    image: ghcr.io/indentiaplatform/indentiadb-trial:latest
    environment:
      SURREAL_URL: "tikv://pd-0:2379,pd-1:2379,pd-2:2379"
    ports:
      - "7001:7001"
      - "9200:9200"
    depends_on: [tikv-0, tikv-1, tikv-2]

TiKV and LIVE queries

TiKV does not support push notifications. LIVE SELECT statements and DEFINE EVENT handlers are unavailable when using the TiKV backend. Use the embedded kv-surrealkv backend if you need reactive queries.

Comparison Table¶

Dimension	SurrealDB Embedded	TiKV Distributed
Complexity	Low — single process	High — separate cluster
Scalability	Single node	Horizontal, unlimited
Write Latency	~0.1–1 ms (local disk)	~1–10 ms (network round-trip)
High Availability	No	Yes (Raft quorum)
Max Dataset	~1 TB	Unlimited
LIVE Queries	Yes (WebSocket)	No
Multi-DC	No	Yes
Operational Cost	Minimal	Significant
Recommended For	Development, <1 TB prod	>1 TB, HA required

Recommendation: Start with SurrealDB embedded (kv-surrealkv). Migrate to TiKV when you approach 500 GB of stored data, require cross-datacenter replication, or need automatic failover.

2. The QLever-compatible SPARQL Engine¶

IndentiaDB's SPARQL subsystem is a Rust-native reimplementation of the core indexing and query-evaluation concepts from QLever — a high-performance SPARQL engine developed at the University of Freiburg. The original C++ QLever project demonstrated that permutation-based triple indexing with vocabulary compression can evaluate SPARQL queries one to two orders of magnitude faster than traditional B-tree approaches.

IndentiaDB embeds those same algorithmic ideas natively in Rust, without a C++ dependency or a separate C++ QLever process. The goals are:

Bit-exact result compatibility with upstream C++ QLever for deterministic SPARQL queries
Native integration with the SurrealDB write path and transaction layer
Extension via RDF-star and SPARQL 1.2 features not yet in upstream QLever

What QLever-compatible Contributes¶

Concept	C++ QLever Origin	IndentiaDB Implementation
6-permutation triple index	QLever's SPO/SOP/PSO/POS/OSP/OPS layout	Same layout, ZSTD + delta + varint compression
Vocabulary compression	FSST (Fast Static Symbol Table)	FSST-compatible encoding; memory-mapped via `memmap2`
`ql:contains-word` predicate	QLever's text index extension	Full inverted BM25 index integrated in IndentiaDB's FTS layer
Cost-based join ordering	QLever's cardinality estimator	`sparopt` crate — filter pushdown, join reordering
Delta triple tracking	QLever's delta index for updates	IndentiaDB delta layer that tracks insertions/deletions per graph

Query Routing: QLever-compatible Engine vs SurrealDB Engine¶

Not every query hits the QLever-compatible path. The query router dispatches based on the operation type:

Query Characteristic	Target Engine	Reason
`SELECT / CONSTRUCT / ASK / DESCRIBE` — read only, no FTS	QLever-compatible (Rust)	Permutation indexes optimally serve pure graph reads
`SELECT` with `ql:contains-word` predicate	QLever-compatible + FTS index	QLever-compatible invokes the integrated BM25 text index
`INSERT DATA / DELETE DATA / DELETE…WHERE`	SurrealDB	All writes go through the SurrealDB ACID layer
`SPARQL()` inside a SurrealQL statement	QLever-compatible (inline dispatch)	Results returned as SurrealQL-typed values
Requests to port `9200`	FTS + vector layer	Elasticsearch-compatible API, bypasses SPARQL parser
`SERVICE <external-endpoint>`	Federation engine	Distributed across QLever-compatible + remote SPARQL endpoints

The split means reads are fully served by the highly-optimised permutation indexes while all state changes go through SurrealDB's transaction log, maintaining ACID guarantees and enabling LIVE queries.

The `ql:contains-word` Predicate¶

C++ QLever introduced the ql:contains-word predicate as a standard way to express full-text conditions inside SPARQL. IndentiaDB supports it natively:

PREFIX ql: <http://qlever.cs.uni-freiburg.de/builtin/>

SELECT ?article ?title ?score WHERE {
  ?article <http://purl.org/dc/terms/title>   ?title .
  ?article <http://purl.org/dc/terms/subject> ?subject .
  (?title ?score) ql:contains-word "knowledge graph" .
}
ORDER BY DESC(?score)
LIMIT 20

The (?var ?score) pair binds the BM25 score alongside the matched literal. The query is routed to the QLever-compatible engine which evaluates the triple patterns via permutation lookup and resolves the text predicate against the inverted BM25 index in a single pass.

Dual-Backend HA Architecture (QLever-compatible + SurrealDB)¶

For large-scale deployments, QLever-compatible read replicas can be run alongside the SurrealDB cluster. The query router sends all reads to the QLever-compatible tier and all writes to SurrealDB, with an asynchronous sync process keeping the QLever-compatible index up to date:

 Clients
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│                    Query Router                          │
│   SPARQL reads ──► QLever-compatible replicas (ReadOnlyMany PVC)   │
│   Writes       ──► SurrealDB cluster (TiKV backend)     │
└─────────────────────────────────────────────────────────┘
         │                         │
         ▼                         ▼
 ┌────────────────┐       ┌─────────────────────┐
 │ QLever-compatible nodes   │◄──────│ SurrealDB + TiKV    │
 │ qlever-0,1,2   │ async │ surreal-0,1,2       │
 │ (NFS/CephFS    │ sync  │ + pd-0,1,2          │
 │  shared index) │       │ + tikv-0,1,2        │
 └────────────────┘       └─────────────────────┘

Use this topology when: - Your RDF dataset exceeds 100 million triples and SPARQL query latency is critical - You need to separate read and write scaling independently - You are running analytics-heavy workloads alongside real-time updates

For most deployments under 50 million triples, a single IndentiaDB process with the integrated QLever-compatible engine and kv-surrealkv storage is sufficient.

3. RDF Triple Storage¶

The RDF triple store is the semantic core of IndentiaDB. It is implemented as a custom storage layer optimized for SPARQL pattern matching.

6-Permutation Index¶

Every triple (subject, predicate, object) in every named graph is indexed in six orderings:

Index	Ordering	Optimized For
SPO	Subject → Predicate → Object	Lookup all predicates and objects for a subject
SOP	Subject → Object → Predicate	Lookup all predicates connecting a subject to an object
PSO	Predicate → Subject → Object	Lookup all subjects with a given predicate
POS	Predicate → Object → Subject	Lookup all subjects that point to a given object via a predicate
OSP	Object → Subject → Predicate	Lookup all subjects that reference a given object
OPS	Object → Predicate → Subject	Lookup via object then predicate

The query optimizer selects the index permutation that best matches the bound variables in each triple pattern. A fully-bound pattern (s, p, o) resolves with a single key lookup. An unbound-subject pattern (?, p, o) uses the POS index.

Compression¶

Triple data is compressed at multiple levels:

ZSTD compression on stored byte sequences (configurable level 1–22, default 3). Level 3 gives a good balance of compression ratio and CPU overhead. Set TRIPLE_STORE_COMPRESSION_LEVEL=9 for higher compression at the cost of write throughput.
Delta encoding for sequences of similar IRIs and literals. Adjacent values in sorted order are stored as deltas rather than full values.
Varint compression on integer deltas — small integers require fewer bytes.

Dual Vocabulary System¶

All IRI strings and literal values are stored in two vocabulary tables:

IRI vocabulary — maps every unique IRI string to an integer ID. Triple index entries store integer IDs, not raw strings. This means the string http://xmlns.com/foaf/0.1/name (31 bytes) is stored once and referenced by a 4-byte integer everywhere else.
Literal vocabulary — maps literal values (strings, numbers, dates) to integer IDs with datatype and optional language tag.

Vocabulary lookups use memory-mapped I/O via the memmap2 crate, allowing the vocabulary to exceed RAM size while maintaining fast random access through the OS page cache.

Memory-Mapped I/O¶

The vocabulary files are memory-mapped using memmap2. This means:

The OS page cache serves as an automatic buffer pool
Random reads do not require explicit read() syscalls
The working set of frequently-accessed vocabulary entries remains hot in L3 cache
Vocabulary files can be larger than available RAM

4. Query Execution Pipeline¶

Parsing¶

SPARQL queries are parsed by the spargebra crate, which produces a typed abstract syntax tree representing the query algebra. The parser validates syntax and resolves prefix declarations.

Optimization¶

The sparopt crate applies cost-based query optimization to the parsed algebra:

Filter pushdown — WHERE clause filters are pushed as close to the data source as possible, reducing intermediate result set sizes.
Cardinality estimation — the optimizer estimates the number of results each triple pattern will produce based on stored statistics.
Join ordering — joins are reordered so the smallest result sets are joined first, minimizing the size of intermediate results.

Join Strategies¶

The query executor selects from four join strategies depending on the cardinality and index availability:

Strategy	When Used	Complexity
Hash Join	Large result sets, no ordering required	O(n + m) time, O(n) memory
Merge Join	Both inputs sorted on the join key	O(n + m) time, O(1) memory
Index Nested Loop	Small outer result set, indexed inner	O(n × log m) time
EXISTS Join	Semi-join / anti-join for FILTER EXISTS / FILTER NOT EXISTS	O(n) with hash set probe

Aggregation¶

SPARQL aggregation functions are evaluated after join processing:

COUNT(?x) / COUNT(*) — counts non-null bindings
SUM(?x) — sum of numeric values
AVG(?x) — arithmetic mean
MIN(?x) / MAX(?x) — minimum and maximum over any comparable type
GROUP_CONCAT(?x; separator=", ") — concatenate string values; returns xsd:string per SPARQL 1.2 Feb 3 ED
SAMPLE(?x) — returns an arbitrary value from the group (useful for non-grouped columns)

Property Paths¶

SPARQL property path expressions are evaluated with full BFS cycle detection (per SPARQL 1.2 Issues #266 and #267):

Operator	Syntax	Meaning
Sequence	`p1 / p2`	p1 followed by p2
Alternative	`p1 \\| p2`	p1 or p2
Zero or more	`p*`	Transitive closure including zero hops
One or more	`p+`	Transitive closure requiring at least one hop
Zero or one	`p?`	Optional single hop
Inverse	`^p`	Traverse in reverse direction
Negated property set	`!(p1 \\| p2)`	Any predicate except p1 or p2

BFS with cycle detection ensures property paths over cyclic graphs terminate correctly.

5. Vector Storage¶

HNSW Indexing¶

Vector embeddings are indexed using HNSW (Hierarchical Navigable Small World) graphs. HNSW is an approximate nearest neighbor algorithm that provides sub-linear search time with configurable accuracy/speed trade-offs.

Index configuration parameters:

Parameter	SurrealQL Keyword	Description	Default
`dimension`	`DIMENSION n`	Number of dimensions in each vector	Required
`distance metric`	`DIST COSINE` or `DIST EUCLIDEAN`	Similarity metric	`COSINE`
`m`	`M n`	Maximum connections per node per layer	16
`ef_construction`	`EFC n`	Candidate list size during index build	200
`n_probe`	at query time `<\\|k,ef\\|>`	Candidate list size during search	100

Search complexity: O(d × k + n_probe × n/k) where d is the vector dimension, k is the number of requested neighbors, and n is the total number of indexed vectors.

Index definition example:

DEFINE INDEX idx_embedding ON document FIELDS embedding
  HNSW DIMENSION 1536 DIST COSINE
  EFC 200 M 16;

Query syntax:

-- Find 10 nearest neighbors with ef=200
SELECT id, title,
  vector::similarity::cosine(embedding, $query_vec) AS score
FROM document
WHERE embedding <|10,200|> $query_vec
ORDER BY score DESC;

The <|k,ef|> operator performs the HNSW search with k results and ef candidate list size. Increasing ef improves recall at the cost of search latency.

6. Full-Text Search¶

Inverted Index Structure¶

The full-text search engine maintains a fragment inverted index:

4-character prefix fragments — tokens are split into overlapping 4-character prefixes. This enables prefix matching and fuzzy queries without a separate phonetic index.
Word postings — for each unique term, a posting list records which document IDs contain that term and at what positions.
Entity co-occurrence postings — tracks which terms appear near which other terms, enabling phrase queries and proximity scoring.

Multi-Stage Compression¶

Posting lists are compressed in multiple stages to minimize storage overhead:

Gap encoding — document IDs in posting lists are stored as gaps (deltas) between consecutive IDs rather than absolute values. Most gaps are small integers.
Frequency encoding — term frequencies are encoded separately from positions.
Simple8b — packs multiple small integers into 64-bit words using a variable-length encoding scheme optimized for modern CPUs.
ZSTD — the final compressed blocks are ZSTD-compressed for additional size reduction.

BM25/TF-IDF Scoring¶

Document ranking uses BM25 with configurable k1 and b parameters:

BM25(d, q) = Σ IDF(t) × (tf(t,d) × (k1 + 1)) / (tf(t,d) + k1 × (1 - b + b × |d|/avgdl))

k1 controls term frequency saturation (default 1.2)
b controls document length normalization (default 0.75)

Configure per-index:

DEFINE INDEX idx_body ON article FIELDS body
  SEARCH ANALYZER english_analyzer BM25(1.2, 0.75);

7. Raft Consensus for High Availability¶

When deploying IndentiaDB in HA mode (three or more nodes), nodes coordinate using the OpenRaft 0.9 implementation of the Raft consensus protocol.

How Raft Works in IndentiaDB¶

Leader election — one node is elected leader per term. All writes go through the leader. Elections use randomized timeouts to avoid split votes.
Log replication — the leader appends write operations to its log and replicates them to followers. A write is committed once a quorum (majority) of nodes acknowledge it.
Snapshot isolation — reads return a consistent snapshot of committed state. Uncommitted writes are not visible.
Monotonic reads — read requests are serviced only by nodes that are up-to-date. Stale reads are prevented by checking the commit index before serving results.

Network Transport¶

Raft messages between nodes use gRPC (via the tonic crate) with mutual TLS:

[raft]
node_id = 1
peers = [
    { id = 2, addr = "node2:7010" },
    { id = 3, addr = "node3:7010" }
]
tls_cert = "/etc/indentiadb/server.crt"
tls_key  = "/etc/indentiadb/server.key"
tls_ca   = "/etc/indentiadb/ca.crt"

HAProxy Load Balancing¶

Client connections should be load-balanced using HAProxy (or any HTTP/WebSocket-aware proxy). HAProxy should route all write operations to the Raft leader and can distribute read operations across all healthy nodes.

Example HAProxy backend configuration:

frontend indentiadb_front
    bind *:7001
    default_backend indentiadb_back

backend indentiadb_back
    balance roundrobin
    option httpchk GET /health
    server node1 node1:7001 check
    server node2 node2:7001 check
    server node3 node3:7001 check

For writes requiring leader routing, use the /health/leader endpoint to identify the current leader and route accordingly.

Cluster Sizing¶

Nodes	Fault Tolerance	Quorum
1	None (single point of failure)	1
3	1 node failure	2
5	2 node failures	3

Recommendation: Deploy 3 nodes for production HA. Deploy 5 nodes only when you need to tolerate 2 simultaneous failures (e.g., multi-AZ with one full AZ failure).

Storage Layer ASCII Diagram¶

IndentiaDB Storage Architecture
════════════════════════════════════════════════════════════════════

  WRITES                          READS
    │                               │
    ▼                               ▼
┌───────────────────────────────────────────────────────────────┐
│                      Query Router                             │
│                                                               │
│  SurrealQL ──┐    SPARQL ──┐    LPG DSL ──┐    ES DSL ──┐   │
└──────────────┼─────────────┼──────────────┼─────────────┼───┘
               │             │              │             │
               ▼             ▼              │             ▼
     ┌─────────────┐  ┌──────────────┐     │   ┌─────────────────┐
     │  SurrealDB  │  │  RDF Triple  │     │   │  Full-Text      │
     │  Engine     │  │  Store       │     │   │  Inverted Index  │
     │             │  │              │     │   │  (BM25/TF-IDF)  │
     │ SCHEMAFULL  │  │ 6-perm index │     │   └─────────────────┘
     │ SCHEMALESS  │  │ SPO,SOP,PSO  │     │
     │ HNSW Vector │  │ POS,OSP,OPS  │     │
     │ DEFINE EVENT│  │              │     │
     │ LIVE queries│  │ ZSTD compr.  │     │
     └──────┬──────┘  │ Delta encode │     │
            │         │ Varint compr.│     │
            │         │              │     │
            │         │ Vocabulary:  │     │
            │         │  IRI table   │     │
            │         │  Lit table   │     │
            │         │ (memmap2 I/O)│     │
            │         └──────┬───────┘     │
            │                │             │
            │                └──────┬──────┘
            │                       │ LPG Projection
            │                       ▼
            │               ┌───────────────┐
            │               │  LPG Engine   │
            │               │  CSR Graph    │
            │               │  Adjacency    │
            │               │               │
            │               │  PageRank     │
            │               │  Shortest Path│
            │               │  ConnComp     │
            │               │  Neighbor Count│
            │               └───────────────┘
            │
            ▼
┌───────────────────────────────────────────────────────────────┐
│                   Physical Storage Backend                    │
│                                                               │
│  ┌──────────────────────────┐  ┌──────────────────────────┐  │
│  │   SurrealDB Embedded     │  │   TiKV Distributed       │  │
│  │                          │  │                           │  │
│  │  kv-mem (in-memory)      │  │  Raft consensus           │  │
│  │  kv-surrealkv (on-disk)  │  │  Multi-node, multi-DC     │  │
│  │                          │  │  Unlimited scale          │  │
│  │  Max ~1 TB               │  │  gRPC + mutual TLS        │  │
│  │  LIVE queries ✓          │  │  LIVE queries ✗           │  │
│  │  Multi-DC ✗              │  │  Multi-DC ✓               │  │
│  └──────────────────────────┘  └──────────────────────────┘  │
└───────────────────────────────────────────────────────────────┘

  HA Layer (3+ nodes, OpenRaft 0.9)
  ─────────────────────────────────
  Leader ◄──── gRPC/TLS ────► Follower 1
    │                              │
    └────── gRPC/TLS ──────► Follower 2