Architecture¶
This document describes the internal architecture of IndentiaDB: how data is stored, how queries are executed, and how high availability is achieved.
Storage Layer Overview¶
┌──────────────────────────────────────────────────────────────────────┐
│ IndentiaDB Process │
│ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ Query Router │ │
│ │ SurrealQL │ SPARQL 1.2 │ LPG JSON DSL │ ES DSL │ │
│ └────────┬─────────────┬───────────────┬──────────────┬──────────┘ │
│ │ │ │ │ │
│ ┌────────▼──────┐ ┌───▼────────┐ ┌──▼──────────┐ │ │
│ │ SurrealDB │ │ RDF Triple │ │ LPG Engine │ │ │
│ │ Engine │ │ Store │ │ (CSR) │ │ │
│ └────────┬──────┘ └───┬────────┘ └──────────────┘ │ │
│ │ │ │ │
│ ┌────────▼─────────────▼──────────────────────────────▼──────────┐ │
│ │ Physical Storage Backend │ │
│ │ │ │
│ │ Option A: SurrealDB embedded (kv-mem / kv-surrealkv) │ │
│ │ Option B: TiKV distributed (Raft consensus, multi-DC) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
1. Dual Storage Backends¶
IndentiaDB supports two physical storage backends. You choose one at deployment time based on your scale and availability requirements.
Option A: SurrealDB Embedded¶
SurrealDB is embedded directly into the IndentiaDB process. No separate database process is needed. This is the default for development and single-node production deployments.
Storage engines available within SurrealDB embedded:
| Engine | Description | Recommended For |
|---|---|---|
kv-mem |
In-memory only, data lost on restart | Development, testing, ephemeral workloads |
kv-surrealkv |
Persistent on-disk storage using SurrealKV | Production single-node deployments |
Characteristics:
- Zero operational overhead — one binary, one process
- Supports LIVE queries (WebSocket push on data change)
- Full ACID transactions with snapshot isolation
- Maximum practical dataset size: approximately 1 TB (single-node disk)
- Multi-datacenter replication: not supported
- High availability: not supported (single node)
Option B: TiKV Distributed¶
TiKV is an external distributed key-value store based on the Raft consensus protocol, developed by PingCAP. IndentiaDB connects to an existing TiKV cluster as its storage backend via SurrealDB's kv-tikv driver.
Characteristics:
- Horizontal scaling across multiple nodes and datacenters
- Raft-based replication with automatic failover
- Unlimited dataset size (add nodes to scale)
- Multi-datacenter support with region-aware placement
- High availability: yes (3+ TiKV nodes required for quorum)
- LIVE queries: not available (TiKV does not support push notifications)
- Operational complexity: high (requires TiKV cluster + PD + TiFlash for analytics)
TiKV Cluster Topology¶
A TiKV deployment consists of two components:
┌──────────────────────────────────┐
│ PD (Placement Driver) × 3 │
│ pd-0, pd-1, pd-2 │
│ ───────────────────────────── │
│ • Raft quorum for metadata │
│ • Cluster topology coordinator │
│ • Region leader election │
│ • Port 2379 (client API) │
│ • Port 2380 (peer replication) │
└────────────────┬─────────────────┘
│ registers with PD
┌────────────────▼─────────────────┐
│ TiKV storage nodes × 3 │
│ tikv-0, tikv-1, tikv-2 │
│ ───────────────────────────── │
│ • Range-based sharding │
│ • Raft replication per region │
│ • Port 20160 (gRPC data) │
│ • Port 20180 (status/metrics) │
└──────────────────────────────────┘
Minimum cluster: 3 PD nodes + 3 TiKV nodes. All three TiKV nodes must be available for writes (Raft quorum requires a majority).
Configuring IndentiaDB to Use TiKV¶
Set the SURREAL_URL environment variable to the PD endpoints:
# Single PD node (not recommended for production)
SURREAL_URL=tikv://pd-0:2379
# Three PD nodes (recommended — Raft quorum for metadata)
SURREAL_URL=tikv://pd-0:2379,pd-1:2379,pd-2:2379
With Docker:
docker run -d \
-e SURREAL_URL="tikv://pd-0:2379,pd-1:2379,pd-2:2379" \
-p 7001:7001 -p 9200:9200 \
ghcr.io/indentiaplatform/indentiadb-trial:latest
In config.toml:
Docker Compose: Full TiKV Stack¶
services:
pd-0:
image: pingcap/pd:v8.5.0
command: >
--name=pd-0
--client-urls=http://0.0.0.0:2379
--peer-urls=http://0.0.0.0:2380
--advertise-client-urls=http://pd-0:2379
--advertise-peer-urls=http://pd-0:2380
--initial-cluster=pd-0=http://pd-0:2380,pd-1=http://pd-1:2380,pd-2=http://pd-2:2380
tikv-0:
image: pingcap/tikv:v8.5.0
command: >
--addr=0.0.0.0:20160
--advertise-addr=tikv-0:20160
--pd=pd-0:2379,pd-1:2379,pd-2:2379
depends_on: [pd-0, pd-1, pd-2]
indentiadb:
image: ghcr.io/indentiaplatform/indentiadb-trial:latest
environment:
SURREAL_URL: "tikv://pd-0:2379,pd-1:2379,pd-2:2379"
ports:
- "7001:7001"
- "9200:9200"
depends_on: [tikv-0, tikv-1, tikv-2]
TiKV and LIVE queries
TiKV does not support push notifications. LIVE SELECT statements and DEFINE EVENT handlers are unavailable when using the TiKV backend. Use the embedded kv-surrealkv backend if you need reactive queries.
Comparison Table¶
| Dimension | SurrealDB Embedded | TiKV Distributed |
|---|---|---|
| Complexity | Low — single process | High — separate cluster |
| Scalability | Single node | Horizontal, unlimited |
| Write Latency | ~0.1–1 ms (local disk) | ~1–10 ms (network round-trip) |
| High Availability | No | Yes (Raft quorum) |
| Max Dataset | ~1 TB | Unlimited |
| LIVE Queries | Yes (WebSocket) | No |
| Multi-DC | No | Yes |
| Operational Cost | Minimal | Significant |
| Recommended For | Development, <1 TB prod | >1 TB, HA required |
Recommendation: Start with SurrealDB embedded (
kv-surrealkv). Migrate to TiKV when you approach 500 GB of stored data, require cross-datacenter replication, or need automatic failover.
2. The QLever-compatible SPARQL Engine¶
IndentiaDB's SPARQL subsystem is a Rust-native reimplementation of the core indexing and query-evaluation concepts from QLever — a high-performance SPARQL engine developed at the University of Freiburg. The original C++ QLever project demonstrated that permutation-based triple indexing with vocabulary compression can evaluate SPARQL queries one to two orders of magnitude faster than traditional B-tree approaches.
IndentiaDB embeds those same algorithmic ideas natively in Rust, without a C++ dependency or a separate C++ QLever process. The goals are:
- Bit-exact result compatibility with upstream C++ QLever for deterministic SPARQL queries
- Native integration with the SurrealDB write path and transaction layer
- Extension via RDF-star and SPARQL 1.2 features not yet in upstream QLever
What QLever-compatible Contributes¶
| Concept | C++ QLever Origin | IndentiaDB Implementation |
|---|---|---|
| 6-permutation triple index | QLever's SPO/SOP/PSO/POS/OSP/OPS layout | Same layout, ZSTD + delta + varint compression |
| Vocabulary compression | FSST (Fast Static Symbol Table) | FSST-compatible encoding; memory-mapped via memmap2 |
ql:contains-word predicate |
QLever's text index extension | Full inverted BM25 index integrated in IndentiaDB's FTS layer |
| Cost-based join ordering | QLever's cardinality estimator | sparopt crate — filter pushdown, join reordering |
| Delta triple tracking | QLever's delta index for updates | IndentiaDB delta layer that tracks insertions/deletions per graph |
Query Routing: QLever-compatible Engine vs SurrealDB Engine¶
Not every query hits the QLever-compatible path. The query router dispatches based on the operation type:
| Query Characteristic | Target Engine | Reason |
|---|---|---|
SELECT / CONSTRUCT / ASK / DESCRIBE — read only, no FTS |
QLever-compatible (Rust) | Permutation indexes optimally serve pure graph reads |
SELECT with ql:contains-word predicate |
QLever-compatible + FTS index | QLever-compatible invokes the integrated BM25 text index |
INSERT DATA / DELETE DATA / DELETE…WHERE |
SurrealDB | All writes go through the SurrealDB ACID layer |
SPARQL() inside a SurrealQL statement |
QLever-compatible (inline dispatch) | Results returned as SurrealQL-typed values |
Requests to port 9200 |
FTS + vector layer | Elasticsearch-compatible API, bypasses SPARQL parser |
SERVICE <external-endpoint> |
Federation engine | Distributed across QLever-compatible + remote SPARQL endpoints |
The split means reads are fully served by the highly-optimised permutation indexes while all state changes go through SurrealDB's transaction log, maintaining ACID guarantees and enabling LIVE queries.
The ql:contains-word Predicate¶
C++ QLever introduced the ql:contains-word predicate as a standard way to express full-text conditions inside SPARQL. IndentiaDB supports it natively:
PREFIX ql: <http://qlever.cs.uni-freiburg.de/builtin/>
SELECT ?article ?title ?score WHERE {
?article <http://purl.org/dc/terms/title> ?title .
?article <http://purl.org/dc/terms/subject> ?subject .
(?title ?score) ql:contains-word "knowledge graph" .
}
ORDER BY DESC(?score)
LIMIT 20
The (?var ?score) pair binds the BM25 score alongside the matched literal. The query is routed to the QLever-compatible engine which evaluates the triple patterns via permutation lookup and resolves the text predicate against the inverted BM25 index in a single pass.
Dual-Backend HA Architecture (QLever-compatible + SurrealDB)¶
For large-scale deployments, QLever-compatible read replicas can be run alongside the SurrealDB cluster. The query router sends all reads to the QLever-compatible tier and all writes to SurrealDB, with an asynchronous sync process keeping the QLever-compatible index up to date:
Clients
│
▼
┌─────────────────────────────────────────────────────────┐
│ Query Router │
│ SPARQL reads ──► QLever-compatible replicas (ReadOnlyMany PVC) │
│ Writes ──► SurrealDB cluster (TiKV backend) │
└─────────────────────────────────────────────────────────┘
│ │
▼ ▼
┌────────────────┐ ┌─────────────────────┐
│ QLever-compatible nodes │◄──────│ SurrealDB + TiKV │
│ qlever-0,1,2 │ async │ surreal-0,1,2 │
│ (NFS/CephFS │ sync │ + pd-0,1,2 │
│ shared index) │ │ + tikv-0,1,2 │
└────────────────┘ └─────────────────────┘
Use this topology when: - Your RDF dataset exceeds 100 million triples and SPARQL query latency is critical - You need to separate read and write scaling independently - You are running analytics-heavy workloads alongside real-time updates
For most deployments under 50 million triples, a single IndentiaDB process with the integrated QLever-compatible engine and kv-surrealkv storage is sufficient.
3. RDF Triple Storage¶
The RDF triple store is the semantic core of IndentiaDB. It is implemented as a custom storage layer optimized for SPARQL pattern matching.
6-Permutation Index¶
Every triple (subject, predicate, object) in every named graph is indexed in six orderings:
| Index | Ordering | Optimized For |
|---|---|---|
| SPO | Subject → Predicate → Object | Lookup all predicates and objects for a subject |
| SOP | Subject → Object → Predicate | Lookup all predicates connecting a subject to an object |
| PSO | Predicate → Subject → Object | Lookup all subjects with a given predicate |
| POS | Predicate → Object → Subject | Lookup all subjects that point to a given object via a predicate |
| OSP | Object → Subject → Predicate | Lookup all subjects that reference a given object |
| OPS | Object → Predicate → Subject | Lookup via object then predicate |
The query optimizer selects the index permutation that best matches the bound variables in each triple pattern. A fully-bound pattern (s, p, o) resolves with a single key lookup. An unbound-subject pattern (?, p, o) uses the POS index.
Compression¶
Triple data is compressed at multiple levels:
- ZSTD compression on stored byte sequences (configurable level 1–22, default 3). Level 3 gives a good balance of compression ratio and CPU overhead. Set
TRIPLE_STORE_COMPRESSION_LEVEL=9for higher compression at the cost of write throughput. - Delta encoding for sequences of similar IRIs and literals. Adjacent values in sorted order are stored as deltas rather than full values.
- Varint compression on integer deltas — small integers require fewer bytes.
Dual Vocabulary System¶
All IRI strings and literal values are stored in two vocabulary tables:
- IRI vocabulary — maps every unique IRI string to an integer ID. Triple index entries store integer IDs, not raw strings. This means the string
http://xmlns.com/foaf/0.1/name(31 bytes) is stored once and referenced by a 4-byte integer everywhere else. - Literal vocabulary — maps literal values (strings, numbers, dates) to integer IDs with datatype and optional language tag.
Vocabulary lookups use memory-mapped I/O via the memmap2 crate, allowing the vocabulary to exceed RAM size while maintaining fast random access through the OS page cache.
Memory-Mapped I/O¶
The vocabulary files are memory-mapped using memmap2. This means:
- The OS page cache serves as an automatic buffer pool
- Random reads do not require explicit
read()syscalls - The working set of frequently-accessed vocabulary entries remains hot in L3 cache
- Vocabulary files can be larger than available RAM
4. Query Execution Pipeline¶
Parsing¶
SPARQL queries are parsed by the spargebra crate, which produces a typed abstract syntax tree representing the query algebra. The parser validates syntax and resolves prefix declarations.
Optimization¶
The sparopt crate applies cost-based query optimization to the parsed algebra:
- Filter pushdown — WHERE clause filters are pushed as close to the data source as possible, reducing intermediate result set sizes.
- Cardinality estimation — the optimizer estimates the number of results each triple pattern will produce based on stored statistics.
- Join ordering — joins are reordered so the smallest result sets are joined first, minimizing the size of intermediate results.
Join Strategies¶
The query executor selects from four join strategies depending on the cardinality and index availability:
| Strategy | When Used | Complexity |
|---|---|---|
| Hash Join | Large result sets, no ordering required | O(n + m) time, O(n) memory |
| Merge Join | Both inputs sorted on the join key | O(n + m) time, O(1) memory |
| Index Nested Loop | Small outer result set, indexed inner | O(n × log m) time |
| EXISTS Join | Semi-join / anti-join for FILTER EXISTS / FILTER NOT EXISTS | O(n) with hash set probe |
Aggregation¶
SPARQL aggregation functions are evaluated after join processing:
COUNT(?x)/COUNT(*)— counts non-null bindingsSUM(?x)— sum of numeric valuesAVG(?x)— arithmetic meanMIN(?x)/MAX(?x)— minimum and maximum over any comparable typeGROUP_CONCAT(?x; separator=", ")— concatenate string values; returnsxsd:stringper SPARQL 1.2 Feb 3 EDSAMPLE(?x)— returns an arbitrary value from the group (useful for non-grouped columns)
Property Paths¶
SPARQL property path expressions are evaluated with full BFS cycle detection (per SPARQL 1.2 Issues #266 and #267):
| Operator | Syntax | Meaning |
|---|---|---|
| Sequence | p1 / p2 |
p1 followed by p2 |
| Alternative | p1 \| p2 |
p1 or p2 |
| Zero or more | p* |
Transitive closure including zero hops |
| One or more | p+ |
Transitive closure requiring at least one hop |
| Zero or one | p? |
Optional single hop |
| Inverse | ^p |
Traverse in reverse direction |
| Negated property set | !(p1 \| p2) |
Any predicate except p1 or p2 |
BFS with cycle detection ensures property paths over cyclic graphs terminate correctly.
5. Vector Storage¶
HNSW Indexing¶
Vector embeddings are indexed using HNSW (Hierarchical Navigable Small World) graphs. HNSW is an approximate nearest neighbor algorithm that provides sub-linear search time with configurable accuracy/speed trade-offs.
Index configuration parameters:
| Parameter | SurrealQL Keyword | Description | Default |
|---|---|---|---|
dimension |
DIMENSION n |
Number of dimensions in each vector | Required |
distance metric |
DIST COSINE or DIST EUCLIDEAN |
Similarity metric | COSINE |
m |
M n |
Maximum connections per node per layer | 16 |
ef_construction |
EFC n |
Candidate list size during index build | 200 |
n_probe |
at query time <\|k,ef\|> |
Candidate list size during search | 100 |
Search complexity: O(d × k + n_probe × n/k) where d is the vector dimension, k is the number of requested neighbors, and n is the total number of indexed vectors.
Index definition example:
DEFINE INDEX idx_embedding ON document FIELDS embedding
HNSW DIMENSION 1536 DIST COSINE
EFC 200 M 16;
Query syntax:
-- Find 10 nearest neighbors with ef=200
SELECT id, title,
vector::similarity::cosine(embedding, $query_vec) AS score
FROM document
WHERE embedding <|10,200|> $query_vec
ORDER BY score DESC;
The <|k,ef|> operator performs the HNSW search with k results and ef candidate list size. Increasing ef improves recall at the cost of search latency.
6. Full-Text Search¶
Inverted Index Structure¶
The full-text search engine maintains a fragment inverted index:
- 4-character prefix fragments — tokens are split into overlapping 4-character prefixes. This enables prefix matching and fuzzy queries without a separate phonetic index.
- Word postings — for each unique term, a posting list records which document IDs contain that term and at what positions.
- Entity co-occurrence postings — tracks which terms appear near which other terms, enabling phrase queries and proximity scoring.
Multi-Stage Compression¶
Posting lists are compressed in multiple stages to minimize storage overhead:
- Gap encoding — document IDs in posting lists are stored as gaps (deltas) between consecutive IDs rather than absolute values. Most gaps are small integers.
- Frequency encoding — term frequencies are encoded separately from positions.
- Simple8b — packs multiple small integers into 64-bit words using a variable-length encoding scheme optimized for modern CPUs.
- ZSTD — the final compressed blocks are ZSTD-compressed for additional size reduction.
BM25/TF-IDF Scoring¶
Document ranking uses BM25 with configurable k1 and b parameters:
k1controls term frequency saturation (default 1.2)bcontrols document length normalization (default 0.75)
Configure per-index:
7. Raft Consensus for High Availability¶
When deploying IndentiaDB in HA mode (three or more nodes), nodes coordinate using the OpenRaft 0.9 implementation of the Raft consensus protocol.
How Raft Works in IndentiaDB¶
- Leader election — one node is elected leader per term. All writes go through the leader. Elections use randomized timeouts to avoid split votes.
- Log replication — the leader appends write operations to its log and replicates them to followers. A write is committed once a quorum (majority) of nodes acknowledge it.
- Snapshot isolation — reads return a consistent snapshot of committed state. Uncommitted writes are not visible.
- Monotonic reads — read requests are serviced only by nodes that are up-to-date. Stale reads are prevented by checking the commit index before serving results.
Network Transport¶
Raft messages between nodes use gRPC (via the tonic crate) with mutual TLS:
[raft]
node_id = 1
peers = [
{ id = 2, addr = "node2:7010" },
{ id = 3, addr = "node3:7010" }
]
tls_cert = "/etc/indentiadb/server.crt"
tls_key = "/etc/indentiadb/server.key"
tls_ca = "/etc/indentiadb/ca.crt"
HAProxy Load Balancing¶
Client connections should be load-balanced using HAProxy (or any HTTP/WebSocket-aware proxy). HAProxy should route all write operations to the Raft leader and can distribute read operations across all healthy nodes.
Example HAProxy backend configuration:
frontend indentiadb_front
bind *:7001
default_backend indentiadb_back
backend indentiadb_back
balance roundrobin
option httpchk GET /health
server node1 node1:7001 check
server node2 node2:7001 check
server node3 node3:7001 check
For writes requiring leader routing, use the /health/leader endpoint to identify the current leader and route accordingly.
Cluster Sizing¶
| Nodes | Fault Tolerance | Quorum |
|---|---|---|
| 1 | None (single point of failure) | 1 |
| 3 | 1 node failure | 2 |
| 5 | 2 node failures | 3 |
Recommendation: Deploy 3 nodes for production HA. Deploy 5 nodes only when you need to tolerate 2 simultaneous failures (e.g., multi-AZ with one full AZ failure).
Storage Layer ASCII Diagram¶
IndentiaDB Storage Architecture
════════════════════════════════════════════════════════════════════
WRITES READS
│ │
▼ ▼
┌───────────────────────────────────────────────────────────────┐
│ Query Router │
│ │
│ SurrealQL ──┐ SPARQL ──┐ LPG DSL ──┐ ES DSL ──┐ │
└──────────────┼─────────────┼──────────────┼─────────────┼───┘
│ │ │ │
▼ ▼ │ ▼
┌─────────────┐ ┌──────────────┐ │ ┌─────────────────┐
│ SurrealDB │ │ RDF Triple │ │ │ Full-Text │
│ Engine │ │ Store │ │ │ Inverted Index │
│ │ │ │ │ │ (BM25/TF-IDF) │
│ SCHEMAFULL │ │ 6-perm index │ │ └─────────────────┘
│ SCHEMALESS │ │ SPO,SOP,PSO │ │
│ HNSW Vector │ │ POS,OSP,OPS │ │
│ DEFINE EVENT│ │ │ │
│ LIVE queries│ │ ZSTD compr. │ │
└──────┬──────┘ │ Delta encode │ │
│ │ Varint compr.│ │
│ │ │ │
│ │ Vocabulary: │ │
│ │ IRI table │ │
│ │ Lit table │ │
│ │ (memmap2 I/O)│ │
│ └──────┬───────┘ │
│ │ │
│ └──────┬──────┘
│ │ LPG Projection
│ ▼
│ ┌───────────────┐
│ │ LPG Engine │
│ │ CSR Graph │
│ │ Adjacency │
│ │ │
│ │ PageRank │
│ │ Shortest Path│
│ │ ConnComp │
│ │ Neighbor Count│
│ └───────────────┘
│
▼
┌───────────────────────────────────────────────────────────────┐
│ Physical Storage Backend │
│ │
│ ┌──────────────────────────┐ ┌──────────────────────────┐ │
│ │ SurrealDB Embedded │ │ TiKV Distributed │ │
│ │ │ │ │ │
│ │ kv-mem (in-memory) │ │ Raft consensus │ │
│ │ kv-surrealkv (on-disk) │ │ Multi-node, multi-DC │ │
│ │ │ │ Unlimited scale │ │
│ │ Max ~1 TB │ │ gRPC + mutual TLS │ │
│ │ LIVE queries ✓ │ │ LIVE queries ✗ │ │
│ │ Multi-DC ✗ │ │ Multi-DC ✓ │ │
│ └──────────────────────────┘ └──────────────────────────┘ │
└───────────────────────────────────────────────────────────────┘
HA Layer (3+ nodes, OpenRaft 0.9)
─────────────────────────────────
Leader ◄──── gRPC/TLS ────► Follower 1
│ │
└────── gRPC/TLS ──────► Follower 2