Linguistic Processing¶
IndentiaDB includes a full-featured linguistic processing engine that powers its BM25 full-text search and Elasticsearch-compatible API. The engine provides tokenization, normalization, stemming, stop word removal, synonym expansion, phonetic matching, and language detection -- all configurable per index and per field. Every text analysis component is implemented natively in Rust, with no JVM or external process dependencies.
Text Analysis Pipeline¶
Every document field passes through a configurable analysis pipeline before being indexed. The pipeline follows a strict four-stage architecture:
| Stage | Purpose | Examples |
|---|---|---|
| Char Filters | Pre-tokenization character transformations | HTML stripping, character mapping, pattern replacement |
| Tokenizer | Split text into individual tokens | Standard (Unicode word boundaries), whitespace, N-gram, CJK bigrams |
| Token Filters | Post-tokenization transformations | Lowercase, stop words, stemming, synonyms, phonetic encoding, ASCII folding |
| Output | Final indexed tokens with position and offset metadata | Used by BM25 scoring, phrase matching, highlighting |
Each stage is independently configurable. A custom analyzer defines one tokenizer and zero or more char filters and token filters, applied in order.
Tokenizers¶
IndentiaDB supports seven tokenizer types. Each tokenizer determines how raw text is split into individual tokens.
| Tokenizer | Description | Best For |
|---|---|---|
Standard |
Unicode word segmentation (UAX #29). Splits on word boundaries, removes most punctuation. | General-purpose text search |
Whitespace |
Splits on whitespace characters only. Preserves punctuation attached to tokens. | Log messages, identifiers |
Letter |
Splits on non-letter characters. Produces tokens containing only letters. | Simple natural language text |
NGram |
Character N-grams of configurable length. | Substring matching, fuzzy search |
EdgeNGram |
N-grams anchored to the start of each token. | Autocomplete / search-as-you-type |
Pattern |
Splits on a configurable regex pattern. | Custom delimiters, structured text |
CJK |
Bigram tokenizer for Chinese, Japanese, and Korean text. | CJK language content |
Tokenizer Configuration¶
# Standard tokenizer (default)
[search.analyzers.default]
tokenizer = "standard"
# N-gram tokenizer for substring matching
[search.analyzers.autocomplete]
tokenizer = { type = "ngram", min_gram = 2, max_gram = 4 }
# Edge N-gram for search-as-you-type
[search.analyzers.suggest]
tokenizer = { type = "edge_ngram", min_gram = 1, max_gram = 15 }
# Pattern tokenizer splitting on hyphens and underscores
[search.analyzers.identifiers]
tokenizer = { type = "pattern", pattern = "[\\-_]+" }
# CJK bigram tokenizer
[search.analyzers.cjk]
tokenizer = "cjk"
Tokenizer Examples¶
Standard tokenizer:
Input: "IndentiaDB supports SPARQL 1.2 and GeoSPARQL."
Output: ["IndentiaDB", "supports", "SPARQL", "1.2", "and", "GeoSPARQL"]
Whitespace tokenizer:
Input: "error_code=42 status=failed"
Output: ["error_code=42", "status=failed"]
NGram(2, 3) tokenizer:
Input: "graph"
Output: ["gr", "gra", "ra", "rap", "ap", "aph", "ph"]
EdgeNGram(1, 4) tokenizer:
Input: "search"
Output: ["s", "se", "sea", "sear"]
CJK tokenizer:
Input: "knowledge graph" (in Chinese characters)
Output: bigram pairs of adjacent characters
Token Filters¶
Token filters transform the token stream produced by the tokenizer. Multiple filters are applied in order, forming a processing chain.
Lowercase¶
Converts all tokens to lowercase. This is almost always the first filter in any analyzer.
Stop Words¶
Removes common words that carry little semantic meaning. IndentiaDB ships built-in stop word lists for all supported languages.
[[search.analyzers.my_analyzer.filters]]
type = "stop_words"
language = "en"
custom = ["etc", "ie", "eg"] # Additional custom stop words
Built-in stop word lists are available for: English, Dutch, German, French, Spanish, Italian, Portuguese, Russian, Arabic, Chinese, Japanese, and Korean.
Stemming¶
Reduces words to their root form so that morphological variants match. IndentiaDB supports multiple stemming algorithms:
| Algorithm | Languages | Aggressiveness | Notes |
|---|---|---|---|
porter |
English | Moderate | Original Porter algorithm |
porter2 |
English | Moderate | Improved Porter2 / Snowball (default) |
snowball |
Multi-language | Moderate | Snowball stemmers for 12 languages |
lovins |
English | Aggressive | Single-pass, fast |
lancaster |
English | Very aggressive | Iterative, heavy reduction |
dutch |
Dutch | Moderate | Snowball Dutch variant |
german |
German | Moderate | Snowball German variant |
french |
French | Moderate | Snowball French variant |
spanish |
Spanish | Moderate | Snowball Spanish variant |
hunspell |
Any | Dictionary-based | Requires a Hunspell dictionary file |
# Porter2 stemmer (English, default)
[[search.analyzers.english.filters]]
type = "stemmer"
algorithm = "porter2"
# Snowball stemmer for Dutch
[[search.analyzers.dutch.filters]]
type = "stemmer"
algorithm = { type = "snowball", language = "nl" }
# Hunspell dictionary-based stemmer
[[search.analyzers.hunspell_de.filters]]
type = "stemmer"
algorithm = { type = "hunspell", dictionary = "/etc/indentiadb/dicts/de_DE.dic" }
Stemming Examples¶
Porter2 (English):
"running" --> "run"
"connections" --> "connect"
"organized" --> "organ"
Dutch stemmer:
"fietsen" --> "fiets"
"huizen" --> "huiz"
"werknemers" --> "werknem"
German stemmer:
"Verbindungen" --> "verbind"
"Arbeitsplatz" --> "arbeitsplatz"
ASCII Folding¶
Converts Unicode characters to their ASCII equivalents. Essential for searching text with accented characters using plain ASCII queries.
Length Filter¶
Removes tokens shorter or longer than specified bounds:
Trim¶
Removes leading and trailing whitespace from tokens:
Deduplication¶
Removes duplicate tokens at the same position (useful after synonym expansion):
Shingle¶
Produces token N-grams (multi-word tokens) for phrase-level matching:
Input tokens: ["knowledge", "graph", "database"]
Shingles: ["knowledge graph", "graph database", "knowledge graph database"]
Word Delimiter¶
Splits tokens on case transitions, letter-digit boundaries, and configurable delimiters:
Synonym Expansion¶
Synonym expansion allows queries to match documents containing equivalent terms. IndentiaDB supports two modes: explicit synonym rules and synonym dictionaries loaded from files.
Synonym Rules¶
Define synonyms inline in the analyzer configuration:
[[search.analyzers.my_analyzer.filters]]
type = "synonym"
expand = true
rules = [
"automobile, car, vehicle",
"quick, fast, speedy",
"big, large, enormous => huge",
]
Bidirectional rules (comma-separated): All terms are interchangeable. A query for "car" also matches "automobile" and "vehicle".
Directional rules (with =>): Only the left-hand terms expand to the right-hand term. A query for "big" matches "huge", but "huge" does not match "big".
Synonym Dictionary File¶
For large synonym sets, load from a file:
[[search.analyzers.my_analyzer.filters]]
type = "synonym"
expand = true
dictionary = "/etc/indentiadb/synonyms/medical.txt"
Dictionary file format (one rule per line):
# Medical synonyms
heart attack, myocardial infarction, MI
high blood pressure, hypertension
paracetamol, acetaminophen
Synonym Expansion in Search¶
# With synonym expansion enabled, this query:
curl -X POST http://localhost:9200/articles/_search \
-H "Content-Type: application/json" \
-d '{
"query": {
"match": {
"content": {
"query": "automobile safety",
"analyzer": "my_analyzer"
}
}
}
}'
# ...also matches documents containing "car safety", "vehicle safety", etc.
Index-Time vs Query-Time Synonyms
By default, synonyms are expanded at index time -- meaning synonym variants are stored in the inverted index alongside the original term. This is faster at query time but requires re-indexing when synonyms change. Set expand = false to apply synonyms only at query time, which allows synonym dictionary updates without re-indexing at the cost of slightly slower queries.
Phonetic Matching¶
Phonetic encoders convert tokens to phonetic codes so that words that sound alike match each other, even when spelled differently. This is valuable for name search, address matching, and correcting misspellings.
Supported Encoders¶
| Encoder | Origin | Best For | Example |
|---|---|---|---|
soundex |
American Census (1880s) | English surnames | "Robert" and "Rupert" both produce R163 |
metaphone |
Lawrence Philips (1990) | English words | "Smith" and "Schmidt" both produce SM0 |
double_metaphone |
Lawrence Philips (2000) | Multi-origin names | Returns primary + alternate code |
cologne |
Hans Postel (1969) | German names and words | "Mueller" and "Muller" both produce 657 |
Configuration¶
# Add phonetic encoding as a token filter
[[search.analyzers.name_search.filters]]
type = "phonetic"
encoder = "double_metaphone"
# Full name search analyzer
[search.analyzers.name_search]
tokenizer = "standard"
filters = [
{ type = "lowercase" },
{ type = "ascii_folding" },
{ type = "phonetic", encoder = "double_metaphone" },
]
Phonetic Search Example¶
# Search for a person by approximate name
curl -X POST http://localhost:9200/contacts/_search \
-H "Content-Type: application/json" \
-d '{
"query": {
"match": {
"name.phonetic": {
"query": "Steven",
"analyzer": "name_search"
}
}
}
}'
# Matches: "Steven", "Stephen", "Stefan", "Stephan"
Phonetic Fields
Phonetic encoding is typically applied to a sub-field (e.g., name.phonetic) rather than the primary field, so that exact matches on the primary field still work and phonetic matching is available as a fallback or boost signal.
Language-Specific Analyzers¶
IndentiaDB ships pre-configured analyzers for common languages. Each language analyzer includes the appropriate tokenizer, stop word list, and stemmer.
Supported Languages¶
| Language | ISO Code | Stemmer | Stop Words | CJK Tokenizer |
|---|---|---|---|---|
| English | en |
Porter2 | 174 words | No |
| Dutch | nl |
Snowball Dutch | 119 words | No |
| German | de |
Snowball German | 232 words | No |
| French | fr |
Snowball French | 164 words | No |
| Spanish | es |
Snowball Spanish | 313 words | No |
| Italian | it |
Snowball Italian | 279 words | No |
| Portuguese | pt |
Snowball Portuguese | 203 words | No |
| Russian | ru |
Snowball Russian | 243 words | No |
| Arabic | ar |
Snowball Arabic | 162 words | No |
| Chinese | zh |
None | Minimal | Yes (CJK bigrams) |
| Japanese | ja |
None | Minimal | Yes (CJK bigrams) |
| Korean | ko |
None | Minimal | Yes (CJK bigrams) |
Using a Built-In Language Analyzer¶
# Use the Dutch analyzer for a specific index
[search.indexes.dutch_articles]
analyzer = "dutch"
# Or reference by language code
[search.indexes.german_articles]
analyzer = { language = "de" }
Custom Language Analyzer¶
Build a custom analyzer with language-specific components:
[search.analyzers.dutch_full]
language = "nl"
tokenizer = "standard"
[[search.analyzers.dutch_full.filters]]
type = "lowercase"
[[search.analyzers.dutch_full.filters]]
type = "stop_words"
language = "nl"
custom = ["etc", "bv", "nv"]
[[search.analyzers.dutch_full.filters]]
type = "stemmer"
algorithm = "dutch"
[[search.analyzers.dutch_full.filters]]
type = "ascii_folding"
[[search.analyzers.dutch_full.filters]]
type = "synonym"
rules = [
"auto, wagen, voertuig",
"computer, pc, laptop",
]
Language Detection¶
IndentiaDB can automatically detect the language of a text field and apply the appropriate language-specific analyzer. Detection uses a statistical n-gram model that returns a confidence score between 0.0 and 1.0.
Configuration¶
[search.indexes.multilingual_docs]
# Enable automatic language detection
language_detection = true
# Minimum confidence to apply a language-specific analyzer (fallback: standard)
language_detection_threshold = 0.7
# Default language when detection confidence is below threshold
default_language = "en"
Detection API¶
curl -X POST http://localhost:7001/api/v1/detect-language \
-H "Content-Type: application/json" \
-d '{"text": "Dit is een voorbeeldzin in het Nederlands."}'
{
"language": "Dutch",
"iso_code": "nl",
"confidence": 0.94,
"alternatives": [
{"language": "German", "iso_code": "de", "confidence": 0.03},
{"language": "English", "iso_code": "en", "confidence": 0.02}
]
}
Custom Analyzer Configuration¶
TOML Configuration¶
Define custom analyzers in config.toml under [search.analyzers]:
# A medical document analyzer
[search.analyzers.medical]
tokenizer = "standard"
[[search.analyzers.medical.char_filters]]
type = "pattern_replace"
pattern = "\\b(Dr|Prof|Mr|Mrs)\\.\\s*"
replacement = ""
[[search.analyzers.medical.filters]]
type = "lowercase"
[[search.analyzers.medical.filters]]
type = "stop_words"
language = "en"
[[search.analyzers.medical.filters]]
type = "synonym"
dictionary = "/etc/indentiadb/synonyms/medical.txt"
expand = true
[[search.analyzers.medical.filters]]
type = "stemmer"
algorithm = "porter2"
[[search.analyzers.medical.filters]]
type = "phonetic"
encoder = "double_metaphone"
Elasticsearch-Compatible Index Settings¶
Custom analyzers can also be defined via the Elasticsearch-compatible API when creating an index:
curl -X PUT http://localhost:9200/medical-records \
-H "Content-Type: application/json" \
-d '{
"settings": {
"analysis": {
"analyzer": {
"medical_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "medical_synonyms", "english_stemmer", "phonetic_dm"]
}
},
"filter": {
"medical_synonyms": {
"type": "synonym",
"synonyms": [
"heart attack, myocardial infarction, MI",
"high blood pressure, hypertension"
]
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"phonetic_dm": {
"type": "phonetic",
"encoder": "double_metaphone"
}
}
}
},
"mappings": {
"properties": {
"diagnosis": {
"type": "text",
"analyzer": "medical_analyzer"
},
"patient_name": {
"type": "text",
"fields": {
"phonetic": {
"type": "text",
"analyzer": "phonetic_dm"
}
}
}
}
}
}'
Integration with SPARQL¶
Linguistic analysis integrates with the SPARQL full-text search extension. Use the bds:search magic predicate to invoke the configured analyzer for a field:
PREFIX bds: <http://www.bigdata.com/rdf/search#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?person ?name ?score WHERE {
?person foaf:name ?name .
?name bds:search "automobile safety" .
?name bds:relevance ?score .
?name bds:matchAllTerms true .
}
ORDER BY DESC(?score)
LIMIT 20
When synonym expansion is enabled on the analyzer bound to the foaf:name field, this query automatically expands "automobile" to include "car" and "vehicle".
Performance Considerations¶
| Factor | Recommendation |
|---|---|
| Synonym dictionary size | Keep under 50,000 rules for sub-millisecond expansion. Larger dictionaries still work but add latency at index time. |
| Phonetic filters | Add measurable overhead (~5-15% slower indexing). Use on dedicated sub-fields, not primary text fields. |
| N-gram tokenizers | Produce many more tokens than standard tokenizers. Set max_gram conservatively (8-10) to avoid index bloat. |
| Stemmer choice | Porter2 is the best default for English. Aggressive stemmers (Lancaster) may over-stem and reduce precision. |
| Stop word removal | Always enable in production to reduce index size by 20-30% for natural language text. |
Re-indexing Required After Analyzer Changes
Changing the analyzer configuration for an existing index requires re-indexing all documents in that index. The old tokens in the inverted index will not match the new analysis pipeline. Plan analyzer configuration carefully before initial data load.