Skip to content

Linguistic Processing

IndentiaDB includes a full-featured linguistic processing engine that powers its BM25 full-text search and Elasticsearch-compatible API. The engine provides tokenization, normalization, stemming, stop word removal, synonym expansion, phonetic matching, and language detection -- all configurable per index and per field. Every text analysis component is implemented natively in Rust, with no JVM or external process dependencies.


Text Analysis Pipeline

Every document field passes through a configurable analysis pipeline before being indexed. The pipeline follows a strict four-stage architecture:

Input Text --> Char Filters --> Tokenizer --> Token Filters --> Indexed Tokens
Stage Purpose Examples
Char Filters Pre-tokenization character transformations HTML stripping, character mapping, pattern replacement
Tokenizer Split text into individual tokens Standard (Unicode word boundaries), whitespace, N-gram, CJK bigrams
Token Filters Post-tokenization transformations Lowercase, stop words, stemming, synonyms, phonetic encoding, ASCII folding
Output Final indexed tokens with position and offset metadata Used by BM25 scoring, phrase matching, highlighting

Each stage is independently configurable. A custom analyzer defines one tokenizer and zero or more char filters and token filters, applied in order.


Tokenizers

IndentiaDB supports seven tokenizer types. Each tokenizer determines how raw text is split into individual tokens.

Tokenizer Description Best For
Standard Unicode word segmentation (UAX #29). Splits on word boundaries, removes most punctuation. General-purpose text search
Whitespace Splits on whitespace characters only. Preserves punctuation attached to tokens. Log messages, identifiers
Letter Splits on non-letter characters. Produces tokens containing only letters. Simple natural language text
NGram Character N-grams of configurable length. Substring matching, fuzzy search
EdgeNGram N-grams anchored to the start of each token. Autocomplete / search-as-you-type
Pattern Splits on a configurable regex pattern. Custom delimiters, structured text
CJK Bigram tokenizer for Chinese, Japanese, and Korean text. CJK language content

Tokenizer Configuration

# Standard tokenizer (default)
[search.analyzers.default]
tokenizer = "standard"

# N-gram tokenizer for substring matching
[search.analyzers.autocomplete]
tokenizer = { type = "ngram", min_gram = 2, max_gram = 4 }

# Edge N-gram for search-as-you-type
[search.analyzers.suggest]
tokenizer = { type = "edge_ngram", min_gram = 1, max_gram = 15 }

# Pattern tokenizer splitting on hyphens and underscores
[search.analyzers.identifiers]
tokenizer = { type = "pattern", pattern = "[\\-_]+" }

# CJK bigram tokenizer
[search.analyzers.cjk]
tokenizer = "cjk"

Tokenizer Examples

Standard tokenizer:
  Input:  "IndentiaDB supports SPARQL 1.2 and GeoSPARQL."
  Output: ["IndentiaDB", "supports", "SPARQL", "1.2", "and", "GeoSPARQL"]

Whitespace tokenizer:
  Input:  "error_code=42 status=failed"
  Output: ["error_code=42", "status=failed"]

NGram(2, 3) tokenizer:
  Input:  "graph"
  Output: ["gr", "gra", "ra", "rap", "ap", "aph", "ph"]

EdgeNGram(1, 4) tokenizer:
  Input:  "search"
  Output: ["s", "se", "sea", "sear"]

CJK tokenizer:
  Input:  "knowledge graph"   (in Chinese characters)
  Output: bigram pairs of adjacent characters

Token Filters

Token filters transform the token stream produced by the tokenizer. Multiple filters are applied in order, forming a processing chain.

Lowercase

Converts all tokens to lowercase. This is almost always the first filter in any analyzer.

[[search.analyzers.my_analyzer.filters]]
type = "lowercase"

Stop Words

Removes common words that carry little semantic meaning. IndentiaDB ships built-in stop word lists for all supported languages.

[[search.analyzers.my_analyzer.filters]]
type = "stop_words"
language = "en"
custom = ["etc", "ie", "eg"]    # Additional custom stop words

Built-in stop word lists are available for: English, Dutch, German, French, Spanish, Italian, Portuguese, Russian, Arabic, Chinese, Japanese, and Korean.

Stemming

Reduces words to their root form so that morphological variants match. IndentiaDB supports multiple stemming algorithms:

Algorithm Languages Aggressiveness Notes
porter English Moderate Original Porter algorithm
porter2 English Moderate Improved Porter2 / Snowball (default)
snowball Multi-language Moderate Snowball stemmers for 12 languages
lovins English Aggressive Single-pass, fast
lancaster English Very aggressive Iterative, heavy reduction
dutch Dutch Moderate Snowball Dutch variant
german German Moderate Snowball German variant
french French Moderate Snowball French variant
spanish Spanish Moderate Snowball Spanish variant
hunspell Any Dictionary-based Requires a Hunspell dictionary file
# Porter2 stemmer (English, default)
[[search.analyzers.english.filters]]
type = "stemmer"
algorithm = "porter2"

# Snowball stemmer for Dutch
[[search.analyzers.dutch.filters]]
type = "stemmer"
algorithm = { type = "snowball", language = "nl" }

# Hunspell dictionary-based stemmer
[[search.analyzers.hunspell_de.filters]]
type = "stemmer"
algorithm = { type = "hunspell", dictionary = "/etc/indentiadb/dicts/de_DE.dic" }

Stemming Examples

Porter2 (English):
  "running"    --> "run"
  "connections" --> "connect"
  "organized"  --> "organ"

Dutch stemmer:
  "fietsen"    --> "fiets"
  "huizen"     --> "huiz"
  "werknemers" --> "werknem"

German stemmer:
  "Verbindungen" --> "verbind"
  "Arbeitsplatz"  --> "arbeitsplatz"

ASCII Folding

Converts Unicode characters to their ASCII equivalents. Essential for searching text with accented characters using plain ASCII queries.

[[search.analyzers.my_analyzer.filters]]
type = "ascii_folding"
"cafe"  matches "cafe"
"Zurich" matches "Zurich"
"resume" matches "resume"

Length Filter

Removes tokens shorter or longer than specified bounds:

[[search.analyzers.my_analyzer.filters]]
type = "length"
min = 2
max = 50

Trim

Removes leading and trailing whitespace from tokens:

[[search.analyzers.my_analyzer.filters]]
type = "trim"

Deduplication

Removes duplicate tokens at the same position (useful after synonym expansion):

[[search.analyzers.my_analyzer.filters]]
type = "dedup"

Shingle

Produces token N-grams (multi-word tokens) for phrase-level matching:

[[search.analyzers.my_analyzer.filters]]
type = "shingle"
min_size = 2
max_size = 3
Input tokens: ["knowledge", "graph", "database"]
Shingles:     ["knowledge graph", "graph database", "knowledge graph database"]

Word Delimiter

Splits tokens on case transitions, letter-digit boundaries, and configurable delimiters:

[[search.analyzers.my_analyzer.filters]]
type = "word_delimiter"
"PowerShell"  --> ["Power", "Shell"]
"Wi-Fi"       --> ["Wi", "Fi"]
"SD3500"      --> ["SD", "3500"]

Synonym Expansion

Synonym expansion allows queries to match documents containing equivalent terms. IndentiaDB supports two modes: explicit synonym rules and synonym dictionaries loaded from files.

Synonym Rules

Define synonyms inline in the analyzer configuration:

[[search.analyzers.my_analyzer.filters]]
type = "synonym"
expand = true
rules = [
    "automobile, car, vehicle",
    "quick, fast, speedy",
    "big, large, enormous => huge",
]

Bidirectional rules (comma-separated): All terms are interchangeable. A query for "car" also matches "automobile" and "vehicle".

Directional rules (with =>): Only the left-hand terms expand to the right-hand term. A query for "big" matches "huge", but "huge" does not match "big".

Synonym Dictionary File

For large synonym sets, load from a file:

[[search.analyzers.my_analyzer.filters]]
type = "synonym"
expand = true
dictionary = "/etc/indentiadb/synonyms/medical.txt"

Dictionary file format (one rule per line):

# Medical synonyms
heart attack, myocardial infarction, MI
high blood pressure, hypertension
paracetamol, acetaminophen
# With synonym expansion enabled, this query:
curl -X POST http://localhost:9200/articles/_search \
  -H "Content-Type: application/json" \
  -d '{
    "query": {
      "match": {
        "content": {
          "query": "automobile safety",
          "analyzer": "my_analyzer"
        }
      }
    }
  }'

# ...also matches documents containing "car safety", "vehicle safety", etc.

Index-Time vs Query-Time Synonyms

By default, synonyms are expanded at index time -- meaning synonym variants are stored in the inverted index alongside the original term. This is faster at query time but requires re-indexing when synonyms change. Set expand = false to apply synonyms only at query time, which allows synonym dictionary updates without re-indexing at the cost of slightly slower queries.


Phonetic Matching

Phonetic encoders convert tokens to phonetic codes so that words that sound alike match each other, even when spelled differently. This is valuable for name search, address matching, and correcting misspellings.

Supported Encoders

Encoder Origin Best For Example
soundex American Census (1880s) English surnames "Robert" and "Rupert" both produce R163
metaphone Lawrence Philips (1990) English words "Smith" and "Schmidt" both produce SM0
double_metaphone Lawrence Philips (2000) Multi-origin names Returns primary + alternate code
cologne Hans Postel (1969) German names and words "Mueller" and "Muller" both produce 657

Configuration

# Add phonetic encoding as a token filter
[[search.analyzers.name_search.filters]]
type = "phonetic"
encoder = "double_metaphone"

# Full name search analyzer
[search.analyzers.name_search]
tokenizer = "standard"
filters = [
    { type = "lowercase" },
    { type = "ascii_folding" },
    { type = "phonetic", encoder = "double_metaphone" },
]

Phonetic Search Example

# Search for a person by approximate name
curl -X POST http://localhost:9200/contacts/_search \
  -H "Content-Type: application/json" \
  -d '{
    "query": {
      "match": {
        "name.phonetic": {
          "query": "Steven",
          "analyzer": "name_search"
        }
      }
    }
  }'

# Matches: "Steven", "Stephen", "Stefan", "Stephan"

Phonetic Fields

Phonetic encoding is typically applied to a sub-field (e.g., name.phonetic) rather than the primary field, so that exact matches on the primary field still work and phonetic matching is available as a fallback or boost signal.


Language-Specific Analyzers

IndentiaDB ships pre-configured analyzers for common languages. Each language analyzer includes the appropriate tokenizer, stop word list, and stemmer.

Supported Languages

Language ISO Code Stemmer Stop Words CJK Tokenizer
English en Porter2 174 words No
Dutch nl Snowball Dutch 119 words No
German de Snowball German 232 words No
French fr Snowball French 164 words No
Spanish es Snowball Spanish 313 words No
Italian it Snowball Italian 279 words No
Portuguese pt Snowball Portuguese 203 words No
Russian ru Snowball Russian 243 words No
Arabic ar Snowball Arabic 162 words No
Chinese zh None Minimal Yes (CJK bigrams)
Japanese ja None Minimal Yes (CJK bigrams)
Korean ko None Minimal Yes (CJK bigrams)

Using a Built-In Language Analyzer

# Use the Dutch analyzer for a specific index
[search.indexes.dutch_articles]
analyzer = "dutch"

# Or reference by language code
[search.indexes.german_articles]
analyzer = { language = "de" }

Custom Language Analyzer

Build a custom analyzer with language-specific components:

[search.analyzers.dutch_full]
language = "nl"
tokenizer = "standard"

[[search.analyzers.dutch_full.filters]]
type = "lowercase"

[[search.analyzers.dutch_full.filters]]
type = "stop_words"
language = "nl"
custom = ["etc", "bv", "nv"]

[[search.analyzers.dutch_full.filters]]
type = "stemmer"
algorithm = "dutch"

[[search.analyzers.dutch_full.filters]]
type = "ascii_folding"

[[search.analyzers.dutch_full.filters]]
type = "synonym"
rules = [
    "auto, wagen, voertuig",
    "computer, pc, laptop",
]

Language Detection

IndentiaDB can automatically detect the language of a text field and apply the appropriate language-specific analyzer. Detection uses a statistical n-gram model that returns a confidence score between 0.0 and 1.0.

Configuration

[search.indexes.multilingual_docs]
# Enable automatic language detection
language_detection = true
# Minimum confidence to apply a language-specific analyzer (fallback: standard)
language_detection_threshold = 0.7
# Default language when detection confidence is below threshold
default_language = "en"

Detection API

curl -X POST http://localhost:7001/api/v1/detect-language \
  -H "Content-Type: application/json" \
  -d '{"text": "Dit is een voorbeeldzin in het Nederlands."}'
{
  "language": "Dutch",
  "iso_code": "nl",
  "confidence": 0.94,
  "alternatives": [
    {"language": "German", "iso_code": "de", "confidence": 0.03},
    {"language": "English", "iso_code": "en", "confidence": 0.02}
  ]
}

Custom Analyzer Configuration

TOML Configuration

Define custom analyzers in config.toml under [search.analyzers]:

# A medical document analyzer
[search.analyzers.medical]
tokenizer = "standard"

[[search.analyzers.medical.char_filters]]
type = "pattern_replace"
pattern = "\\b(Dr|Prof|Mr|Mrs)\\.\\s*"
replacement = ""

[[search.analyzers.medical.filters]]
type = "lowercase"

[[search.analyzers.medical.filters]]
type = "stop_words"
language = "en"

[[search.analyzers.medical.filters]]
type = "synonym"
dictionary = "/etc/indentiadb/synonyms/medical.txt"
expand = true

[[search.analyzers.medical.filters]]
type = "stemmer"
algorithm = "porter2"

[[search.analyzers.medical.filters]]
type = "phonetic"
encoder = "double_metaphone"

Elasticsearch-Compatible Index Settings

Custom analyzers can also be defined via the Elasticsearch-compatible API when creating an index:

curl -X PUT http://localhost:9200/medical-records \
  -H "Content-Type: application/json" \
  -d '{
    "settings": {
      "analysis": {
        "analyzer": {
          "medical_analyzer": {
            "type": "custom",
            "tokenizer": "standard",
            "filter": ["lowercase", "medical_synonyms", "english_stemmer", "phonetic_dm"]
          }
        },
        "filter": {
          "medical_synonyms": {
            "type": "synonym",
            "synonyms": [
              "heart attack, myocardial infarction, MI",
              "high blood pressure, hypertension"
            ]
          },
          "english_stemmer": {
            "type": "stemmer",
            "language": "english"
          },
          "phonetic_dm": {
            "type": "phonetic",
            "encoder": "double_metaphone"
          }
        }
      }
    },
    "mappings": {
      "properties": {
        "diagnosis": {
          "type": "text",
          "analyzer": "medical_analyzer"
        },
        "patient_name": {
          "type": "text",
          "fields": {
            "phonetic": {
              "type": "text",
              "analyzer": "phonetic_dm"
            }
          }
        }
      }
    }
  }'

Integration with SPARQL

Linguistic analysis integrates with the SPARQL full-text search extension. Use the bds:search magic predicate to invoke the configured analyzer for a field:

PREFIX bds: <http://www.bigdata.com/rdf/search#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT ?person ?name ?score WHERE {
    ?person foaf:name ?name .
    ?name bds:search "automobile safety" .
    ?name bds:relevance ?score .
    ?name bds:matchAllTerms true .
}
ORDER BY DESC(?score)
LIMIT 20

When synonym expansion is enabled on the analyzer bound to the foaf:name field, this query automatically expands "automobile" to include "car" and "vehicle".


Performance Considerations

Factor Recommendation
Synonym dictionary size Keep under 50,000 rules for sub-millisecond expansion. Larger dictionaries still work but add latency at index time.
Phonetic filters Add measurable overhead (~5-15% slower indexing). Use on dedicated sub-fields, not primary text fields.
N-gram tokenizers Produce many more tokens than standard tokenizers. Set max_gram conservatively (8-10) to avoid index bloat.
Stemmer choice Porter2 is the best default for English. Aggressive stemmers (Lancaster) may over-stem and reduce precision.
Stop word removal Always enable in production to reduce index size by 20-30% for natural language text.

Re-indexing Required After Analyzer Changes

Changing the analyzer configuration for an existing index requires re-indexing all documents in that index. The old tokens in the inverted index will not match the new analysis pipeline. Plan analyzer configuration carefully before initial data load.