Bayesian BM25 (BB25)¶

IndentiaDB implements Bayesian BM25 (BB25) — a probabilistic framework that converts raw BM25 scores into calibrated relevance probabilities and fuses them with dense vector similarities in a unified probability space. The result is a hybrid ranker that consistently outperforms both BM25-only and kNN-only search.

Benchmark result: NDCG@10 = 0.9149 on the Olympics dataset (BM25 alone: 0.71, kNN alone: 0.78).

Why BM25 Scores Are Hard to Fuse Directly¶

Standard BM25 produces an unbounded score (e.g. 3.7, 12.4, 0.2) that varies by corpus size, query length, and field length. Dense vector search produces cosine similarity in [−1, 1]. Directly adding these numbers produces garbage — they live in completely different numerical spaces.

Two common workarounds both have serious problems:

Method	Problem
Linear combination `α·bm25 + (1−α)·cosine`	BM25 magnitude dominates; α must be retuned per corpus
Reciprocal Rank Fusion (RRF) `1/(k+rank)`	Throws away score magnitude; equal weight regardless of confidence

BB25 solves this by calibrating both signals into probability space before fusion.

Architecture¶

BM25 scores ──► BayesianProbabilityTransform ──► P(relevant | BM25)  ─┐
                                                                        ├─► log-odds fusion ──► ranked list
cosine sim  ──► cosine_to_probability()        ──► P(relevant | kNN)  ─┘

Three components do the work:

BayesianProbabilityTransform — maps BM25 scores to calibrated probabilities
cosine_to_probability — maps cosine similarity to probability
Balanced log-odds fusion — blends both in logit space with min-max normalisation

Step 1: BM25 Score Calibration¶

The Sigmoid Likelihood¶

The BM25 score s is converted to a likelihood via a learned sigmoid:

L(s) = σ(α · (s − β))

where:

α (steepness) — how sharply the curve transitions from non-relevant to relevant; auto-estimated as 1 / std(scores)
β (midpoint) — the BM25 score at which relevance probability is 0.5; auto-estimated as median(scores)

Both parameters can be auto-estimated from the BM25 score distribution at query time (zero-config), overridden per-query, or learned offline via gradient descent using relevance labels.

Composite Prior¶

The likelihood alone ignores query-specific context. BB25 adds a composite prior that incorporates term frequency and document length:

P_tf(tf)            = 0.2 + 0.7 · min(1, tf / 10)
P_norm(doc_len_ratio) = 0.3 + 0.6 · (1 − |doc_len_ratio − 0.5| · 2)

P_composite(tf, doc_len_ratio) = clamp(0.7·P_tf + 0.3·P_norm,  0.1, 0.9)

Where doc_len_ratio = doc_len / avg_doc_len.

The intuition: - A term appearing many times in a short document is a stronger signal than a single occurrence in a long document. - The composite prior prevents extreme probability values while encoding this domain knowledge.

Bayesian Posterior¶

The final calibrated probability combines likelihood, prior, and optionally a corpus-level base rate via a two-step Bayes update:

Step 1:  P₁ = (L · P_composite) / (L · P_composite + (1−L) · (1−P_composite))

Step 2 (if base_rate set):
         P₂ = (P₁ · base_rate) / (P₁ · base_rate + (1−P₁) · (1−base_rate))

The base rate encodes corpus-level relevance density — estimated as the fraction of documents scoring above the 95th percentile. A sparse corpus (few relevant documents) gets a low base rate (e.g. 0.03), which down-weights marginal BM25 hits.

Full Pipeline¶

BM25 score
    ↓
L = σ(α · (s − β))                  # sigmoid likelihood
    ↓
P_composite = f(tf, doc_len_ratio)   # composite prior
    ↓
P₁ = Bayes(L, P_composite)          # first update
    ↓
P₂ = Bayes(P₁, base_rate)           # second update (corpus prior)
    ↓
calibrated probability ∈ (0, 1)

Step 2: Cosine Similarity Calibration¶

Cosine similarity c ∈ [−1, 1] is mapped to probability via a simple linear transform:

P(kNN | c) = clamp((1 + c) / 2,  ε,  1−ε)

This puts cosine similarity on the same [0, 1] scale as the BM25 posterior, enabling direct comparison.

Step 3: Log-Odds Fusion¶

Once both signals are in probability space, they are fused via balanced log-odds fusion:

logit_bm25 = log(P_bm25 / (1 − P_bm25))    # convert to log-odds
logit_knn  = log(P_knn  / (1 − P_knn ))

bm25_norm  = min_max_normalize(logit_bm25)  # normalise each source
knn_norm   = min_max_normalize(logit_knn)   # independently to [0,1]

score = weight · knn_norm + (1 − weight) · bm25_norm

Documents present in only one source get logit(0.5) = 0.0 — the neutral "no information" value. This avoids penalising documents that happen to be absent from one retriever's top-k window.

The weight parameter (default 0.5) controls the kNN/BM25 balance and can be tuned per-query.

Auto-Estimation vs. Manual Tuning¶

Parameter	Auto-estimate	Manual override
`alpha`	`1 / std(bm25_scores)`	`"alpha": 2.0` in query
`beta`	`median(bm25_scores)`	`"beta": 4.0` in query
`base_rate`	fraction above 95th percentile	`"base_rate": 0.03` in query
`weight`	`0.5`	`"weight": 0.7` in query

Auto-estimation runs at query time from the BM25 result set — no offline training required. For best results on a stable corpus, use fit() to learn alpha/beta from labeled data.

Training Modes¶

Three training modes are supported for offline parameter learning:

Mode	Description
`Balanced` (default)	Train on sigmoid likelihood: `P = σ(α·(s−β))`
`PriorAware`	Full Bayesian posterior with composite prior during training
`PriorFree`	Same training as Balanced, but at inference uses `prior = 0.5`

Use PriorAware when you have tf and doc_len metadata in your training labels.

Gating Functions¶

Before aggregating logits across multi-field or multi-hop queries, a gating function controls how uninformative (negative) log-odds contribute:

Gating	Formula	Use case
`NoGating`	`logit` unchanged	Default
`Relu`	`max(0, logit)`	MAP estimate under sparse prior (Theorem 6.5.3)
`Swish`	`logit · σ(logit)`	Bayes estimate under sparse prior (Theorem 6.7.4)
`GeneralizedSwish(β)`	`logit · σ(β · logit)`	Tunable (Theorem 6.7.6)
`Gelu`	`logit · σ(1.702 · logit)`	GELU activation analogue (Theorem 6.8.1)

Relu is the most conservative — negative evidence is discarded rather than allowed to lower the score.

Configuration¶

Enable Globally (config.toml)¶

[search]
hybrid_scorer = "bayesian"   # "bayesian" | "rrf" | "linear"

Enable via Environment Variable¶

ES_HYBRID_SCORER=bayesian indentiagraph serve

Per-Query Override (Elasticsearch-compatible API)¶

POST /my-index/_search
{
  "retriever": {
    "bayesian": {
      "weight": 0.6,
      "rank_window_size": 100,
      "retrievers": [
        {
          "standard": {
            "query": { "match": { "content": "SPARQL federation" } }
          }
        },
        {
          "knn": {
            "field": "embedding",
            "query_vector": [0.12, -0.34, 0.89],
            "k": 100,
            "num_candidates": 200
          }
        }
      ]
    }
  },
  "size": 10
}

With Manual Parameter Override¶

{
  "retriever": {
    "bayesian": {
      "weight": 0.7,
      "alpha": 1.5,
      "beta": 6.0,
      "base_rate": 0.03,
      "rank_window_size": 200,
      "retrievers": [...]
    }
  }
}

With Explain¶

{
  "retriever": {
    "bayesian": {
      "weight": 0.5,
      "retrievers": [...]
    }
  },
  "explain": true
}

The _explanation field in each hit shows the fused score breakdown:

"_explanation": {
  "value": 0.812,
  "description": "bayesian_hybrid",
  "details": [{
    "bm25_logit_norm": 0.743,
    "knn_logit_norm": 0.881,
    "weight": 0.5
  }]
}

Comparison: BB25 vs RRF vs Linear¶

	BB25	RRF	Linear
Uses score magnitude	Yes	No (rank only)	Yes
Calibrated to [0,1]	Yes	Yes (1/(k+r))	No
Handles missing docs	Neutral prior (0.5)	Absent = 0	Absent = 0
Corpus-adaptive	Auto-estimates α/β	Fixed k=60	Manual α
Per-query tunable	weight, α, β, base_rate	k	α
NDCG@10 (Olympics)	0.9149	0.847	0.831

SurrealQL Integration¶

Hybrid search with BB25 is also available via SurrealQL using the search::bayesian function:

SELECT
    id,
    title,
    search::bayesian(content_bm25, embedding, 0.6) AS score
FROM documents
WHERE content @@ 'knowledge graph federation'
ORDER BY score DESC
LIMIT 10;

The function signature: search::bayesian(bm25_field, vector_field, weight) → f64

References¶

"Bayesian BM25: A Probabilistic Framework for Hybrid Text and Vector Search"
bb25 Rust reference implementation
Python reference: Cognica bayesian-bm25
IndentiaDB source: indentiagraph-storage/src/text_index/bayesian.rs

Calibration Workflow¶

Calibrating BB25 for your corpus ensures that the sigmoid parameters (alpha, beta) accurately model the relationship between BM25 scores and relevance in your specific dataset.

Step 1: Collect BM25 Score Distribution¶

Run a representative set of queries and collect the raw BM25 scores:

# Export BM25 scores for calibration queries
indentiagraph admin search calibrate \
  --profile prod \
  --queries calibration_queries.txt \
  --output bm25_scores.jsonl \
  --top-k 200

The output contains one JSON object per query-document pair:

{"query": "knowledge graph federation", "doc_id": "doc_4821", "bm25_score": 8.42}
{"query": "knowledge graph federation", "doc_id": "doc_1203", "bm25_score": 6.17}

Step 2: Auto-Estimate Parameters¶

Use the from_scores method which auto-estimates parameters from the score distribution:

beta = median of all BM25 scores (the score at which P(relevant) = 0.5)
alpha = 1 / std(scores) (how sharply the sigmoid transitions)
base_rate = fraction of documents above the 95th percentile (~0.05)

# Auto-estimate and save calibration profile
indentiagraph admin search calibrate \
  --profile prod \
  --input bm25_scores.jsonl \
  --save-as my_corpus_calibration

Step 3: Validate with Relevance Labels (Optional)¶

If you have relevance judgments (e.g., from user click data), use supervised training for better calibration:

# Train with relevance labels
indentiagraph admin search calibrate \
  --profile prod \
  --input bm25_scores.jsonl \
  --labels relevance_judgments.jsonl \
  --training-mode balanced \
  --save-as my_corpus_calibration_v2

The labels file contains binary relevance judgments:

{"query": "knowledge graph federation", "doc_id": "doc_4821", "relevant": true}
{"query": "knowledge graph federation", "doc_id": "doc_1203", "relevant": false}

Step 4: Apply Calibration Profile¶

[search]
hybrid_scorer          = "bayesian"
calibration_profile    = "my_corpus_calibration_v2"

Or per-query:

{
  "retriever": {
    "bayesian": {
      "calibration_profile": "my_corpus_calibration_v2",
      "retrievers": [...]
    }
  }
}

A/B Testing Patterns¶

Compare BB25 against RRF and Linear fusion in production using IndentiaDB's scorer selection.

Per-Query Scorer Selection¶

Route a percentage of traffic to each scorer for comparison:

import requests
import random

def hybrid_search(query: str, embedding: list[float]) -> dict:
    """Execute hybrid search with A/B test scorer selection."""
    roll = random.random()
    if roll < 0.33:
        scorer = "bayesian"
    elif roll < 0.66:
        scorer = "rrf"
    else:
        scorer = "linear"

    body = {
        "retriever": {
            scorer: {
                "weight": 0.6,
                "rank_window_size": 100,
                "retrievers": [
                    {"standard": {"query": {"match": {"content": query}}}},
                    {"knn": {"field": "embedding", "query_vector": embedding, "k": 100}},
                ],
            }
        },
        "size": 10,
        "explain": True,
    }

    resp = requests.post(
        "http://localhost:7001/my-index/_search",
        json=body,
    )
    resp.raise_for_status()
    result = resp.json()

    # Log for offline analysis
    return {
        "scorer": scorer,
        "query": query,
        "hits": [h["_id"] for h in result["hits"]["hits"]],
        "scores": [h["_score"] for h in result["hits"]["hits"]],
    }

Offline Evaluation with NDCG¶

Compare scorers using a held-out evaluation set:

# Run evaluation suite against all three scorers
indentiagraph admin search evaluate \
  --profile prod \
  --queries eval_queries.jsonl \
  --labels eval_labels.jsonl \
  --scorers bayesian,rrf,linear \
  --metrics ndcg@10,mrr,precision@5

Output:

Scorer     | NDCG@10 | MRR   | P@5
-----------|---------|-------|------
bayesian   | 0.9149  | 0.891 | 0.842
rrf        | 0.8470  | 0.823 | 0.780
linear     | 0.8310  | 0.801 | 0.762

Multi-Field BB25¶

When searching across multiple fields (title, body, metadata), each field produces its own BM25 score. BB25 calibrates each field independently and fuses them using log-odds conjunction.

Configuration¶

{
  "retriever": {
    "bayesian": {
      "weight": 0.6,
      "multi_field": {
        "title":   {"alpha": 2.5, "beta": 3.0, "field_weight": 0.4},
        "content": {"alpha": 1.2, "beta": 6.0, "field_weight": 0.5},
        "tags":    {"alpha": 3.0, "beta": 1.5, "field_weight": 0.1}
      },
      "gating": "swish",
      "retrievers": [
        {"standard": {"query": {"multi_match": {"query": "graph database", "fields": ["title", "content", "tags"]}}}},
        {"knn": {"field": "embedding", "query_vector": [0.12, -0.34, 0.89], "k": 100}}
      ]
    }
  }
}

How Multi-Field Fusion Works¶

Each field produces a BM25 score, calibrated independently with its own alpha/beta.
The calibrated probabilities are converted to logits.
The gating function (e.g., Swish) is applied to each logit — this prevents uninformative fields (negative logit) from dominating.
Logits are combined via weighted sum: field_weight_i * gated_logit_i.
The fused logit is converted back to a probability via sigmoid.
This probability is then fused with the kNN probability via balanced log-odds fusion.

The swish gating is recommended for multi-field search because it smoothly attenuates negative evidence rather than hard-clipping it (like relu).

Diagnostics and Debugging¶

Interpreting Explain Output¶

Enable "explain": true to get per-hit score breakdowns:

{
  "_id": "doc_4821",
  "_score": 0.892,
  "_explanation": {
    "value": 0.892,
    "description": "bayesian_hybrid",
    "details": [
      {
        "bm25_raw": 8.42,
        "bm25_likelihood": 0.934,
        "composite_prior": 0.72,
        "bm25_posterior": 0.891,
        "base_rate_adjusted": 0.847,
        "bm25_logit": 1.71,
        "bm25_logit_norm": 0.873
      },
      {
        "knn_cosine": 0.82,
        "knn_probability": 0.91,
        "knn_logit": 2.31,
        "knn_logit_norm": 0.912
      },
      {
        "fusion_weight": 0.6,
        "final_score": 0.892
      }
    ]
  }
}

Reading the Breakdown¶

Field	Meaning	Healthy Range
`bm25_likelihood`	Sigmoid output from raw BM25	0.0 - 1.0
`composite_prior`	Prior from term freq + doc length	0.1 - 0.9
`bm25_posterior`	Bayesian posterior (likelihood x prior)	0.0 - 1.0
`base_rate_adjusted`	After corpus base-rate correction	Slightly lower than posterior
`bm25_logit_norm`	Normalized logit (0 = neutral, 1 = strong)	0.0 - 1.0
`knn_logit_norm`	Normalized kNN logit	0.0 - 1.0

Common Issues¶

Issue: All BM25 posteriors are very high (>0.95)

The alpha parameter is too large, making the sigmoid too steep. Lower it manually or recalibrate with a larger score sample:

{"retriever": {"bayesian": {"alpha": 0.5, "retrievers": [...]}}}

Issue: kNN dominates despite low cosine similarity

The fusion weight is too high. Lower it to give BM25 more influence:

{"retriever": {"bayesian": {"weight": 0.3, "retrievers": [...]}}}

Issue: Documents absent from one retriever are ranked poorly

This is expected — absent documents get logit(0.5) = 0.0 (neutral). Increase rank_window_size to ensure both retrievers return more candidates:

{"retriever": {"bayesian": {"rank_window_size": 500, "retrievers": [...]}}}

SPARQL Integration¶

BB25 hybrid search can be invoked from SPARQL queries using the IndentiaDB search: function namespace:

Basic Hybrid Search in SPARQL¶

PREFIX search: <http://indentia.ai/vocab/search#>
PREFIX ex:     <http://example.org/>

SELECT ?doc ?title ?score WHERE {
    (?doc ?score) search:bayesian (
        "knowledge graph federation"    # query text
        0.6                              # kNN weight
        100                              # rank window size
    ) .
    ?doc ex:title ?title .
}
ORDER BY DESC(?score)
LIMIT 10

Hybrid Search with Filters¶

Combine BB25 ranking with SPARQL graph pattern filters:

PREFIX search: <http://indentia.ai/vocab/search#>
PREFIX ex:     <http://example.org/>
PREFIX dcterms: <http://purl.org/dc/terms/>

SELECT ?doc ?title ?score ?date WHERE {
    (?doc ?score) search:bayesian (
        "temporal RDF querying"
        0.5
        200
    ) .
    ?doc ex:title   ?title ;
         dcterms:date ?date ;
         dcterms:language "en" .
    FILTER(?date > "2024-01-01"^^xsd:date)
}
ORDER BY DESC(?score)
LIMIT 20

SPARQL with Explicit Calibration Parameters¶

PREFIX search: <http://indentia.ai/vocab/search#>

SELECT ?doc ?score WHERE {
    (?doc ?score) search:bayesian (
        "SPARQL optimization"   # query
        0.5                      # weight
        100                      # rank_window_size
        1.5                      # alpha (optional)
        6.0                      # beta (optional)
        0.03                     # base_rate (optional)
    ) .
}
ORDER BY DESC(?score)
LIMIT 10