memnotary — memory your AI agents can actually trust

Benchmark — Track 2 · Behavioral

Correctness is the easy part.

7 real-world contradiction scenarios across 3 systems, scored on correctness, conflict signal, and fact preservation. All three return the right answer. Only memnotary tells you when another answer is wrong.

6/7

contradiction scenarios flagged

2.0×

conflict detection rate vs Mem0

0.94

overall score · highest tested

System Overall Signal (flag rate) Risk

memnotary 0.94 0.86 LOW

mem0 0.77 0.43 MEDIUM

naive-qdrant 0.71 0.29 MEDIUM

Signal measures whether conflicting or superseded memories are flagged in results. Correctness is 1.00 for all three systems — the difference is whether the wrong answer is silently returned alongside the right one.

Read the full benchmark

Live demo

The full reliability loop, in real time.

store → detect → score → consolidate. No API key, no external service — reproduced entirely in a terminal with an in-memory adapter.

Open in asciinema ↗

How it works

Three calls. Full reliability loop.

01

Wrap your existing vector backend

One constructor. Works with Qdrant, Chroma, pgvector, or in-memory. Nothing in your storage layer changes.

from memnotary import (
    Memnotary,
    ContradictionDetector,
    QdrantAdapter,
)
from qdrant_client import QdrantClient

mn = Memnotary(
    QdrantAdapter(QdrantClient(url="localhost:6333")),
    detector=ContradictionDetector(llm_fn=your_llm),
)

02

Every store runs conflict detection

When a new memory lands, memnotary searches for semantic neighbors above its similarity threshold. Confirmed conflicts become ConflictRecords — inspectable, queryable, actionable.

await mn.store(Memory(agent_id="bot",
    text="Refund policy is 30 days", embedding=...))

await mn.store(Memory(agent_id="bot",
    text="Refund policy changed to 14 days", embedding=...))

⚠ ConflictRecord created direct_contradiction confidence: 0.95

03

Health scoring tells you when to act

health() returns a snapshot across your full memory collection. consolidate() plans and executes: supersede, merge, or flag for human review.

h = await mn.health("bot")
# contradiction_score: 1.00   risk: HIGH

await mn.consolidate("bot")

h = await mn.health("bot")
# contradiction_score: 0.00   risk: LOW

✓ plan executed action=merge tier=auto_merge

Three calls. Your agent memory goes from a silent liability to an auditable, self-healing store. Get started →