Why Metis Isn't "Just RAG"¶

A technical position paper for evaluators

Most AI knowledge tools that have come on the market in the last two years describe themselves as "RAG" — Retrieval-Augmented Generation. The term has become close to meaningless, because what passes for RAG ranges from carefully-engineered systems with verifiable grounding all the way down to a one-line script that pastes a few search results into a ChatGPT prompt. Both call themselves the same thing. Their failure modes are not the same.

This document explains what stock RAG actually does, where it fails, and the specific engineering choices Metis makes to address those failure modes. It is written for someone evaluating whether to trust an AI system with a document estate that has consequences attached — a borough's records, a department's institutional knowledge, a firm's case files.

How stock RAG works (and why it's risky)¶

A typical "RAG pipeline" has two steps:

Retrieve. The user's question is converted to a numeric vector (an "embedding"), and the system finds the top-K chunks of text in the document corpus whose embeddings are closest to the question's. Closeness is measured by cosine similarity in a 384- or 768-dimensional space.
Generate. Those chunks are dumped into a prompt and a Large Language Model is asked to answer the question using them as context.

This works passably when the documents are clean, the question is direct, and there's no ambiguity. It fails — sometimes silently, sometimes spectacularly — in five specific situations that show up constantly in real document estates:

Failure 1 — Embedding similarity confuses opposites.¶

The sentences "Overtime requires the manager's written approval" and "Overtime does not require the manager's written approval" have nearly identical embeddings. They are opposite in meaning. Pure semantic retrieval cannot reliably distinguish them. A user asking "do I need approval for overtime?" can be returned the wrong sentence — confidently — and stock RAG has no mechanism to notice.

Failure 2 — Silent contradictions.¶

If two retrieved chunks contradict each other (a 2017 policy and a 2024 policy that disagree, or a vendor contract and a council resolution that conflict), stock RAG concatenates them into the prompt and asks the LLM to answer. The LLM is optimized for fluent completion, not for refusal — so it picks one position, usually whichever appears first or sounds more authoritative, and presents it as the answer. The user has no way to know the documents disagreed.

Failure 3 — Confident hallucination from weak retrieval.¶

When the retrieved chunks don't actually contain the answer, the LLM does not say "I don't know." It generates plausible-sounding text that resembles an answer drawn from the documents — sometimes with citation markers that look authoritative — and the user, who has no easy way to verify, treats it as fact.

Failure 4 — No document-age awareness.¶

Stock RAG treats a superseded 2014 policy and the current 2024 replacement as equally relevant if their text is similar. Latest version wins by accident, not by design. In a document estate where the same procedure has been re-issued five times over twenty years, this is dangerous: the system can confidently cite the wrong version.

Failure 5 — Decisions evaporate.¶

When a human notices that a stock-RAG system has flagged a contradiction or surfaced a stale document and resolves it manually — say, by deciding which version is operative — that decision is not remembered. The next time the same question is asked, the same contradiction is re-surfaced and the same human time is re-spent. The system does not learn from review.

These five failures aren't theoretical. They appear in the first hour of using any stock-RAG system against a real corpus.

What Metis does instead¶

Metis is built around the assumption that the cost of an unflagged wrong answer is much higher than the cost of an extra processing step. Every architectural decision below trades performance for trustworthiness.

A five-stage pipeline, not one retrieval step¶

Stage 1 — Query understanding. Before any documents are touched, Metis can rewrite the question against prior conversation history (turning "what about overtime?" into "what is the overtime approval policy for DPW?") and decompose compound questions into separately-retrieved sub-queries. This catches the cases where the user's literal phrasing doesn't match the way the documents are written.

Stage 2 — Hybrid retrieval, two independent channels. This is the most important architectural choice and addresses Failure 1 directly:

Channel A: semantic retrieval — the question and document chunks are embedded with all-MiniLM-L6-v2 (sentence-transformers); FAISS finds the top ~20 chunks by cosine similarity.
Channel B: lexical retrieval — BM25Okapi (a classic term-frequency model used by every major search engine for forty years) independently finds the top ~20 chunks by keyword overlap.
The two ranked lists are merged with Reciprocal Rank Fusion: chunks that score well in both channels rise to the top. This catches the cases where semantic search misses (exact policy numbers, vendor names, ordinance citations) AND the cases where keyword search misses (paraphrased questions, conceptual matches). Negation-sensitive matches (which trip pure semantic retrieval) are often surfaced by the lexical channel.

Stage 3 — Cross-encoder reranking. The top ~20 candidates from hybrid retrieval pass through a separate model — ms-marco-MiniLM-L-6-v2 — that scores each candidate jointly with the question, rather than independently. This is computationally more expensive than retrieval but dramatically more accurate, especially for distinguishing semantically-similar-but-factually-opposite passages. Top 5–8 survive to the LLM.

Stage 4 — Dedicated conflict detection. Before the answer LLM sees the retrieved chunks, a separate LLM call examines them with a single, narrowly-scoped task: "Do these chunks materially disagree on facts, numbers, policies, or procedures — or do they merely cover different aspects of the topic?" The conflict-detection LLM reads the actual chunk text (not just embeddings) and returns structured output identifying which sources argue which positions. This pass exists specifically to address Failure 2: stock RAG silently picks one chunk; Metis pulls the disagreement out and surfaces it.

Stage 5 — Grounded answer with enforced citations. The answer LLM is given the retrieved chunks and required to produce structured JSON output that includes:

The answer text with inline [N] citation markers
An answer_basis field — one sentence naming which documents grounded the answer
A per-source evidence_summary explaining why each source was used
An explicit confidence rating (high / medium / low)
A gaps field listing what was asked but not present in the evidence

The system prompt explicitly instructs: "If the evidence does not clearly support an answer, set confidence to low and list what's missing in gaps. Never invent information." The frontend renders citations as required UI — not optional decoration — so the user clicks any claim and sees the source. This addresses Failure 3.

Beyond the pipeline: what makes Metis defensible over time¶

Pre-ingest classification preview (V1.7). Before a single chunk is embedded, the operator can run a classification preview that exposes every document's auto-detected tags — jurisdictional level, regulatory force, age, type, extracted date — side-by-side with confidence levels and method (regex heuristic vs. LLM classifier vs. manual override). For any document where the classification looks wrong, the operator corrects it in one click and the correction is persisted. The actual ingest, when committed, honors every override automatically. This addresses a failure mode stock RAG ignores: confidently operating on badly-classified data.

Hierarchical regulatory awareness (V1.6). In any organization that operates under a regulatory hierarchy — local government under state and federal law, hospitals under CMS and state health code, contractors under OSHA and state labor codes — answers about local procedure that ignore the binding higher-authority constraint are wrong in a way that matters. A borough manager asks "do we need a public hearing for this zoning variance?" — the borough procedure may say "Council discretion," but the PA Sunshine Act may make it mandatory regardless. Without hierarchy, stock RAG gives a confidently-wrong answer.

Metis tags every document at ingestion with three independent fields: a jurisdictional level (federal → state → county → municipal → department → team), a named entity (e.g. pennsylvania, castle_shannon), and a regulatory force (binding | guidance | internal | informational). Tags are derived by a cascade of filename heuristic + optional LLM classifier + administrator override.

At query time, when the retrieval surfaces local-procedure documents, a second retrieval pass finds binding higher-authority documents on the same topic. Both are sent to the answer LLM with a prompt that requires the higher-authority constraint to be surfaced explicitly — "Subject to X..." or "However, the state statute requires Y..." The conflict-finder is extended to specifically flag local-vs-higher-authority disagreements as the most consequential class of conflict, with elevated visual severity in the UI.

This is not jurisdiction-specific. It generalizes to every multi-layer organization Metis serves. Government is the lead demo because the hierarchy is sharpest there, but the same engineering lifts every other vertical.

Document-age awareness. Every document carries a doc_age tag — current, legacy, or draft — derived from a four-signal pipeline that runs at ingestion:

Filename + path heuristics (e.g. /old/, /archive/, filename markers like superseded, pre-2020, draft, wip)
Date extraction from multiple sources — filename patterns, document content (looking for context phrases like "Adopted:", "Effective:", "Revised:" preceding a date), file-format metadata (PDF ModDate, DOCX core properties), and filesystem mtime — combined with weighted confidence
Optional LLM age classifier (configurable per deployment) that reads the first ~1500 chars of the document and detects subtle signals the regex misses: language tense, references to newer versions, draft indicators that don't appear in the filename
Date-based demotion — documents whose extracted year is more than 7 years old are downgraded to legacy with low confidence, surfaced for admin review

Beyond initial classification, the system actively suggests supersession relationships post-ingestion: same-typed documents with high semantic similarity but different dates are flagged as candidate predecessor/successor pairs, with the older document proposed as superseded. The administrator reviews each suggestion with a single-click accept (which automatically applies the override and demotes the older document in retrieval) or dismiss (preserving the audit trail of the decision). Accepted overrides are reversible; dismissed suggestions remain visible in the audit log.

This addresses Failure 4. The legacy version is not deleted — you may need it later for audit, RTKL response, or historical context — but it is no longer the system's first choice for answering questions about current state.

Persistent review decisions. When an administrator reviews a flagged conflict and dismisses it with a reason ("the 2017 policy was superseded by Council Resolution 22-04"), the dismissal is recorded with the reviewer's name, timestamp, and rationale. Future queries that surface a conflict between the same source pair suppress the flag automatically — the system has learned. Dismissals are reversible: if circumstances change, an administrator can revert with a reason, and the original entry is preserved with a reverted status. The audit trail does not erase. This addresses Failure 5.

Combined confidence scoring. Every answer carries three confidence signals: the strength of retrieval (did the corpus actually contain relevant material?), the LLM's confidence in the answer it produced, and whether the underlying sources conflicted. These combine into a single answer_confidence shown to the user. When sources are weak or contested, the answer is explicitly downgraded — and the user sees that, rather than a confident wrong answer.

What Metis does not do¶

In the interest of being honest about scope:

Metis is not a record-management system. It sits on top of whatever record system the customer already has. Documents are ingested but originals remain authoritative.
Metis does not provide legal advice. Citations point to source documents; interpretation belongs to lawyers and the customer's professional advisors.
Metis is not infallible. Even with all of the above, the LLM that generates the answer is still a probabilistic system. The combined confidence score and the citation-based UI are designed to make wrongness visible to the user — not to eliminate it.

The honest claim is narrow and defensible: Metis is engineered so that when it doesn't know, it tells you; when sources disagree, it shows you both; and when the answer is grounded, you can verify it in two clicks. That's a meaningfully different shape than stock RAG, and it's the shape that matters in any context where the cost of an unflagged wrong answer is real.

A one-line summary¶

Stock RAG retrieves a few documents that look similar to your question and asks an LLM to answer — which means it can confidently lie when documents disagree, when the answer isn't in the corpus, or when retrieval finds something semantically close but factually wrong. Metis runs three independent retrieval signals, a separate model that reads the candidates jointly with your question, a dedicated pass that detects when documents disagree, and a structured-output contract that forces the LLM to either cite specific evidence or admit it doesn't know. Same building blocks, different engineering choices, different failure mode.