Why Conflict Detection Is Harder Than It Looks¶
And why most knowledge systems silently fail at it¶
A Base2ML white paper. Second in a series following The Information Paradox.
The moment most systems fail¶
Most internal knowledge systems pass a casual demo. Ask them an easy question; they retrieve a relevant document; they paraphrase what it says. The user comes away impressed.
The harder test is what happens when two of the organization's documents disagree about the answer. This case is not exotic. It is the typical case, in any organization that has been operating long enough to have produced more than a small number of documents. Policies are revised but old versions persist. Resolutions modify ordinances without updating the underlying SOPs. Departmental procedures evolve while the organization-wide handbook sits frozen. State laws shift while local interpretations lag. The corpus of any working organization is, at any given moment, partially in agreement with itself and partially not.
A useful knowledge system has to answer the question correctly when the documents agree, and surface the disagreement clearly when they don't. The second job is harder than the first. Most systems don't do it well, and the reasons are worth understanding.
What "conflict detection" actually means¶
The phrase is overloaded. Three quite different operations sometimes share the label, and the differences matter.
Detection. The system notices that retrieved documents are not aligned on the topic of the user's question. The output is a binary signal: yes, there's a conflict; no, there isn't. The system does not yet say what the conflict is or which document is right.
Classification. Given a detected conflict, the system distinguishes the kind. Are the documents flatly contradictory, or do they cover different scope? Is one current and the other superseded? Does one have higher authority than the other (state binding over local procedure, federal over state)? Is the conflict substantive or merely a difference in phrasing?
Surfacing. The system communicates the conflict to the user in a way that's actionable. This is a UX problem masquerading as a retrieval problem. A correctly-detected, correctly-classified conflict that's buried in a footnote does the user no good. A conflict that's surfaced too aggressively — flagging every minor difference in phrasing — trains the user to ignore the warnings.
A useful system has to do all three. Most systems do the first two badly and the third hardly at all.
Why detection alone is insufficient¶
A detection-only system has a recognizable failure mode. It produces an answer, picks one of the conflicting documents to base it on, and tags the response with a "low confidence" indicator or a generic "the documents may disagree" footnote. The user, reading the answer, has no good options. They can't trust the answer, but they also can't see what the alternative position is or which documents support it. The system has identified a problem and then declined to characterize it.
This is what most generic AI assistants produce when pointed at an organization's documents. The retrieval layer surfaces multiple sources; the LLM, asked to synthesize them, smooths over the disagreement to produce a fluent response. If the LLM is prompted to "flag conflicts," it sometimes inserts a vague disclaimer. The disclaimer is not actionable because the underlying conflict was never made visible.
The user's job in a conflict situation is to make a judgment. The system can't make the judgment for them — only the user has the contextual knowledge to know which document is operationally in force, which is a draft, which has been informally superseded by a verbal directive, which the union representative will lean on. The system's job is to make the judgment possible by presenting both positions clearly.
A useful conflict surface, at minimum, shows: each conflicting document by name, the specific passage from each that grounds the conflict, what each document says, and whatever metadata the system has about authority and currency. The user is then equipped to decide.
The asymmetry between false positives and false negatives¶
When the system makes mistakes about conflicts, the two kinds of mistake have very different operational costs.
A false positive — flagging a conflict that isn't real — wastes user time. The user opens the conflict surface, reads both passages, realizes they don't actually disagree (they cover different scope, they say the same thing in different phrasing), dismisses the flag, and moves on. Annoying. Not catastrophic. With a dismissal mechanism that persists ("this conflict has been reviewed and resolved"), false positives are bounded — the same flag doesn't reappear repeatedly.
A false negative — failing to flag a real conflict — is operationally dangerous. The user gets a confident-looking answer based on one of two contradictory documents. They act on it. Later, the contradicting document surfaces in some other context — a grievance, an audit, a legal challenge — and the organization has to explain why it was operating against the wrong policy. The cost is concentrated, traceable to a specific question that should have surfaced both positions, and visible to people whose opinion matters.
This asymmetry suggests the right tuning. A system that detects conflicts should err toward over-detection rather than under-detection, with a dismissal mechanism so the noise floor is bounded for any individual user. The default failure mode should be "the user reviews and dismisses a borderline case in 30 seconds" rather than "the system silently picks one and the organization finds out the hard way."
Most systems are tuned the opposite way. The LLM is biased toward fluent, confident responses; conflict-flagging interrupts the fluency; product teams find that aggressive conflict detection makes the system "feel less polished" in demos. The polish wins. The asymmetry argues this is the wrong call.
The hierarchical case¶
The hardest class of conflict, and the one most systems get wrong, is when the documents themselves don't disagree — but their combination does. A local borough ordinance says one thing. A state law says something more restrictive. The local ordinance is internally consistent. The state law is internally consistent. There is no flat contradiction between them. But the local ordinance cannot override the state requirement, and the answer to a user's question depends on knowing that.
A naive system retrieves the local ordinance, paraphrases it, and produces a confident answer that's legally wrong. A system with conflict detection but no authority awareness might surface both documents but treat them as equivalently authoritative — leaving the user to figure out the precedence. A system with explicit authority hierarchy can do better: surface the binding higher-authority document with a label that makes the precedence clear, present the local procedure as operating within that constraint, and flag the case as a hierarchical conflict requiring user awareness.
This is not a generic LLM capability. It requires deliberate machinery: tagging documents with their jurisdictional level (federal / state / county / municipal / departmental), tagging documents with their regulatory force (binding / guidance / internal / informational), running a separate retrieval pass that surfaces binding higher-authority documents on the same topic as the primary retrieval, and rendering the result with explicit hierarchical labeling.
In our work, this is the capability that produces the strongest user reaction in demos. The moment a borough manager sees their own local procedure displayed under a "binding state authority" header showing the PA Sunshine Act constraint they hadn't been thinking about — that is the moment that turns "interesting tool" into "this changes how we operate." The capability is technically achievable, but it's also the capability that distinguishes systems built deliberately for regulated environments from systems that were built for general-purpose use and pointed at a regulated corpus afterward.
What conflict detection cannot do¶
There is a genuine limit worth being honest about. A system can detect, classify, and surface conflicts. It cannot resolve them.
Resolution requires judgment that depends on information the system doesn't have. Which document is "really" in force when two contradict? Sometimes the answer is in the documents themselves (one explicitly supersedes the other). Often it isn't. The actual operational practice may have been verbally communicated, codified in an email thread the system hasn't ingested, or implicit in a chain of decisions the user can reconstruct but the documents alone don't support.
This is why the system's job is to surface conflicts, not to pick winners. A system that confidently picks the "right" document when the documents themselves don't establish precedence is making up authority that doesn't exist. The user — the borough manager, the compliance officer, the legal counsel — is the only entity with the contextual access to make the call.
What the system can do, after the user makes the call, is record it. Dismissal of a flagged conflict with a documented reason is itself a useful organizational artifact. It's a record of "we reviewed this; here's what we decided; here's why." Six months later, when someone else asks a question that touches the same documents, the dismissal record is visible and the previous reasoning can be inspected. The system has converted an ad-hoc judgment into a durable institutional decision.
This pattern — the system surfaces; the user judges; the judgment is persisted as a reviewable record — is the right division of labor. It respects what the system can do, respects what the system can't, and produces an audit trail that serves the organization regardless of whether the original judgment turns out to have been right.
Where this fails in practice¶
Even with all of the above implemented, conflict detection has practical failure modes worth knowing about.
Sparse coverage produces phantom conflicts. If the corpus is shallow on a topic, the system may surface what looks like a conflict between two documents when in fact one of them is just incomplete. The user reads both, realizes neither actually addresses the question fully, and the "conflict" was an artifact of the corpus being thin rather than the documents disagreeing. The fix is corpus-side, not system-side: better coverage produces fewer phantom conflicts.
Phrasing differences read as conflicts. Two documents say substantively the same thing but use different terminology. A naive comparison flags them as disagreeing. A more sophisticated comparison handles paraphrase but adds latency and cost. The right tradeoff depends on how often the corpus has phrasing variation versus actual content variation, which varies by domain.
Updates without explicit supersession. A 2022 resolution modifies a 2017 policy without explicitly saying "this supersedes section 4.2 of the 2017 policy." Both documents survive in the corpus. The system flags the conflict every time, correctly. The fix is operator-driven: someone has to mark the supersession relationship explicitly, after which the system can deprioritize the older document. Over time, an organization that uses the system regularly accumulates these supersession links, and the noise floor drops.
The user dismisses too aggressively. Once a user is in a flow, conflict flags interrupt them. A user who dismisses a flag without actually reviewing the conflict is worse than a user who never saw the flag — they have institutionalized the assumption that conflicts in this corpus are noise. The product design has to fight this. Forcing a written reason on every dismissal helps; preserving the dismissal in an audit log that's visible to the user's peers helps more.
These failure modes don't undermine the value of conflict detection — they shape what it takes to do conflict detection well. The capability is non-trivial. It rewards organizations that invest in it. It punishes organizations that adopt a system that promises it without doing the work behind it.
What to look for¶
If conflict detection matters in your environment, the questions worth asking of any system you evaluate are concrete. Show me a case where the corpus has two documents that disagree on a topic — does the system surface both positions or smooth them over into a single answer? When two documents disagree, does the answer include each conflicting passage, or only the names of the documents that conflicted? How does the system handle hierarchical conflict — local procedure constrained by binding higher authority — distinct from flat contradiction at the same level? When a flagged conflict is dismissed, is the dismissal logged with a reason, and is the same conflict suppressed on future queries that hit the same source set? When a dismissal turns out to have been wrong, is reverting it possible, and does the audit trail preserve both decisions?
These aren't trick questions. The right answers are visible in any system that's been built carefully against the problem; the wrong answers are visible in systems built for the demo. The differences become apparent within an hour of substantive use.
If you're working through these tradeoffs and want a sounding board — diagnostic, not pitch — we'd welcome the conversation.
About Base2ML. Base2ML is a Pittsburgh-based company building knowledge-access tools for organizations that need to find what they already have. We work in the specific space where retrieval, authority hierarchy, and conflict surfacing meet operational reality.
Contact. Base2ML · chris@base2ml.com · base2ml.com · docs.base2ml.com
Numbers and percentages are deliberately not invented. Where industry research provided a credible figure we cite it; where it didn't, we say so rather than fabricating one.