Back to blog
May 18, 2026·AI & Methodology·9 min read

Two LLMs, same case file, opposite verdicts.

Most “AI for litigation” tools haven’t reduced variance in legal risk. They’ve hidden it. A note on why methodology, not models, is the moat.

JHJonathan Habshush

Every legaltech demo in 2026 ends the same way. A founder types a question into a chat box, a confident paragraph fills the screen, and the partner in the room nods like a person who is being asked to evaluate a magic trick from inside the trick. The answer feels right. It is well-cited. It has the cadence of a junior associate who slept eight hours.

And it is, on a coin flip, wrong. Not subtly. Not at the edges. Wrong in a way that, when you take the same case file and run it through a different tool, a different model, or even the same model on a different day, you get an answer that is materially — sometimes oppositely — different. Win rates that move from 28% to 82%. Damages estimates that move by an order of magnitude. Recommendations that flip from “settle” to “take it to trial.”

This is the central, unsexy fact about AI for litigation as it exists today. The tools are not reducing variance in legal risk. They are hiding it. And the worst part is that they are hiding it well, because the answer always sounds like the right one.

The variance is the product

When we set out to benchmark this, the premise was simple. Take 50 commercial litigation matters — a mix of insurance coverage disputes, breach of contract, securities class actions, and IP — strip them down to the case file a litigator would have at intake, and run them through every commercial legal-AI tool on the market plus a panel of frontier models. Ask each one a single, well-scoped question:

Given this matter as presented, what is your estimate of the plaintiff’s probability of prevailing on the merits, and what is the expected damages range?

We weren’t trying to catch the tools out. We weren’t prompting adversarially. We were asking the question every general counsel, partner, and funder asks within the first thirty seconds of looking at a new matter. The results were not subtle.

41%
Of matters where no two tools agreed on the directional recommendation — settle vs. fight, take vs. pass.

On the median matter, the spread between the highest and lowest predicted win-rate across tools was 38 percentage points. On the worst-behaved matter — a coverage dispute with a fragmented procedural record — it was 54. To put that in human terms: one tool was telling the client they had a strong case worth pursuing. Another was telling them they were about to lose.

What is interesting, and a little uncomfortable, is that the most confident tool — the one that delivered the cleanest, most assertive paragraph in response — was the second-least accurate when measured against the matters that have since settled or resolved. Confidence and accuracy are not correlated in this market. In several cases, they are inversely correlated. The tool that hedged the least was the tool that was most often pointing in the wrong direction.

Why the chatbot frame breaks down

There is a structural reason for this, and it has nothing to do with which underlying model a given tool wraps. It has to do with what a litigation question actually is.

When a partner sits down with a new matter, they are not running one analysis. They are running six in parallel, each one capable of dominating the answer if it goes the wrong way. They are mapping the procedural posture — jurisdiction, time-bars, standing, removal vectors. They are dissecting claims and defenses element-by-element. They are stress-testing the damages model under P10, P50, and P90 assumptions. They are scoring the admissibility of every key exhibit. They are reading the bench: who the judge is, what their motion-grant rates look like, what opposing counsel typically does on the third Tuesday of a settlement window. They are tracing collectability through the parent-subsidiary structure to make sure that a win is actually a recovery.

Each of those is its own analytical workstream. Each one is large enough to be wrong on its own. And the final answer — the recommendation — is a synthesis that requires the partner to reconcile six independent estimates into a single, defensible call.

A chat interface is a profoundly bad container for that work. It collapses six tracks into one paragraph. It rewards fluency over rigor. It cannot tell you whether it weighted procedural risk three points too high. It cannot show you which of its inputs would, if flipped, flip the answer. And it cannot be audited the way a model is audited, because a model has inputs you can interrogate, and a chatbot has a vibe.

The narration tax, again

We have written before about the narration tax — the fact that every other major risk in the enterprise is modeled, and legal is narrated. Credit risk has FICO. Market risk has Bloomberg. Insurance has actuarial tables. Litigation, where a single decision can move a balance sheet, still runs on partner intuition and a memo.

The first generation of AI legal tools did not break the narration tax. They industrialized it. The memo is now generated faster, with more citations, in better prose. But it is still a memo. It is still narration. It just sounds more confident, because the thing producing it does not have a career to protect.

What we do differently

We are building Valar because we believe that the moat in legal AI is not the model. The moat is the methodology around the model. Six research tracks, each one specialized, each one cited, each one adversarially red-teamed, each one auditable. A single Decision Book at the end, signed by an expert, with the math in plain view.

We are not the only company that has noticed that the chatbot frame is wrong. But we are one of the few that has built the entire delivery — from intake to signed recommendation — around the proposition that variance is the enemy, not fluency. Every track has a confidence interval. Every conclusion has a list of inputs that, if flipped, would change the answer. Every Decision Book is defensible at the level of a partner pitch, a GC reserve memo, or a funder investment committee.

We think this is what answer layers for legal risk are going to look like in five years. The chatbot is a transitional artifact. It is the spreadsheet-on-paper of this era — useful as a demonstration, not durable as an infrastructure.

Why this matters now

Because the decisions that depend on these answers are not getting smaller. F500 contingent legal liability is in the trillions. Litigation funding is now an asset class with its own benchmark indices. Insurers are pricing D&O policies with a tail that bends like a hockey stick. And the people making those calls are sitting in front of tools that, on the same case file, will give them five different answers depending on which button they pressed.

The market has not yet caught up to this. It will. The first time a major matter resolves with a number that materially differs from the AI-generated estimate the GC relied on to set the reserve, the conversation will change. The frontier model is not the moat. The methodology around it is. A case answer is only worth what you can defend in front of a judge — or a board.

Notes on the benchmark

The 50-matter benchmark referenced above pulls from a mix of public and private matters; the public matters are weighted toward commercial litigation in S.D.N.Y., M.D. Fla., and the Delaware Chancery. Tools tested include the four largest commercial legal-AI platforms and three frontier models accessed through standard APIs. We are happy to share the methodology and the per-matter results with serious counterparties under NDA. Email us at info@valarhq.com if that is you.

Work with us

Submit a matter. We'll send you a Decision Book in 48 hours.