How AI-Native Compliance Reduces False Positives by 95%

12-min read Published 15 March 2026 Updated 1 April 2026

Compliance teams have been trained to accept false-positive rates as a cost of doing business. A 95%+ false-positive rate on transaction-monitoring alerts is the industry-quoted norm — and it costs roughly 60% of every analyst's working day. This study asks a simple question: what happens when detection is done by reasoning, not rules?

The benchmark

We ran WIDTH's detection engine in shadow mode against three production rule-based tools across twelve institutions — a mix of retail banks, digital banks, fintechs, and payment firms. Over a twelve-week window, each tool produced roughly 40M alerts in aggregate. We paired-reviewed a stratified sample of 20,000 to establish ground-truth labels.

What we measured

Precision (true positives / total alerts)
Recall at fixed review-budget (how many true positives survive the top N alerts)
Time-to-investigate (analyst minutes per case)
Narrative quality (reviewer rating, blinded)

The structural difference

Rule-based monitoring matches fixed conditions. Behaviour-aware scoring compares a customer's activity to their own baseline and to a peer group sharing their onboarding profile, transaction type, and corridor. The same $12,000 wire looks normal for one peer group and extraordinary for another — and that context never appears in a threshold.

"We went from 40,000 alerts a month to 2,000 — and the 2,000 are the ones that actually matter."

Results

Precision

Aggregate precision rose from 3.1% under rule-based monitoring to 62.4% under WIDTH. The strongest gains were in peer-sensitive typologies — structuring, layering, and APP scam variants — where the peer-group normalisation had the largest effect.

Analyst time

Median investigation time per genuine case dropped from 42 minutes to 9 — not because the investigator moved faster, but because the case arrived pre-drafted with entity graph, typology label, and citation chain.

Narrative quality

Blinded reviewers (three MLROs, independent) rated WIDTH narratives 4.6/5 on average, compared to 2.8/5 for analyst-assembled narratives written from scratch — a statistically significant gap at p < 0.001.

Where the AI still missed

The model under-flagged two classes of novel typology that no rule catches either — but which seasoned investigators intuited from context the model didn't yet have. Those gaps were closed by adding a peer-group feature the model wasn't originally ingesting.

The lesson: AI-native monitoring isn't magic, and it isn't static. The gap between WIDTH and rule-based tools widened over the study window because the model learned from feedback; the rules didn't.

What this means for compliance teams

Investigators stop leaving. Teams running WIDTH reported resignation rates declining against the industry baseline.
Audit packs ship cleaner. Every decision ties back to input features, policy version, and model version — visible in one click.
Novel typologies surface sooner. Pattern detection runs independently of whether a rule has been written yet.

Reproducing the study

We're making the methodology, sample schema, and review rubric available to compliance teams under NDA. Email research@width.com.

How AI-native compliance reduces false positives by 95%.