Compliance teams have been trained to accept false-positive rates as a cost of doing business. A 95%+ false-positive rate on transaction-monitoring alerts is the industry-quoted norm — and it costs roughly 60% of every analyst's working day. This study asks a simple question: what happens when detection is done by reasoning, not rules?
The benchmark
We ran WIDTH's detection engine in shadow mode against three production rule-based tools across twelve institutions — a mix of retail banks, digital banks, fintechs, and payment firms. Over a twelve-week window, each tool produced roughly 40M alerts in aggregate. We paired-reviewed a stratified sample of 20,000 to establish ground-truth labels.
What we measured
- Precision (true positives / total alerts)
- Recall at fixed review-budget (how many true positives survive the top N alerts)
- Time-to-investigate (analyst minutes per case)
- Narrative quality (reviewer rating, blinded)
The structural difference
Rule-based monitoring matches fixed conditions. Behaviour-aware scoring compares a customer's activity to their own baseline and to a peer group sharing their onboarding profile, transaction type, and corridor. The same $12,000 wire looks normal for one peer group and extraordinary for another — and that context never appears in a threshold.
"We went from 40,000 alerts a month to 2,000 — and the 2,000 are the ones that actually matter."
Results
Precision
Aggregate precision rose from 3.1% under rule-based monitoring to 62.4% under WIDTH. The strongest gains were in peer-sensitive typologies — structuring, layering, and APP scam variants — where the peer-group normalisation had the largest effect.
Analyst time
Median investigation time per genuine case dropped from 42 minutes to 9 — not because the investigator moved faster, but because the case arrived pre-drafted with entity graph, typology label, and citation chain.
Narrative quality
Blinded reviewers (three MLROs, independent) rated WIDTH narratives 4.6/5 on average, compared to 2.8/5 for analyst-assembled narratives written from scratch — a statistically significant gap at p < 0.001.
Where the AI still missed
The model under-flagged two classes of novel typology that no rule catches either — but which seasoned investigators intuited from context the model didn't yet have. Those gaps were closed by adding a peer-group feature the model wasn't originally ingesting.
The lesson: AI-native monitoring isn't magic, and it isn't static. The gap between WIDTH and rule-based tools widened over the study window because the model learned from feedback; the rules didn't.
What this means for compliance teams
- Investigators stop leaving. Teams running WIDTH reported resignation rates declining against the industry baseline.
- Audit packs ship cleaner. Every decision ties back to input features, policy version, and model version — visible in one click.
- Novel typologies surface sooner. Pattern detection runs independently of whether a rule has been written yet.
Reproducing the study
We're making the methodology, sample schema, and review rubric available to compliance teams under NDA. Email research@width.com.