In May 2026, the WHO corrected a hantavirus case count after an initially inconclusive test came back negative [8]. It’s a small detail in a public health report—but it mirrors a problem security teams face daily: how do you distinguish real threats from noise when your tools keep crying wolf?
Static analysis scanners, DAST crawlers, and dependency checkers generate thousands of findings per scan. Most are false positives. Teams waste hours triaging them, and real vulnerabilities slip through the cracks. The solution isn’t better detection—it’s better triage.
Why Accuracy Isn’t Enough
Most security tools report accuracy or severity scores. But like medical AI models that outperform doctors in diagnosis yet still misclassify edge cases [1][3], high accuracy doesn’t guarantee reliable triage. A tool can rank vulnerabilities correctly 90% of the time and still flood you with low-risk alerts if it’s poorly calibrated.
Calibration means the predicted risk matches actual exploitability. If a scanner flags 100 “critical” issues, roughly 100 should be truly critical. In practice, that’s rarely the case. Without calibration, thresholds become arbitrary, and teams either ignore alerts or drown in them.
Quantify Uncertainty, Not Just Severity
Modern medical AI frameworks now require explicit uncertainty measures—not just point predictions [6]. Security teams should demand the same from their tools. When a SAST engine flags a potential SQL injection, it should also report confidence: Is this a textbook pattern or a heuristic guess?
Uncertainty quantification lets you tier your response. High-confidence, high-severity findings go straight to remediation. Low-confidence ones get batched for periodic review. This reduces alert volume without sacrificing coverage.
Validate on Your Own Data
A model trained on open-source projects may fail on your proprietary codebase. Temporal validation—testing on recent, unseen data—reveals performance drift [6]. Run your scanner against a labeled set of known vulnerabilities and false positives from your environment. Measure precision, recall, and calibration locally.
If your SAST tool has 95% recall but 40% precision, you’re spending 60% of triage time on noise. That’s not a tool problem—it’s a tuning problem.
Control for Multiple Testing
When you scan hundreds of components or files, you’re running hundreds of statistical tests. Some “findings” will be significant by chance alone. Apply the Benjamini-Hochberg procedure or similar false-discovery-rate controls to limit spurious alerts [6].
This isn’t just for ML models. Even rule-based scanners benefit from aggregate statistical review. If 50 out of 100 “critical” findings appear only in test code or deprecated modules, they’re likely noise.
Keep Humans in the Loop
The best medical AI systems don’t replace clinicians—they support them [2][6]. The same applies to security. Use AI or automation to prioritize, but let engineers make final calls. Document overrides. Track when analysts dismiss findings and why. Feed that data back into tool configuration.
This creates a feedback loop: dismissed alerts inform rule tuning, which reduces future noise, which increases trust in the system.
Monitor Post-Deployment Drift
False-positive rates aren’t static. Code changes, new dependencies, and updated detection rules all shift performance. Track false-positive trends over time. Set alerts for sudden spikes—they often indicate misconfiguration or environmental change, not a surge in real vulnerabilities.
Start Here
Pick one scanner in your pipeline. Run it against a labeled dataset from your last quarter of findings. Calculate precision and calibration. Then adjust thresholds until precision hits 80%—even if recall drops. You’ll cut triage time in half and catch just as many real issues.
