Methodology

Transparent Benchmarks

Every claim we make is backed by reproducible benchmarks. Our methodology is transparent, and our test suites are continuously expanded as we discover new patterns.

CVE BENCHMARK

Real-World Detection Performance

We test TraceMint against real CVE-affected repositories, not synthetic test cases. This benchmark measures our ability to find known vulnerabilities in production code.

500 CVE Repositories Scanned

                    80.4%
                    Strict Recall (Type + File)
                

25+ Vulnerability Classes Covered

📈 CVE Benchmark v4 Definition

Total Corpus 500 OSS CVE repositories across major web languages

Ground Truth Verified vulnerable file + vulnerability type per repo

Mode Blind Detection (no vulnerability hints provided)

Strict Hit Correct vulnerability type AND correct file identified

                        Result
                        80.4% strict recall on verified ground truth repos
                    

Validation JSON output, file from analyzed set, evidence required

Detection Rate by Vulnerability Type

Deserialization

100%

Path Traversal

92%

LDAP Injection

89%

XSS

87%

Command Injection

83%

XXE

82%

SQL Injection

80%

SSRF

78%

SSTI

75%

Auth Bypass

70%

IDOR

65%

Open Redirect

62%

METHODOLOGY

How We Test

Our testing methodology is designed to reflect real-world performance, not cherry-picked results.

📈

CVE Repository Testing

We clone real CVE-affected repositories and run full scans. A "hit" means we find the correct vulnerability type in the correct file — verified against advisory data.

Test Corpus 500 CVE repositories

🔄

Multi-Language Coverage

Our benchmark spans Python, JavaScript, Go, Java, PHP, Ruby, and more. Each language has dedicated taint engines and framework adapters.

Languages 30+ supported

🐳

Docker PoC Verification

When a repo includes docker-compose or Dockerfile, we automatically spin up a lab and execute the PoC to confirm exploitability.

Verification Automated E2E

🎯

Verdict-Based Scoring

Instead of arbitrary FP percentages, we use VERIFIED / PROOF-BACKED / NEEDS_REVIEW verdicts based on proof obligation completion.

Approach Proof-first

FALSE POSITIVE MANAGEMENT

Signal Over Noise

High recall means nothing if every finding is a false positive. Our proof-obligation pipeline ensures that reported findings carry verifiable evidence.

<0.5 FP per 1K LOC

                    5-Stage
                    FP Reduction Pipeline
                

Evidence Required for Every Finding

🛡️

Guard & Sanitizer Verification

Before reporting a finding, the engine checks whether authorization guards, sanitizers, or input validators are present on the data flow path — and suppresses if protected.

🔗

Source→Sink Chain Required

No pattern-match-only reports. Every finding must have a verified taint chain from user-controlled source to a dangerous sink, with all intermediate steps traced.

📋

Verdict Classification

Findings are classified as VERIFIED, PROOF-BACKED, or NEEDS_REVIEW based on how many proof obligations are satisfied — so you know exactly what to triage first.

METRICS

What We Measure

We track multiple metrics to ensure a balanced view of scanner performance.

Recall % of real vulnerabilities detected out of all known vulnerabilities Current: 80.4% strict

Precision % of reported findings that are true vulnerabilities Target: >85%

FP Rate False positives per 1000 lines of code analyzed Target: <0.5

Coverage % of vulnerability classes with active detection rules 25+ categories