How it works
Four statistical checks run against your data. Each one looks for a different class of problem.
Benford's Law
In real-world datasets — expense reports, population figures, tax returns — the leading digit "1" appears about 30% of the time. Not 11%. Not evenly distributed. This logarithmic pattern was discovered by physicist Frank Benford in 1938 and is now standard in forensic accounting.
We measure the Mean Absolute Deviation from the expected distribution using Nigrini's thresholds. The same method the IRS uses.
Duplicate detection
Exact-match row hashing catches the obvious copies. We also flag columns where an unusually high percentage of values are identical — which can indicate placeholder data or incomplete records that got duplicated to fill gaps.
Outlier analysis
A single outlier method always generates false positives. So we run two — Z-score (parametric, assumes normal-ish data) and IQR (non-parametric, doesn't) — and only raise a flag when both methods independently agree a value is unusual. This cuts the noise substantially.
Data integrity
The boring but critical stuff. Missing values and their distribution across columns. Numbers accidentally stored as text strings. Date columns with three different formats mixed together. Invisible leading and trailing whitespace that breaks lookups. These are the errors that silently break downstream analysis.
About your data
Your file is parsed in memory on Cloudflare's edge network, analyzed, and the results are sent back to your browser. The file content is not written to disk, not logged, not stored anywhere. When the response is delivered, the data is gone.