Once in a while you come across some great content on Netflix. The show Connected presented by Latif Nasser is one such example. In particular, the episode Digits had me wondering whether we have finally discovered a glitch in the simulation.
Benford's Law is beautifully explained in the Netflix show. Maybe "explained" is the wrong word, since the head-scratching begins in earnest after you grasp what it actually means. I'll try and give you a taste.
The Discovery
In 1881 Canadian-American astronomer Simon Newcomb noticed that the earlier pages of a logarithm book were far better thumbed than latter pages. Why would that be? He wondered. Later in 1938 the physicist Frank Benford tested the idea on datasets from different domains, and found that the first digit of the data-points follows a certain distribution. Namely, the first digit is "1" about 30% of the time, is "2" about 18% of the time, and so on.
If your eyebrows haven't furrowed yet in a way that signifies maximum-puzzlement, read and think about the prior paragraph till they do! Why would it be that across random datasets (surface areas of rivers, sizes of US cities, physical constants in nature, numbers circled on a newspaper from different stories, etc.) where data is supposed to be randomly distributed – we would find that the first digit is "1" 30% of the time, "2" is 18%, "3" is 13% and so on. Intuitively we would expect 1 to show up the same number of times as 9 (i.e. 11%) if they are random. This is spooky. Hence my calling this 'law' a spell, and now I'd like to take credit for discovering the strongest evidence yet that we do, in fact, live in a simulation.
Fighting Fraud with Mathematics
Of course, we don't need to understand why this happens in order to make use of it. Benford's Law has been used to fight accounting fraud, election fraud, IRS audit process, and has been admitted as evidence in criminal cases in the United States.
What about healthcare data? That was my immediate thought before the episode even finished. The chart below shows the distribution of the first digit of paid amounts for prescription drugs in a healthcare data. The blue dots are from the healthcare data, and the orange ones from Benford's predicted distribution. They essentially match! The error bars are a sort of 95% confidence interval – this analysis is across a number of different sources of data.
[Note: Chart showing distribution comparison would be displayed here]
Healthcare Fraud Detection Applications
So what can this tell us about fraud? I haven't gone into technical details in this brief write-up. To go deeper, we analyze the minimum number of claims and members needed to get stable Benford-like curves, getting the curves and confidence intervals by type & site of service (some curves are different than Benford's!) developed across vast datasets, analyzing the distribution of digits beyond the first one, and then using these to identify facilities, providers etc. that are outliers.
Fraudulent transactions will typically cause a significant deviation from the Benford's law variant that has been calculated from relevant data. This is not proof of fraudulent activity, but does provide a very useful way to winnow lists for further investigation. I know that many in my network work at health plans, would love to hear from you about your experiments with these ideas!
The Science Behind the Spell
So why does Benford's law hold? We humans tend to assign numerical values to measurements (e.g. surface areas of rivers, distances, etc.) in a way that makes them uniformly distributed on a logarithmic scale, and not in a linear one. There are more little observations than big ones. We prefer a counting scheme where the amount of data between 1 and 2 is about the same as that between 10 and 20, and so on. This isn't the most satisfying explanation…but then again do we really understand the things we do?
Limitations and Considerations
Finally, Benford's law isn't really a law. It doesn't always hold. Consider for example, height data on a population. However given the right statistical circumstances, lots of data across orders of magnitude, some ingenuity with data transforms – and you can gain considerable insight with this weird, strange, mathemagical spell.
Of course I'm not the first to think along these lines. Some quick searching yielded good papers on the topic, which explore various applications of Benford's Law in healthcare fraud detection and provide deeper technical analysis of the methodology.
Have you experimented with Benford's Law in your fraud detection work? What patterns have you discovered in your healthcare data? I'd love to hear about your experiences and connect on LinkedIn.