To understand today’s proteomics: imagine a Gold Rush started with top earth scientists spending millions to map the landscape and to prospect for treasure, only to find just fool’s gold (false positives).
Self-doubt starts to set in; others leave for greener pastures. The problem? They dug 100x too shallow and barely scratched the surface. Instead of industrial excavators with skilled operators (platform with data scientists), they save money with cheap gadgets or brittle homemade machines.
We earlier explained how “LOD Proteomics” offers the ultimate sensitivity for deep protein research [click here]. We now explain how today’s methodologies don’t meet a minimum standard for science.
Once a dirty word in science, irreproducibility became normalized when peer-reviewed allowed unverifiable, often-secret formulas in publications. One root problem are labs conflating quality with quantity to embrace “fast-and-cheap” PC programs that maximize IDs. In turn, academic programmers use aggressive “p-hacking” statistics [click here] to boost IDs to chase citations. This is kosher in research as long as everyone understands the asterisks. The problem is, not all users do.
In fact, the software rule-of-thumb (“Good, fast, and cheap — pick any two”) implies that, as with restaurants and neurosurgeons, fast-and-cheap usually means no good. It correctly predicts that like clockwork, every trendy PC program fades within a few years. The issue is fundamental and not fixable with a software update. Another rule-of-thumb applies: “There’s no free lunch.”
Given the central role of proteins, this obvious information gap opens up possibly one of the greatest opportunities in science and medicine.
Irreproducibility caused by unfiltered noisy calculations
Any automated calculation with noisy data is ambiguous, period. In robust sciences, labs validate software calculations to discard anything questionable. But most proteomics labs publish them unfiltered, even from noisy co-eluded data.
Such analyses contain mostly correct easy answers interspersed with semi-random results lumped together with 1% FDR. Even though only a few percent may be suspect, they tend to be the most important. To biology PIs, that individual important results are untrustworthy makes proteomics itself irreproducible.
PeptideProphet was early robust (if imprecise) software that allowed visual check of true-vs-false data distributions. We found perhaps one in 10 or more datasets may not look right, suggesting something subtle might be wrong. Without transparency, there would be no way to tell.
Many popular PC programs are opaque, fully-integrated academic software designed to help beginners publish quickly and to elicit citations. But lack of transparency makes validation practically impossible. Their popularity generally coincides with the rise of the irreproducibility crisis, which supports our hypothesis.
1% FDR workflows have low precision
Re-read papers carefully and we see today’s data analysis workflows don’t actually analyze ‘data’ (i.e. m/z) per se, but rather calculate a hodgepodge of subjective scores and statistics.
Typically, high-accuracy m/z data are first transformed into peptide sequences. Subsequent analyses are sequence-based including probabilities/p-values and FDR/q-values. Like chemical transformations, each mathematical transformation increases entropy so each step loses precision. In other words, pricey data accuracy is squandered by superficial statistical analysis.
As an analogy, there are two ways to calculate taxes for General Motors. The only one acceptable to the IRS is a precise bottom-up calculation that takes months and costs millions with professional accountants. In contrast, a top-down statistical model can be fast-and-cheap and quickly programmed by a self-taught amateur — not the same thing. Pity the inexperienced CFO who can’t tell the difference and files taxes with the “best” model defined as the lowest payment.
Most proteomics labs lack tech backgrounds but distrust “IT people” and commonly choose the “best” software that maximizes IDs.
How proteomics p-hacking works
Consider astronomy where >99.9% of telescope photons represent stars. Therefore, to maximize celestial IDs allowing up to 1% error, just call everything a star. Also point the telescope toward an airport to catch some night flights. Calling any dim twinkle a star is statistically likely, but that’s mainly from meta-information not data.
Similarly in proteomics, when you analyze yeast proteins, >99.9% of peptides are yeast. Therefore, to maximize peptide IDs allowing 1% FDR, one can database search against yeast peptides and then exploit meta-statistics to identify them from very few fragments. It’s likely to be statistically correct but that’s not data-driven science.
Data overload confuses via analysis paralysis
Think of the information revolution as continually producing 2x information in 10x more data. It takes computing to extract information from among useless, bad or contradictory data. Untrained humans tend to draw untrue conclusions from spurious patterns, then find supporting cherry-picked evidence.
In politics, it causes division. So-called “death of expertise” has people distrusting scientists and experts. Filling the vacuum is fake news propagated through social media. “Facts” lose objectivity depending on your echo chamber.
Interestingly, today’s proteomics is stalled by the same dynamics. Biologists trust equations and software from peers not math or tech experts. Biased hearsay is propagated by peer-review as science, often from just one analysis of two datasets.The solution is not from peers but data experts.
“In God we trust; all others bring data”
Statistical problems are notoriously difficult to debug. Bad statistics caused the 2008 Great Recession and bankruptcy of LTCM, a hedge fund with two Economics Nobel laureates. Ad hoc statistics is holding back proteomics by shielding scientists from raw data.
LOD Proteomics is a statistics-free, data-driven foundation that restores proteomics as a robust, ultra-sensitive analytical science.
For more information, please contact Ms. Terri Nowak [TNowak@SageNResearch.com].
Leave a Reply
Send Us Your Thoughts On This Post.