Explained: Mathematics of Irreproducible Proteomics

Was this article helpful?

Proteomics would be ten times more successful if results are precise and trustworthy instead of imprecise or irreproducible. We explained precise raw spectra analysis [click here] using a multi-dimensional methodology [click here]. Now we mathematically explain three main causes of irreproducibility: 1% FDR, probability scores, and machine learning proliferated by popular PC programs.

Science is by design reproducible, so irreproducibility implies non-science. In other words, hypotheses are not rigorously tested against data.

For proteomics, that means peptide ID hypotheses (i.e. PSMs from a search engine) are not rigorously accepted/rejected using m/z data. Since workflows other than SorcererScore™ use a statistical model (e.g. PeptideProphet), the cause must be flawed statistics in the discriminator module. Incorrect peptide IDs corrupt all aspects of protein analysis.

Older data analysis methods like PeptideProphet give reproducible results if used correctly, with robust if imprecise results to perhaps 1% accuracy.

Most problematic are very popular PC programs that implicitly exploit meta-statistics to inflate IDs from flimsy m/z evidence. Non-rigorous algorithms can be faster and report more IDs than rigorous software. Users report small tweaks can drastically change outputs.

Note deep physical research uses powerful servers to analyze noisy raw data. Clearly astronomy and nuclear physics labs don’t hunt for distant planets or subatomic particles with 1% FDR PC software. Identifying elusive objects, whether Jupiter moons or protein biomarkers, require discovering needle-in-a-haystack raw data that support their existence.

1% FDR: too loose, hides flaws

Prevalent 1% FDR workflows are clearly inadequate beyond 100 abundant proteins, rendering biomarker discovery futile. In fact, the concept of false discovery rate (FDR) is arguably an unsuitable metric in a practically-infinite science where objects outnumber data.

Think of it this way: In astronomy, the fastest algorithm that reports the most IDs at <1% FDR is the dumbest one — just call every single twinkle a star (even planets and comets)!

When shooting fish in a barrel, like when we search clean yeast spectra against yeast proteins with a narrow tolerance, even insight-free algorithms can give 99% correct answers by exploiting meta-statistics alone.

In other words, 1% FDR gives a wide berth for sloppy statistics to thrive.

Probability scores are over-simplistic, non-rigorous

The p-value is a well-understood measure of rarity in the context of a random background.

As a thought experiment, let’s say a mass spectrometer can measure a fragment m/z to 1 ppm accuracy. Does that mean any fragment match has a p-value of one-in-a-million or 0.0001%?

No, because m/z’s of random fragments are far from uniformly distributed.

For one thing, peptide fragment masses have an exponentially large number of sequences with the same mass. In a general peptide, there are 2 isomers for y2 (“ab” and “ba”), 6 for y3 (all permutations of a,b,c), and roughly N! (N factorial if no repeats) for the N-residue y-ion. Additionally, natural peptide sequences do not have predictable permutation statistics.

In short, fragment m/z p-values are not conducive to simple significance formulas. Both permutational combinatorics and unstructured natural sequence variations make that intractable.

The “probability score” is a proteomics invention that simply models the significance of each fragment match with a constant “probability” (typically 5% to 10%). For example, the “binomial score” of M matches out of N possible may be defined as the probability of getting M head from N flips of a 5%/95% loaded coin. A match is typically within ~0.5 m/z tolerance.

Mathematically, a probability score is just a p-value knockoff but without any statistical rigor. But it is dangerously misleading in ad hoc validation with simple spiked mixtures.

Just as a broken clock tells the correct time twice a day, modeling a factorial function with a constant has a narrow range of correctness. Here, the score is mostly on-target for obvious “Yes” (many possible fragments match) and “No” (almost none do), but it becomes a twilight zone in-between where reality blends into wishful thinking. Again, this is because the assumed constant has no connection to the underlying statistics of a particular fragment m/z.

Machine learning = complex overfit

Machine learning (ML) is a hot algorithmic technology that makes sophisticated models practical. Mathematical models like PeptideProphet used to require humans to fit. ML instead implicitly derives parameters from “yes” and “no” training data. So almost anyone can build complex models in minutes.

But the question remains whether a very complex model is the right approach for peptide ID discrimination.

A peptide ID hypothesis is likely correct to the extent the raw m/z data fit it exceptionally well in terms of number and delta-m/z, period.

In my view, the Percolator algorithm is one of this field’s most innovative papers which triggered ideas directly leading to SorcererScore. Therefore it was very successful research.

But we found it is fundamentally non-rigorous to optimize FDR, a weak statistic with wide variance, using 20 dimensions of mostly meta-data. With too little hard data and way too much meta-data, the net result is an overfitted model with under-estimated FDR. (Most robust Support Vector Machines use a handful of dimensions.)

The FDR (after decoys are removed) is typically estimated as the decoy/target ratio, which assumes the number of wrong targets is exactly the decoy count. For example, if 90 two-headed coins [target-only] are tossed with 20 fair coins [50/50 target/decoy], on average we expect 100 heads (10 from fair coins) and 10 tails. Here FDR estimation is perfect as FDR = 10/100 or 10%. But occasionally, we get 102 heads (12 from fair coins) and 8 tails, which underestimates FDR as 8/102 = 7.8%. If we try very hard, it is possible to get 119 heads and 1 tail, or FDR <1%.

In other words, if you can repeatedly sample a random variable with high variance, you can artificially produce a far-off estimate. A 3GHz PC can do a lot of iterations to make >10% FDR look like 1%.

Note casual users won’t see the problem. Clean spectra would be unaffected because ID inflation arises from only from ambiguous cases. However, real-world data would show inflated ID count with significantly under-estimated FDR.

Proteomics is not genomics

Implicit to many ‘omics fields is the misguided adoption of the genomics paradigm of statistics-based science. But beyond the superficial, there is little commonality between the florescent-labeled primary sequence data of genomics and high-accuracy m/z data of proteomics and metabolomics.

Many labs confuse non-rigorous research algorithm prototypes with robust research software. Once one becomes popular, it seems to live forever. The proliferation of non-rigorous software confuses researchers and stalls the research.

This is a unique opportunity for breakout success with precise reproducible proteomics.

Explained: Mathematics of Irreproducible Proteomics

Science is by design reproducible, so irreproducibility implies non-science. In other words, hypotheses are not rigorously tested against data.

Older data analysis methods like PeptideProphet give reproducible results if used correctly, with robust if imprecise results to perhaps 1% accuracy.

Most problematic are very popular PC programs that implicitly exploit meta-statistics to inflate IDs from flimsy m/z evidence. Non-rigorous algorithms can be faster and report more IDs than rigorous software. Users report small tweaks can drastically change outputs.

1% FDR: too loose, hides flaws

Prevalent 1% FDR workflows are clearly inadequate beyond 100 abundant proteins, rendering biomarker discovery futile. In fact, the concept of false discovery rate (FDR) is arguably an unsuitable metric in a practically-infinite science where objects outnumber data.

Think of it this way: In astronomy, the fastest algorithm that reports the most IDs at <1% FDR is the dumbest one — just call every single twinkle a star (even planets and comets)!

When shooting fish in a barrel, like when we search clean yeast spectra against yeast proteins with a narrow tolerance, even insight-free algorithms can give 99% correct answers by exploiting meta-statistics alone.

In other words, 1% FDR gives a wide berth for sloppy statistics to thrive.

Probability scores are over-simplistic, non-rigorous

The p-value is a well-understood measure of rarity in the context of a random background.

As a thought experiment, let’s say a mass spectrometer can measure a fragment m/z to 1 ppm accuracy. Does that mean any fragment match has a p-value of one-in-a-million or 0.0001%?

No, because m/z’s of random fragments are far from uniformly distributed.

In short, fragment m/z p-values are not conducive to simple significance formulas. Both permutational combinatorics and unstructured natural sequence variations make that intractable.

Mathematically, a probability score is just a p-value knockoff but without any statistical rigor. But it is dangerously misleading in ad hoc validation with simple spiked mixtures.

Machine learning = complex overfit

But the question remains whether a very complex model is the right approach for peptide ID discrimination.

A peptide ID hypothesis is likely correct to the extent the raw m/z data fit it exceptionally well in terms of number and delta-m/z, period.

In my view, the Percolator algorithm is one of this field’s most innovative papers which triggered ideas directly leading to SorcererScore. Therefore it was very successful research.

In other words, if you can repeatedly sample a random variable with high variance, you can artificially produce a far-off estimate. A 3GHz PC can do a lot of iterations to make >10% FDR look like 1%.

Note casual users won’t see the problem. Clean spectra would be unaffected because ID inflation arises from only from ambiguous cases. However, real-world data would show inflated ID count with significantly under-estimated FDR.

Proteomics is not genomics

Implicit to many ‘omics fields is the misguided adoption of the genomics paradigm of statistics-based science. But beyond the superficial, there is little commonality between the florescent-labeled primary sequence data of genomics and high-accuracy m/z data of proteomics and metabolomics.

Many labs confuse non-rigorous research algorithm prototypes with robust research software. Once one becomes popular, it seems to live forever. The proliferation of non-rigorous software confuses researchers and stalls the research.

This is a unique opportunity for breakout success with precise reproducible proteomics.

Leave a Reply

Send Us Your Thoughts On This Post.