You can’t solve Sudoku with statistics. Same with proteomic peptide identification, it turns out.
Current workflows contain the subtle math bug of using statistical models too early, which causes the informational foundation (peptide IDs) and hence all results to be imprecise. Conventional thinking confuses “information” vs. “precision” (analogous to “energy” vs. “entropy”). Ad hoc improvements make things worse. Software like Percolator combine many metadata to increase overall information. Counterintuitively this actually hurts individual precision. For example, you get a better estimate of total correct IDs, but less certainty about any one peptide ID.
Proteomic disease research concerns specific low-abundance proteins for early detection. This means uncovering certain one-in-a-million spectra with poor signal-to-noise — aka data-mining. In contrast, conventional workflows are optimized for global statistics (e.g. <1% false-positives), which is meaningless for such analyses where >99% of results are irrelevant.
SorcererScore™ is the breakthrough that fixes the imprecision bug in one robust workflow (“TPP+XCorr”) to deliver Precision Proteomics. By replacing complex metadata-based statistical models with simple m/z arithmetic, it preserves the precision of raw data for discriminating correct peptide IDs among search results.
There is a tradeoff: It requires a powerful server to supply the “digital solvent” to scrub deep dirty data, as is consistent with other big-data fields.
Information science is often trivialized as learning to program, akin to equating linguistics with grammar. “What do you really know?” and “How do you know it?” are not trivial questions when inferring biochemistry from a billion mass/charge ratios (m/z), most of which are irrelevant or random. Algorithms to statistically blend hard data with soft metadata have devolved into a pile of inscrutable equations where no one can be certain of anything.
Mathematically there is in essence one precise base methodology but countless imprecise ones that can be significantly faster and cheaper. Many labs don’t understand the difference and choose software that is “fast and cheap” — attributes better for a haircut than brain surgery. The fallacy is conflating a well-defined, easy problem (shallow analysis) with an open-ended, noise-challenged problem (deep analysis).
With SorcererScore, the “precision” genie is now out of the bottle to create new winners and losers in biomedical research — a Revolution!
Because proteins are the main active biomolecules in life and disease, Precision Proteomics of low-abundance proteins is clearly a very big deal.
Here’s the thing about revolutions: Contrary to popular saying, a rising tide from a technology paradigm shift is a tidal wave that floats the 20% best boats and sinks the rest. The 80/20 rule suggests some 80% of labs and researchers are sitting ducks. Understanding and action help protect careers against fowl play.
Why low-abundance is a fundamentally new paradigm
To understand low-abundance analysis, look to the stars!
Conventional proteomics finds already-anticipated abundant proteins, akin to finding the Big Dipper among known constellations in the night sky. Not valuable.
In contrast, Precision Proteomics focuses on low-abundance proteins, akin to discovering new stars from faint photons. This is invaluable but hard due to poor signal-to-noise.
There are perhaps 100 billion trillion observable stars, but it’s hard to identify a single new star. The problem is, any distant star emits only a few faint photons that strike your tiny detector on tiny earth. But the sheer number of stars means every image of the night sky is full of mostly different twinkles.
That’s the analogy for identifying low-abundance peptides with mass spectrometry.
There is experimental evidence that detectable low-abundance peptides are everywhere (click here). However, most exist for a fraction of a second, show up in only one mass spec scan (either ‘MS1’ or ‘MS2’), and typically yield poor ‘MS2’ fragmentation. So it becomes a numbers game in terms mining the data for a few “Goldilocks” spectra that happen to be low-abundance yet whose ‘MS2’ is informationally analyzable.
We expect that low-abundance peptides tend to implicate single-peptide proteins (“one-hit wonders”), may appear in only one of several repeated experiments, and have low search engine scores.
Ironically, conventional workflows use these same meta information as rejection criteria. Many accept only multi-hit protein IDs, only peptides/proteins repeated in replicates, and define a PSM discriminant score dominated by the search engine score.
That’s why low-abundance proteins never show up in conventional experiments!
Even for data-independent analysis (DIA) mass spectrometry, our new paradigm suggests optimizing to identify a single peptide ID, which would likely be a low-abundance one. Instead, conventional DIA wisdom tries to identify two co-eluded peptides, a difficult problem that yields only abundant peptides.
Precision means avoiding metadata
Metadata is seductive because they provide quick-and-dirty answers to easy questions (shallow analysis), but they are misleading for tricky questions (deep analysis).
For example, I knew two younger students from 1970’s New York who became a math professor and a software engineer. The metadata paint a quick-and-dirty picture of their math aptitude. More metadata reveal a fuller picture: Both were International Math Olympiad gold medalists and Putnam Fellows, the latter meaning they were ranked top 5 in the US. In fact, one became the youngest Harvard full professor at 26. For this easy case, the metadata suggest they are both math prodigies, a conclusion that would require inordinate effort from analyzing only primary data (i.e. reading their theorems and proofs).
But metadata can be misleading for diamonds-in-the-rough situations. One gifted but disillusioned friend dropped out of college and became a courier. You can’t tell his gift without reading his work.
To discriminate correct peptide IDs from random PSMs, using metadata like search scores allow simple yes/no answers for the vast majority of PSMs. But the most valuable research deals with poor signal-to-noise spectra, where metadata are misleading and must be avoided.
The theoretical question then becomes whether it is possible to use only hard information (delta-mass) to discriminate correct IDs vs. random PSMs. In other words, is proteomics mass spectrometry a hard or soft science?
Two counterintuitive discoveries are the foundation of SorcererScore. First, precise peptide IDs can be discriminated with only hard information (delta-mass). And second, soft metadata actually hurt precision and must be avoided.
First, each peptide-sequence match (PSM) can be mapped into 3 physical parameters (delta-mass, fragment delta-mass, fragment peak-count) which is a point in 3D space. These points mostly self-aggregate into two mostly distinct regions. Correct IDs form a tall narrow column, while random PSMs form a low, wide blob. Low-abundance peptides reside in the intersection, which can be thinned out with a wide mass-tolerant search. (Click here for similar 3D plot [Figure 1] which illustrates the point.)
The correct ID and random PSM populations can be generally separated by a plane in 3D space, such that a discriminant score can be the signed distance to the plane.
In other words, it is possible to use only delta-mass information to discriminate correct IDs among PSMs. The interpretation is through geometry without any statistical modeling. Delta-mass information comprises core physical invariants that are independent of search engines, search conditions, etc. (Even de novo PSMs can also be analyzed this way, for example by superimposing them on search PSMs.)
Why discriminant score should not use search scores
Here we define metadata as anything not m/z or delta-mass. Importantly, this includes the search engine score which is a subjective figure-of-merit of a PSM.
Most people think the search score is critical information, but that’s not really true. All the critical info is already captured by delta-masses (again, the only hard info). So it adds no new hard info. On the negative side, they embed subtle model artifacts and assumptions that skew the analysis in non-obvious ways.
If you think about it, metadata by definition are non-hard-data used to influence data analysis. This means they necessarily hurt the intrinsic precision of the raw data. So they need to be omitted from the discriminant score, at least at the beginning step of the workflow. (Subsequent steps in the workflow can benefit from metadata, just not the first step.)
Mathematically, one can visualize metadata as a distortion that nudges each PSM physical 3D point either toward or away from the separation plane. Solid ‘yes’ and ‘no’ PSMs would be minimally influenced, so their main effect is on poor signal-to-noise spectra. And that’s the problem.
On average, their asymmetric distortions are more good than bad, so global statistics become sharper overall.
Unfortunately, the individual distortions are essentially semi-random, which make individual PSMs less precise in hard-to-predict ways.
Therefore, the best way is to omit metadata, such that the discriminant score is only calculated from hard data, but allow human experts to judge marginal PSMs on a case-by-case basis. In effect, the metadata is applied explicitly by a human rather than implicitly by a formula. This is why the best data-mining is done semi-manually by a computer-assisted human expert.
(Note that we simplified SorcererScore for illustration purposes. Sorry if this section is a little technical.)
A mathematician’s apology
Precision proteomics progressed slowly because it required two unnatural partners — analytical chemistry and theoretical mathematics. The former trusts only what can be seen while the latter is totally abstract. In a nutshell, the chemists controlled the money and bungled the math, so the field evolved with brilliant chemistry but flaky math to yield beautiful data but no breakthroughs.
From a glass-half-full perspective, this sets up perhaps one of the greatest Revolutions in our lifetime, waiting for a breakthrough.
Decades ago, as a young immigrant with nothing, I was lucky my family settled in New York, a mecca for teen math talent. I became a successful competitor with a national profile, but the highlight was my privilege to befriend and compete alongside several amazing prodigies. All my insecurities disappeared once I realized I wasn’t outrageously far from true math genius, and I was better at computing anyway. My poor MIT professors had to endure this newfound confidence.
Almost two decades after achieving the American dream through Moore’s Law, I switched to proteomics to “give back” to science and for a chance to ride the next great revolution. Along the way we made mistakes and misunderstood the unfavorable financial politics. But our unique background let us see what others can’t. Instead of fleeing like other software companies, we understand proteomics’s trillion-dollar potential and stayed to solve the precision riddle.
After a year of beta-testing and private “validation” discussions at ASMS, we are ready to declare a major first victory with SorcererScore.
Look, as outsiders we always faced detractors. Chemists quickly figure out I know little chemistry, which some find disqualifying. By the same token, I can tell if they don’t understand math, for example misinterpreting probabilities or abusing the cross-correlation function. A lot of published equations are ad hoc and make no sense. You can’t do clean precise science with sloppy math.
Believe it or not, ready or not, we are shifting the proteomics paradigm to deliver precise deep science.
First step to revolutionary success
The mathematics of precision is now well-understood. We can prove that, given today’s state-of-the-art, speedy PC software pervasive in labs are inherently imprecise.
It’s not hard to predict that only labs capable of precise analyses from $M instruments will thrive.
Serious labs can contact us about using SorcererScore to data-mine one dataset for free. This would allow training on the revolutionary technology and paradigm.
Please email Terri at sales@SageNResearch.com or visit our webpage (http://www.SageNResearch.com) for more information and previous Technical Posts.
P.S.: Why mass-tolerance search is necessary for Precision Proteomics
Theorem: In a search workflow, using only precursor delta-mass to discriminate correct IDs requires a mass-tolerance search.
Proof: The average delta-mass of correct IDs is fixed (by instrument accuracy), while that of random PSMs varies monotonically with search mass tolerance. So to increase the discriminating power of delta-mass means increasing said tolerance. QED
This seems to be a contentious point but the math is pretty clear.
1 Comment
Leave your reply.