Major breakthroughs are often prejudged if not from places like Harvard. At the 2019 UCSF mass spectrometry (MS) symposium in a room of world experts, one MIT grad proposed a radical approach to ultra-low abundance proteomics — the field’s Holy Grail for billion-dollar clinical R&D — that took years to discover but minutes to explain: Trust data over equations. Now everything changes.
Our big idea: Just 2 m/z data-points can yield accurate mass with gross quantity, even for a peptide near the limit-of-detection. But its identity is unknown. However, if it happens to be exactly one phosphorylation or oxidation mass less than an identified peptide, then it’s most likely (though not definitively) that same peptide with one less post-translational modification (PTM).
This is the final puzzle piece that allows Sage-N Research to offer world-class data science services for biopharma R&D to save months and millions of dollars. The fulsome solution is too important and nuanced for push-button software. As Silicon Valley tech veterans, we specialize in troubleshooting critical projects using proprietary data mining and artificial intelligence. High-accuracy MS data analysis means solving a large Sudoku-like puzzle with often-incomplete clues, which is ill-suited for today’s equation-based approach [see here].
Proteomics had little impact in clinical discovery despite spending billions over decades. The root cause is mis-application of statistics and software designed for academic proteome-mapping to clinical needle-in-a-haystack analysis. Indeed, free prototype software promoting exotic statistics (in lieu of rigorous p-values) that increase false positives are a costly bug for biopharma, but an accepted feature for publishing academic research which, by definition, doesn’t have to work in real-life.
Here we outline how non-academic tech professionals — focused on evidence and facts, not publications — solve deep MS analysis.
Clinical R&D costs millions with low success because molecular biology is opaque. When an experiment unexpectedly fails, no one knows why, which triggers new experiments until patience or funding runs out. Time and money can be saved by probing low-abundance pathway proteins (proteomics) or their byproducts (metabolomics) to troubleshoot problems without new experiments. But data analysis remains its misunderstood Achilles heel.
The immunoassay (ELISA) is a standard chemical probe that can be created for a specific protein form, but only if enough can be isolated to inject into an animal to harvest antibodies. They are easy to use if available — mix then measure florescence — but are practically impossible to make for low-abundance molecules. Any probe’s quality is defined by sensitivity and specificity. A poor probe for clinical analysis would have both poor sensitivity (only abundant proteins) and poor specificity (prone to false positives).
Today’s proteomics uses million-dollar mass spectrometers with amateur “quick-and-dirty” software that analyze only abundant proteins (from peptide fragment ions) and are prone to false positives.
Since less than 2% of the most abundant peptides can be fragmented (DIA or DDA), for every 20K identified peptides, some 1M+ anonymous peptides are thrown away. Clinical MS analysis involves the art of mining discarded deep data for robust clues even if incomplete.
As a useful if imperfect analogy, we conceptualize proteomics as digitalized version of highly multiplexed immunoassays — i.e. “cyber-assays.” Each cyber-assay is specific to one protein form as defined by one designated surrogate peptide. Unlike immunoassays, its specificity has to be explicitly tested against fragment data, which requires skill (or AI) for low-abundance proteins.
Let’s start with probing a well-characterized abundant protein. Its cyber-assay is keyed off a designated surrogate peptide that: (1) contains defining modification sites, (2) is long enough to identify the base protein sequence, and (3) short enough for MS measurement. In practice, the best surrogate may be a partially trypsin-digested peptide with one or more missed cleavages.
Applying a cyber-assay simply means collecting essential MS data-points (i.e. no Gaussian or other models) that support the quantitation and ID of the surrogate peptide. Quantitation requires 2 peptide ions (MS1) — the apex intensity ion and its isotope ion. The ID comprises all candidates +1 fragment y-ions (MS2 from DIA/DDA), the most dependable type. (We found b-ions and other ion types to be hit-or-miss and a source of false positives if considered.)
Solid results from an abundant cyber-assay may produce only about 10 raw m/z data-points (2 MS1, ~8 MS2), which can be readily validated for critical results. Quantitation is given by MS1 apex intensity. Identity is defined by the completeness of fragment evidence. Note that 8 consecutive +1 y-ions (y1 to y8) define a length-8 amino acid sequence that is typically enough to directly implicate the protein form including in-range PTMs. Conversely, we found solid IDs of high-abundance peptides tend to have near-complete +1 y-ions from y1 to about y10. (Software purported to identify a peptide from something like 2 fragments contain an obvious statistical bug.)
Next, we probe an ultra-low abundance protein form with a well-characterized surrogate peptide, which also has an abundant form exactly one PTM mass different. A clean cyber-assay result would produce about 12 data-points — 2 surrogate MS1 data-points (for mass and quantitation) plus ~10 for its abundant PTM variant.
The sensitivity challenge is that it may take several to dozens of MS runs to capture the surrogate’s 2 data-points. (Think how many telescope snapshots it takes to capture a faint Jupiter moon in view.) In other words, the surrogate and its abundant PTM variant are likely in different data files. But if we know the surrogate’s retention time and m/z, it’s fast and straightforward to check if new data capture the surrogate.
In real-world clinical R&D, data analysis is not straightforward because pathway protein forms are rarely well-characterized. A single protein has perhaps 100 potential surrogate peptide backbones, each with an exponential number of possible PTM forms. While applying one cyber-assay with known parameters can be fast: A solid result would have about 12 data-points, a weak or null result would have few or no data-points. But unknown parameters cause an exponential explosion in combinatorics of applying every viable pathway cyber-assay within dozens of aggregated raw data. This requires world-class data mining tools (AIMS™) and know-how.
For the first time, proteomics can save biopharma time and money, but it requires a professional approach to deep data analysis.
Leave a Reply
Send Us Your Thoughts On This Post.