View our ASMS 2019 poster HERE for more details on the SorcererSILAC product and technology.
Proteomics is a revolution stalled by its reputation as irreproducible pseudo-science. If data analysis is responsible for characterizing peptides entering the mass spectrometer (MSMS), then any irreproducibility must be caused by flaky data analysis (i.e. not earlier steps like sample prep).
Obvious in hindsight, we discovered the root cause: misfit models from noisy data. This diagnosis allows a cure — an essentially model-free workflow using raw m/z data-points — with phospho SILAC as illustration.
If you cook a complex recipe with a large batch of ingredients, natural variation would cause some 5% to be undercooked or overcooked, why top chefs carefully prepare small batches for reproducible quality.
Likewise, using complex models to process a large m/z dataset yields some 5% under-fitted or over-fitted. While the cleanest 90%+ of the data may yield reasonable results to win over casual observers (most demos showcase spiked samples), the 5% noisiest data may yield random quantitation. Since a PI has no way to tell which 5% is suspect, proteomics at large becomes suspect.
Conventional data analysis has evolved with dizzying complexity to prevent rigorous analysis. A half dozen statistical and subjective models that almost no one understand are used to search spectra, discriminate correct peptide IDs, localize chemical modifications, estimate quantitation, and infer protein ID and quantitation. Blindly, most labs “play it safe” with trendy PC programs that perpetuate the problem.
A new paradigm is needed to end the analysis paralysis, which we recently discovered.
The SorcererSILAC™ technology, based loosely on astronomical data mining, views each MSMS datum of {(m/z, time, intensity)} as one standalone evidence of an ion (akin to a detected photon in a telescope), and not just one piece of a larger modeled whole. This simple idea has profound implications.
It turns out that as few as 8 tightly matched fragment y-ions may be sufficient to establish a solid peptide ID. And peptides as short as a 10-mer (though longer is better) can uniquely identify the protein. This is critical for quantifying modified proteins, as we shall explain.
For SILAC differential quantitation, it turns out as few as 50 SILAC light/heavy pairs of m/z data-points (more or less depending on data variability) can reasonably estimate the light/heavy quant ratio. Importantly, as independent evidence, their statistical variance can be used to detect and flag ambiguous quantitation for the PI. For instance, co-eluded peptides occur in some 15% of the data and may corrupt quant estimation.
In other words, in lieu of many inscrutable models, we mine perhaps 108 raw m/z data-points (8 MS2 and 100 MS1 in our example) for arithmetic calculation to discriminate peptide IDs, localize some modifications, and estimate peptide quantitation. Long unique-protein peptides also become protein surrogates for both ID and quant. Critical peptides can be readily verified by examining all relevant m/z data. Coding for novel chemistries and PTMs is also more straightforward because a new custom statistical model is unnecessary.
We can now explain the theory and application of reproducible proteomics.
In classical science, prior data drives a hypothesis which is then tested against brand new data. With fresh data, any random correlation is neutralized to ensure scientific reproducibility.
Data science adapts the scientific method to a single set of “big data” by randomly partitioning it into separate training and test subsets. Machine learning can be used to auto-optimize a hypothesis from the training data (think of fitting a ML model as selecting the best hypothesis among a continuum of possibilities), which is then tested against test data. Done rigorously, scientific reproducibility is preserved by simulating many virtual experiments using a single large dataset.
Fundamentally, both proteomics and astronomy are digital treasure hunts seeking elusive objects from physical measurements, so similar methodologies should apply.
Astronomers formulate data-testable hypotheses, for example that distant planets would cause a predictable dimming of star light, then use powerful servers to mathematically clean and mine noisy images to detect possible new planets. Statistical models are peripherally used for gross filtering, but final planet IDs are presumed verified by human experts. Such is a typical data mining methodology.
In SorcererSILAC, targeted proteins/peptides in our “cyber-assay” are listed in the FASTA sequence file. This drives a search engine to gross-guess possible sightings among MSMS spectra. These in turn drive the gross mining of ideally 100 or more MS1 and MS2 raw m/z data-points for peptide ID and quant.
Here are some uncommon insights from our SILAC analysis, which required more engineering than first anticipated.
First, for modified proteins, quantitation at the protein level is meaningless because of explosive combinatorics. Instead, clinical quantitation must be at the peptide level, hence the desire to quantify long protein-unique peptides if possible. (We found it necessary to modify the search engine here.) For instance, a theoretical single-site phosphoprotein has 2 forms and 3 possible peptide SILAC ratios. An N-site protein has 2^N forms and many more SILAC ratios.
Second,quantitative ambiguity is ever-present, for example with co-eluded or low abundance peptides. Just as faint astronomical objects can be identifiable but not quantifiable, some solid peptide IDs cannot be quantified.
SorcererSILAC reports one of 3 possible outcomes:
(1) unambiguous Light/Heavy ratio
(2) maxed out quantitation [i.e. the other pair is undetected and presumed undetectable]
(3) ambiguous quant that fails one or more quality-checks.
Finally,for most SILAC experiments, >90% of the log2(L/H) exhibit a Gaussian-like distribution. This suggests the majority of quantified deviations are really background noise, so biomarker discovery should focus on tail distributions only.
The reason for today’s irreproducibility crisis becomes clear:
In contrast to testing hypotheses against data, most experiments are run with no m/z-testable hypotheses, and most researchers do not interpret m/z data directly, but instead try different informatics software to find one that fits their expectations or narrative.
Robust translational proteomics is now possible.
View our ASMS 2019 poster HERE for more details on the SorcererSILAC product and technology.
Leave a Reply
Send Us Your Thoughts On This Post.