Tandem mass spectrometry (MSMS) is a revolutionary game-changer in a changing world. But it’s being held back by weak math and computing, the foundation of the new paradigm. What does that mean? The people who figure it out will change the world. Here we use uncommon common sense to explain precision proteomics for breakthrough discoveries.
Mass spec data — mass/charge ratios (m/z) — involve only basic arithmetic. Conceptually it’s almost as simple as elemental analysis (EA), which determines the CHNS composition of urine or other bio-samples by burning it to collect combustion products.
If the data is so simple, why is the mathematics of data analysis so complex? Turns out, it should not be. It’s basically a math bug.
The problem: While the data itself is high-dimensional (many fragment m/z’s), prevalent data analysis methodology is one-dimensional, probabilistic (binomial), and subjective. The solution-vs-problem mismatch gives rise to additional compensating complexity (e.g. incorporate meta-data, normalize non-probability scores, etc.) that causes researchers to give up trying to understand. Instead, they look to blackbox push-button software, each subjectively different, that proliferate like weeds in the confused peer-review literature. Paradoxically, less trustworthy software tend to become the most popular because they are faster, cheaper, and more easily manipulated to report favorable results. All labs have a hodgepodge of different software that give slightly different answers. How do you know which to trust?
The only real solution: Define the data analysis workflow yourself — the only person you really trust. With clear understanding and a professional platform like SORCERER, it’s easier than you may think to visualize data and prototype custom analyses. Published innovations are best customized into your own workflow to enable insightful validation and optimization.
Mass spec is confusing because only true negatives are definitive. Akin to Cinderella’s Story, if the slipper doesn’t fit, it’s surely not her. But if it fits, it is impossible to tell whether it’s her or a random girl. We can abstract MSMS molecular identification as identifying Cinderella in a sizable neighborhood using one slipper (precursor mass) plus a full wardrobe (many fragment m/z’s). A girl is likely our quarry to the extent she is an outlier in both the number and tightness of garments that loosely fit. Similarly, a peptide ID hypothesis (from a search engine) is likely correct to the extent in both the number and m/z error of loosely matched fragment ions, period. Importantly, this is not conducive to modeling with simple probabilities.
Another confusing point is that MSMS cannot identify a molecule per se, but merely provides fragment m/z’s to be compared to a given molecular hypothesis. We can abstract this problem as a crossword (peptide sequence) with numerical clues (fragment m/z’s). Most people solve a crossword by guessing many words, then checking if any one fits exceptionally well.
We now see a conceptually clean abstraction for MSMS peptide identification: A high-sensitivity search engine gross-guesses many (100+) peptide ID hypotheses from a spectrum; a high-specificity filter finds the best match — in terms of being a 2D outlier in number and delta-m/z error — to accept as the peptide ID. How can we check if an ID hypothesis is an outlier? The best way is visually in a 2D scatterplot, as part of an semi-interactive data analysis, somewhat like interpreting a medical x-ray.
Mathematically, the degree of being an outlier is the p-value — a calculated probability of being far from typical. In practice, a robust p-value is not well-defined for non-random, non-independent data like fragment m/z’s. Published papers essentially force-fit a binomial p-value using a single-fragment probability that varies between 3% to 10%, even though the model does not reflect the underlying statistics. This essentially injects random modeling artifacts into the workflow. The distortion would be unnoticeable for simple benchmarks with clean data, but significant for clinical biomarker discovery.
At the peptide level, proteomics data analysis tries to answer three questions: (1) What peptides are present? (2) How are they modified? (3) What are their relative quantities?
The scripted “guess-then-filter” abstraction is powerfully able to handle complex post-translational modifications (PTMs) without requiring specialized software.
For instance, N-linked glycosylation is an important cancer-related PTM on the N of the motif N-X-[ST] where X is any amino acid except P. Push-button software either does not handle it, include a special feature for it, or require editing the protein sequence file. The clean way is to search allowing for this PTM on N, then modify the filter script to exclude ID hypotheses without the motif. In other words, simply adding one regular expression (i.e. “N[^P][ST]”) to the filter script handles this complex PTM.
Tricky searches, such as for N15 and SILAM, require special features in conventional workflows. A “guess-then-filter” paradigm can apply sequence-dependent fragment prediction adjustments to raw peptide guesses prior to filtering. Here, raw peptide guesses don’t have to be exactly correct, just close enough. Scripting gives you flexibility to quickly develop novel analysis without relying on a 3rd party months later.
Quantitation is an important functionality for translational research that is simple in concept but difficult in practice due to unpredictable experimental variations. A scripting platform allows you to use machine learning to make adjustments in order to quantify, say, 100 or 1000 patient samples.
In 1931, the feds convicted gangster Al Capone by deciphering his ledger’s anonymous cash-flows to prove his crime. Today, researchers wish to identify and “convict” disease-causing proteins by deciphering the MS1 “ledger” of anonymous ion flows. For example, a targeted 1000 m/z chromatograph may show 10 “payments” (sampled intensities) a second apart. Without fancy curve-fitting, we can easily approximate the relative area-under-curve (AUC) by simply adding those consecutive non-zero intensities. But we won’t know what the peptide sequence is (a scrambled sequence has the same precursor mass) unless we can identify the anonymous peptide through its MS2 spectrum.
Push-button quantitation in general yields irreproducible results. Incorrect and mis-assigned peptides (common workflows allow 1% IDs to be wrong) is one reason. A decade of empirical evidence suggests robust generic push-button quantitation may be unrealistic. While the literature seems focused on fancy mathematics to increase accuracy for abundant peptides, robust real-world quantitation requires a simple semi-interactive scripting approach so researchers can easily isolate and fix sources of problematic calculations.
Finally, we must infer protein identity, modifications, and relative quantity from multiple peptides. The literature includes complex Bayesian approaches that few understand. (I think of them as the mathematics of circular reasoning used for data-poor sciences like economics — but not necessarily appropriate for data-rich MSMS.) They add to an over-mathematization of proteomics that cause most labs to run insight-free experiments. For practical reasons, we advocate a simpler approach to protein inference — just use the longest peptide (ideally long-enough to identify the protein) as the quantization surrogate. Identified sibling peptides are used only as indirect supporting evidence. It’s not perfect but simple to understand. Conceptual simplicity is critical for deep research with poor signal-to-noise.
Ironically, what emerges as the most practical and robust framework for translational proteomics looks a lot like the original Yates Lab workflow from 25 years ago. At that time, MS2 spectra were searched using a cross-correlation search engine, then manually filtered with the DTA Select software. Each identified peptide was by default assigned to the first matching protein sequence in the FASTA file (which could be sorted to establish priority). Back then, its simplicity allowed almost everyone to understand almost everything so insight could be deepened. In contrast, many current scientists seem dependent on the same push-button beginner software for years with little insight, one reason for slow progress.
Today, our core SorcererScore™ framework runs the exact same Yates-Eng search engine (still among the most sensitive), does semi-automated scriptable filtering, and uses simplified protein assignments. Customization can build on this simple foundation.
The latest paradigm — focused on longer, more informative peptides — allows data-dependent acquisition (DDA) data to be narrowly searched to true data accuracy (~1-2 ppm) with up to 20 or more PTM types. Data-independent acquisition (DIA) data can be similarly searched to true data accuracy (on the order of 10 amu for 3.5 m/z DIA windows for up to +3 charge) using the same unified framework. A unified DDA/DIA framework allows the same customized scripts to be reused and improved upon over time.
Tandem mass spectrometers, like space telescopes and particle colliders, produce big datasets with wide dynamic range. They change biomolecule analysis to be closer to astrophysics than traditional chemistry, and potential productivity gain from linear to exponential. A paradigm shift requires a new approach. The trick is not try to do everything new yourself.
The SORCERER X integrated data appliance, a compact version of our flagship SORCERER Pro iDA, costs less than $10,000, installs in minutes (like a printer), and works through a web browser from any PC. It puts a more than a decade of Silicon Valley engineering into your new mass spec paradigm.
Leave a Reply
Send Us Your Thoughts On This Post.