Mass spectrometry (MS) proteomics successfully sequence proteins for the Human Proteome Project (HPP). But quantitative proteomics typically reports random values from irrelevant proteins. Solving reproducible quantitation would drive ground-breaking discoveries using Nobel-recognized biological MS.
It took Sage-N Research many years to finally debug quantitative clinical proteomics (detailed below):
1. Quantitative vs qualitative proteomics are fundamentally different sciences.
2. Biomarker discovery requires quantifying anonymous peptides BEFORE (or even without) sequencing.
3. Each reproducible peptide quantitation requires an appropriate in-sample reference.
4. Discovery of biomarker peptides is conceptually simple but computationally complex due to noise and very large number of possibilities.
Conventional biomarker discovery projects focus on finding proteins (not peptides) quantitatively correlated to a medical condition. Scientists compare protein quantity between diseased (‘A’) vs. control (‘B’) bio-samples; proteins consistently over- or under-expressed are markers.
The first step is protein identification: MS2 fragment data are fed to a search engine to sequence peptides for inferring proteins.
Then quantitation follows: Sequenced peptides are typically quantified from their MS1 peptide intensity; differential quant is calculated as the ratio of A:B apex intensities. Since such ratios linked to the same protein can vary widely, analysis software may use the median or mean value (very imprecise!) to represent relative protein quant.
Though reasonable at first glance, a closer look reveals significant flaws.
First, while HPP ‘proteins’ are abstract 1D sequences (akin to genomics), biological proteins are physical 3D conformations completely absent from MS. This implies true MS protein quantitation is impossible, and any meaningful quantitation must be peptide-based. (In practice, multiple biomarker peptides should be combined for improved sensitivity and specificity.)
Clearly, quantitative vs qualitative proteomics are fundamentally different: They focus on different molecules (peptides vs proteins) with different data (MS1 vs. MS2).
Second, conventional quantitation of only identified peptides means throwing away 99%+ of valuable MS1 quant data (albeit of anonymous peptides), thus likely reducing the odds of research success below 1%.
To change the odds, we developed AI with signal processing for expert data mining of MS1 quant. The key insight: A peptide can be ‘identified’ from its m/z and apex retention time; its sequence is only needed for protein inference.
Third, peptide quant calculated as a direct ratio of intensities from different samples is effectively a random number generator. Instead, a more reproducible metric is a ratio of internal intensity ratios.
A peptide biomarker must indicate no-change for any sample from the same patient. In practice, it’s easy to see how different lab techs can extract X,Y samples from a heterogeneous fatty tumor with 10:1 difference in fat content, for instance. MS analysis would report X:Y direct peptide ratios of anywhere from 1:10 to 10:1 — far from the 1:1 required. (Note single- cell analysis won’t help as the same problem extends to heterogeneous organelles.)
Instead, reproducible peptide quant is ideally referenced to a near-identical peptide from the same sample — for instance, two peptides differing only in one post-translational modification — as an “in-sample peptide ratio” (ISPR). This helps to self-correct for sample prep if, for instance, doubling the fat content would double both the numerator and denominator of a fat cell ISPR of well-chosen like-peptides.(We first proposed the ISPR to quantify barely- detectable peptides in a 2019 ASBMB/UCSF Symposium poster, available HERE.)
And fourth, we can abstract peptide biomarker discovery as finding the proverbial needle (ISPR) in the haystack (MS1):
Given MS1 data for, say, 500 diseased and 500 control samples, we are looking for two anonymous peptide ions whose ratio (ISPR) is a significant discriminator of disease-vs-normal, period. Note a good biomarker ISPR is straightforward to validate; the challenge is finding it in the first place.
Tech veterans see a universal solution that always works given time (akin to Bitcoin mining): Randomly test every possible pairwise ISPR to find enough with discriminating power.
Almost all ISPRs involve randomly unrelated peptides with zero correlation to disease. A small number of ISPRs will correlate to demographics like gender and age. We hypothesize only perhaps dozens to hundreds correlate to disease. To find them from several million million (10^12) possible ISPRs in a typical MS1 data file is best with help from experienced specialists.
To our knowledge, this is the first detailed explanation of how inadequate data analysis allows sample prep artifacts to corrupt quantitative results, and how discarding 99% of the MS1 quant data makes novel discovery unlikely.
We’ve submitted an abstract and plan to present our original work at ASMS 2022. Please email me David@SageNResearch.com for additional information, and on how our CyberAssay technology and consulting services can unleash your clinical research.
Leave a Reply
Send Us Your Thoughts On This Post.