Quantitative Proteomics Debugged
Mass spectrometry (MS) proteomics successfully sequence proteins for the Human Proteome Project (HPP). But quantitative proteomics typically reports random values from irrelevant proteins. Solving reproducible quantitation would drive ground-breaking discoveries using Nobel-recognized biological MS. It took Sage-N Research many years to finally debug quantitative clinical proteomics (detailed below): 1. Quantitative vs qualitative proteomics are fundamentally different sciences. 2. Biomarker discovery requires quantifying anonymous peptides BEFORE (or even without) sequencing. 3. Each reproducible peptide quantitation requires an appropriate in-sample reference. 4. Discovery of biomarker peptides is conceptually simple but computationally complex due to noise and very large number of possibilities. Conventional…
Read MoreMulti-Peptide Signature as a Pathway Biomarker for Clinical Trial Patient Selection
Patient selection for a pharmaceutical clinical trial is a critical process that considers many biomarkers including molecular, clinical and demographic. Since drugs act on protein pathways, molecular markers are most indicative, but discovering enough protein and metabolite markers is extremely difficult. While mass spectrometry (MS) is known to struggle with intact proteins, it excels at detecting millions of digested peptides from practically every pathway. This suggests MS peptide ensembles can make easier and more sensitive pathway biomarkers than any one biomolecule. We developed the AIMS™ technology to compute multi-peptide signatures — from dozens to thousands of pathway peptides — using…
Read MoreUltra-Low Abundance Proteomics and Biopharma R&D
Major breakthroughs are often prejudged if not from places like Harvard. At the 2019 UCSF mass spectrometry (MS) symposium in a room of world experts, one MIT grad proposed a radical approach to ultra-low abundance proteomics — the field’s Holy Grail for billion-dollar clinical R&D — that took years to discover but minutes to explain: Trust data over equations. Now everything changes. Our big idea: Just 2 m/z data-points can yield accurate mass with gross quantity, even for a peptide near the limit-of-detection. But its identity is unknown. However, if it happens to be exactly one phosphorylation or oxidation mass…
Read MoreMathematics of Clinical Proteomics
Astrophysics comprises at least two distinct fields requiring different mathematics. Profiling black holes is a modeling exercise, while discovering unknown Saturn moons requires precise raw data mining. Using one math to solve the other won’t work. In particular, using old discoveries to build a model that is blindly applied to new data would yield mostly false positives and negatives. To be clear, models are useful for gross pre-filtering, but not for novel discovery, because they cannot embody yet-to-be-discovered insights. It’s easy to explain why clinical proteomics has had low success The math is unfit for purpose. Labs mistakenly conflate proteomics’s…
Read MoreFDR Secrets: What You Must Know
Chemical separation requires multiple steps. Using just one single step means low purity, scarce output, or both. Same with digital separation. False discovery rate (FDR), the statistical workhorse of proteomics, is widely used (and misused) as a single-step metric to “purify” search engine results into peptide IDs. Common 1% FDR workflows, designed for academic statistical molecular profiling, are ill-suited for discovery of individual biomarkers. Two problems: way too many false positives while likely omitting low-abundance peptides of clinical relevance. That’s why discovery success proves elusive. For example, the low-specificity prostate cancer biomarker PSA is said to have abundance beyond the…
Read MorePrecision Proteomics: A 10 Minute Explanation
Before Copernicus, planetary mathematics was ad hoc and difficult to understand. But his heliocentric abstraction needed only simple math to reveal patterns (circles) that enabled deep insight (gravity). Proteomics researchers suffer analysis paralysis from today’s multitude of ad hoc statistics. The only true solution is start with statistics-free, raw data analysis. But how?? When we recently presented posters on our discovery of precision analysis, some of the world’s best-known researchers were literally speechless over its simplicity. With accurate m/z data, each data-point represents one specific ion. Therefore, akin to using raw light to hunt exoplanets (2019 Physics Nobel), we use…
Read MoreWhy Irreproducible Proteomics Should Not Reproduce
To understand today’s proteomics: imagine a Gold Rush started with top earth scientists spending millions to map the landscape and to prospect for treasure, only to find just fool’s gold (false positives). Self-doubt starts to set in; others leave for greener pastures. The problem? They dug 100x too shallow and barely scratched the surface. Instead of industrial excavators with skilled operators (platform with data scientists), they save money with cheap gadgets or brittle homemade machines. We earlier explained how “LOD Proteomics” offers the ultimate sensitivity for deep protein research [click here]. We now explain how today’s methodologies don’t meet a minimum…
Read MoreHow LOD Proteomics Accelerates Deep Research with Lower Risk
It’s easy to snorkel in the shallows among tons of colorful fish. But if you want to do marine science, you start with quality scuba gear and go deep with a dive-master. Deep diving with DIY equipment and know-how can be painful or worse. Proteomics was once the “Next Big Thing” to revolutionize clinical discovery but fell short. We discovered why — the field is stuck in the shallows with “fast-and-cheap” statistical data analysis. We recently invented LOD (limit-of-detection) proteomics to identify/characterize protein forms down to LOD of mass spectrometers. This conceptually simple paradigm requires specialized equipment (data platform) and…
Read MoreLimit-of-Detection Proteomics Accelerates Clinical PTM Discovery
View our Stanford SUMS Symposium 2019 poster (Coming Soon!) Clinical discovery proteomics requires analyzing low abundance modified peptides (LAMPs). We present a novel method down to limit-of-detection (LOD). Under ideal conditions, 12 raw ions (4 MS1, 8+ MS2) can identify/quantify a protein with localized PTMs. Proteomics data typically contains more peptide ions (MS1) than fragment ions (MS2). One-third of MS1 peptides can be near-LOD with only 2 isotopic ions — still enough to calculate accurate mass and gross quantitation (apex intensity). In other words, MS1 is a treasure trove of LAMPs with solid mass and quant info, but is largely untapped because…
Read MoreLow Abundance Peptide Quant by Mining MS1 PTM Variants
View our ASBMB/Molecular and Cellular Proteomics 2019 poster HERE In tandem mass spectrometry (MSMS), 95%+ of the information resides in MS1 m/z data because relatively few precursors get fragmented. Yet conventional data analysis focuses primarily on MS2 spectra and sequence statistics, with MS1 data treated as an afterthought. This backward approach only scratches the surface of what’s possible. “Deep” label-free quantitation (DLFQ) instead focuses on identifying/quantifying MS1 precursor ions using MS2-identified peptides as a starting guide. This is like interpreting Egyptian hieroglyphs as an integrated story, instead of a collection of unconnected sentences, guided by a Rosetta Stone. It allows…
Read MoreArt of Science: AI and Software Platform in Breakthrough Proteomics
As a tech veteran among academics, I often feel like a chef among nutritional scientists who equate culinary mastery with learning the most recipes. Like cooking, math and engineering are more art than science, which is why experts can be young with minimal education. While science seeks to distill a complex world into lower dimensional knowns, art finds success within high dimensional unknowns. Here I offer my tech perspective to research success. In proteomics, the science of analyzing clean data is easy but unimportant. Its true power lies in characterizing ground-breaking one-in-a-million peptides with poor signal-to-noise — a challenge in…
Read MoreBreakthrough for Translational Peptide Quantitation (SILAC) Explained
View our ASMS 2019 poster HERE for more details on the SorcererSILAC product and technology. Proteomics is a revolution stalled by its reputation as irreproducible pseudo-science. If data analysis is responsible for characterizing peptides entering the mass spectrometer (MSMS), then any irreproducibility must be caused by flaky data analysis (i.e. not earlier steps like sample prep). Obvious in hindsight, we discovered the root cause: misfit models from noisy data. This diagnosis allows a cure — an essentially model-free workflow using raw m/z data-points — with phospho SILAC as illustration. If you cook a complex recipe with a large batch of…
Read MoreUS HUPO 2019
Model-Free SILAC Data Analysis is Possible, Reproducible, and Essential David Chiang; Patrick ChuSage-N Research, Milpitas, CA ABSTRACT: (View the Poster Here) Using SILAC as a case study, we aim to solve this paradox: How proteomics — an analytical science with high-accuracy data — can produce irreproducible results. One factor is that labs rely on complex models (search engines, discriminant scores, posterior probabilities) within often-opaque software to interpret m/z data. Increased complexity generally means narrower applicability. Because mass spectrometry data have wide signal-to-noise, any complex model is prone to fitting problems for some data subset while depriving researchers of data insight.…
Read MoreCyber-Assays Accelerate Nobel Proteomics
The 2018 Nobel laureate James Allison brings inspiration to all maverick researchers. He bucked convention and toiled at the fringes with custom assays for protein breakthroughs in immunotherapy. Conventional assays use chemistry and can take days to weeks to create if even possible. That’s why one experiment can take weeks to months — while breakthroughs take years to decades. But what if we accelerate custom assays by 10X? Digitalizing peptides with mass spectrometry transforms assays from chemical to algorithmic (and labs into tech startups). Although it converts tricky chemistry into tricky m/z Sudoku, the latter becomes solvable with computers that,…
Read MoreExplained: Mathematics of Irreproducible Proteomics
Proteomics would be ten times more successful if results are precise and trustworthy instead of imprecise or irreproducible. We explained precise raw spectra analysis [click here] using a multi-dimensional methodology [click here]. Now we mathematically explain three main causes of irreproducibility: 1% FDR, probability scores, and machine learning proliferated by popular PC programs. Science is by design reproducible, so irreproducibility implies non-science. In other words, hypotheses are not rigorously tested against data. For proteomics, that means peptide ID hypotheses (i.e. PSMs from a search engine) are not rigorously accepted/rejected using m/z data. Since workflows other than SorcererScore™ use a statistical…
Read MoreSolved: Raw Spectra Analysis for Clinical Proteomics
We recently discovered a surprisingly fast and robust method for raw spectra analysis (RSA). For the first time, it allows clinically important LAMPs (low abundance modified peptides) to be characterized at the raw data limit. The breakthrough: Figure 1 shows average y-ion accuracy can approach 0.001 amu (m/z) — which allows analysis of the smallest peaks. Figure 1: Distribution of y-ion delta-m/z errors for 80 correct PSMs (a) and 80 random PSMs (b). New SorcererScore™ RSA searches at true data accuracy (<2 ppm) while retaining 200 search results (PSMs) to include any low-score LAMPs. A customizable script (SX1503) uses…
Read MoreIntuitive Breakthrough for Protein Mass Spectrometry
Tandem mass spectrometry (MSMS) is a revolutionary game-changer in a changing world. But it’s being held back by weak math and computing, the foundation of the new paradigm. What does that mean? The people who figure it out will change the world. Here we use uncommon common sense to explain precision proteomics for breakthrough discoveries. Mass spec data — mass/charge ratios (m/z) — involve only basic arithmetic. Conceptually it’s almost as simple as elemental analysis (EA), which determines the CHNS composition of urine or other bio-samples by burning it to collect combustion products. If the data is so simple, why…
Read MoreHow Bad Analytics Hijacked Proteomics: A Theoretical Analysis
The fixable reason why proteomics has little clinical impact — it’s big data mis-analyzed by small PCs using faulty analytics. Here we explain how proteomics has been hijacked by a fundamentally flawed, self-reinforcing fantasy — that a sufficiently clever PC program can analyze huge content-rich mass spec datasets in minutes with 99%+ accuracy. Just as dirty samples need sufficient chemistry to clean, dirty deep data (where breakthroughs hide) need a lot of computation that greatly exceeds a PC’s capacity. In a nutshell, basically any analysis from any fast PC program is either uncompetitively shallow or deceptively semi-random because minimal information…
Read MorePrecision Proteomics: How to Unlock Its Unlimited Potential
“Data is not information, information is not knowledge, knowledge is not understanding, understanding is not wisdom.” — Clifford Stoll How do you tell the world’s best minds their instincts are wrong? That happens when science evolves quicker than scientists. The prize to get it right — scientific immortality. Now is the time for pioneering research into proteins, the central biomolecule in disease and treatment. Proteomics data analysis is really two distinct methodologies — one push-button simple, the other valuable. “Shallow” proteomics was solved two decades ago with statistics, but “deep” precise proteomics (low-abundance proteins) is only now solved with our…
Read MoreMeta-Math for Precision Low-Abundance Proteomics Explained
You can’t solve Sudoku with statistics. Same with proteomic peptide identification, it turns out. Current workflows contain the subtle math bug of using statistical models too early, which causes the informational foundation (peptide IDs) and hence all results to be imprecise. Conventional thinking confuses “information” vs. “precision” (analogous to “energy” vs. “entropy”). Ad hoc improvements make things worse. Software like Percolator combine many metadata to increase overall information. Counterintuitively this actually hurts individual precision. For example, you get a better estimate of total correct IDs, but less certainty about any one peptide ID. Proteomic disease research concerns specific low-abundance proteins…
Read MoreHow Simple Math is the Secret to Precision Proteomics
Let’s say you have some data-points (x,y) in a file. Without looking, you can force a linear regression and calculate a slope. Or you can blindly fit a polynomial or Bayesian or any other model and compute as many parameters as you like. You might even get the model published somewhere, but it doesn’t necessarily mean it’s any good. Such ad hoc models can parrot correct answers within a narrow range of applicability, for example on data similar to what was used to tune the model. But they can produce nonsense calculations otherwise. Since every dataset is unique, blindly applying…
Read MoreSorcererScore™ for Identifying Spliced Peptides in Immunology
“Data analysis”, often trivialized by those who don’t understand deep data, can vary from trivial to extremely difficult. Analyzing mass spectrometry (MS) data is fundamentally solving math puzzles with computers. Solving a mostly-filled Sudoku can be easy (even for an amateur-written PC program), but a gigantic version with millions of possibly ambiguous entries (representing data with added noise) requires skilled detective work. Solving hard MS puzzles has $M impact in accelerating biomarker and clinical discovery. Immunology is a hot clinical area involving tricky peptide analysis. A recent paper (Liepe et al, 10/2016) used a complex bioinformatics workflow to discover that…
Read MoreWhy Scripts for Precision Proteomics of Hybrid Mixtures
While “1.0” focused on instrumentation, “Proteomics 2.0” — aka Precision Proteomics — is about application, particularly for translational clinical research. As such, the emphasis shifts from acquisition accuracy to dependable data analysis that is robust, sensitive, and reproducible. This means bioinformatics, the long misunderstood and neglected part of proteomics, becomes the critical success factor. Unlike first generation tools reliant on statistical models for low-accuracy data, precision bioinformatics is necessarily a back-to-basics direct interpretation of raw mass/charge (m/z) measurements. This is a paradigm shift from over-reliance on inscrutable search scores and subjective probability models. It also requires flexible data-mining tools rather…
Read MoreLow-Abundance Peptides Are Everywhere in Proteomic MS1 Scans
Proteomics mass spectrometry holds unlimited potential for translational “bench-to-bedside” medical research, but until now it lacked 3 must-haves: robustness, sensitivity, and verifiability. Clinical impact requires dependable analysis on low abundance peptides/proteins, with results that can be directly traced back to raw data. In contrast, most labs process data with low-priced or academic proof-of-concept software that lack some if not all of these requirements. Here we address low-abundance and/or modified peptides (LAMPs). Specifically, we illustrate how to use the SORCERER GEMYNI data-mining platform to find, characterize, and quantify LAMPs in the intact peptide (“MS1”) mass spec data. Our internal studies suggest:…
Read MoreWhat Election Prediction Teaches About Probability Models
After the unexpected presidential election, many political pundits were quick to criticize prediction models for being “wrong” because almost all calculated probabilities were well below 50%. In fact, Nate Silver’s estimated ~30% sounds about right to me given the tiny margin of victory in several necessary states. Nevertheless the concept of abstracting the electoral outcome as one flip of a loaded coin begs these questions: (1) What exactly does a probability represent? and (2) How is it computed? In a nutshell, estimated probabilities of non-trivial predictions are best viewed as mental tools for human observers to quantify ambiguity by incorporating…
Read MoreHow to Identify Labile Phosphorylation and PTMs with SorcererScore
Figure 1: 3-D data-cube of S-score’s three components, with S-score=0 plane. A scuba diver has a sophisticated dive computer on his wrist. But if disoriented, he would blow bubbles which he follows in slow ascent. The bubbles directly tell him: (1) which way is up, and (2) the safe ascent speed to avoid the bends. That’s the success strategy amidst disorienting complexity: back to direct fundamentals. Proteomics is robust for easy problems like identifying semi-pure proteins under ideal conditions, but it mostly struggles with clinically valuable low-abundance modified peptides (LAMPs) and proteins due to limitations in analytics. Here we…
Read MoreBreakthrough SorcererScore Identifies Low-Abundance Peptides
Low-abundance analysis in proteomics is billion-dollar valuable but unsolved until now. After 18 months in stealth mode, we announce SorcererScore(tm), a breakthrough in analytics that successfully finds low-abundance, modified peptides (LAMPs). Figure 1: Peptide Identification Part of Proteomics Workflow Deep proteomics needs both high-accuracy mass data and quality interpretation. Many proteomics labs produce the former but struggle with the latter, akin to an x-ray lab that produces super-sharp images but grossly mis-identifies early tumors. Funding and attention will explode once quality results can be dependably delivered. The issue is analytics, defined as the discovery of meaningful patterns in data using…
Read More