Why Scripts for Precision Proteomics of Hybrid Mixtures

Was this article helpful?

While “1.0” focused on instrumentation, “Proteomics 2.0” — aka Precision Proteomics — is about application, particularly for translational clinical research. As such, the emphasis shifts from acquisition accuracy to dependable data analysis that is robust, sensitive, and reproducible. This means bioinformatics, the long misunderstood and neglected part of proteomics, becomes the critical success factor.

Unlike first generation tools reliant on statistical models for low-accuracy data, precision bioinformatics is necessarily a back-to-basics direct interpretation of raw mass/charge (m/z) measurements. This is a paradigm shift from over-reliance on inscrutable search scores and subjective probability models. It also requires flexible data-mining tools rather than simplistic push-button PC programs.

Is the bone broken? Best to look at the raw x-ray image, and not just some software summary.

Is the targeted low-abundance protein present? Best to directly compare measured vs. predicted m/z’s. Same principle.

Before search engines were invented, every mass spectrometry researcher understood correct identifications meant small delta-masses, which forms the foundation of SorcererScore(tm). Two decades later, many seem to have forgotten core basics and became over-dependent on complex formulas (search scores, probabilities, error rates) that are always imprecise and often misleading.

Here, we illustrate the precision bioinformatics paradigm, including how scripts provide flexibility to handle hybrid mixtures that canned software cannot.

In most cases, we expect to provide SORCERER GEMYNI sample scripts free of charge, which licensed users can customize. Simple scripts can take only a few hours or even minutes to write. These scripts can run on a low-priced cloud account (SORCERER Storm) up to the high-performance physical SORCERER Pro iDA system.

Why hybrid mixtures

Conventional workflows handle simple experiments involving uniformly processed, simple mixtures of known abundant proteins. This means proteins with known sequences from one sample are digested by one enzyme (typically trypsin) into peptides, which are dissociated within a tandem mass spectrometer by one mechanism (typically CID or HCD resulting in b-ions and y-ions).

Hybrid mixtures are important for advanced research for differential quantitation or improved peptide identification:

Heavy isotope-labeled peptides spiked into a bio-sample
CID vs. ETD spectra (b/y-ions vs. c/z-ions)
Light vs. heavy isotope labeling of specific amino acids in cells (SILAC)
Light vs. heavy isotope labeling of many amino acids in animals (SILAM)
Peptides from same protein mixture but digested with different enzymes, for sequencing unknown proteins

In general, searching hybrid mixtures with rigid workflows is either simple or very difficult, with little in between, depending on whether it is supported natively.

It is simple when the difference can be coded as a variable chemical modification. This is the case for CID-vs-ETD (as terminus mods), SILAC (residue mod), and certain types of spiked peptides.

It gets difficult for SILAM, multi-enzyme digests, and all-heavy isotope-labeled peptides which cannot be accounted for with a variable mod.

Why scripts

Notably many hybrid analyses difficult for canned workflows can be very straightforward with scripts.

In many cases, the same dataset can simply be searched independently under different conditions, with search results merged with a script for subsequent processing in Trans-Proteomic Pipeline or other tools. The flexibility to merge, with optional filtering, is what scripts can easily do that canned software cannot.

You may ask, Why not just add a menu item in a canned workflow instead of a script?

The challenge is unlimited variability and subjectivity. There are too many present and future options to distill into menus. Certain experiments may require filtering out decoys or illegitimate modifications. Or you may combine 2, 5 or 10 searches.

In contrast, scripts (akin to MS Excel macros) describe the workflow step-by-step to allow near-infinite flexibility. Server scripts can combine multiple languages and subsystems, for example R, Python, and MySQL for data/statistics, text processing, and relational database, respectively. Sample scripts make this flexibility accessible to researchers who don’t want to code themselves to those who want to change parameters or entire sub-sections.

You may now ask, Why license the GEMYNI platform to write apps when you can develop exactly what you want from scratch?

The short answer is that platform development, unlike app development, is a complex, error-prone distraction for new programmers or app-focused bioinformaticians. If one needs to do a custom calculation, it’s more efficient to write an Excel macro than to write a custom spreadsheet or data analysis program. It’s even more efficient to start with a sample macro that does 80% of what you need.

Software development is simply digital construction. Any handyman can do simple projects (apps), but only skilled general contractors can build large complex systems (platforms) that last years without hidden problems.

The deeper you look, the more you find

Broadly speaking, proteomics doesn’t get bioinformatics. This is both crisis and opportunity.

Anyone with basic understanding of “cars” knows that a car can indeed go 150 mph, seat 8, or cost only $500. Just not all in the same car. Only with misunderstanding would someone accustomed to beat-up minivans with a broken speedometer find a $25K Honda Accord to be an outrageously overpriced letdown.

However, proteomics labs routinely demand “software” that is easy, accurate, and inexpensive — 3 conflicting goals — in the same product, pointing to programs they currently use.

Actually, real-world products let you pick 2 of 3 at best.

Most labs can’t check accuracy and unwittingly choose easy-inexpensive software that “look” accurate by underestimating FDR, sometimes by 10x or more. This probably arose from an algorithm sensitivity or bug that became a feature. (We explained the mathematics of FDR-hacking in our blog.)

Such software provide short-sighted answers for cash-strapped labs: (1) reduced IT costs and (2) increased grants by making every experiment appear successful. In 10 typical experiments, accurate software might reveal only 1 exciting result along with 1 corrupt dataset and 8 “nothing-new”, while non-robust software can yield exciting results no matter what.

But hacking quality metrics creates a costly time bomb. Underestimating risk in credit default swaps, a financial derivative involving mortgages, helped triggered the $T 2008 Great Recession. Under-reporting emissions costed a $4B fine for a single diesel car company. In proteomics, unreliable results from flaky software cause credibility problems for individuals, labs, and the whole field with $M impact.

Think of the bioinformatics challenge this way: Imagine analyzing time-elapsed hi-res satellite images of a forest for zoology research. As in proteomics, the biggest challenge is the mind-boggling scale of the data, and not necessarily interpretation of specific data-points.

A simple PC program can automatically identify/count all the elephants and zebras. With significant effort, it’s possible to write sophisticated server software that automatically characterize all large and medium plus some small animals. But it’s probably too broad and cookie-cutter to be useful for any real-world research.

Instead, specific zoologists might want it to study endangered pygmy sloths or snow leopards. Here, the researcher would use gross filtering to eliminate most of data without animals, customize scripts to pick out obvious matches, and to keep ambiguous cases for manual interpretation.

In other words, a flexible, semi-automatic platform for data-mining large, hi-res datasets is the design goal for precision proteomics. SORCERER is designed to make high accuracy as easy as possible.

SORCERER accelerates original research

The greatest molecular biology revolution is happening now, mostly based on genomics. Proteomics has a role once it is robust, sensitive, and precise enough to answer clinically relevant questions. Its main limitation is probably bioinformatics.

Many labs say they can’t afford to upgrade software. Actually, the real question is whether they can afford not to. Can a lab be sustainable doing only simple experiments with simple software? Or rely on downloaded software to do copycat experiments 2 years later, which was tuned for their data not yours?

It isn’t surprising the best papers come from original research with customized bioinformatics. With SORCERER and tech support, any lab with sample prep and mass spec expertise can do leading-edge original research. It’s hard to think of a better leapfrog opportunity.

In the 1940’s, Alan Turing designed an electromechanical machine that searched settings for the Enigma code to help end World War II. Today, the SORCERER is ready to help crack the code of life.

Why Scripts for Precision Proteomics of Hybrid Mixtures

Is the bone broken? Best to look at the raw x-ray image, and not just some software summary.

Is the targeted low-abundance protein present? Best to directly compare measured vs. predicted m/z’s. Same principle.

Here, we illustrate the precision bioinformatics paradigm, including how scripts provide flexibility to handle hybrid mixtures that canned software cannot.

Why hybrid mixtures

Hybrid mixtures are important for advanced research for differential quantitation or improved peptide identification:

Heavy isotope-labeled peptides spiked into a bio-sample

CID vs. ETD spectra (b/y-ions vs. c/z-ions)

Light vs. heavy isotope labeling of specific amino acids in cells (SILAC)

Light vs. heavy isotope labeling of many amino acids in animals (SILAM)

Peptides from same protein mixture but digested with different enzymes, for sequencing unknown proteins

In general, searching hybrid mixtures with rigid workflows is either simple or very difficult, with little in between, depending on whether it is supported natively.

It is simple when the difference can be coded as a variable chemical modification. This is the case for CID-vs-ETD (as terminus mods), SILAC (residue mod), and certain types of spiked peptides.

It gets difficult for SILAM, multi-enzyme digests, and all-heavy isotope-labeled peptides which cannot be accounted for with a variable mod.

Why scripts

Notably many hybrid analyses difficult for canned workflows can be very straightforward with scripts.

You may ask, Why not just add a menu item in a canned workflow instead of a script?

The challenge is unlimited variability and subjectivity. There are too many present and future options to distill into menus. Certain experiments may require filtering out decoys or illegitimate modifications. Or you may combine 2, 5 or 10 searches.

You may now ask, Why license the GEMYNI platform to write apps when you can develop exactly what you want from scratch?

Software development is simply digital construction. Any handyman can do simple projects (apps), but only skilled general contractors can build large complex systems (platforms) that last years without hidden problems.

The deeper you look, the more you find

Broadly speaking, proteomics doesn’t get bioinformatics. This is both crisis and opportunity.

However, proteomics labs routinely demand “software” that is easy, accurate, and inexpensive — 3 conflicting goals — in the same product, pointing to programs they currently use.

Actually, real-world products let you pick 2 of 3 at best.

Most labs can’t check accuracy and unwittingly choose easy-inexpensive software that “look” accurate by underestimating FDR, sometimes by 10x or more. This probably arose from an algorithm sensitivity or bug that became a feature. (We explained the mathematics of FDR-hacking in our blog.)

Think of the bioinformatics challenge this way: Imagine analyzing time-elapsed hi-res satellite images of a forest for zoology research. As in proteomics, the biggest challenge is the mind-boggling scale of the data, and not necessarily interpretation of specific data-points.

Instead, specific zoologists might want it to study endangered pygmy sloths or snow leopards. Here, the researcher would use gross filtering to eliminate most of data without animals, customize scripts to pick out obvious matches, and to keep ambiguous cases for manual interpretation.

In other words, a flexible, semi-automatic platform for data-mining large, hi-res datasets is the design goal for precision proteomics. SORCERER is designed to make high accuracy as easy as possible.

SORCERER accelerates original research

The greatest molecular biology revolution is happening now, mostly based on genomics. Proteomics has a role once it is robust, sensitive, and precise enough to answer clinically relevant questions. Its main limitation is probably bioinformatics.

It isn’t surprising the best papers come from original research with customized bioinformatics. With SORCERER and tech support, any lab with sample prep and mass spec expertise can do leading-edge original research. It’s hard to think of a better leapfrog opportunity.

In the 1940’s, Alan Turing designed an electromechanical machine that searched settings for the Enigma code to help end World War II. Today, the SORCERER is ready to help crack the code of life.

1 Comment

Leave your reply.

Leave a Reply

Send Us Your Thoughts On This Post.