“Data is not information, information is not knowledge, knowledge is not understanding, understanding is not wisdom.” — Clifford Stoll
How do you tell the world’s best minds their instincts are wrong? That happens when science evolves quicker than scientists. The prize to get it right — scientific immortality. Now is the time for pioneering research into proteins, the central biomolecule in disease and treatment.
Proteomics data analysis is really two distinct methodologies — one push-button simple, the other valuable. “Shallow” proteomics was solved two decades ago with statistics, but “deep” precise proteomics (low-abundance proteins) is only now solved with our proprietary SorcererScore™ technology. The former achieves jaw-dropping simplicity by ignoring 99% of the content-rich data and only scratching the surface. In the latter, SorcererScore uses >200x more computing to scrub deep dirty data. This pushes automated analysis deeper into the surface layer, and also sets the stage for even deeper semi-manual analysis by your in-house bioinformaticians.
We previously explained our analytics breakthrough [click here]. Now we explain the once-in-a-lifetime opportunity of Precision Proteomics at the dawn of the digital medical revolution.
For years it was like trying to warn a desert civilization to get serious about a great flood that will forever change their world. Like that other story, almost no one listened. They don’t see that technology climate change (the digital revolution) is transforming their data desert into a flood zone. Now that labs are inundated with data but yielded little success, it’s time to bring in professional specialists to fix leaks in their information pipeline. Time is of the essence.
At Sage-N Research, we are seasoned Silicon Valley system engineering professionals, basically general contractors of industrial-strength digital construction. Our job and aptitude are to take half-baked bioinformatics ideas from researchers (their job) and fully bake them into the SORCERER integrated data appliance. In flood control parlance, we’re the people you consult when building Hoover Dams with no margin for error. Over the years the bulk of homemade workflows fell by the wayside due to quality or maintenance issues.
Proteomics was once red hot due to obvious revolutionary potential. But it mostly disappointed so supporters lost faith. No one knows why results were so imprecise and undependable, even with parts-per-million data accuracy.
Then we figured it out.
Three years ago, we discovered the problem — the math was wrong for precise analysis [click here]. Last year we fixed it by deriving SorcererScore from first principles. It takes only high school math and a 3D plot of your searched data to see how it works. However, more skill is needed to mathematically show it’s more or less the only way, at least without extensive use of metadata or prior assumptions.
This means that, for the first time in three decades since “proteomics” was coined, Precision Proteomics is ready to revolutionize medicine as the de facto protein microscope.
To be clear, scientific breakthroughs can never be pushbutton easy. Deep research takes thoughtful experiment design, meticulous execution, and luck. But rigorous mathematics applied by the powerful SORCERER iDA finally makes it possible at long last.
Now comes the hardest part — changing peoples’ minds and expectations.
When labs first started getting overwhelmed by raw data, instead of seeing a new paradigm most reverted to old instincts. Basically the field presumed the problem of “mass spec data” is mostly “mass spec” not “data”, so the presumed solution would be PC software from chemistry academics (as opposed to server scripts from data-mining specialists). This became a self-fulfilling prophecy as research and funding were directed toward academic labs publishing PC algorithms, and away from commercial server platforms like SORCERER.
The fallacy is that it means giving up on deep data where breakthroughs hide. Unlike high-tech engineers that grew up with huge monolithic datasets and powerful servers, bioscientists tend to have little experience with either. So most won’t know what they don’t know. Indeed, we see deep proteomics being closer to experimental particle physics than even genomics. Particle physicists may run data-mining scripts to look for needle-in-a-haystack collider mass/charge data that represent new particles among abundant ones.
Deep analytics is the art and science of avoiding subtle pitfalls, why data scientists are in high demand even in pro sports. Imagine developing a machine learning program that simulates the best card players or Wall Street investors. It should be a foolproof way to win money in online poker or stock investing, right? But the devil is in the details, while few have the mathematical skills to pinpoint vulnerabilities. It’s the same problem with most do-it-yourself workflows — they look good on paper but don’t quite work. Most can’t tell subtle flaws in the methodology until validation fails. which happens too often.
You’ve heard the statement, “Biology became an information science.” But what does it mean? One interpretation: data growth shifts from linear to exponential, which triggers a revolution like Moore’s Law. The exponential function starts slow but then overwhelms any linear function after its tipping point — i.e. “big” data. The superficial challenge is volume (“high throughput”) solvable by faster software alone — to produce shallow results faster — but that completely misses the point.
The real power is depth (“deep data”) because explosive data growth increases the odds of capturing important low-abundance peptides in an interpretable MS2 fragment scan. (Data-independent analysis or DIA further increases the odds.) Unlike merely faster software that uniformly analyzes data the same way, a data-mining platform can focus an extraordinary amount of computing power to target specific subsets of interest. A flexible script allows selective redirection of that power.
In other words, the above statement really means biology became a deep and precise mathematical science much like particle physics. And it also implies that information technology (IT) becomes the critical success factor and your competition edge, not something to cut to zero. More than other big-data bioscience technologies (genomics, hi-res microscopy), deep proteomics offers the most direct connection to breakthroughs in precision medicine. This suggests its natural scientific and market impact is 10x that of genomics, not 1/10th like it is today, once it actually works!
The game-changer is SorcererScore, a novel technology developed from the ground up for low-abundance analysis. As a baseline, a standard “TPP+XCorr” type workflow is retrofitted to provide fully automated capability using a familiar interface, as a first introduction to Precision Proteomics.
The full SorcererScore paradigm, loosely modeled after particle physics, involves designing the experiment around deep data analysis. The most important consideration is to minimize “aliasing” (mistaking one fragment ion for another). For instance, “fusing” multiple spectra or labeling more than one amino acid (say, K and R) in SILAC look good on paper for shallow analyses but exacerbates aliasing. Counterintuitive to some, in many cases it may be advantageous to SILAC label only the rarer amino acid.
Note that for deep analysis, >99% of the relevant data is within ‘MS1’ scans (intact peptide ions), of which only a tiny percentage have ‘MS2’ scans (fragment ions). Therefore SorcererScore focuses on extracting information from MS1 scans. The metaphor is to decipher the biochemical snapshot decribed by MS1 m/z “hieroglyphics” using peptide IDs (from MS2) as the Rosetta Stone. A low-abundance protein may only be represented by one or two peptide IDs plus multiple ‘MS1’ peaks from other peptides, using what we call hybrid peptide mass fingerprinting (‘HPMF’) as an extension of standard PMF.
Peptide IDs are the Achilles heel of data analysis and must be essentially error-free to the extent possible. Conventional workflows are overly statistical and search engine-dependent, so no one really understands them. SorcererScore uses the metaphor of solving a crossword puzzle by using a search engine to “guess” the “word” (peptide sequence), and accepting only a sequence with many matched fragments and tiny mass errors. To find low-abundance guesses, a cross-correlation search engine must be used, keeping the 50 or 100 best guesses. This novel metaphor provides a clean and simple abstraction that anyone can understand, is independent of subjective search scores, and uses only simple arithmetic on m/z’s (vs. complex statistics on search scores and metadata), which preserves the precision of the raw m/z data for peptide identification.
Peptide quantitation is also greatly simplified. Conventional approaches curve-fit a Gaussian m/z-vs-intensity curve for one MS1 scan, then combine ones over multiple scans. This won’t work for a sub-second signal with only 2 data-points total. Since the m/z width and time-between-scans are both roughly constant, we use the metaphor of estimating the volume of a freight train (to a scaling factor) as the sum of the max height of each car (each representing one MS1 scan). The simple sum-of-max-intensity (‘SOMI’) calculation importantly works for sparse MS1 scans that characterize low-abundance peptides. For a one-scan signal, SOMI is just the max intensity; for two-scan signals, it is the sum of two max intensities.
SorcererScore is based on powerfully simple ideas that are fundamentally different from conventional thinking. Unlike shallow analysis, it puts the scientists back in charge of data interpretation. Because it’s so different, we expect almost all labs will benefit, at least initially, from data analysis as a service, custom script development, or experiment review. Sage-N Research now offers such services.
To be sure, there are many complementary ideas with significant potential, including DIA and spectral libraries (SWATH), labeled and label-free quantitation, and so on. Fundamentally though, peptide IDs are the Rosetta Stone that connects the world of m/z’s to peptides and proteins. They are the foundation of everything in proteomics mass spectrometry.
In summary, first-generation proteomics, focused on instrumentation and chemistry, is restricted by small computers analyzing big data. SorcererScore enables Precision Proteomics by rebuilding its mathematical foundation as a data-mining paradigm for low-abundance proteins. Now that accurate mass spectrometers are the norm, deep data analysis becomes the bottleneck and competitive edge.
At Sage-N Research, we are not academics but engineering professionals who build, fix, and teach advanced IT for hire. Our mission from day one was to apply high-tech to accelerate medical discoveries to benefit mankind.
Now is the chance to become the Jacques Cousteau of deep disease research. You may be more productive by focusing on the deep science and let us handle the deep engineering. Or let us review and fix vulnerabilities in your custom workflow.
If your proteomics research is not going swimmingly, contact us to schedule a confidential Skype call. Such once-in-a-lifetime opportunity knocks but once.
Leave a Reply
Send Us Your Thoughts On This Post.