Output list
Journal article
Selection on X1 + X2 + ⋯ + Xm via Cartesian product trees
Published 28/04/2021
PeerJ. Computer science, 483
Selection on the Cartesian product is a classic problem in computer science. Recently, an optimal algorithm for selection on A+B, based on soft heaps, was introduced. By combining this approach with layer-ordered heaps (LOHs), an algorithm using a balanced binary tree of A+B selections was proposed to perform selection on X-1+X-2+...+X-m in o(n.m+k.m), where X-i have length n. Here, that o(n.m+k.m) algorithm is combined with a novel, optimal LOH-based algorithm for selection on A+B (without a soft heap). Performance of algorithms for selection on X-1+X-2+...+X-m are compared empirically, demonstrating the benefit of the algorithm proposed here.
Journal article
Performing Selection on a Monotonic Function in Lieu of Sorting Using Layer-Ordered Heaps
Published 02/04/2021
Journal of proteome research, 20, 4, 1849 - 1854
Nonparametric statistical tests are an integral part of scientific experiments in a diverse range of fields. When performing such tests, it is standard to sort values; however, this requires Ω( ( )) time to sort values. Thus given enough data, sorting becomes the computational bottleneck, even with very optimized implementations such as the C++ standard library routine, std::sort. Frequently, a nonparametric statistical test is only used to partition values above and below a threshold in the sorted ordering, where the threshold corresponds to a significant statistical result. Linear-time selection and partitioning algorithms cannot be directly used because the selection and partitioning are performed on the transformed statistical significance values rather than on the sorted statistics. Usually, those transformed statistical significance values (e.g., the value when investigating the family-wise error rate and values when investigating the false discovery rate (FDR)) can only be computed at a threshold. Because this threshold is unknown, this leads to sorting the data. Layer-ordered heaps, which can be constructed in ( ), only partially sort values and thus can be used to get around the slow runtime required to fully sort. Here we introduce a layer-ordering-based method for selection and partitioning on the transformed values (e.g., values or values). We demonstrate the use of this method to partition peptides using an FDR threshold. This approach is applied to speed up Percolator, a postprocessing algorithm used in mass-spectrometry-based proteomics to evaluate the quality of peptide-spectrum matches (PSMs), by >70% on data sets with 100 million PSMs.
Journal article
On Solving Probabilistic Linear Diophantine Equations
Published 01/01/2021
Journal of machine learning research, 22, 87
Multiple methods exist for computing marginals involving a linear Diophantine constraint on random variables. Each of these extant methods has some limitation on the dimension and support or on the type of marginal computed (e.g., sum-product inference, max-product inference, maximum a posteriori, etc.). Here, we introduce the "trimmed p-convolution tree" an approach that generalizes the applicability of the existing methods and achieves a runtime within a log-factor or better compared to the best existing methods. A second form of trimming we call underflow/overflow trimming is introduced which aggregates events which land outside the supports for a random variable into the nearest support. Trimmed p-convolution trees with and without underflow/overflow trimming are used in different protein inference models. Then two different methods of approximating max-convolution using Cartesian product trees are introduced.
Journal article
Fast Exact Computation of the k Most Abundant Isotope Peaks with Layer-Ordered Heaps
Published 04/08/2020
Analytical chemistry (Washington), 92, 15, 10613 - 10619
Computation of the isotopic distribution of compounds is crucial to applications of mass spectrometry, particularly as machine precision continues to improve. In the past decade, several tools have been created for doing so. In this paper we present a novel algorithm for calculating either the most abundant k isotopologue peaks of a compound or the minimal set of isotopologue peaks which have a combined total abundance of at least p. The algorithm uses Serang’s optimal method of selection on Cartesian products. The method is significantly faster than the state-of-the-art on large compounds (e.g., Titin protein) and on compounds whose elements have many isotopes (e.g., palladium alloys).
Journal article
EPIFANY: A Method for Efficient High-Confidence Protein Inference
Published 06/03/2020
Journal of proteome research, 19, 3, 1060 - 1072
Accurate protein inference in the presence of shared peptides is still one of the key problems in bottom-up proteomics. Most protein inference tools employing simple heuristic inference strategies are efficient but exhibit reduced accuracy. More advanced probabilistic methods often exhibit better inference quality but tend to be too slow for large data sets. Here, we present a novel protein inference method, EPIFANY, combining a loopy belief propagation algorithm with convolution trees for efficient processing of Bayesian networks. We demonstrate that EPIFANY combines the reliable protein inference of Bayesian methods with significantly shorter runtimes. On the 2016 iPRG protein inference benchmark data, EPIFANY is the only tested method that finds all true-positive proteins at a 5% protein false discovery rate (FDR) without strict prefiltering on the peptide-spectrum match (PSM) level, yielding an increase in identification performance (+10% in the number of true positives and +14% in partial AUC) compared to previous approaches. Even very large data sets with hundreds of thousands of spectra (which are intractable with other Bayesian and some non-Bayesian tools) can be processed with EPIFANY within minutes. The increased inference quality including shared peptides results in better protein inference results and thus increased robustness of the biological hypotheses generated. EPIFANY is available as open-source software for all major platforms at https://OpenMS.de/epifany.
Journal article
Alphabet Projection of Spectra
Published 06/09/2019
Journal of proteome research, 18, 9, 3268 - 3281
In the metabolomics, glycomics, and mass spectrometry of structured small molecules, the combinatoric nature of the problem renders a database impossibly large, and thus de novo analysis is necessary. De novo analysis requires an alphabet of mass difference values used to link peaks in fragmentation spectra when they are different by a mass in the alphabet divided by a charge. Often, this alphabet is not known, prohibiting de novo analysis. A method is proposed that, given fragmentation mass spectra, identifies an alphabet of m/z differences that can build large connected graphs from many intense peaks in each spectrum from a collection. We then introduce a novel approach to efficiently find recurring substructures in the de novo graph results.
Journal article
Published 10/2016
Biochimica et biophysica acta. Proteins and proteomics, 1864, 10, 1363 - 1371
We describe in detail the usage of leucine metabolic labelling in yeast in order to monitor quantitative proteome alterations, e.g. upon removal of a protease. Since laboratory yeast strains are typically leucine auxotroph, metabolic labelling with trideuterated leucine (d3-leucine) is a straightforward, cost-effective, and ubiquitously applicable strategy for quantitative proteomic studies, similar to the widely used arginine/lysine metabolic labelling method for mammalian cells. We showcase the usage of advanced peptide quantification using the FeatureFinderMultiplex algorithm (part of the OpenMS software package) for robust and reliable quantification. Furthermore, we present an OpenMS bioinformatics data analysis workflow that combines accurate quantification with high proteome coverage. In order to enable visualization, peptide-mapping, and sharing of quantitative proteomic data, especially for membrane-spanning and cell-surface proteins, we further developed the web-application Proteator (http://proteator.appspot.com). Due to its simplicity and robustness, we expect metabolic leucine labelling in yeast to be of great interest to the research community. As an exemplary application, we show the identification of the copper transporter Ctr1 as a putative substrate of the ER-intramembrane protease Ypf1 by yeast membrane proteomics using d3-leucine isotopic labelling. •Leucine metabolic labelling is a versatile strategy for quantitative proteomic studies.•Advanced peptide feature detection supports the data analysis for leucine metabolic labelling.•The copper transporter Ctr1 is a putative substrate of the ER-intramembrane protease Ypf1.
Journal article
Published 01/04/2016
Journal of machine learning research, 17, 36
Max-convolution is an important problem closely resembling standard convolution; as such, max-convolution occurs frequently across many fields. Here we extend the method with fastest known worst-case runtime, which can be applied to nonnegative vectors by numerically approximating the Chebyshev norm parallel to center dot parallel to (infinity), and use this approach to derive two numerically stable methods based on the idea of computing p-norms via fast convolution: The first method proposed, with runtime in O(klog(k)log(log(k))) (which is less than 18klog(k) for any vectors that can be practically realized), uses the p-norm as a direct approximation of the Chebyshev norm. The second approach proposed, with runtime in O(klog(k)) (although in practice both perform similarly), uses a novel null space projection method, which extracts information from a sequence of p-norms to estimate the maximum value in the vector (this is equivalent to querying a small number of moments from a distribution of bounded support in order to estimate the maximum). The p-norm approaches are compared to one another and are shown to compute an approximation of the Viterbi path in a hidden Markov model where the transition matrix is a Toeplitz matrix; the runtime of approximating the Viterbi path is thus reduced from O(nk(2)) steps to O(nklog(k)) steps in practice, and is demonstrated by inferring the U.S. unemployment rate from the S&P 500 stock index.
Journal article
Published 01/08/2015
Journal of computational biology, 22, 8, 770 - 783
Observations depending on sums of random variables are common throughout many fields; however, no efficient solution is currently known for performing max-product inference on these sums of general discrete distributions (max-product inference can be used to obtain maximum a posteriori estimates). The limiting step to max-product inference is the max-convolution problem (sometimes presented in log-transformed form and denoted as "infimal convolution," "min-convolution," or "convolution on the tropical semiring"), for which no O(k log(k)) method is currently known. Presented here is an O(k log(k)) numerical method for estimating the max-convolution of two nonnegative vectors (e.g., two probability mass functions), where k is the length of the larger vector. This numerical max-convolution method is then demonstrated by performing fast max-product inference on a convolution tree, a data structure for performing fast inference given information on the sum of n discrete random variables in O(nk log(nk)log(n)) steps (where each random variable has an arbitrary prior distribution on k contiguous possible states). The numerical max-convolution method can be applied to specialized classes of hidden Markov models to reduce the runtime of computing the Viterbi path from nk(2) to nk log(k), and has potential application to the all-pairs shortest paths problem.
Journal article
Quantitative SNP genotyping of polyploids with MassARRAY and other platforms
Published 01/01/2015
Methods in molecular biology (Clifton, N.J.), 1245, 215 - 241
Accurate genotyping is essential for building genetic maps and performing genome assembly of polyploid species. Recent high-throughput techniques, such as Illumina GoldenGate™ and Sequenom iPLEX MassARRAY®, have made it possible to accurately estimate the relative abundances of different alleles even when the ploidy of the population is unknown. Here we describe the experimental methods for collecting these relative allele intensities and then demonstrate the practical concerns for inferring genotypes using Bayesian inference via the software package SuperMASSA.