Oliver Serang

Hillsdale College

Preprint

Median of heaps: linear-time selection by recursively constructing binary heaps

by Oliver Serang

Posted to a preprint site 24/04/2023

arXiv (Cornell University)

The first worst-case linear-time algorithm for selection was discovered in 1973; however, linear-time binary heap construction was first published in 1964. Here we describe another worst-case linear selection algorithm,which is simply implemented and uses binary heap construction as its principal engine. The algorithm is implemented in place, and shown to perform similarly to in-place median of medians.

Preprint

Adversarial network training using higher-order moments in a modified Wasserstein distance

by Oliver Serang

Posted to a preprint site 07/10/2022

arXiv (Cornell University)

Generative-adversarial networks (GANs) have been used to produce data closely resembling example data in a compressed, latent space that is close to sufficient for reconstruction in the original vector space. The Wasserstein metric has been used as an alternative to binary cross-entropy, producing more numerically stable GANs with greater mode covering behavior. Here, a generalization of the Wasserstein distance, using higher-order moments than the mean, is derived. Training a GAN with this higher-order Wasserstein metric is demonstrated to exhibit superior performance, even when adjusted for slightly higher computational cost. This is illustrated generating synthetic antibody sequences.

Journal article Open access Peer reviewed

Selection on X1 + X2 + ⋯ + Xm via Cartesian product trees

by Patrick Kreitzberg, Kyle Lucke, Jake Pennington and Oliver Serang

Published 28/04/2021

PeerJ. Computer science, 483

Selection on the Cartesian product is a classic problem in computer science. Recently, an optimal algorithm for selection on A+B, based on soft heaps, was introduced. By combining this approach with layer-ordered heaps (LOHs), an algorithm using a balanced binary tree of A+B selections was proposed to perform selection on X-1+X-2+...+X-m in o(n.m+k.m), where X-i have length n. Here, that o(n.m+k.m) algorithm is combined with a novel, optimal LOH-based algorithm for selection on A+B (without a soft heap). Performance of algorithms for selection on X-1+X-2+...+X-m are compared empirically, demonstrating the benefit of the algorithm proposed here.

Journal article Peer reviewed

Performing Selection on a Monotonic Function in Lieu of Sorting Using Layer-Ordered Heaps

by Kyle Lucke, Jake Pennington, Patrick Kreitzberg, Lukas Käll and Oliver Serang

Published 02/04/2021

Journal of proteome research, 20, 4, 1849 - 1854

Nonparametric statistical tests are an integral part of scientific experiments in a diverse range of fields. When performing such tests, it is standard to sort values; however, this requires Ω( ( )) time to sort values. Thus given enough data, sorting becomes the computational bottleneck, even with very optimized implementations such as the C++ standard library routine, std::sort. Frequently, a nonparametric statistical test is only used to partition values above and below a threshold in the sorted ordering, where the threshold corresponds to a significant statistical result. Linear-time selection and partitioning algorithms cannot be directly used because the selection and partitioning are performed on the transformed statistical significance values rather than on the sorted statistics. Usually, those transformed statistical significance values (e.g., the value when investigating the family-wise error rate and values when investigating the false discovery rate (FDR)) can only be computed at a threshold. Because this threshold is unknown, this leads to sorting the data. Layer-ordered heaps, which can be constructed in ( ), only partially sort values and thus can be used to get around the slow runtime required to fully sort. Here we introduce a layer-ordering-based method for selection and partitioning on the transformed values (e.g., values or values). We demonstrate the use of this method to partition peptides using an FDR threshold. This approach is applied to speed up Percolator, a postprocessing algorithm used in mass-spectrometry-based proteomics to evaluate the quality of peptide-spectrum matches (PSMs), by >70% on data sets with 100 million PSMs.

Journal article Peer reviewed

On Solving Probabilistic Linear Diophantine Equations

by Patrick Kreitzberg and Oliver Serang

Published 01/01/2021

Journal of machine learning research, 22, 87

Multiple methods exist for computing marginals involving a linear Diophantine constraint on random variables. Each of these extant methods has some limitation on the dimension and support or on the type of marginal computed (e.g., sum-product inference, max-product inference, maximum a posteriori, etc.). Here, we introduce the "trimmed p-convolution tree" an approach that generalizes the applicability of the existing methods and achieves a runtime within a log-factor or better compared to the best existing methods. A second form of trimming we call underflow/overflow trimming is introduced which aggregates events which land outside the supports for a random variable into the nearest support. Trimmed p-convolution trees with and without underflow/overflow trimming are used in different protein inference models. Then two different methods of approximating max-convolution using Cartesian product trees are introduced.

Journal article Open access Peer reviewed

Fast Exact Computation of the k Most Abundant Isotope Peaks with Layer-Ordered Heaps

by Patrick Kreitzberg, Jake Pennington, Kyle Lucke and Oliver Serang

Published 04/08/2020

Analytical chemistry (Washington), 92, 15, 10613 - 10619

Computation of the isotopic distribution of compounds is crucial to applications of mass spectrometry, particularly as machine precision continues to improve. In the past decade, several tools have been created for doing so. In this paper we present a novel algorithm for calculating either the most abundant k isotopologue peaks of a compound or the minimal set of isotopologue peaks which have a combined total abundance of at least p. The algorithm uses Serang’s optimal method of selection on Cartesian products. The method is significantly faster than the state-of-the-art on large compounds (e.g., Titin protein) and on compounds whose elements have many isotopes (e.g., palladium alloys).

Journal article Peer reviewed

EPIFANY: A Method for Efficient High-Confidence Protein Inference

by Julianus Y. Pfeuffer, Timo Sachsenberg, Tjeerd M. H. Dijkstra, Oliver Serang, Knut Reinert and Oliver Kohlbacher

Published 06/03/2020

Journal of proteome research, 19, 3, 1060 - 1072

Accurate protein inference in the presence of shared peptides is still one of the key problems in bottom-up proteomics. Most protein inference tools employing simple heuristic inference strategies are efficient but exhibit reduced accuracy. More advanced probabilistic methods often exhibit better inference quality but tend to be too slow for large data sets. Here, we present a novel protein inference method, EPIFANY, combining a loopy belief propagation algorithm with convolution trees for efficient processing of Bayesian networks. We demonstrate that EPIFANY combines the reliable protein inference of Bayesian methods with significantly shorter runtimes. On the 2016 iPRG protein inference benchmark data, EPIFANY is the only tested method that finds all true-positive proteins at a 5% protein false discovery rate (FDR) without strict prefiltering on the peptide-spectrum match (PSM) level, yielding an increase in identification performance (+10% in the number of true positives and +14% in partial AUC) compared to previous approaches. Even very large data sets with hundreds of thousands of spectra (which are intractable with other Bayesian and some non-Bayesian tools) can be processed with EPIFANY within minutes. The increased inference quality including shared peptides results in better protein inference results and thus increased robustness of the biological hypotheses generated. EPIFANY is available as open-source software for all major platforms at https://OpenMS.de/epifany.

Preprint

Optimally selecting the top K values from X+Y with layer-ordered heaps

by Oliver Serang

Posted to a preprint site 30/01/2020

arXiv (Cornell University)

Selection and sorting the Cartesian sum, $X+Y$, are classic and important
problems. Here, a new algorithm is presented, which generates the top $k$
values of the form $X_i+Y_j$. The algorithm relies only on median-of-medians
and is simple to implement. Furthermore, it uses data structures contiguous in
memory, and is fast in practice. The presented algorithm is demonstrated to be
theoretically optimal.

Journal article Peer reviewed

Alphabet Projection of Spectra

by Patrick A. Kreitzberg, Marshall Bern, Qingbo Shu, Fuquan Yang and Oliver Serang

Published 06/09/2019

Journal of proteome research, 18, 9, 3268 - 3281

In the metabolomics, glycomics, and mass spectrometry of structured small molecules, the combinatoric nature of the problem renders a database impossibly large, and thus de novo analysis is necessary. De novo analysis requires an alphabet of mass difference values used to link peaks in fragmentation spectra when they are different by a mass in the alphabet divided by a charge. Often, this alphabet is not known, prohibiting de novo analysis. A method is proposed that, given fragmentation mass spectra, identifies an alphabet of m/z differences that can build large connected graphs from many intense peaks in each spectrum from a collection. We then introduce a novel approach to efficiently find recurring substructures in the de novo graph results.

Preprint

Practically efficient methods for performing bit-reversed permutation in C++11 on the x86-64 architecture

by Christian Knauth, Boran Adas, Daniel Whitfield, Xuesong Wang, Lydia Ickler, Tim Conrad and Oliver Serang

Posted to a preprint site 02/08/2017

arXiv (Cornell University)

The bit-reversed permutation is a famous task in signal processing and is key to efficient implementation of the fast Fourier transform. This paper presents optimized C++11 implementations of five extant methods for computing the bit-reversed permutation: Stockham auto-sort, naive bitwise swapping, swapping via a table of reversed bytes, local pairwise swapping of bits, and swapping via a cache-localized matrix buffer. Three new strategies for performing the bit-reversed permutation in C++11 are proposed: an inductive method using the bitwise XOR operation, a template-recursive closed form, and a cache-oblivious template-recursive approach, which reduces the bit-reversed permutation to smaller bit-reversed permutations and a square matrix transposition. These new methods are compared to the extant approaches in terms of theoretical runtime, empirical compile time, and empirical runtime. The template-recursive cache-oblivious method is shown to be competitive with the fastest known method; however, we demonstrate that the cache-oblivious method can more readily benefit from parallelization on multiple cores and on the GPU.

Oliver Serang

Hillsdale College

Output list