John D Kececioglu
- Professor, Computer Science
- Associate Professor, Applied Mathematics - GIDP
- Associate Professor, Genetics - GIDP
- Member of the Graduate Faculty
- Professor, BIO5 Institute
Contact
- (520) 621-4526
- Gould-Simpson, Rm. 727
- Tucson, AZ 85721
- kece@cs.arizona.edu
Bio
No activities entered.
Interests
Research
Algorithm design, analysis, and implementation; bioinformatics and computational biology; combinatorial optimization.
Courses
2024-25 Courses
-
Analysis Discrete Struct
CSC 345 (Spring 2025) -
Research
CSC 900 (Spring 2025) -
Algorithms in Bioinformatics
CSC 450 (Fall 2024) -
Algorithms in Bioinformatics
CSC 550 (Fall 2024) -
Research
CSC 900 (Fall 2024)
2023-24 Courses
-
Independent Study
CSC 599 (Spring 2024) -
Thesis
CSC 910 (Spring 2024) -
Research
CSC 900 (Fall 2023)
2022-23 Courses
-
Algorithms
CSC 445 (Spring 2023) -
Research
CSC 900 (Spring 2023) -
Algorithms
CSC 445 (Fall 2022) -
Honors Thesis
CSC 498H (Fall 2022) -
Research
CSC 900 (Fall 2022)
2021-22 Courses
-
Dissertation
CSC 920 (Spring 2022) -
Dsgn+Anls of Algorithms
CSC 545 (Spring 2022) -
Independent Study
CSC 399 (Spring 2022) -
Algorithms
CSC 445 (Fall 2021) -
Dissertation
CSC 920 (Fall 2021) -
Research
CSC 900 (Fall 2021)
2020-21 Courses
-
Dissertation
CSC 920 (Spring 2021) -
Dsgn+Anls of Algorithms
CSC 545 (Spring 2021) -
Independent Study
CSC 599 (Spring 2021) -
Algorithms
CSC 445 (Fall 2020) -
Dissertation
CSC 920 (Fall 2020) -
Research
CSC 900 (Fall 2020)
2019-20 Courses
-
Algorithms
CSC 445 (Spring 2020) -
Dissertation
CSC 920 (Spring 2020) -
Adv Tpcs:Doctoral Colloq
CSC 695C (Fall 2019) -
Dsgn+Anls of Algorithms
CSC 545 (Fall 2019) -
Research
CSC 900 (Fall 2019)
2018-19 Courses
-
Adv Topics Software Sys
CSC 630 (Spring 2019) -
Algorithms
CSC 445 (Fall 2018) -
Independent Study
CSC 599 (Fall 2018)
2017-18 Courses
-
Adv Tpcs:Doctoral Colloq
CSC 695C (Spring 2018) -
Algorithms in Bioinformatics
CSC 550 (Spring 2018) -
Honors Thesis
CSC 498H (Spring 2018) -
Adv Tpcs:Doctoral Colloq
CSC 695C (Fall 2017) -
Algorithms in Bioinformatics
CSC 450 (Fall 2017) -
Honors Thesis
CSC 498H (Fall 2017)
2016-17 Courses
-
Adv Tpcs:Doctoral Colloq
CSC 695C (Spring 2017) -
Adv Tpcs:Doctoral Colloq
CSC 695C (Fall 2016)
2015-16 Courses
-
Dissertation
CSC 920 (Spring 2016) -
Honors Thesis
CSC 498H (Spring 2016)
Scholarly Contributions
Books
- DeBlasio, D. F., & Kececioglu, J. D. (2017). Parameter Advising for Multiple Sequence Alignment. New York: Springer.
Journals/Publications
- Krieger, S., & Kececioglu, J. D. (2022). Heuristic shortest hyperpaths in cell signaling hypergraphs. BMC Algorithms for Molecular Biology, 17(12), 26 pages. doi:doi:10.1186/s13015-022-00217-9More infoPublished in BMC Algorithms for Molecular Biology in 2022
- Matheson, T., Stubens, C., Wolf, N., Lee, C., Narayan, G., Saha, A., Scott, A., Soraisam, M., Bolton, A., Hauger, B., Silva, D., Kececioglu, J. D., Scheidegger, C. E., Snodgrass, R. T., Aleo, P., Evans-Jacquez, E., Singh, N., Wang, Z., Yang, S., & Zhao, Z. (2021). The ANTARES astronomical time-domain event broker. The Astronomical Journal, 161(107), 1-16.
- DeBlasio, D. F., & Kececioglu, J. D. (2018). Adaptive local realignment of protein sequences. Journal of Computational Biology, 25(7), 780-793. doi:DOI:10.1089/cmb.2018.0045More infoJournal of Computational biology does not even come up on CORE (which is ridiculous), as it is a highly-ranked journal in bioinformatics. (Selected papers from RECOMB, one of the top two conferences in bioinformatics, are invited for full journal versions in the Journal of Computational Biology.)
- Zhu, S., Narayan, G., Welch, E., Zaidi, T., Toeniskoetter, J., Soraisam, M. D., Taylor, C., Wang, Z., Singh, N., Lochner, M., Evans, E. M., Matheson, T., Seaman, R. L., Saha, A., Ridgway, S. T., Yang, S., Zhao, Z., Maier, R. S., Jenness, T., , Kececioglu, J. D., et al. (2018). Machine-learning-based brokers for real-time classification of the LSST alert stream. The Astrophysical Journal Supplement Series, 236(9), 26 pages. doi:doi:10.3847/1538-4365/aab781
- DeBlasio, D. F., & Kececioglu, J. D. (2017). Core column prediction for protein multiple sequence alignments. Algorithms for Molecular Biology, 12(11), 16. doi:10.1186/s13015-017-0102-3
- DeBlasio, D. F., & Kececioglu, J. D. (2017). Learning parameter-advising sets for multiple sequence alignment. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 14(5), 1028-1041.
- Suh, Y., Snodgrass, R. T., Kececioglu, J. D., Downey, P. J., Maier, R. S., & Yi, C. (2017). EMP: execution time measurement protocol for compute-bound programs. Software - Practice and Experience, 47(4), 559-597.
- Suh, Y., Snodgrass, R. T., Kececioglu, J. D., & Downey, P. J. (2016). EMP: Execution Time Measurement Protocol for Compute-Bound Programs. Software---Practice and Experience, 37+4.More infoThis journal is ranked A in CORE
- DeBlasio, D., & Kececioglu, J. D. (2014). Learning parameter sets for alignment advising. Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM BCB 2014), 230-239.More infoWhile the multiple sequence alignment output by an alignerstrongly depends on the parameter values used for the alignmentscoring function(such as the choice of gap penalties and substitution scores),most users rely on the single default parameter settingprovided by the aligner.A different parameter setting, however,might yield a much higher-quality alignment for the specific set ofinput sequences.The problem of picking a good choice of parameter valuesfor specific input sequences is called \textit{parameter advising}.A parameter advisor has two ingredients:(i) a set of parameter choices to select from,and (ii) an estimator that provides an estimate of the accuracyof the alignment computed by the aligner using a parameter choice.The parameter advisor picks the parameter choice from the setwhose resulting alignment has highest estimated accuracy.We consider for the first timethe problem of learning the optimal set of parameterchoices for a parameter advisor that uses a given accuracy estimator.The optimal set is one that maximizes the expected true accuracy of theresulting parameter advisor, averaged over a collection of training data.While we prove that learning an optimal set for an advisor is NP-complete,we show there is a natural approximation algorithm for this problem,and prove a tight bound on its approximation ratio.Experiments with an implementation of this approximation algorithmon biological benchmarks,using various accuracy estimators from the literature,show it finds sets for advisors that are surprisingly close to optimal.Furthermore, the resulting parameter advisors are significantlymore accurate in practice than simply aligningwith a single default parameter choice.
- Kececioglu, J., & Deblasio, D. (2013). Accuracy estimation and parameter advising for protein multiple sequence alignment. Journal of Computational Biology, 20(4), 259-279.More infoPMID: 23489379;PMCID: PMC3619150;Abstract: We develop a novel and general approach to estimating the accuracy of multiple sequence alignments without knowledge of a reference alignment, and use our approach to address a new task that we call parameter advising: the problem of choosing values for alignment scoring function parameters from a given set of choices to maximize the accuracy of a computed alignment. For protein alignments, we consider twelve independent features that contribute to a quality alignment. An accuracy estimator is learned that is a polynomial function of these features; its coefficients are determined by minimizing its error with respect to true accuracy using mathematical optimization. Compared to prior approaches for estimating accuracy, our new approach (a) introduces novel feature functions that measure nonlocal properties of an alignment yet are fast to evaluate, (b) considers more general classes of estimators beyond linear combinations of features, and (c) develops new regression formulations for learning an estimator from examples; in addition, for parameter advising, we (d) determine the optimal parameter set of a given cardinality, which specifies the best parameter values from which to choose. Our estimator, which we call Facet (for "feature-based accuracy estimator"), yields a parameter advisor that on the hardest benchmarks provides more than a 27% improvement in accuracy over the best default parameter choice, and for parameter advising significantly outperforms the best prior approaches to assessing alignment quality. © Copyright 2013, Mary Ann Liebert, Inc. 2013.
- Deblasio, D. F., Wheeler, T. J., & Kececioglu, J. D. (2012). Estimating the accuracy of multiple alignments and its use in parameter advising. Proceedings of RECOMB 2012 (Springer-Verlag Lecture Notes in Bioinformatics), 7262 LNBI, 45-59.More infoAbstract: We develop a novel and general approach to estimating the accuracy of protein multiple sequence alignments without knowledge of a reference alignment, and use our approach to address a new problem that we call parameter advising. For protein alignments, we consider twelve independent features that contribute to a quality alignment. An accuracy estimator is learned that is a polynomial function of these features; its coefficients are determined by minimizing its error with respect to true accuracy using mathematical optimization. We evaluate this approach by applying it to the task of parameter advising: the problem of choosing alignment scoring parameters from a collection of parameter values to maximize the accuracy of a computed alignment. Our estimator, which we call Facet (for "feature-based accuracy estimator"), yields a parameter advisor that on the hardest benchmarks provides more than a 20% improvement in accuracy over the best default parameter choice, and outperforms the best prior approaches to selecting good alignments for parameter advising. © 2012 Springer-Verlag Berlin Heidelberg.
- Huynh, L., Kececioglu, J., Köppe, M., & Tagkopoulos, I. (2012). Automatic design of synthetic gene circuits through mixed integer non-linear programming. PLoS ONE, 7(4).More infoPMID: 22536398;PMCID: PMC3334995;Abstract: Automatic design of synthetic gene circuits poses a significant challenge to synthetic biology, primarily due to the complexity of biological systems, and the lack of rigorous optimization methods that can cope with the combinatorial explosion as the number of biological parts increases. Current optimization methods for synthetic gene design rely on heuristic algorithms that are usually not deterministic, deliver sub-optimal solutions, and provide no guaranties on convergence or error bounds. Here, we introduce an optimization framework for the problem of part selection in synthetic gene circuits that is based on mixed integer non-linear programming (MINLP), which is a deterministic method that finds the globally optimal solution and guarantees convergence in finite time. Given a synthetic gene circuit, a library of characterized parts, and user-defined constraints, our method can find the optimal selection of parts that satisfy the constraints and best approximates the objective function given by the user. We evaluated the proposed method in the design of three synthetic circuits (a toggle switch, a transcriptional cascade, and a band detector), with both experimentally constructed and synthetic promoter libraries. Scalability and robustness analysis shows that the proposed framework scales well with the library size and the solution space. The work described here is a step towards a unifying, realistic framework for the automated design of biological circuits. © 2012 Huynh et al.
- Kececioglu, J., Kececioglu, J. D., & Kim, E. (2008). Learning scoring schemes for sequence alignment from partial examples. IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM, 5(4).More infoWhen aligning biological sequences, the choice of parameter values for the alignment scoring function is critical. Small changes in gap penalties, for example, can yield radically different alignments. A rigorous way to compute parameter values that are appropriate for aligning biological sequences is through inverse parametric sequence alignment. Given a collection of examples of biologically correct alignments, this is the problem of finding parameter values that make the scores of the example alignments close to those of optimal alignments for their sequences. We extend prior work on inverse parametric alignment to partial examples, which contain regions where the alignment is left unspecified, and to an improved formulation based on minimizing the average error between the score of an example and the score of an optimal alignment. Experiments on benchmark biological alignments show we can find parameters that generalize across protein families and that boost the accuracy of multiple sequence alignment by as much as 25 percent.
- Kececioglu, J., Kim, E., & Wheeler, T. (2010). Aligning protein sequences with predicted secondary structure. Journal of Computational Biology, 17(3), 561-580.More infoPMID: 20377464;Abstract: Accurately aligning distant protein sequences is notoriously difficult. Since the amino acid sequence alone often does not provide enough information to obtain accurate alignments under the standard alignment scoring functions, a recent approach to improving alignment accuracy is to use additional information such as secondary structure. We make several advances in alignment of protein sequences annotated with predicted secondary structure: (1) more accurate models for scoring alignments, (2) efficient algorithms for optimal alignment under these models, and (3) improved learning criteria for setting model parameters through inverse alignment, as well as (4) in-depth experiments evaluating model variants on benchmark alignments. More specifically, the new models use secondary structure predictions and their confidences to modify the scoring of both substitutions and gaps. All models have efficient algorithms for optimal pairwise alignment that run in near-quadratic time. These models have many parameters, which are rigorously learned using inverse alignment under a new criterion that carefully balances score error and recovery error. We then evaluate these models by studying how accurately an optimal alignment under the model recovers benchmark reference alignments that are based on the known three-dimensional structures of the proteins. The experiments show that these new models provide a significant boost in accuracy over the standard model for distant sequences. The improvement for pairwise alignment is as much as 15% for sequences with less than 25% identity, while for multiple alignment the improvement is more than 20% for difficult benchmarks whose accuracy under standard tools is at most 40%. © Copyright 2010, Mary Ann Liebert, Inc.
- Kim, E., Wheeler, T., & Kececioglu, J. (2009). Learning models for aligning protein sequences with predicted secondary structure. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 5541 LNBI, 512-531.More infoAbstract: Accurately aligning distant protein sequences is notoriously difficult. A recent approach to improving alignment accuracy is to use additional information such as predicted secondary structure. We introduce several new models for scoring alignments of protein sequences with predicted secondary structure, which use the predictions and their confidences to modify both the substitution and gap cost functions. We present efficient algorithms for computing optimal pairwise alignments under these models, all of which run in near-quadratic time. We also review an approach to learning the values of the parameters in these models called inverse alignment. We then evaluate the accuracy of these models by studying how well an optimal alignment under the model recovers known benchmark reference alignments. Our experiments show that using parameters learned by inverse alignment, these new secondarystructure-based models provide a significant improvement in alignment accuracy for distant sequences. The best model improves upon the accuracy of the standard sequence alignment model for pairwise alignment by as much as 15% for sequences with less than 25% identity, and improves the accuracy of multiple alignment by 20% for difficult benchmarks whose average accuracy under standard tools is less than 40%. © Springer-Verlag Berlin Heidelberg 2009.
- Kim, E., & Kececioglu, J. (2008). Learning scoring schemes for sequence alignment from partial examples. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 5(4), 546-556.More infoPMID: 18989042;Abstract: When aligning biological sequences, the choice of scoring scheme is critical. Even small changes in gap penalties, for example, can yield radically different alignments. A rigorous way to learn parameter values that are appropriate for biological sequences is through inverse parametric sequence alignment. Given a collection of examples of biologically correct reference alignments, this is the problem of finding parameter values that make the scores of the reference alignments be as close as possible to those of optimal alignments of their sequences. We extend prior work on inverse parametric alignment to partial examples, which contain regions where the reference alignment is not specified, and to an improved formulation based on minimizing the average error between the scores of the reference alignments and the scores of optimal alignments. Experiments on benchmark biological alignments show we can learn scoring schemes that generalize across protein families, and that boost the accuracy of multiple sequence alignment by as much as 25 percent. © 2008 IEEE.
- Kececioglu, J., Wheeler, T. J., & Kececioglu, J. D. (2007). Multiple alignment by aligning alignments. Bioinformatics (Oxford, England), 23(13).More infoMultiple sequence alignment is a fundamental task in bioinformatics. Current tools typically form an initial alignment by merging subalignments, and then polish this alignment by repeated splitting and merging of subalignments to obtain an improved final alignment. In general this form-and-polish strategy consists of several stages, and a profusion of methods have been tried at every stage. We carefully investigate: (1) how to utilize a new algorithm for aligning alignments that optimally solves the common subproblem of merging subalignments, and (2) what is the best choice of method for each stage to obtain the highest quality alignment.
- Kim, E., & Kececioglu, J. (2007). Inverse sequence alignment from partial examples. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 4645 LNBI, 359-370.More infoAbstract: When aligning biological sequences, the choice of parameter values for the alignment scoring function is critical. Small changes in gap penalties, for example, can yield radically different alignments. A rigorous way to compute parameter values that are appropriate for biological sequences is inverse parametric sequence alignment. Given a collection of examples of biologically correct alignments, this is the problem of finding parameter values that make the example alignments score close to optimal. We extend prior work on inverse alignment to partial examples and to an improved model based on minimizing the average error of the examples. Experiments on benchmark biological alignments show we can find parameters that generalize across protein families and that boost the recovery rate for multiple sequence alignment by up to 25%. © Springer-Verlag Berlin Heidelberg 2007.
- Wheeler, T. J., & Kececioglu, J. D. (2007). Multiple alignment by aligning alignments. Bioinformatics, 23(13), i559-i568.More infoPMID: 17646343;Abstract: Motivation: Multiple sequence alignment is a fundamental task in bioinformatics. Current tools typically form an initial alignment by merging subalignments, and then polish this alignment by repeated splitting and merging of subalignments to obtain an improved final alignment. In general this form-and-polish strategy consists of several stages, and a profusion of methods have been tried at every stage. We carefully investigate: (1) how to utilize a new algorithm for aligning alignments that optimally solves the common subproblem of merging subalignments, and (2) what is the best choice of method for each stage to obtain the highest quality alignment. Results: We study six stages in the form-and-polish strategy for multiple alignment: parameter choice, distance estimation, merge-tree construction, sequence-pair weighting, alignment merging, and polishing. For each stage, we consider novel approaches as well as standard ones. Interestingly, the greatest gains in alignment quality come from (i) estimating distances by a new approach using normalized alignment costs, and (ii) polishing by a new approach using 3-cuts. Experiments with a parameter-value oracle suggest large gains in quality may be possible through an input-dependent choice of alignment parameters, and we present a promising approach for building such an oracle. Combining the best approaches to each stage yields a new tool we call Opal that on benchmark alignments matches the quality of the top tools, without employing alignment consistency or hydrophobic gap penalties. © 2007 The Author(s).
- Kececioglu, J., & Kim, E. (2006). Simple and fast inverse alignment. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 3909 LNBI, 441-455.More infoAbstract: For as long as biologists have been computing alignments of sequences, the question of what values to use for scoring substitutions and gaps has persisted. While some choices for substitution scores are now common, largely due to convention, there is no standard for choosing gap penalties. An objective way to resolve this question is to learn the appropriate values by solving the Inverse String Alignment Problem: given examples of correct alignments, find parameter values that make the examples be optimal-scoring alignments of their strings. We present a new polynomial-time algorithm for Inverse String Alignment that is simple to implement, fast in practice, and for the first time can learn hundreds of parameters simultaneously. The approach is also flexible: minor modifications allow us to solve inverse unique alignment (find parameter values that make the examples be the unique optimal alignments of their strings), and inverse near-optimal alignment (find parameter values that make the example alignments be as close to optimal as possible). Computational results with an implementation for global alignment show that, for the first time, we can find best-possible values for all 212 parameters of the standard protein-sequence scoring-model from hundreds of alignments in a few minutes of computation. © Springer-Verlag Berlin Heidelberg 2006.
- Kececioglu, J., & Starrett, D. (2004). Aligning alignments exactly. Proceedings of the Annual International Conference on Computational Molecular Biology, RECOMB, 8, 85-96.More infoAbstract: A basic computational problem that arises in both the construction and local-search phases of the best heuristics for multiple sequence alignment is that of aligning the columns of two multiple alignments. When the scoring function is the sum-of-pairs objective and induced pairwise alignments are evaluated using linear gap-costs, we call this problem Aligning Alignments. While seemingly a straightforward extension of two-sequence alignment, we prove it is actually NP-complete. As explained in the paper, this provides the first demonstration that minimizing linear gap-costs, in the context of multiple sequence alignment, is inherently hard. We also develop an exact algorithm for Aligning Alignments that is remarkably efficient in practice, both in time and space. Even though the problem is NP-complete, computational experiments on both biological and simulated data show we can compute optimal alignments for all benchmark instances in two standard datasets, and solve very-large random instances with highly-gapped sequences.
- Kececioglu, J., & Yu, J. (2001). Separating repeats in DNA sequence assembly. Proceedings of the Annual International Conference on Computational Molecular Biology, RECOMB, 176-183.More infoAbstract: One of the key open problems in large-scale DNA sequence assembly is the correct reconstruction of sequences that contain repeats. A long repeat can confound a sequence assembler into falsely overlaying fragments that sample its copies, effectively compressing out the repeat in the reconstructed sequence. We call the task of correcting this compression by separating the overlaid fragments into the distinct copies they sample, the repeat separation problem. We present a rigorous formulation of repeat separation in the general setting without prior knowledge of consensus sequences of repeats or their number of copies. Our formulation decomposes the task into a series of four subproblems, and we design probabilistic tests or combinatorial algorithms that solve each subproblem. The core subproblem separates repeats using the so-called k-median problem in combinatorial optimization, which we solve using integer linear-programming. Experiments with an implementation show we can separate fragments that are over laid at 10 times the coverage with very few mistakes in a few seconds of computation, even when the sequencing error rate and the error rate between copies are identical. To our knowledge this is the first rigorous and fully general approach to separating repeats that directly addresses the problem.
- Kececioglu, J. D., Lenhof, H., Mehlhorn, K., Mutzel, P., Reinert, K., & Vingron, M. (2000). A polyhedral approach to sequence alignment problems. Discrete Applied Mathematics, 104(1-3), 143-186.More infoAbstract: We study two new problems in sequence alignment both from a practical and a theoretical view, using tools from combinatorial optimization to develop branch-and-cut algorithms. The generalized maximum trace formulation captures several forms of multiple sequence alignment problems in a common framework, among them the original formulation of maximum trace. The RNA sequence alignment problem captures the comparison of RNA molecules on the basis of their primary sequence and their secondary structure. Both problems have a characterization in terms of graphs which we reformulate in terms of integer linear programming. We then study the polytopes (or convex hulls of all feasible solutions) associated with the integer linear program for both problems. For each polytope we derive several classes of facet-defining inequalities and show that for some of these classes the corresponding separation problem can be solved in polynomial time. This leads to a polynomial-time algorithm for pairwise sequence alignment that is not based on dynamic programming. Moreover, for multiple sequences the branch-and-cut algorithms for both sequence alignment problems are able to solve to optimality instances that are beyond the range of present dynamic programming approaches. © 2000 Elsevier Science B.V.
- Kececioglu, J., Shete, S., & Arnold, J. (2000). Reconstructing distances in physical maps of chromosomes with nonoverlapping probes. Proceedings of the Annual International Conference on Computational Molecular Biology, RECOMB, 183-192.More infoAbstract: We present a new method for reconstructing the distances between probes in physical maps of chromosomes constructed by hybridizing pairs of clones under the so-called sampling-without-replacement protocol. In this protocol, which is simple, inexpensive, and has been used to successfully map several organisms, equal-length clones are hybridized against a clone-subset called the probes. The probes are chosen by a sequential process that is designed to generate a pairwise-nonoverlapping subset of the clones. We derive a likelihood function on probe spacings and orders for this protocol under a natural model of hybridization error, and describe how to reconstruct the most likely spacing for a given order under this objective using continuous optimization. The approach is tested on simulated data and real data from chromosome VI of Aspergillus nidulans. On simulated data we recover the true order and close to the true spacing; on the real data, for which the true order and spacing is unknown, we recover a probe order differing significantly from the published one. To our knowledge this is the first practical approach for computing a globally-optimal maximum-likelihood reconstruction of interprobe distances from clone-probe hybridization data.
- Christof, T., & Kececioglu, J. (1999). Computing physical maps of chromosomes with nonoverlapping probes by branch-and-cut. Proceedings of the Annual International Conference on Computational Molecular Biology, RECOMB, 115-123.More infoAbstract: We introduce a new combinatorial formulation of chromosome physical-mapping by the sampling-without-replacement protocol. In this protocol, which is simple, inexpensive, and has been used to successfully map several organisms, equal-length clones are hybridized against a subset of the clones called probes, which are designed to form a maximal nonoverlapping clone-subset. The output of the protocol is the clone-probe hybridization matrix H. The problem of finding a maximum-likelihood reconstruction of the order of the probes along the chromosome in the presence of false positive and negative hybridization error is equivalent to finding the minimum number of entries of H to change to zeros so that the resulting matrix has at most 2 ones per row, and the consecutive-ones property across rows. This combinatorial problem, which we call 2-Consecutive-Ones Mapping, has a concise integer linear-programming formulation, to which we apply techniques from polyhedral combinatorics. The formulation is unique in that it does not explicitly represent the probe permutation, and in contrast to prior linear-programming approaches, the number of variables is small: in practice, linear in the number of clones. We derive a large class of facet-defining inequalities for the 2-consecutive-ones polytope that we call the augmented k-degree inequalities, and we show that the basic k-degree class can be efficiently separated using bipartite matchings. Computational results with an implementation of the resulting branch-and-cut algorithm applied to both simulated and real data from the complete genome of Aspergillus nidulans show that we can solve many problems to provable optimality and find maps of higher quality than previously possible.
- Kececioglu, J., & Gusfield, D. (1998). Reconstructing a history of recombinations from a set of sequences. Discrete Applied Mathematics, 88(1-3), 239-260.More infoAbstract: One of the classic problems in computational biology is the reconstruction of evolutionary history. A recent trend in the area is to increase the explanatory power of the models that are considered by incorporating higher-order evolutionary events that more accurately reflect the mechanisms of mutation at the level of the chromosome. We take a step in this direction by considering the problem of reconstructing an evolutionary history for a set of genetic sequences that have evolved by recombination. Recombination is a non-tree-like event that produces a child sequence by crossing two parent sequences. We present polynomial-time algorithms for reconstructing a parsimonious history of such events for several models of recombination when all sequences, including those of ancestors, are present in the input. We also show that these models appear to be near the limit of what can be solved in polynomial time, in that several natural generalizations are NP-complete. © 1998 Elsevier Science B.V. All rights reserved.
- Ravi, R., & Kececioglu, J. D. (1998). Approximation algorithms for multiple sequence alignment under a fixed evolutionary tree. Discrete Applied Mathematics, 88(1-3), 355-366.More infoAbstract: We consider the problem of multiple sequence alignment under a fixed evolutionary tree: given a tree whose leaves are labeled by sequences, find ancestral sequences to label its internal nodes so as to minimize the total length of the tree, where the length of an edge is the edit distance between the sequences labeling its endpoints. We present a new polynomial-time approximation algorithm for this problem, and analyze its performance on regular d-ary trees with d a constant. On such a tree, the algorithm finds a solution within a factor (d + 1)/(d - 1) of the minimum in Q(kdT(d, n)+k2dn2) time, where k is the number of leaves in the tree, n is the length of the longest sequence labeling a leaf, and T(d, n) is the time to compute a Steiner point for d sequences of length at most n. (A Steiner point for a set script capital L sign of sequences is a sequence P that minimizes the sum of the edit distances from P to each sequence in script capital L sign. The time T(d, n) is O(d2dnd), given O(dsd+l)-time preprocessing for an alphabet of size s.) The approximation algorithm is conceptually simple and easy to implement, and actually applies to any metric space in which a Steiner point for any fixed-sized set can be computed in polynomial time. We also introduce a new problem, bottleneck tree-alignment, in which the objective is to label the internal nodes of the tree so as to minimize the length of the longest edge. We describe an exponential-time exact algorithm for the case of unit-cost edit operations, and show there is a simple linear-time approximation algorithm for the general case that finds a solution within a factor O(log k) of the minimum. 1998 Published by Elsevier Science B.V. All rights reserved.
- Christof, T., Juenger, M., Kececioglu, J., Mutzel, P., & Reinelt, G. (1997). Branch-and-cut approach to physical mapping with end-probes. Proceedings of thr Annual International Conference on Computational Molecular Biology, RECOMB, 84-92.More infoAbstract: A fundamental problem in computational biology is the construction of physical maps of chromosomes from hybridization experiments between unique probes and clones of chromosome fragments in the presence of error. Alizadeh, Karp, Weisser and Zweig [AKWZ94] first considered a maximum-likelihood model of the problem that is equivalent to finding an ordering of the probes that minimizes a weighted sum of errors, and developed several effective heuristics. We show that by exploiting information about the end-probes of clones, this model can be formulated as a weighted Betweenness Problem. This affords the significant advantage of allowing the well-developed tools of integer linear-programming and branch-and-cut algorithms to be brought to bear on physical mapping, enabling us for the first time to solve small mapping instances to optimality even in the presence of high error. We also show that by combining the optimal solution of many small overlapping Betweenness Problems, one can effectively screen errors from larger instances, and solve the edited instance to optimality as a Hamming-Distance Traveling Salesman Problem. This suggests a new combined approach to physical map construction.
- Christof, T., Jünger, M., Kececioglu, J., Mutzel, P., & Reinelt, G. (1997). A branch-and-cut approach to physical mapping of chromosomes by unique end-probes. Journal of Computational Biology, 4(4), 433-447.More infoPMID: 9385538;Abstract: A fundamental problem in computational biology is the construction of physical maps of chromosomes from hybridization experiments between unique probes and clones of chromosome fragments in the presence of error. Alizadeh, Karp, Weisser and Zweig (Algorithmica 13:1/2, 52-76, 1995) first considered a maximum-likelihood model of the problem that is equivalent to finding an ordering of the probes that minimizes a weighted sum of errors and developed several effective heuristics. We show that by exploiting information about the end-probes of clones, this model can be formulated as a Weighted Betweenness Problem. This affords the significant advantage of allowing the well-developed tools of integer linear-programming and branch-and-cut algorithms to be brought to bear on physical mapping, enabling us for the first time to solve small mapping instances to optimality even in the presence of high error. We also show that by combining the optimal solution of many small overlapping Betweenness Problems, one can effectively screen errors from larger instances and solve the edited instance to optimality as a Hamming-Distance Traveling Salesman Problem. This suggests a new approach, a Betweenness-Traveling Salesman hybrid, for constructing physical maps.
- Kececioglu, J., Ming, L. i., & Tromp, J. (1997). Inferring a DNA sequence from erroneous copies. Theoretical Computer Science, 185(1), 3-13.More infoAbstract: We suggest a novel approach for efficiently reconstructing an original DNA sequence from erroneous copies.
- Reinert, K., Lenhof, H. -., Mutzel, P., Mehlhorn, K., & Kececioglu, J. D. (1997). Branch-and-cut algorithm for multiple sequence alignment. Proceedings of thr Annual International Conference on Computational Molecular Biology, RECOMB, 241-250.More infoAbstract: Multiple sequence alignment is an important problem in computational biology. We study the Maximum Trace formulation introduced by Kececioglu [Kec91]. We first phrase the problem in terms of forbidden subgraphs, which enables us to express Maximum Trace as an integer linear-programming problem, and then solve the integer linear program using methods from polyhedral combinatorics. The trace polytope is the convex hull of all feasible solutions to the Maximum Trace problem; for the case of two sequences, we give a complete characterization of this polytope. This yields a polynomial-time algorithm for a general version of pairwise sequence alignment that, perhaps suprisingly, does not use dynamic programming; this yields, for instance, a non-dynamic-programming algorithm for sequence comparison under the 0-1 metric, which gives another answer to a long-open question in the area of string algorithms [PW93]. For the multiple-sequence case, we derive several classes of facet-defining inequalities and show that for all but one class, the corresponding separation problem can be solved in polynomial time. This leads to a branch-and-cut algorithm for multiple sequence alignment, and we report on our first computational experience. It appears that a polyhedral approach to multiple sequence alignment can solve instances that are beyond present dynamic-programming approaches.
- Taylor, E. W., Bhat, A., Nadimpalli, R. G., Zhang, W., & Kececioglu, J. (1997). HIV-1 encodes a sequence overlapping env gp41 with highly significant similarity to selenium-dependent glutathione peroxidases.. Journal of acquired immune deficiency syndromes and human retrovirology : official publication of the International Retrovirology Association, 15(5), 393-394.More infoPMID: 9342263;
- Gupta, S. K., Kececioglu, J. D., & Schäffer, A. (1995). Improving the practical space and time efficiency of the shortest-paths approach to sum-of-pairs multiple sequence alignment.. Journal of computational biology : a journal of computational molecular cell biology, 2(3), 459-472.More infoPMID: 8521275;Abstract: The MSA program, written and distributed in 1989, is one of the few existing programs that attempts to find optimal alignments of multiple protein or DNA sequences. The MSA program implements a branch-and-bound technique together with a variant of Dijkstra's shortest paths algorithm to prune the basic dynamic programming graph. We have made substantial improvements in the time and space usage of MSA. The improvements make feasible a variety of problem instances that were not feasible previously. On some runs we achieve an order of magnitude reduction in space usage and a significant multiplicative factor speedup in running time. To explain how these improvements work, we give a much more detailed description of MSA than has been previously available. In practice, MSA rarely produces a provably optimal alignment and we explain why.
- Kececioglu, J. D., & Myers, E. W. (1995). Combinatorial algorithms for DNA sequence assembly. Algorithmica, 13(1-2), 7-51.More infoAbstract: The trend toward very large DNA sequencing projects, such as those being undertaken as part of the Human Genome Program, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The sequence reconstruction problem that we take as our formulation of DNA sequence assembly is a variation of the shortest common superstring problem, complicated by the presence of sequencing errors and reverse complements of fragments. Since the simpler superstring problem is NP-hard, any efficient reconstruction procedure must resort to heuristics. In this paper, however, a four-phase approach based on rigorous design criteria is presented, and has been found to be very accurate in practice. Our method is robust in the sense that it can accommodate high sequencing error rates, and list a series of alternate solutions in the event that several appear equally good. Moreover, it uses a limited form of multiple sequence alignment to detect, and often correct, errors in the data. Our combined algorithm has successfully reconstructed nonrepetitive sequences of length 50,000 sampled at error rates of as high as 10%. © 1995 Springer-Verlag New York Inc.
- Kececioglu, J., & Sankoff, D. (1995). Exact and approximation algorithms for sorting by reversals, with application to genome rearrangement. Algorithmica, 13(1-2), 180-210.More infoAbstract: Motivated by the problem in computational biology of reconstructing the series of chromosome inversions by which one organism evolved from another, we consider the problem of computing the shortest series of reversals that transform one permutation to another. The permutations describe the order of genes on corresponding chromosomes, and a reversal takes an arbitrary substring of elements, and reverses their order. For this problem, we develop two algorithms: a greedy approximation algorithm, that finds a solution provably close to optimal in O(n 2) time and 0(n) space for n-element permutations, and a branch- and-bound exact algorithm, that finds an optimal solution in 0(mL(n, n)) time and 0(n 2) space, where m is the size of the branch- and-bound search tree, and L(n, n) is the time to solve a linear program of n variables and n constraints. The greedy algorithm is the first to come within a constant factor of the optimum; it guarantees a solution that uses no more than twice the minimum number of reversals. The lower and upper bounds of the branch- and-bound algorithm are a novel application of maximum-weight matchings, shortest paths, and linear programming. In a series of experiments, we study the performance of an implementation on random permutations, and permutations generated by random reversals. For permutations differing by k random reversals, we find that the average upper bound on reversal distance estimates k to within one reversal for k
- Kececioglu, J., & Gusfield, D. (1994). Reconstructing a history of recombinations from a set of sequences. Proceedings of the Annual ACM SIAM Symposium on Discrete Algorithms, 471-480.More infoAbstract: One of the classic problems in computational biology is the reconstruction of evolutionary histories. A recent trend is toward increasing the explanatory power of the models by incorporating higher-order evolutionary events that more accurately reflect the full range of mutation at the molecular level. In this paper, we take a step in this direction by considering the problem of reconstructing an evolutionary history for a set of genetic sequences that have evolved by recombination. Recombination produces a new sequence by crossing two parent sequences, and is among the most important mechanisms of higher-order molecular mutation.
- Lipman, D. J., Altschul, S. F., & Kececioglu, J. D. (1989). A tool for multiple sequence alignment. Proceedings of the National Academy of Sciences of the United States of America, 86(12), 4412-4415.More infoPMID: 2734293;PMCID: PMC287279;Abstract: Multiple sequence alignment can be a useful technique for studying molecular evolution and analyzing sequence-structure relationships. Until recently, it has been impractical to apply dynamic programming, the most widely accepted method for producing pairwise alignments, to comparisons of more than three sequences. We describe the design and application of a tool for multiple alignment of amino acid sequences that implements a new algorithm that greatly reduces the computational demands of dynamic programming. This tool is able to align in reasonable time as many as eight sequences the length of an average protein.
Proceedings Publications
- Krieger, S., & Kececioglu, J. D. (2022). Computing optimal factories in metabolic networks with negative regulation. In 30th ISCB Conference on Intelligent Systems for Molecular Biology (ISMB 2022), 38, i369-i377.More infoPublished in the Proceedings of ISMB 2022 (one of the top two conferences in bioinformatics and computational biology), in a special issue of the journal Bioinformatics 38 (2022)
- Kececioglu, J. D., & Krieger, S. (2021). Fast approximate shortest hyperpaths in cell signaling hypergraphs. In 21st ISCB Workshop on Algorithms in Bioinformatics (WABI 2021), 201, 1-20.More infoPublished in Proceedings of WABI 2021
- Krieger, S., & Kececioglu, J. D. (2020, July). Boosting the accuracy of protein secondary structure prediction through nearest neighbor search and method hybridization. In 28th ISCB Conference on Intelligent Systems for Molecular Biology (ISMB 2020).More infoISMB is one of the top two conferences in bioinformatics and computational biology (listed in csrankings.org under bioinformatics)
- Krieger, S., & Kececioglu, J. D. (2020, September). Predicting protein secondary structure by an ensemble through feature-based accuracy estimation. In 11th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB 2020), 1-10.More infoACM-BCB is a second-tier conference in bioinformatics and computational biology
- DeBlasio, D. F., & Kececioglu, J. D. (2017, May). Boosting alignment accuracy by adaptive local realignment. In Proceedings of the 21st Conference on Research in Computational Molecular Biology (RECOMB 2017), LNBI 10229, 1-17.More infoRECOMB is ranked in category "Computational biology and bioinformatics" on csrankings.org
- DeBlasio, D. F., & Kececioglu, J. D. (2016, August). Predicting core columns of protein multiple sequence alignments for improved parameter advising. In Proceedings of the 16th Workshop on Algorithms in Bioinformatics (WABI 2016), LNCS 9838, 77-89.
- Kececioglu, J. D., & Wilcox, A. (2016, October). Proceedings of the 7th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. In 7th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB 2016).More infoAs Co-Chair of the Program Committee for the conference ACM-BCB 2016, I was an editor of the conference Proceedings, along with Co-Chair Adam Wilcox
- Kececioglu, J. D., Snodgrass, R. T., Saha, A., Wang, Z., Matheson, T., Narayan, G., Scheidegger, C. E., Axelrod, T., Jenness, T., Ridgway, S., Seaman, R., Taylor, C., Toeniskoetter, J., Welch, E., Yang, S., & Zaidi, T. (2016, July). ANTARES: progress towards building a broker of time-domain alerts. In Proceedings of SPIE 9910 (Observatory Operations: Strategies, Processes, and Systems VI), SPIE 9910, 8.
- DeBlasio, D. F., & Kececioglu, J. D. (2015, September). Ensemble multiple sequence alignment via advising. In Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM BCB 2015).