John D Kececioglu
 Associate Professor, Computer Science
 Associate Professor, BIO5 Institute
 Associate Professor, Applied Mathematics  GIDP
 Associate Professor, Genetics  GIDP
Contact
 (520) 6214526
 GouldSimpson, Rm. 727
 Tucson, AZ 85721
 kece@cs.arizona.edu
Bio
No activities entered.
Interests
Research
Algorithm design, analysis, and implementation; bioinformatics and computational biology; combinatorial optimization.
Courses
201617 Courses

Adv Tpcs:Doctoral Colloq
CSC 695C (Fall 2016)
201516 Courses

Dissertation
CSC 920 (Spring 2016) 
Honors Thesis
CSC 498H (Spring 2016) 
Adv Tpcs:Doctoral Colloq
CSC 695C (Fall 2015) 
Dissertation
CSC 920 (Fall 2015) 
Dsgn+Anls of Algorithms
CSC 545 (Fall 2015) 
Honors Thesis
CSC 498H (Fall 2015)
201415 Courses

Dissertation
CSC 920 (Spring 2015) 
Dsgn+Anls of Algorithms
CSC 545 (Spring 2015) 
Analysis Discrete Struct
CSC 345 (Fall 2014) 
Dissertation
CSC 920 (Fall 2014) 
Independent Study
CSC 699 (Fall 2014) 
Preceptorship
CSC 391 (Fall 2014) 
Research
CSC 900 (Fall 2014)
201314 Courses

Adv Tpcs:Doctoral Colloq
CSC 695C (Spring 2014) 
Analysis Discrete Struct
CSC 345 (Spring 2014) 
Dissertation
CSC 920 (Spring 2014) 
Adv Rsrch Tops Cmptr Sci
CSC 496H (Fall 2013) 
Adv Topic Algorithm Anls
CSC 645 (Fall 2013) 
Adv Tpcs:Doctoral Colloq
CSC 695C (Fall 2013) 
Directed Research
CSC 492 (Fall 2013) 
Dissertation
CSC 920 (Fall 2013) 
Research
CSC 900 (Fall 2013) 
Rsrch Tops Computer Sci
CSC 296H (Fall 2013)
201213 Courses

Independent Study
CSC 599 (Summer I 2013)
Scholarly Contributions
Journals/Publications
 Suh, Y., Snodgrass, R. T., Kececioglu, J. D., & Downey, P. J. (2016). EMP: Execution Time Measurement Protocol for ComputeBound Programs. SoftwarePractice and Experience, 37+4.
 DeBlasio, D. F., & Kececioglu, J. D. (2015). Learning parameteradvising sets for multiple sequence alignment. IEEE/ACM Transactions on Computational Biology and Bioinformatics.More infoWhile the multiple sequence alignment output by an alignerstrongly depends on the parameter values used for the alignmentscoring function(such as the choice of gap penalties and substitution scores),most users rely on the single default parameter settingprovided by the aligner.A different parameter setting, however,might yield a much higherquality alignment for the specific set ofinput sequences.The problem of picking a good choice of parameter valuesfor specific input sequences is called \textit{parameter advising}.A parameter advisor has two ingredients:(i) a set of parameter choices to select from,and (ii) an estimator that provides an estimate of the accuracyof the alignment computed by the aligner using a parameter choice.The parameter advisor picks the parameter choice from the setwhose resulting alignment has highest estimated accuracy.We consider for the first timethe problem of learning the optimal set of parameterchoices for a parameter advisor that uses a given accuracy estimator.The optimal set is one that maximizes the expected true accuracy of theresulting parameter advisor, averaged over a collection of training data.While we prove that learning an optimal set for an advisor is NPcomplete,we show there is a natural approximation algorithm for this problem,and prove a tight bound on its approximation ratio.Experiments with an implementation of this approximation algorithmon biological benchmarks,using various accuracy estimators from the literature,show it finds sets for advisors that are surprisingly close to optimal.Furthermore, the resulting parameter advisors are significantlymore accurate in practice than simply aligningwith a single default parameter choice.
 DeBlasio, D., & Kececioglu, J. D. (2014). Learning parameter sets for alignment advising. Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM BCB 2014), 230239.More infoWhile the multiple sequence alignment output by an alignerstrongly depends on the parameter values used for the alignmentscoring function(such as the choice of gap penalties and substitution scores),most users rely on the single default parameter settingprovided by the aligner.A different parameter setting, however,might yield a much higherquality alignment for the specific set ofinput sequences.The problem of picking a good choice of parameter valuesfor specific input sequences is called \textit{parameter advising}.A parameter advisor has two ingredients:(i) a set of parameter choices to select from,and (ii) an estimator that provides an estimate of the accuracyof the alignment computed by the aligner using a parameter choice.The parameter advisor picks the parameter choice from the setwhose resulting alignment has highest estimated accuracy.We consider for the first timethe problem of learning the optimal set of parameterchoices for a parameter advisor that uses a given accuracy estimator.The optimal set is one that maximizes the expected true accuracy of theresulting parameter advisor, averaged over a collection of training data.While we prove that learning an optimal set for an advisor is NPcomplete,we show there is a natural approximation algorithm for this problem,and prove a tight bound on its approximation ratio.Experiments with an implementation of this approximation algorithmon biological benchmarks,using various accuracy estimators from the literature,show it finds sets for advisors that are surprisingly close to optimal.Furthermore, the resulting parameter advisors are significantlymore accurate in practice than simply aligningwith a single default parameter choice.
 Kececioglu, J., & Deblasio, D. (2013). Accuracy estimation and parameter advising for protein multiple sequence alignment. Journal of Computational Biology, 20(4), 259279.More infoPMID: 23489379;PMCID: PMC3619150;Abstract: We develop a novel and general approach to estimating the accuracy of multiple sequence alignments without knowledge of a reference alignment, and use our approach to address a new task that we call parameter advising: the problem of choosing values for alignment scoring function parameters from a given set of choices to maximize the accuracy of a computed alignment. For protein alignments, we consider twelve independent features that contribute to a quality alignment. An accuracy estimator is learned that is a polynomial function of these features; its coefficients are determined by minimizing its error with respect to true accuracy using mathematical optimization. Compared to prior approaches for estimating accuracy, our new approach (a) introduces novel feature functions that measure nonlocal properties of an alignment yet are fast to evaluate, (b) considers more general classes of estimators beyond linear combinations of features, and (c) develops new regression formulations for learning an estimator from examples; in addition, for parameter advising, we (d) determine the optimal parameter set of a given cardinality, which specifies the best parameter values from which to choose. Our estimator, which we call Facet (for "featurebased accuracy estimator"), yields a parameter advisor that on the hardest benchmarks provides more than a 27% improvement in accuracy over the best default parameter choice, and for parameter advising significantly outperforms the best prior approaches to assessing alignment quality. © Copyright 2013, Mary Ann Liebert, Inc. 2013.
 Kececioglu, J., Kececioglu, J. D., & Kim, E. (2008). Learning scoring schemes for sequence alignment from partial examples. IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM, 5(4).More infoWhen aligning biological sequences, the choice of parameter values for the alignment scoring function is critical. Small changes in gap penalties, for example, can yield radically different alignments. A rigorous way to compute parameter values that are appropriate for aligning biological sequences is through inverse parametric sequence alignment. Given a collection of examples of biologically correct alignments, this is the problem of finding parameter values that make the scores of the example alignments close to those of optimal alignments for their sequences. We extend prior work on inverse parametric alignment to partial examples, which contain regions where the alignment is left unspecified, and to an improved formulation based on minimizing the average error between the score of an example and the score of an optimal alignment. Experiments on benchmark biological alignments show we can find parameters that generalize across protein families and that boost the accuracy of multiple sequence alignment by as much as 25 percent.
 Deblasio, D. F., Wheeler, T. J., & Kececioglu, J. D. (2012). Estimating the accuracy of multiple alignments and its use in parameter advising. Proceedings of RECOMB 2012 (SpringerVerlag Lecture Notes in Bioinformatics), 7262 LNBI, 4559.More infoAbstract: We develop a novel and general approach to estimating the accuracy of protein multiple sequence alignments without knowledge of a reference alignment, and use our approach to address a new problem that we call parameter advising. For protein alignments, we consider twelve independent features that contribute to a quality alignment. An accuracy estimator is learned that is a polynomial function of these features; its coefficients are determined by minimizing its error with respect to true accuracy using mathematical optimization. We evaluate this approach by applying it to the task of parameter advising: the problem of choosing alignment scoring parameters from a collection of parameter values to maximize the accuracy of a computed alignment. Our estimator, which we call Facet (for "featurebased accuracy estimator"), yields a parameter advisor that on the hardest benchmarks provides more than a 20% improvement in accuracy over the best default parameter choice, and outperforms the best prior approaches to selecting good alignments for parameter advising. © 2012 SpringerVerlag Berlin Heidelberg.
 Huynh, L., Kececioglu, J., Köppe, M., & Tagkopoulos, I. (2012). Automatic design of synthetic gene circuits through mixed integer nonlinear programming. PLoS ONE, 7(4).More infoPMID: 22536398;PMCID: PMC3334995;Abstract: Automatic design of synthetic gene circuits poses a significant challenge to synthetic biology, primarily due to the complexity of biological systems, and the lack of rigorous optimization methods that can cope with the combinatorial explosion as the number of biological parts increases. Current optimization methods for synthetic gene design rely on heuristic algorithms that are usually not deterministic, deliver suboptimal solutions, and provide no guaranties on convergence or error bounds. Here, we introduce an optimization framework for the problem of part selection in synthetic gene circuits that is based on mixed integer nonlinear programming (MINLP), which is a deterministic method that finds the globally optimal solution and guarantees convergence in finite time. Given a synthetic gene circuit, a library of characterized parts, and userdefined constraints, our method can find the optimal selection of parts that satisfy the constraints and best approximates the objective function given by the user. We evaluated the proposed method in the design of three synthetic circuits (a toggle switch, a transcriptional cascade, and a band detector), with both experimentally constructed and synthetic promoter libraries. Scalability and robustness analysis shows that the proposed framework scales well with the library size and the solution space. The work described here is a step towards a unifying, realistic framework for the automated design of biological circuits. © 2012 Huynh et al.
 Kececioglu, J., Kim, E., & Wheeler, T. (2010). Aligning protein sequences with predicted secondary structure. Journal of Computational Biology, 17(3), 561580.More infoPMID: 20377464;Abstract: Accurately aligning distant protein sequences is notoriously difficult. Since the amino acid sequence alone often does not provide enough information to obtain accurate alignments under the standard alignment scoring functions, a recent approach to improving alignment accuracy is to use additional information such as secondary structure. We make several advances in alignment of protein sequences annotated with predicted secondary structure: (1) more accurate models for scoring alignments, (2) efficient algorithms for optimal alignment under these models, and (3) improved learning criteria for setting model parameters through inverse alignment, as well as (4) indepth experiments evaluating model variants on benchmark alignments. More specifically, the new models use secondary structure predictions and their confidences to modify the scoring of both substitutions and gaps. All models have efficient algorithms for optimal pairwise alignment that run in nearquadratic time. These models have many parameters, which are rigorously learned using inverse alignment under a new criterion that carefully balances score error and recovery error. We then evaluate these models by studying how accurately an optimal alignment under the model recovers benchmark reference alignments that are based on the known threedimensional structures of the proteins. The experiments show that these new models provide a significant boost in accuracy over the standard model for distant sequences. The improvement for pairwise alignment is as much as 15% for sequences with less than 25% identity, while for multiple alignment the improvement is more than 20% for difficult benchmarks whose accuracy under standard tools is at most 40%. © Copyright 2010, Mary Ann Liebert, Inc.
 Kim, E., Wheeler, T., & Kececioglu, J. (2009). Learning models for aligning protein sequences with predicted secondary structure. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 5541 LNBI, 512531.More infoAbstract: Accurately aligning distant protein sequences is notoriously difficult. A recent approach to improving alignment accuracy is to use additional information such as predicted secondary structure. We introduce several new models for scoring alignments of protein sequences with predicted secondary structure, which use the predictions and their confidences to modify both the substitution and gap cost functions. We present efficient algorithms for computing optimal pairwise alignments under these models, all of which run in nearquadratic time. We also review an approach to learning the values of the parameters in these models called inverse alignment. We then evaluate the accuracy of these models by studying how well an optimal alignment under the model recovers known benchmark reference alignments. Our experiments show that using parameters learned by inverse alignment, these new secondarystructurebased models provide a significant improvement in alignment accuracy for distant sequences. The best model improves upon the accuracy of the standard sequence alignment model for pairwise alignment by as much as 15% for sequences with less than 25% identity, and improves the accuracy of multiple alignment by 20% for difficult benchmarks whose average accuracy under standard tools is less than 40%. © SpringerVerlag Berlin Heidelberg 2009.
 Kim, E., & Kececioglu, J. (2008). Learning scoring schemes for sequence alignment from partial examples. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 5(4), 546556.More infoPMID: 18989042;Abstract: When aligning biological sequences, the choice of scoring scheme is critical. Even small changes in gap penalties, for example, can yield radically different alignments. A rigorous way to learn parameter values that are appropriate for biological sequences is through inverse parametric sequence alignment. Given a collection of examples of biologically correct reference alignments, this is the problem of finding parameter values that make the scores of the reference alignments be as close as possible to those of optimal alignments of their sequences. We extend prior work on inverse parametric alignment to partial examples, which contain regions where the reference alignment is not specified, and to an improved formulation based on minimizing the average error between the scores of the reference alignments and the scores of optimal alignments. Experiments on benchmark biological alignments show we can learn scoring schemes that generalize across protein families, and that boost the accuracy of multiple sequence alignment by as much as 25 percent. © 2008 IEEE.
 Kececioglu, J., Wheeler, T. J., & Kececioglu, J. D. (2007). Multiple alignment by aligning alignments. Bioinformatics (Oxford, England), 23(13).More infoMultiple sequence alignment is a fundamental task in bioinformatics. Current tools typically form an initial alignment by merging subalignments, and then polish this alignment by repeated splitting and merging of subalignments to obtain an improved final alignment. In general this formandpolish strategy consists of several stages, and a profusion of methods have been tried at every stage. We carefully investigate: (1) how to utilize a new algorithm for aligning alignments that optimally solves the common subproblem of merging subalignments, and (2) what is the best choice of method for each stage to obtain the highest quality alignment.
 Kim, E., & Kececioglu, J. (2007). Inverse sequence alignment from partial examples. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 4645 LNBI, 359370.More infoAbstract: When aligning biological sequences, the choice of parameter values for the alignment scoring function is critical. Small changes in gap penalties, for example, can yield radically different alignments. A rigorous way to compute parameter values that are appropriate for biological sequences is inverse parametric sequence alignment. Given a collection of examples of biologically correct alignments, this is the problem of finding parameter values that make the example alignments score close to optimal. We extend prior work on inverse alignment to partial examples and to an improved model based on minimizing the average error of the examples. Experiments on benchmark biological alignments show we can find parameters that generalize across protein families and that boost the recovery rate for multiple sequence alignment by up to 25%. © SpringerVerlag Berlin Heidelberg 2007.
 Wheeler, T. J., & Kececioglu, J. D. (2007). Multiple alignment by aligning alignments. Bioinformatics, 23(13), i559i568.More infoPMID: 17646343;Abstract: Motivation: Multiple sequence alignment is a fundamental task in bioinformatics. Current tools typically form an initial alignment by merging subalignments, and then polish this alignment by repeated splitting and merging of subalignments to obtain an improved final alignment. In general this formandpolish strategy consists of several stages, and a profusion of methods have been tried at every stage. We carefully investigate: (1) how to utilize a new algorithm for aligning alignments that optimally solves the common subproblem of merging subalignments, and (2) what is the best choice of method for each stage to obtain the highest quality alignment. Results: We study six stages in the formandpolish strategy for multiple alignment: parameter choice, distance estimation, mergetree construction, sequencepair weighting, alignment merging, and polishing. For each stage, we consider novel approaches as well as standard ones. Interestingly, the greatest gains in alignment quality come from (i) estimating distances by a new approach using normalized alignment costs, and (ii) polishing by a new approach using 3cuts. Experiments with a parametervalue oracle suggest large gains in quality may be possible through an inputdependent choice of alignment parameters, and we present a promising approach for building such an oracle. Combining the best approaches to each stage yields a new tool we call Opal that on benchmark alignments matches the quality of the top tools, without employing alignment consistency or hydrophobic gap penalties. © 2007 The Author(s).
 Kececioglu, J., & Kim, E. (2006). Simple and fast inverse alignment. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 3909 LNBI, 441455.More infoAbstract: For as long as biologists have been computing alignments of sequences, the question of what values to use for scoring substitutions and gaps has persisted. While some choices for substitution scores are now common, largely due to convention, there is no standard for choosing gap penalties. An objective way to resolve this question is to learn the appropriate values by solving the Inverse String Alignment Problem: given examples of correct alignments, find parameter values that make the examples be optimalscoring alignments of their strings. We present a new polynomialtime algorithm for Inverse String Alignment that is simple to implement, fast in practice, and for the first time can learn hundreds of parameters simultaneously. The approach is also flexible: minor modifications allow us to solve inverse unique alignment (find parameter values that make the examples be the unique optimal alignments of their strings), and inverse nearoptimal alignment (find parameter values that make the example alignments be as close to optimal as possible). Computational results with an implementation for global alignment show that, for the first time, we can find bestpossible values for all 212 parameters of the standard proteinsequence scoringmodel from hundreds of alignments in a few minutes of computation. © SpringerVerlag Berlin Heidelberg 2006.
 Kececioglu, J., & Starrett, D. (2004). Aligning alignments exactly. Proceedings of the Annual International Conference on Computational Molecular Biology, RECOMB, 8, 8596.More infoAbstract: A basic computational problem that arises in both the construction and localsearch phases of the best heuristics for multiple sequence alignment is that of aligning the columns of two multiple alignments. When the scoring function is the sumofpairs objective and induced pairwise alignments are evaluated using linear gapcosts, we call this problem Aligning Alignments. While seemingly a straightforward extension of twosequence alignment, we prove it is actually NPcomplete. As explained in the paper, this provides the first demonstration that minimizing linear gapcosts, in the context of multiple sequence alignment, is inherently hard. We also develop an exact algorithm for Aligning Alignments that is remarkably efficient in practice, both in time and space. Even though the problem is NPcomplete, computational experiments on both biological and simulated data show we can compute optimal alignments for all benchmark instances in two standard datasets, and solve verylarge random instances with highlygapped sequences.
 Kececioglu, J., & Yu, J. (2001). Separating repeats in DNA sequence assembly. Proceedings of the Annual International Conference on Computational Molecular Biology, RECOMB, 176183.More infoAbstract: One of the key open problems in largescale DNA sequence assembly is the correct reconstruction of sequences that contain repeats. A long repeat can confound a sequence assembler into falsely overlaying fragments that sample its copies, effectively compressing out the repeat in the reconstructed sequence. We call the task of correcting this compression by separating the overlaid fragments into the distinct copies they sample, the repeat separation problem. We present a rigorous formulation of repeat separation in the general setting without prior knowledge of consensus sequences of repeats or their number of copies. Our formulation decomposes the task into a series of four subproblems, and we design probabilistic tests or combinatorial algorithms that solve each subproblem. The core subproblem separates repeats using the socalled kmedian problem in combinatorial optimization, which we solve using integer linearprogramming. Experiments with an implementation show we can separate fragments that are over laid at 10 times the coverage with very few mistakes in a few seconds of computation, even when the sequencing error rate and the error rate between copies are identical. To our knowledge this is the first rigorous and fully general approach to separating repeats that directly addresses the problem.
 Kececioglu, J. D., Lenhof, H., Mehlhorn, K., Mutzel, P., Reinert, K., & Vingron, M. (2000). A polyhedral approach to sequence alignment problems. Discrete Applied Mathematics, 104(13), 143186.More infoAbstract: We study two new problems in sequence alignment both from a practical and a theoretical view, using tools from combinatorial optimization to develop branchandcut algorithms. The generalized maximum trace formulation captures several forms of multiple sequence alignment problems in a common framework, among them the original formulation of maximum trace. The RNA sequence alignment problem captures the comparison of RNA molecules on the basis of their primary sequence and their secondary structure. Both problems have a characterization in terms of graphs which we reformulate in terms of integer linear programming. We then study the polytopes (or convex hulls of all feasible solutions) associated with the integer linear program for both problems. For each polytope we derive several classes of facetdefining inequalities and show that for some of these classes the corresponding separation problem can be solved in polynomial time. This leads to a polynomialtime algorithm for pairwise sequence alignment that is not based on dynamic programming. Moreover, for multiple sequences the branchandcut algorithms for both sequence alignment problems are able to solve to optimality instances that are beyond the range of present dynamic programming approaches. © 2000 Elsevier Science B.V.
 Kececioglu, J., Shete, S., & Arnold, J. (2000). Reconstructing distances in physical maps of chromosomes with nonoverlapping probes. Proceedings of the Annual International Conference on Computational Molecular Biology, RECOMB, 183192.More infoAbstract: We present a new method for reconstructing the distances between probes in physical maps of chromosomes constructed by hybridizing pairs of clones under the socalled samplingwithoutreplacement protocol. In this protocol, which is simple, inexpensive, and has been used to successfully map several organisms, equallength clones are hybridized against a clonesubset called the probes. The probes are chosen by a sequential process that is designed to generate a pairwisenonoverlapping subset of the clones. We derive a likelihood function on probe spacings and orders for this protocol under a natural model of hybridization error, and describe how to reconstruct the most likely spacing for a given order under this objective using continuous optimization. The approach is tested on simulated data and real data from chromosome VI of Aspergillus nidulans. On simulated data we recover the true order and close to the true spacing; on the real data, for which the true order and spacing is unknown, we recover a probe order differing significantly from the published one. To our knowledge this is the first practical approach for computing a globallyoptimal maximumlikelihood reconstruction of interprobe distances from cloneprobe hybridization data.
 Christof, T., & Kececioglu, J. (1999). Computing physical maps of chromosomes with nonoverlapping probes by branchandcut. Proceedings of the Annual International Conference on Computational Molecular Biology, RECOMB, 115123.More infoAbstract: We introduce a new combinatorial formulation of chromosome physicalmapping by the samplingwithoutreplacement protocol. In this protocol, which is simple, inexpensive, and has been used to successfully map several organisms, equallength clones are hybridized against a subset of the clones called probes, which are designed to form a maximal nonoverlapping clonesubset. The output of the protocol is the cloneprobe hybridization matrix H. The problem of finding a maximumlikelihood reconstruction of the order of the probes along the chromosome in the presence of false positive and negative hybridization error is equivalent to finding the minimum number of entries of H to change to zeros so that the resulting matrix has at most 2 ones per row, and the consecutiveones property across rows. This combinatorial problem, which we call 2ConsecutiveOnes Mapping, has a concise integer linearprogramming formulation, to which we apply techniques from polyhedral combinatorics. The formulation is unique in that it does not explicitly represent the probe permutation, and in contrast to prior linearprogramming approaches, the number of variables is small: in practice, linear in the number of clones. We derive a large class of facetdefining inequalities for the 2consecutiveones polytope that we call the augmented kdegree inequalities, and we show that the basic kdegree class can be efficiently separated using bipartite matchings. Computational results with an implementation of the resulting branchandcut algorithm applied to both simulated and real data from the complete genome of Aspergillus nidulans show that we can solve many problems to provable optimality and find maps of higher quality than previously possible.
 Kececioglu, J., & Gusfield, D. (1998). Reconstructing a history of recombinations from a set of sequences. Discrete Applied Mathematics, 88(13), 239260.More infoAbstract: One of the classic problems in computational biology is the reconstruction of evolutionary history. A recent trend in the area is to increase the explanatory power of the models that are considered by incorporating higherorder evolutionary events that more accurately reflect the mechanisms of mutation at the level of the chromosome. We take a step in this direction by considering the problem of reconstructing an evolutionary history for a set of genetic sequences that have evolved by recombination. Recombination is a nontreelike event that produces a child sequence by crossing two parent sequences. We present polynomialtime algorithms for reconstructing a parsimonious history of such events for several models of recombination when all sequences, including those of ancestors, are present in the input. We also show that these models appear to be near the limit of what can be solved in polynomial time, in that several natural generalizations are NPcomplete. © 1998 Elsevier Science B.V. All rights reserved.
 Ravi, R., & Kececioglu, J. D. (1998). Approximation algorithms for multiple sequence alignment under a fixed evolutionary tree. Discrete Applied Mathematics, 88(13), 355366.More infoAbstract: We consider the problem of multiple sequence alignment under a fixed evolutionary tree: given a tree whose leaves are labeled by sequences, find ancestral sequences to label its internal nodes so as to minimize the total length of the tree, where the length of an edge is the edit distance between the sequences labeling its endpoints. We present a new polynomialtime approximation algorithm for this problem, and analyze its performance on regular dary trees with d a constant. On such a tree, the algorithm finds a solution within a factor (d + 1)/(d  1) of the minimum in Q(kdT(d, n)+k2dn2) time, where k is the number of leaves in the tree, n is the length of the longest sequence labeling a leaf, and T(d, n) is the time to compute a Steiner point for d sequences of length at most n. (A Steiner point for a set script capital L sign of sequences is a sequence P that minimizes the sum of the edit distances from P to each sequence in script capital L sign. The time T(d, n) is O(d2dnd), given O(dsd+l)time preprocessing for an alphabet of size s.) The approximation algorithm is conceptually simple and easy to implement, and actually applies to any metric space in which a Steiner point for any fixedsized set can be computed in polynomial time. We also introduce a new problem, bottleneck treealignment, in which the objective is to label the internal nodes of the tree so as to minimize the length of the longest edge. We describe an exponentialtime exact algorithm for the case of unitcost edit operations, and show there is a simple lineartime approximation algorithm for the general case that finds a solution within a factor O(log k) of the minimum. 1998 Published by Elsevier Science B.V. All rights reserved.
 Christof, T., Juenger, M., Kececioglu, J., Mutzel, P., & Reinelt, G. (1997). Branchandcut approach to physical mapping with endprobes. Proceedings of thr Annual International Conference on Computational Molecular Biology, RECOMB, 8492.More infoAbstract: A fundamental problem in computational biology is the construction of physical maps of chromosomes from hybridization experiments between unique probes and clones of chromosome fragments in the presence of error. Alizadeh, Karp, Weisser and Zweig [AKWZ94] first considered a maximumlikelihood model of the problem that is equivalent to finding an ordering of the probes that minimizes a weighted sum of errors, and developed several effective heuristics. We show that by exploiting information about the endprobes of clones, this model can be formulated as a weighted Betweenness Problem. This affords the significant advantage of allowing the welldeveloped tools of integer linearprogramming and branchandcut algorithms to be brought to bear on physical mapping, enabling us for the first time to solve small mapping instances to optimality even in the presence of high error. We also show that by combining the optimal solution of many small overlapping Betweenness Problems, one can effectively screen errors from larger instances, and solve the edited instance to optimality as a HammingDistance Traveling Salesman Problem. This suggests a new combined approach to physical map construction.
 Christof, T., Jünger, M., Kececioglu, J., Mutzel, P., & Reinelt, G. (1997). A branchandcut approach to physical mapping of chromosomes by unique endprobes. Journal of Computational Biology, 4(4), 433447.More infoPMID: 9385538;Abstract: A fundamental problem in computational biology is the construction of physical maps of chromosomes from hybridization experiments between unique probes and clones of chromosome fragments in the presence of error. Alizadeh, Karp, Weisser and Zweig (Algorithmica 13:1/2, 5276, 1995) first considered a maximumlikelihood model of the problem that is equivalent to finding an ordering of the probes that minimizes a weighted sum of errors and developed several effective heuristics. We show that by exploiting information about the endprobes of clones, this model can be formulated as a Weighted Betweenness Problem. This affords the significant advantage of allowing the welldeveloped tools of integer linearprogramming and branchandcut algorithms to be brought to bear on physical mapping, enabling us for the first time to solve small mapping instances to optimality even in the presence of high error. We also show that by combining the optimal solution of many small overlapping Betweenness Problems, one can effectively screen errors from larger instances and solve the edited instance to optimality as a HammingDistance Traveling Salesman Problem. This suggests a new approach, a BetweennessTraveling Salesman hybrid, for constructing physical maps.
 Kececioglu, J., Ming, L. i., & Tromp, J. (1997). Inferring a DNA sequence from erroneous copies. Theoretical Computer Science, 185(1), 313.More infoAbstract: We suggest a novel approach for efficiently reconstructing an original DNA sequence from erroneous copies.
 Reinert, K., Lenhof, H. ., Mutzel, P., Mehlhorn, K., & Kececioglu, J. D. (1997). Branchandcut algorithm for multiple sequence alignment. Proceedings of thr Annual International Conference on Computational Molecular Biology, RECOMB, 241250.More infoAbstract: Multiple sequence alignment is an important problem in computational biology. We study the Maximum Trace formulation introduced by Kececioglu [Kec91]. We first phrase the problem in terms of forbidden subgraphs, which enables us to express Maximum Trace as an integer linearprogramming problem, and then solve the integer linear program using methods from polyhedral combinatorics. The trace polytope is the convex hull of all feasible solutions to the Maximum Trace problem; for the case of two sequences, we give a complete characterization of this polytope. This yields a polynomialtime algorithm for a general version of pairwise sequence alignment that, perhaps suprisingly, does not use dynamic programming; this yields, for instance, a nondynamicprogramming algorithm for sequence comparison under the 01 metric, which gives another answer to a longopen question in the area of string algorithms [PW93]. For the multiplesequence case, we derive several classes of facetdefining inequalities and show that for all but one class, the corresponding separation problem can be solved in polynomial time. This leads to a branchandcut algorithm for multiple sequence alignment, and we report on our first computational experience. It appears that a polyhedral approach to multiple sequence alignment can solve instances that are beyond present dynamicprogramming approaches.
 Taylor, E. W., Bhat, A., Nadimpalli, R. G., Zhang, W., & Kececioglu, J. (1997). HIV1 encodes a sequence overlapping env gp41 with highly significant similarity to seleniumdependent glutathione peroxidases.. Journal of acquired immune deficiency syndromes and human retrovirology : official publication of the International Retrovirology Association, 15(5), 393394.More infoPMID: 9342263;
 Gupta, S. K., Kececioglu, J. D., & Schäffer, A. (1995). Improving the practical space and time efficiency of the shortestpaths approach to sumofpairs multiple sequence alignment.. Journal of computational biology : a journal of computational molecular cell biology, 2(3), 459472.More infoPMID: 8521275;Abstract: The MSA program, written and distributed in 1989, is one of the few existing programs that attempts to find optimal alignments of multiple protein or DNA sequences. The MSA program implements a branchandbound technique together with a variant of Dijkstra's shortest paths algorithm to prune the basic dynamic programming graph. We have made substantial improvements in the time and space usage of MSA. The improvements make feasible a variety of problem instances that were not feasible previously. On some runs we achieve an order of magnitude reduction in space usage and a significant multiplicative factor speedup in running time. To explain how these improvements work, we give a much more detailed description of MSA than has been previously available. In practice, MSA rarely produces a provably optimal alignment and we explain why.
 Kececioglu, J. D., & Myers, E. W. (1995). Combinatorial algorithms for DNA sequence assembly. Algorithmica, 13(12), 751.More infoAbstract: The trend toward very large DNA sequencing projects, such as those being undertaken as part of the Human Genome Program, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The sequence reconstruction problem that we take as our formulation of DNA sequence assembly is a variation of the shortest common superstring problem, complicated by the presence of sequencing errors and reverse complements of fragments. Since the simpler superstring problem is NPhard, any efficient reconstruction procedure must resort to heuristics. In this paper, however, a fourphase approach based on rigorous design criteria is presented, and has been found to be very accurate in practice. Our method is robust in the sense that it can accommodate high sequencing error rates, and list a series of alternate solutions in the event that several appear equally good. Moreover, it uses a limited form of multiple sequence alignment to detect, and often correct, errors in the data. Our combined algorithm has successfully reconstructed nonrepetitive sequences of length 50,000 sampled at error rates of as high as 10%. © 1995 SpringerVerlag New York Inc.
 Kececioglu, J., & Sankoff, D. (1995). Exact and approximation algorithms for sorting by reversals, with application to genome rearrangement. Algorithmica, 13(12), 180210.More infoAbstract: Motivated by the problem in computational biology of reconstructing the series of chromosome inversions by which one organism evolved from another, we consider the problem of computing the shortest series of reversals that transform one permutation to another. The permutations describe the order of genes on corresponding chromosomes, and a reversal takes an arbitrary substring of elements, and reverses their order. For this problem, we develop two algorithms: a greedy approximation algorithm, that finds a solution provably close to optimal in O(n 2) time and 0(n) space for nelement permutations, and a branch andbound exact algorithm, that finds an optimal solution in 0(mL(n, n)) time and 0(n 2) space, where m is the size of the branch andbound search tree, and L(n, n) is the time to solve a linear program of n variables and n constraints. The greedy algorithm is the first to come within a constant factor of the optimum; it guarantees a solution that uses no more than twice the minimum number of reversals. The lower and upper bounds of the branch andbound algorithm are a novel application of maximumweight matchings, shortest paths, and linear programming. In a series of experiments, we study the performance of an implementation on random permutations, and permutations generated by random reversals. For permutations differing by k random reversals, we find that the average upper bound on reversal distance estimates k to within one reversal for k
 Kececioglu, J., & Gusfield, D. (1994). Reconstructing a history of recombinations from a set of sequences. Proceedings of the Annual ACM SIAM Symposium on Discrete Algorithms, 471480.More infoAbstract: One of the classic problems in computational biology is the reconstruction of evolutionary histories. A recent trend is toward increasing the explanatory power of the models by incorporating higherorder evolutionary events that more accurately reflect the full range of mutation at the molecular level. In this paper, we take a step in this direction by considering the problem of reconstructing an evolutionary history for a set of genetic sequences that have evolved by recombination. Recombination produces a new sequence by crossing two parent sequences, and is among the most important mechanisms of higherorder molecular mutation.
 Lipman, D. J., Altschul, S. F., & Kececioglu, J. D. (1989). A tool for multiple sequence alignment. Proceedings of the National Academy of Sciences of the United States of America, 86(12), 44124415.More infoPMID: 2734293;PMCID: PMC287279;Abstract: Multiple sequence alignment can be a useful technique for studying molecular evolution and analyzing sequencestructure relationships. Until recently, it has been impractical to apply dynamic programming, the most widely accepted method for producing pairwise alignments, to comparisons of more than three sequences. We describe the design and application of a tool for multiple alignment of amino acid sequences that implements a new algorithm that greatly reduces the computational demands of dynamic programming. This tool is able to align in reasonable time as many as eight sequences the length of an average protein.
Proceedings Publications
 DeBlasio, D. F., & Kececioglu, J. D. (2016, August). Predicting core columns of protein multiple sequence alignments for improved parameter advising. In Proceedings of the 16th Workshop on Algorithms in Bioinformatics (WABI 2016), LNCS 9838, 7789.
 Kececioglu, J. D., & Wilcox, A. (2016, October). Proceedings of the 7th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. In 7th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACMBCB 2016).More infoAs CoChair of the Program Committee for the conference ACMBCB 2016, I was an editor of the conference Proceedings, along with CoChair Adam Wilcox
 DeBlasio, D. F., & Kececioglu, J. D. (2015, September). Ensemble multiple sequence alignment via advising. In Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM BCB 2015).