Jump to navigation

The University of Arizona Wordmark Line Logo White
UA Profiles | Home
  • Phonebook
  • Edit My Profile
  • Feedback

Profiles search form

Travis Wheeler

  • Associate Professor
  • Member of the Graduate Faculty
  • Associate Professor, Applied Mathematics - GIDP
  • Associate Professor, Genetics - GIDP
  • Associate Professor, BIO5 Institute
Contact
  • twheeler@arizona.edu
  • Bio
  • Interests
  • Courses
  • Scholarly Contributions

Biography

Travis Wheeler is an Associate Professor in the University of Arizona College of Pharmacy Practice. He earned his bachelors in Evolutionary Biology from the University of Arizona, then his PhD in Computer Science from UArizona, with a research emphasis on computational genomics. He spent 5 years as a postdoc and research scientist in the research group of Sean Eddy at HHMI Janelia Research Campus, then joined the Computer Science faculty at the University of Montana in 2014, where he remained until his move back to Arizona in 2022. Dr. Wheeler leads a group (http://wheelerlab.org/people) with research focus that can be broadly described as “algorithms and machine learning approaches for computational biology”, primarily emphasizing applications to genomics, drug discovery, and animal behavior classification.

Degrees

  • Ph.D. Computer Science
    • University of Arizona, Tucson, Arizona, United States
    • Efficient construction of accurate multiple alignments and large-scale phylogenies
  • B.A. Ecology and Evolutionary Biology
    • University of Arizona, Tucson, Arizona, United States

Work Experience

  • Department of Pharmacy Practice & Science, University of Arizona (2022 - Ongoing)
  • Department of Computer Science, University of Montana (2019 - 2022)
  • Department of Computer Science, University of Montana (2014 - 2019)
  • HHMI Janelia Research Campus (2011 - 2014)
  • HHMI Janelia Research Campus (Sean Eddy) (2009 - 2011)
  • University of Arizona, Tucson, Arizona (2000 - 2003)
  • Intuit, Inc (1995 - 2000)

Related Links

Share Profile

Interests

Teaching

Computation (introductory, through advanced algorithms)BioinformaticsMachine LearningProbabilistic ModelingDrug Discovery

Research

Computational biology: - Algorithms, Machine Learning, Software engineering- Genomics, proteomics, drug discovery, animal tracking/behavior

Courses

2025-26 Courses

  • Dissertation
    INFO 920 (Spring 2026)
  • Drug Hunting for Beginners
    PCOL 488 (Spring 2026)
  • Lab Research Rotation
    GENE 792 (Spring 2026)
  • Research
    CSC 900 (Spring 2026)
  • Directed Research
    PCOL 792 (Fall 2025)
  • Research
    CSC 900 (Fall 2025)

2024-25 Courses

  • Directed Research
    PCOL 792 (Spring 2025)
  • Research
    CSC 900 (Spring 2025)
  • Thesis
    CSC 910 (Spring 2025)
  • Directed Research
    PHSC 792A (Fall 2024)
  • Honors Thesis
    ISTA 498H (Fall 2024)
  • Research
    CSC 900 (Fall 2024)

2023-24 Courses

  • Directed Research
    INFO 492 (Summer I 2024)
  • Honors Thesis
    BIOC 498H (Spring 2024)
  • Pharmacy Administration
    PHSC 596E (Spring 2024)
  • Research
    CSC 900 (Spring 2024)
  • Rsrch Ecology+Evolution
    ECOL 610A (Spring 2024)
  • Thesis
    CSC 910 (Spring 2024)
  • Directed Research
    ACBS 492 (Fall 2023)
  • Honors Thesis
    BIOC 498H (Fall 2023)

2022-23 Courses

  • Honors Directed Research
    BIOC 392H (Spring 2023)
  • Independent Study
    PHSC 599 (Spring 2023)
  • Honors Directed Research
    BIOC 392H (Fall 2022)
  • Honors Thesis
    MCB 498H (Fall 2022)
  • Research
    PHSC 900 (Fall 2022)

Related Links

UA Course Catalog

Scholarly Contributions

Journals/Publications

  • Amaro, R. E., Åqvist, J., Bahar, I., Battistini, F., Bellaiche, A., Beltran, D., Biggin, P. C., Bonomi, M., Bowman, G. R., Bryce, R. A., Bussi, G., Carloni, P., Case, D. A., Cavalli, A., Chang, C. E., Cheatham, T. E., Cheung, M. S., Chipot, C., Chong, L. T., , Choudhary, P., et al. (2025). The need to implement FAIR principles in biomolecular simulations. Nature Methods, 22(Issue 4). doi:10.1038/s41592-025-02635-0
    More info
    In the Big Data era, a change of paradigm in the use of molecular dynamics is required. Trajectories should be stored under FAIR (findable, accessible, interoperable and reusable) requirements to favor its reuse by the community under an open science paradigm.
  • Khajouei, E., Ghisays, V., Piras, I. S., Martinez, K. L., Vicenti, A. T., Naymik, M., Ngo, P., Tran, T. C., Denny, J. C., Wheeler, T. J., Huentelman, M. J., Reiman, E. M., & Karnes, J. H. (2025). Phenome-wide association of APOE alleles in the All of Us Research Program. eBioMedicine, 117. doi:10.1016/j.ebiom.2025.105768
    More info
    Background: Apolipoprotein E (APOE) variation is associated with altered lipid metabolism, as well as cardiovascular and neurodegenerative disease. Prior studies are largely limited to European ancestry populations and differential risk by sex and ancestry has not been widely evaluated. Methods: We utilised a phenome-wide association study (PheWAS) to explore APOE-associated phenotypes in the All of Us Research Program. We determined APOE alleles for 181,880 participants, representing seven ancestry groups. We tested association of APOE variants, ordered based on Alzheimer's disease risk hierarchy (ε2/ε2 < ε2/ε3 < ε3/ε3 < ε2/ε4 < ε3/ε4 < ε4/ε4), with 2318 phenotypes. Bonferroni-adjusted analyses were performed overall, by ancestry, by sex, and with adjustment for social determinants of health (SDOH). Findings: In the overall cohort, PheWAS identified 17 significant associations, including increased odds of hyperlipidaemia (OR 1.15 [1.14–1.16] per APOE genotype group; P = 1.8 × 10−129), dementia, and Alzheimer's disease (OR 1.55 [1.40–1.70]; P = 5 × 10−19), and reduced odds of fatty liver disease and chronic liver disease. ORs were similar after SDOH adjustment and by sex, except for an increased number of cardiovascular associations in males, and decreased odds of noninflammatory disorders of vulva and perineum in females. Significant heterogeneity was observed for hyperlipidaemia and mild cognitive impairment across ancestry. Unique associations by ancestry included transient retinal arterial occlusion in the European ancestry group, and first-degree atrioventricular block in the American Admixed/Latino ancestry group. Interpretation: We replicate extensive phenotypic associations with APOE alleles in a large, diverse cohort. We provide a comprehensive catalogue of APOE-associated phenotypes and evidence of unique phenotypic associations by sex and ancestry, as well as heterogeneity in effect size across ancestry. Funding: Funding is listed in the Acknowledgements section.
  • Olson, D., Colligan, T., Demekas, D., Roddy, J. W., Youens-Clark, K., & Wheeler, T. J. (2025). NEAR: Neural embeddings for amino acid relationships. Bioinformatics, 41(Issue). doi:10.1093/bioinformatics/btaf198
    More info
    Summary Protein language models (PLMs) have recently demonstrated potential to supplant classical protein database search methods based on sequence alignment, but are slower than common alignment-based tools and appear to be prone to a high rate of false labeling. Here, we present Neural Embeddings for Amino acid Relationships (NEAR), a method based on neural representation learning that is designed to improve both speed and accuracy of search for likely homologs in a large protein sequence database. NEAR's ResNet embedding model is trained using contrastive learning guided by trusted sequence alignments. It computes per-residue embeddings for target and query protein sequences, and identifies alignment candidates with a pipeline consisting of residue-level k-NN search and a simple neighbor aggregation scheme. Tests on a benchmark consisting of trusted remote homologs and randomly shuffled decoy sequences reveal that NEAR substantially improves accuracy relative to state-of-the-art PLMs, with lower memory requirements and faster embedding and search speed. While these results suggest that the NEAR model may be useful for standalone homology detection with increased sensitivity over standard alignment-based methods, in this manuscript, we focus on a more straightforward analysis of the model's value as a high-speed pre-filter for sensitive annotation. In that context, NEAR is at least 5x faster than the pre-filter currently used in the widely used profile hidden Markov model (pHMM) search tool HMMER3, and also outperforms the pre-filter used in our fast pHMM tool, nail. Availability and implementation NEAR is under an open-source license. Code and data curation instructions can be found at https://github.com/TravisWheelerLab/NEAR.
  • Roy, A., Ward, E., Choi, I., Cosi, M., Edgin, T., Hughes, T. S., Islam, M. S., Khan, A. M., Kolekar, A., Rayl, M., & others, . (2025). MDRepo -- an open data warehouse for community-contributed molecular dynamics simulations of proteins. Nucleic Acids Research, 53(D1), D477--D486.
  • Anderson, T., & Wheeler, T. (2024). An FPGA-based hardware accelerator supporting sensitive sequence homology filtering with profile hidden Markov models. BMC Bioinformatics, 25(1). doi:10.1186/s12859-024-05879-3
    More info
    Background: Sequence alignment lies at the heart of genome sequence annotation. While the BLAST suite of alignment tools has long held an important role in alignment-based sequence database search, greater sensitivity is achieved through the use of profile hidden Markov models (pHMMs). Here, we describe an FPGA hardware accelerator, called HAVAC, that targets a key bottleneck step (SSV) in the analysis pipeline of the popular pHMM alignment tool, HMMER. Results: The HAVAC kernel calculates the SSV matrix at 1739 GCUPS on a ∼ $3000 Xilinx Alveo U50 FPGA accelerator card, ∼ 227× faster than the optimized SSV implementation in nhmmer. Accounting for PCI-e data transfer data processing, HAVAC is 65× faster than nhmmer’s SSV with one thread and 35× faster than nhmmer with four threads, and uses ∼ 31% the energy of a traditional high end Intel CPU. Conclusions: HAVAC demonstrates the potential offered by FPGA hardware accelerators to produce dramatic speed gains in sequence annotation and related bioinformatics applications. Because these computations are performed on a co-processor, the host CPU remains free to simultaneously compute other aspects of the analysis pipeline.
  • Anderson, T., & Wheeler, T. (2024). An FPGA-based hardware accelerator supporting sensitive sequence homology filtering with profile hidden Markov models. BMC Bioinformatics.
  • Clements, A. N., Casillas, A. L., Flores, C. E., Liou, H., Toth, R. K., Chauhan, S. S., Sutterby, K., Deshmukh, S. K., Wu, S., Xiu, J., & others, . (2024). Inhibition of PIM kinase in tumor associated macrophages suppresses inflammasome 1 activation and sensitizes prostate cancer to immunotherapy. bioRxiv, 2024--10.
  • Copeland, C. J., Roddy, J. W., Schmidt, A. K., Secor, P. R., & Wheeler, T. J. (2024). VIBES: a workflow for annotating and visualizing viral sequences integrated into bacterial genomes [EDITOR'S CHOICE]. NAR Genomics and Bioinformatics, 6(2).
  • Geller-McGrath, D., Konwar, K. M., Edgcomb, V. P., Pachiadaki, M., Roddy, J. W., Wheeler, T. J., & McDermott, J. E. (2024). Predicting metabolic modules in incomplete bacterial genomes with MetaPathPredict. eLife, 13.
  • Geller-Mcgrath, D., Konwar, K., Edgcomb, V., Pachiadaki, M., Roddy, J., Wheeler, T., & McDermott, J. (2024). Predicting metabolic modules in incomplete bacterial genomes with MetaPathPredict. eLife, 13. doi:10.7554/elife.85749
    More info
    The reconstruction of complete microbial metabolic pathways using 'omics data from environmental samples remains challenging. Computational pipelines for pathway reconstruction that utilize machine learning methods to predict the presence or absence of KEGG modules in incomplete genomes are lacking. Here, we present MetaPathPredict, a software tool that incorporates machine learning models to predict the presence of complete KEGG modules within bacterial genomic datasets. Using gene annotation data and information from the KEGG module database, MetaPathPredict employs deep learning models to predict the presence of KEGG modules in a genome. MetaPathPredict can be used as a command line tool or as a Python module, and both options are designed to be run locally or on a compute cluster. Benchmarks show that MetaPathPredict makes robust predictions of KEGG module presence within highly incomplete genomes.
  • Glidden-Handgis, G., & Wheeler, T. J. (2024). WAS IT A MATch I SAW? Approximate palindromes lead to overstated false match rates in benchmarks using reversed sequences. Bioinformatics Advances, 4(1). doi:10.1093/bioadv/vbae052
    More info
    Background: Software for labeling biological sequences typically produces a theory-based statistic for each match (the E-value) that indicates the likelihood of seeing that match's score by chance. E-values accurately predict false match rate for comparisons of random (shuffled) sequences, and thus provide a reasoned mechanism for setting score thresholds that enable high sensitivity with low expected false match rate. This threshold-setting strategy is challenged by real biological sequences, which contain regions of local repetition and low sequence complexity that cause excess matches between non-homologous sequences. Knowing this, tool developers often develop benchmarks that use realistic-seeming decoy sequences to explore empirical tradeoffs between sensitivity and false match rate. A recent trend has been to employ reversed biological sequences as realistic decoys, because these preserve the distribution of letters and the existence of local repeats, while disrupting the original sequence's functional properties. However, we and others have observed that sequences appear to produce high scoring alignments to their reversals with surprising frequency, leading to overstatement of false match risk that may negatively affect downstream analysis. Results: We demonstrate that an alignment between a sequence S and its (possibly mutated) reversal tends to produce higher scores than alignment between truly unrelated sequences, even when S is a shuffled string with no notable repetitive or low-complexity regions. This phenomenon is due to the unintuitive fact that (even randomly shuffled) sequences contain palindromes that are on average longer than the longest common substrings (LCS) shared between permuted variants of the same sequence. Though the expected palindrome length is only slightly larger than the expected LCS, the distribution of alignment scores involving reversed sequences is strongly right-shifted, leading to greatly increased frequency of high-scoring alignments to reversed sequences. Impact: Overestimates of false match risk can motivate unnecessarily high score thresholds, leading to potentially reduced true match sensitivity. Also, when tool sensitivity is only reported up to the score of the first matched decoy sequence, a large decoy set consisting of reversed sequences can obscure sensitivity differences between tools. As a result of these observations, we advise that reversed biological sequences be used as decoys only when care is taken to remove positive matches in the original (un-reversed) sequences, or when overstatement of false labeling is not a concern. Though the primary focus of the analysis is on sequence annotation, we also demonstrate that the prevalence of internal palindromes may lead to an overstatement of the rate of false labels in protein identification with mass spectrometry.
  • Glidden-Handgis, G., & Wheeler, T. J. (2024). WAS IT A MATch I SAW? Approximate palindromes lead to overstated false match rates in benchmarks using reversed sequences. Bioinformatics Advances.
  • Groza, C., Chen, X., Wheeler, T. J., Bourque, G., & Goubert, C. (2024). A unified framework to analyze transposable element insertion polymorphisms using graph genomes. Nature Communications, 15(1).
  • Groza, C., Chen, X., Wheeler, T., Bourque, G., & Goubert, C. (2024). A unified framework to analyze transposable element insertion polymorphisms using graph genomes. Nature Communications, 15(1). doi:10.1038/s41467-024-53294-2
    More info
    Transposable elements are ubiquitous mobile DNA sequences generating insertion polymorphisms, contributing to genomic diversity. We present GraffiTE, a flexible pipeline to analyze polymorphic mobile elements insertions. By integrating state-of-the-art structural variant detection algorithms and graph genomes, GraffiTE identifies polymorphic mobile elements from genomic assemblies or long-read sequencing data, and genotypes these variants using short or long read sets. Benchmarking on simulated and real datasets reports high precision and recall rates. GraffiTE is designed to allow non-expert users to perform comprehensive analyses, including in models with limited transposable element knowledge and is compatible with various sequencing technologies. Here, we demonstrate the versatility of GraffiTE by analyzing human, Drosophila melanogaster, maize, and Cannabis sativa pangenome data. These analyses reveal the landscapes of polymorphic mobile elements and their frequency variations across individuals, strains, and cultivars.
  • Khajouei, E., Ghisays, V., Piras, I. S., Martinez, K. L., Naymik, M., Ngo, P., Tran, T. C., Denny, J. C., Wheeler, T. J., Huentelman, M. J., & others, . (2024). Phenome-Wide Association of APOE Alleles in the All of Us Research Program. medRxiv.
  • Krause, G. R., Shands, W., & Wheeler, T. J. (2024). Sensitive and error-tolerant annotation of protein-coding DNA with BATH. Bioinformatics Advances.
  • Marbut, A. C., Chandler, J. W., & Wheeler, T. J. (2024). Exploring the Impact of a Transformer's Latent Space Geometry on Downstream Task Performance. arXiv.
  • Nord, A. J., & Wheeler, T. J. (2024). Diviner uncovers hundreds of novel human (and other) exons though comparative analysis of proteins. bioRxiv.
  • Olson, D. R., & Wheeler, T. J. (2024). ULTRA-Effective labeling of tandem repeats in genomic sequence. Bioinformatics Advances.
  • Olson, D. R., Demekas, D., Colligan, T., & Wheeler, T. (2024). NEAR: Neural Embeddings for Amino acid Relationships. bioRxiv.
  • Olson, D., & Wheeler, T. (2024). ULTRA-effective labeling of tandem repeats in genomic sequence. Bioinformatics Advances, 4(1). doi:10.1093/bioadv/vbae149
    More info
    In the age of long read sequencing, genomics researchers now have access to accurate repetitive DNA sequence (including satellites) that, due to the limitations of short read-sequencing, could previously be observed only as unmappable fragments. Tools that annotate repetitive sequence are now more important than ever, so that we can better understand newly uncovered repetitive sequences, and also so that we can mitigate errors in bioinformatic software caused by those repetitive sequences. To that end, we introduce the 1.0 release of our tool for identifying and annotating locally repetitive sequence, ULTRA Locates Tandemly Repetitive Areas (ULTRA). ULTRA is fast enough to use as part of an efficient annotation pipeline, produces state-of-the-art reliable coverage of repetitive regions containing many mutations, and provides interpretable statistics and labels for repetitive regions.
  • Roddy, J. W., Rich, D. H., & Wheeler, T. J. (2024). nail: software for high-speed, high-sensitivity protein sequence annotation. bioRxiv.
  • Schimunek, J., Seidl, P., Elez, K., Hempel, T., Le, T., Olsson, S., Raich, L., Winter, R., Gokcan, H., Gusev, F., Gutkin, E., Isayev, O., Kurnikova, M., Narangoda, C., Zubatyuk, R., Bosko, I., Furs, K., Karpenko, A., Kornoushenko, Y., , Shuldau, M., et al. (2024). A community effort in SARS-CoV-2 drug discovery. Molecular Informatics, 43(1). doi:10.1002/minf.202300262
    More info
    The COVID-19 pandemic continues to pose a substantial threat to human lives and is likely to do so for years to come. Despite the availability of vaccines, searching for efficient small-molecule drugs that are widely available, including in low- and middle-income countries, is an ongoing challenge. In this work, we report the results of an open science community effort, the “Billion molecules against COVID-19 challenge”, to identify small-molecule inhibitors against SARS-CoV-2 or relevant human receptors. Participating teams used a wide variety of computational methods to screen a minimum of 1 billion virtual molecules against 6 protein targets. Overall, 31 teams participated, and they suggested a total of 639,024 molecules, which were subsequently ranked to find ‘consensus compounds’. The organizing team coordinated with various contract research organizations (CROs) and collaborating institutions to synthesize and test 878 compounds for biological activity against proteases (Nsp5, Nsp3, TMPRSS2), nucleocapsid N, RdRP (only the Nsp12 domain), and (alpha) spike protein S. Overall, 27 compounds with weak inhibition/binding were experimentally identified by binding-, cleavage-, and/or viral suppression assays and are presented here. Open science approaches such as the one presented here contribute to the knowledge base of future drug discovery efforts in finding better SARS-CoV-2 treatments.
  • Venkatraman, V., Gaiser, J., Demekas, D., Roy, A., Xiong, R., & Wheeler, T. (2024). Do Molecular Fingerprints Identify Diverse Active Drugs in Large-Scale Virtual Screening? (No). Pharmaceuticals, 17(8). doi:10.3390/ph17080992
    More info
    Computational approaches for small-molecule drug discovery now regularly scale to the consideration of libraries containing billions of candidate small molecules. One promising approach to increased the speed of evaluating billion-molecule libraries is to develop succinct representations of each molecule that enable the rapid identification of molecules with similar properties. Molecular fingerprints are thought to provide a mechanism for producing such representations. Here, we explore the utility of commonly used fingerprints in the context of predicting similar molecular activity. We show that fingerprint similarity provides little discriminative power between active and inactive molecules for a target protein based on a known active—while they may sometimes provide some enrichment for active molecules in a drug screen, a screened data set will still be dominated by inactive molecules. We also demonstrate that high-similarity actives appear to share a scaffold with the query active, meaning that they could more easily be identified by structural enumeration. Furthermore, even when limited to only active molecules, fingerprint similarity values do not correlate with compound potency. In sum, these results highlight the need for a new wave of molecular representations that will improve the capacity to detect biologically active molecules based on their similarity to other such molecules.
  • Venkatraman, V., Gaiser, J., Demekas, D., Roy, A., Xiong, R., & Wheeler, T. J. (2024). Do Molecular Fingerprints Identify Diverse Active Drugs in Large-Scale Virtual Screening? (No). Pharmaceuticals, 17(8), 992.
  • Brodie, J., Henao-Diaz, L., Pratama, B., Copeland, C., Wheeler, T., & Helmy, O. (2023). Fruit Size in Indo-Malayan Island Plants Is More Strongly Influenced by Filtering than by In Situ Evolution. American Naturalist, 201(4). doi:10.1086/723212
    More info
    Community trait assembly, the formation of distributions of phenotypic characteristics across coexisting species, can occur via two main processes: filtering of trait distributions from the regional pool and in situ phenotypic evolution in local communities. But the relative importance of these processes remains unclear, largely because of the difficulty in determining the timing of evolutionary trait changes and biogeographic dispersal events in phylogenies. We assessed evolutionary and biogeographic transitions in woody plant species across the Indo-Malay archipelago, a series of island groups where the same plant lineages interact with different seed disperser and seed predator assemblages. Fruit size in 2,650 taxa spanning the angiosperm tree of life tended to be smaller in the Sulawesi and Maluku island groups, where frugivores are less diverse and smaller bodied, than in the regional source pool. While numerous plant lineages (not just small-fruited ones) reached the isolated islands, colonists tended to be the smaller-fruited members of each clade. Nearly all of the evolutionary transitions to smaller fruit size predated, often substantially, organis-mal dispersal to the islands. Our results suggest that filtering rather than within-island evolution largely determined the distribution of fruit sizes in these regions.
  • Colligan, T., Irish, K., Emlen, D. J., & Wheeler, T. J. (2023). DISCO: A deep learning ensemble for uncertainty-aware segmentation of acoustic signals. PLOS ONE, 18(7), 1-20.
  • Nord, A. J., & Wheeler, T. J. (2023). Mirage2's high-quality spliced protein-to-genome mappings produce accurate multiple-sequence alignments of isoforms. PLOS ONE, 18(5), e0285225.
  • Schimunek, J., Seidl, P., Elez, K., Hempel, T., Le, T., No\'{e}, F., Olsson, S., Raich, L., Winter, R., Gokcan, H., Gusev, F., Gutkin, E. M., Isayev, O., Kurnikova, M. G., Narangoda, C. H., Zubatyuk, R., Bosko, I. P., Furs, K. V., Karpenko, A. D., , Kornoushenko, Y. V., et al. (2023). A community effort in SARS-CoV-2 drug discovery. Molecular Informatics.
  • Storer, J. M., Walker, J. A., Baker, J. N., Hossain, S., Roos, C., Wheeler, T. J., & Batzer, M. A. (2023). Framework of the Alu Subfamily Evolution in the platyrrhine Three-Family Clade of Cebidae, Callithrichidae, and Aotidae. Genes, 14(2), 249.
  • Altemose, N., Logsdon, G., Bzikadze, A., Sidhwani, P., Langley, S., Caldas, G., Hoyt, S., Uralsky, L., Ryabov, F., Shew, C., Sauria, M., Borchers, M., Gershman, A., Mikheenko, A., Shepelev, V., Dvorkina, T., Kunyavskaya, O., Vollger, M., Rhie, A., , McCartney, A., et al. (2022). Complete genomic and epigenetic maps of human centromeres. Science, 376(6588). doi:10.1126/science.abl4178
    More info
    Existing human genome assemblies have almost entirely excluded repetitive sequences within and near centromeres, limiting our understanding of their organization, evolution, and functions, which include facilitating proper chromosome segregation. Now, a complete, telomere-to-telomere human genome assembly (T2T-CHM13) has enabled us to comprehensively characterize pericentromeric and centromeric repeats, which constitute 6.2% of the genome (189.9 megabases). Detailed maps of these regions revealed multimegabase structural rearrangements, including in active centromeric repeat arrays. Analysis of centromere-associated sequences uncovered a strong relationship between the position of the centromere and the evolution of the surrounding DNA through layered repeat expansions. Furthermore, comparisons of chromosome X centromeres across a diverse panel of individuals illuminated high degrees of structural, epigenetic, and sequence variation in these complex and rapidly evolving regions.
  • Brodie, J. F., Henao-Diaz, L. F., Pratama, B., Copeland, C., Wheeler, T., & Helmy, O. E. (2022). Fruit size in Indo-Malayan island plants is more strongly influenced by filtering than by in situ evolution. The American Naturalist.
  • Hubley, R., Wheeler, T. J., & Smit, A. F. (2022). Accuracy of multiple sequence alignment methods in the reconstruction of transposable element families. NAR Genomics and Bioinformatics, 4(2), lqac040.
  • Nord, A. J., & Wheeler, T. J. (2022). Mirage2's high-quality spliced protein-to-genome mappings produce accurate multiple-sequence alignments of isoforms. bioRxiv.
  • Roddy, J. W., Lesica, G. T., & Wheeler, T. J. (2022). SODA: a TypeScript/JavaScript library for visualizing biological sequence annotation. NAR Genomics and Bioinformatics, 4(4).
  • Venkatraman, V., Colligan, T., Lesica, G., Olson, D., Gaiser, J., Copeland, C., Wheeler, T., & Roy, A. (2022). Drugsniffer: An Open Source Workflow for Virtually Screening Billions of Molecules for Binding Affinity to Protein Targets. Frontiers in Pharmacology, 13(Issue). doi:10.3389/fphar.2022.874746
    More info
    The SARS-CoV2 pandemic has highlighted the importance of efficient and effective methods for identification of therapeutic drugs, and in particular has laid bare the need for methods that allow exploration of the full diversity of synthesizable small molecules. While classical high-throughput screening methods may consider up to millions of molecules, virtual screening methods hold the promise of enabling appraisal of billions of candidate molecules, thus expanding the search space while concurrently reducing costs and speeding discovery. Here, we describe a new screening pipeline, called drugsniffer, that is capable of rapidly exploring drug candidates from a library of billions of molecules, and is designed to support distributed computation on cluster and cloud resources. As an example of performance, our pipeline required ∼40,000 total compute hours to screen for potential drugs targeting three SARS-CoV2 proteins among a library of ∼3.7 billion candidate molecules.
  • Anderson, T., & Wheeler, T. (2021). An optimized FM-index library for nucleotide and amino acid search. Algorithms for Molecular Biology, 16(1). doi:10.1186/s13015-021-00204-6
    More info
    Background: Pattern matching is a key step in a variety of biological sequence analysis pipelines. The FM-index is a compressed data structure for pattern matching, with search run time that is independent of the length of the database text. Implementation of the FM-index is reasonably complicated, so that increased adoption will be aided by the availability of a fast and flexible FM-index library. Results: We present AvxWindowedFMindex (AWFM-index), a lightweight, open-source, thread-parallel FM-index library written in C that is optimized for indexing nucleotide and amino acid sequences. AWFM-index introduces a new approach to storing FM-index data in a strided bit-vector format that enables extremely efficient computation of the FM-index occurrence function via AVX2 bitwise instructions, and combines this with optional on-disk storage of the index’s suffix array and a cache-efficient lookup table for partial k-mer searches. The AWFM-index performs exact match count and locate queries faster than SeqAn3’s FM-index implementation across a range of comparable memory footprints. When optimized for speed, AWFM-index is ∼ 2–4x faster than SeqAn3 for nucleotide search, and ∼ 2–6x faster for amino acid search; it is also ∼ 4x faster with similar memory footprint when storing the suffix array in on-disk SSD storage. Conclusions: AWFM-index is easy to incorporate into bioinformatics software, offers run-time performance parameterization, and provides clients with FM-index functionality at both a high-level (count or locate all instances of a query string) and low-level (step-wise control of the FM-index backward-search process). The open-source library is available for download at https://github.com/TravisWheelerLab/AvxWindowFmIndex.
  • Carey, K., Patterson, G., & Wheeler, T. (2021). Transposable element subfamily annotation has a reproducibility problem. Mobile DNA, 12(1). doi:10.1186/s13100-021-00232-4
    More info
    Background: Transposable element (TE) sequences are classified into families based on the reconstructed history of replication, and into subfamilies based on more fine-grained features that are often intended to capture family history. We evaluate the reliability of annotation with common subfamilies by assessing the extent to which subfamily annotation is reproducible in replicate copies created by segmental duplications in the human genome, and in homologous copies shared by human and chimpanzee. Results: We find that standard methods annotate over 10% of replicates as belonging to different subfamilies, despite the fact that they are expected to be annotated as belonging to the same subfamily. Point mutations and homologous recombination appear to be responsible for some of this discordant annotation (particularly in the young Alu family), but are unlikely to fully explain the annotation unreliability. Conclusions: The surprisingly high level of disagreement in subfamily annotation of homologous sequences highlights a need for further research into definition of TE subfamilies, methods for representing subfamily annotation confidence of TE instances, and approaches to better utilizing such nuanced annotation data in downstream analysis.
  • Elliott, T., Heitkam, T., Hubley, R., Quesneville, H., Suh, A., Wheeler, T., Anselem, J., Berrens, R., Gonzalez, J., Goubert, C., Lesica, G., Rosen, J., Smit, A., Storer, J., & Schaack, S. (2021). TE Hub: A community-oriented space for sharing and connecting tools, data, resources, and methods for transposable element annotation. Mobile DNA, 12(1). doi:10.1186/s13100-021-00244-0
    More info
    Transposable elements (TEs) play powerful and varied evolutionary and functional roles, and are widespread in most eukaryotic genomes. Research into their unique biology has driven the creation of a large collection of databases, software, classification systems, and annotation guidelines. The diversity of available TE-related methods and resources raises compatibility concerns and can be overwhelming to researchers and communicators seeking straightforward guidance or materials. To address these challenges, we have initiated a new resource, TE Hub, that provides a space where members of the TE community can collaborate to document and create resources and methods. The space consists of (1) a website organized with an open wiki framework, https://tehub.org, (2) a conversation framework via a Twitter account and a Slack channel, and (3) bi-monthly Hub Update video chats on the platform’s development. In addition to serving as a centralized repository and communication platform, TE Hub lays the foundation for improved integration, standardization, and effectiveness of diverse tools and protocols. We invite the TE community, both novices and experts in TE identification and analysis, to join us in expanding our community-oriented resource.
  • Storer, J., Hubley, R., Rosen, J., Wheeler, T., & Smit, A. (2021). The Dfam community resource of transposable element families, sequence models, and genome annotations. Mobile DNA, 12(1). doi:10.1186/s13100-020-00230-y
    More info
    Dfam is an open access database of repetitive DNA families, sequence models, and genome annotations. The 3.0–3.3 releases of Dfam (https://dfam.org) represent an evolution from a proof-of-principle collection of transposable element families in model organisms into a community resource for a broad range of species, and for both curated and uncurated datasets. In addition, releases since Dfam 3.0 provide auxiliary consensus sequence models, transposable element protein alignments, and a formalized classification system to support the growing diversity of organisms represented in the resource. The latest release includes 266,740 new de novo generated transposable element families from 336 species contributed by the EBI. This expansion demonstrates the utility of many of Dfam’s new features and provides insight into the long term challenges ahead for improving de novo generated transposable element datasets.
  • Hornbeck, P., Kornhauser, J., Latham, V., Murray, B., Nandhikonda, V., Nord, A., Skrzypek, E., Wheeler, T., Zhang, B., & Gnad, F. (2019). 15 years of PhosphoSitePlus ® : Integrating post-translationally modified sites, disease variants and isoforms. Nucleic Acids Research, 47(1). doi:10.1093/nar/gky1159
    More info
    For 15 years the mission of PhosphoSitePlus ® (PSP, https://www.phosphosite.org) has been to provide comprehensive information and tools for the study of mammalian post-translational modifications (PTMs). The number of unique PTMs in PSP is now more than 450 000 from over 22 000 articles and thousands of MS datasets. The most important areas of growth in PSP are in disease and isoform informatics. Germline mutations associated with inherited diseases and somatic cancer mutations have been added to the database and can now be viewed along with PTMs and associated quantitative information on novel 'lollipop' plots. These plots enable researchers to interactively visualize the overlap between disease variants and PTMs, and to identify mutations that may alter phenotypes by rewiring signaling networks. We are expanding the sequence space to include over 30 000 human and mouse isoforms to enable researchers to explore the important but understudied biology of isoforms. This represents a necessary expansion of sequence space to accommodate the growing precision and depth of coverage enabled by ongoing advances in mass spectrometry. Isoforms are aligned using a new algorithm. Exploring the worlds of PTMs and disease mutations in the entire isoform space will hopefully lead to new biomarkers, therapeutic targets, and insights into isoform biology.
  • Grimes, M., Hall, B., Foltz, L., Levy, T., Rikova, K., Gaiser, J., Cook, W., Smirnova, E., Wheeler, T., Clark, N., Lachmann, A., Zhang, B., Hornbeck, P., Comb, M., & Ma’ayan, A. (2018). Integration of protein phosphorylation, acetylation, and methylation data sets to outline lung cancer signaling networks. Science Signaling, 11(531). doi:10.1126/scisignal.aaq1087
    More info
    Protein posttranslational modifications (PTMs) have typically been studied independently, yet many proteins are modified by more than one PTM type, and cell signaling pathways somehow integrate this information. We coupled immunoprecipitation using PTM-specific antibodies with tandem mass tag (TMT) mass spectrometry to simultaneously examine phosphorylation, methylation, and acetylation in 45 lung cancer cell lines compared to normal lung tissue and to cell lines treated with anticancer drugs. This simultaneous, large-scale, integrative analysis of these PTMs using a cluster-filtered network (CFN) approach revealed that cell signaling pathways were outlined by clustering patterns in PTMs. We used the t-distributed stochastic neighbor embedding (t-SNE) method to identify PTM clusters and then integrated each with known protein-protein interactions (PPIs) to elucidate functional cell signaling pathways. The CFN identified known and previously unknown cell signaling pathways in lung cancer cells that were not present in normal lung epithelial tissue. In various proteins modified by more than one type of PTM, the incidence of those PTMs exhibited inverse relationships, suggesting that molecular exclusive “OR” gates determine a large number of signal transduction events. We also showed that the acetyltransferase EP300 appears to be a hub in the network of pathways involving different PTMs. In addition, the data shed light on the mechanism of action of geldanamycin, an HSP90 inhibitor. Together, the findings reveal that cell signaling pathways mediated by acetylation, methylation, and phosphorylation regulate the cytoskeleton, membrane traffic, and RNA binding protein–mediated control of gene expression.
  • Hubley, R., Finn, R., Clements, J., Eddy, S., Jones, T., Bao, W., Smit, A., & Wheeler, T. (2016). The Dfam database of repetitive DNA families. Nucleic Acids Research, 44(1). doi:10.1093/nar/gkv1272
    More info
    Repetitive DNA, especially that due to transposable elements (TEs), makes up a large fraction of many genomes. Dfam is an open access database of families of repetitive DNA elements, in which each family is represented by a multiple sequence alignment and a profile hidden Markov model (HMM). The initial release of Dfam, featured in the 2013 NAR Database Issue, contained 1143 families of repetitive elements found in humans, and was used to produce more than 100 Mb of additional annotation of TE-derived regions in the human genome, with improved speed. Here, we describe recent advances, most notably expansion to 4150 total families including a comprehensive set of known repeat families from four new organisms (mouse, zebrafish, fly and nematode). We describe improvements to coverage, and to our methods for identifying and reducing false annotation. We also describe updates to the website interface. The Dfam website has moved to http://dfam.org. Seed alignments, profile HMMs, hit lists and other underlying data are available for download.
  • Finn, R., Clements, J., Arndt, W., Miller, B., Wheeler, T., Schreiber, F., Bateman, A., & Eddy, S. (2015). HMMER web server: 2015 Update. Nucleic Acids Research, 43(1). doi:10.1093/nar/gkv397
    More info
    The HMMER website, available at http://www.ebi.ac. uk/Tools/hmmer/, provides access to the protein homology search algorithms found in the HMMER software suite. Since the first release of the website in 2011, the search repertoire has been expanded to include the iterative search algorithm, jackhmmer. The continued growth of the target sequence databases means that traditional tabular representations of significant sequence hits can be overwhelming to the user. Consequently, additional ways of presenting homology search results have been developed, allowing them to be summarised according to taxonomic distribution or domain architecture. The taxonomy and domain architecture representations can be used in combination to filter the results according to the needs of a user. Searches can also be restricted prior to submission using a new taxonomic filter, which not only ensures that the results are specific to the requested taxonomic group, but also improves search performance. The repertoire of profile hidden Markov model libraries, which are used for annotation of query sequences with protein families and domains, has been expanded to include the libraries from CATH-Gene3D, PIRSF, Superfamily and TIGRFAMs. Finally, we discuss the relocation of the HMMER webserver to the European Bioinformatics Institute and the potential impact that this will have.
  • Hoen, D. R., Hickey, G., Bourque, G., Casacuberta, J., Cordaux, R., Feschotte, C., Fiston-Lavier, A. S., Hua-Van, A., Hubley, R., Kapusta, A., Lerat, E., Maumus, F., Pollock, D. D., Quesneville, H., Smit, A., Wheeler, T. J., Bureau, T. E., & Blanchette, M. (2015). A call for benchmarking transposable element annotation methods. Mobile DNA, 6(Issue 1). doi:10.1186/s13100-015-0044-6
    More info
    DNA derived from transposable elements (TEs) constitutes large parts of the genomes of complex eukaryotes, with major impacts not only on genomic research but also on how organisms evolve and function. Although a variety of methods and tools have been developed to detect and annotate TEs, there are as yet no standard benchmarks - that is, no standard way to measure or compare their accuracy. This lack of accuracy assessment calls into question conclusions from a wide range of research that depends explicitly or implicitly on TE annotation. In the absence of standard benchmarks, toolmakers are impeded in improving their tools, annotators cannot properly assess which tools might best suit their needs, and downstream researchers cannot judge how accuracy limitations might impact their studies. We therefore propose that the TE research community create and adopt standard TE annotation benchmarks, and we call for other researchers to join the authors in making this long-overdue effort a success.
  • Wheeler, T., Clements, J., & Finn, R. (2014). Skylign: A tool for creating informative, interactive logos representing sequence alignments and profile hidden Markov models. BMC Bioinformatics, 15(1). doi:10.1186/1471-2105-15-7
    More info
    Background: Logos are commonly used in molecular biology to provide a compact graphical representation of the conservation pattern of a set of sequences. They render the information contained in sequence alignments or profile hidden Markov models by drawing a stack of letters for each position, where the height of the stack corresponds to the conservation at that position, and the height of each letter within a stack depends on the frequency of that letter at that position.Results: We present a new tool and web server, called Skylign, which provides a unified framework for creating logos for both sequence alignments and profile hidden Markov models. In addition to static image files, Skylign creates a novel interactive logo plot for inclusion in web pages. These interactive logos enable scrolling, zooming, and inspection of underlying values. Skylign can avoid sampling bias in sequence alignments by down-weighting redundant sequences and by combining observed counts with informed priors. It also simplifies the representation of gap parameters, and can optionally scale letter heights based on alternate calculations of the conservation of a position.Conclusion: Skylign is available as a website, a scriptable web service with a RESTful interface, and as a software package for download. Skylign's interactive logos are easily incorporated into a web page with just a few lines of HTML markup. Skylign may be found at http://skylign.org. © 2014 Wheeler et al.; licensee BioMed Central Ltd.
  • Wheeler, T., & Eddy, S. (2013). Nhmmer: DNA homology search with profile HMMs. Bioinformatics, 29(19). doi:10.1093/bioinformatics/btt403
    More info
    Summary: Sequence database searches are an essential part of molecular biology, providing information about the function and evolutionary history of proteins, RNA molecules and DNA sequence elements. We present a tool for DNA/DNA sequence comparison that is built on the HMMER framework, which applies probabilistic inference methods based on hiddenMarkov models to the problemof homology search. This tool, called nhmmer, enables improved detection of remote DNA homologs, and has been used in combination with Dfam and RepeatMasker to improve annotation of transposable elements in the human genome. © The Author 2013.
  • Wheeler, T., Clements, J., Eddy, S., Hubley, R., Jones, T., Jurka, J., Smit, A., & Finn, R. (2013). Dfam: A database of repetitive DNA based on profile hidden Markov models. Nucleic Acids Research, 41(1). doi:10.1093/nar/gks1265
    More info
    We present a database of repetitive DNA elements, called Dfam (http://dfam.janelia.org). Many genomes contain a large fraction of repetitive DNA, much of which is made up of remnants of transposable elements (TEs). Accurate annotation of TEs enables research into their biology and can shed light on the evolutionary processes that shape genomes. Identification and masking of TEs can also greatly simplify many downstream genome annotation and sequence analysis tasks. The commonly used TE annotation tools RepeatMasker and Censor depend on sequence homology search tools such as cross-match and BLAST variants, as well as Repbase, a collection of known TE families each represented by a single consensus sequence. Dfam contains entries corresponding to all Repbase TE entries for which instances have been found in the human genome. Each Dfam entry is represented by a profile hidden Markov model, built from alignments generated using RepeatMasker and Repbase. When used in conjunction with the hidden Markov model search tool nhmmer, Dfam produces a 2.9% increase in coverage over consensus sequence search methods on a large human benchmark, while maintaining low false discovery rates, and coverage of the full human genome is 54.5%. The website provides a collection of tools and data views to support improved TE curation and annotation efforts. Dfam is also available for download in flat file format or in the form of MySQL table dumps. © The Author(s) 2012.
  • Tanifuji, G., Onodera, N., Wheeler, T., Dlutek, M., Donaher, N., & Archibald, J. (2011). Complete nucleomorph genome sequence of the nonphotosynthetic alga Cryptomonas paramecium reveals a core nucleomorph gene set. Genome Biology and Evolution, 3(1). doi:10.1093/gbe/evq082
    More info
    Nucleomorphs are the remnant nuclei of algal endosymbionts that were engulfed by nonphotosynthetic host eukaryotes. These peculiar organelles are found in cryptomonad and chlorarachniophyte algae, where they evolved from red and green algal endosymbionts, respectively. Despite their independent origins, cryptomonad and chlorarachniophyte nucleomorph genomes are similar in size and structure: they are both, 1 million base pairs in size (the smallest nuclear genomes known), comprised three chromosomes, and possess subtelomeric ribosomal DNA operons. Here, we report the complete sequence of one of the smallest cryptomonad nucleomorph genomes known, that of the secondarily nonphotosynthetic cryptomonad Cryptomonas paramecium. The genome is 486 kbp in size and contains 518 predicted genes, 466 of which are protein coding. Although C. paramecium lacks photosynthetic ability, its nucleomorph genome still encodes 18 plastid-associated proteins. More than 90% of the "conserved" protein genes in C. paramecium (i.e., those with clear homologs in other eukaryotes) are also present in the nucleomorph genomes of the cryptomonads Guillardia theta and Hemiselmis andersenii. In contrast, 143 of 466 predicted C. paramecium proteins (30.7%) showed no obvious similarity to proteins encoded in any other genome, including G. theta and H. andersenii. Significantly, however, many of these "nucleomorph ORFans" are conserved in position and size between the three genomes, suggesting that they are in fact homologous to one another. Finally, our analyses reveal an unexpected degree of overlap in the genes present in the independently evolved chlorarachniophyte and cryptomonad nucleomorph genomes: ∼80% of a set of 120 conserved nucleomorph genes in the chlorarachniophyte Bigelowiella natans were also present in all three cryptomonad nucleomorph genomes. This result suggests that similar reductive processes have taken place in unrelated lineages of nucleomorph-containing algae. © 2011 The Author(s).
  • Good, J., Hayden, C., & Wheeler, T. (2006). Adaptive protein evolution and regulatory divergence in Drosophila. Molecular Biology and Evolution, 23(6). doi:10.1093/molbev/msk002
    More info
    Two recent studies demonstrated a positive correlation between divergence in gene expression and protein sequence in Drosophila. This correlation could be driven by positive selection or variation in functional constraint. To distinguish between these alternatives, we compared patterns of molecular evolution for 1,862 genes with two previously reported estimates of expression divergence in Drosophila. We found a slight negative trend (nonsignificant) between positive selection on protein sequence and divergence in expression levels between Drosophila melanogaster and Drosophila simulans. Conversely, shifts in expression patterns during Drosophila development showed a positive association with adaptive protein evolution, though as before the relationship was weak and not significant. Overall, we found no strong evidence for an increase in the incidence of positive selection on protein-coding regions in genes with divergent expression in Drosophila, suggesting that the previously reported positive association between protein and regulatory divergence primarily reflects variation in functional constraint. © The Author 2006. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved.
  • Cutter, A., Good, J., Pappas, C., Saunders, M., Starrett, D., & Wheeler, T. (2005). Transposable element orientation bias in the Drosophila melanogaster genome. Journal of Molecular Evolution, 61(6). doi:10.1007/s00239-004-0243-0
    More info
    Nonrandom distributions of transposable elements can be generated by a variety of genomic features. Using the full D. melanogaster genome as a model, we characterize the orientations of different classes of transposable elements in relation to the directionality of genes. DNA-mediated transposable elements are more likely to be in the same orientation as neighboring genes when they occur in the nontranscribed region's that flank genes. However, RNA-mediated transposable elements located in an intron are more often oriented in the direction opposite to that of the host gene. These orientation biases are strongest for genes with highly biased codon usage, probably reflecting the ability of such loci to respond to weak positive or negative selection. The leading hypothesis for selection against transposable elements in the coding orientation proposes that transcription termination poly(A) signal motifs within retroelements interfere with normal gene transcription. However, after accounting for differences in base composition between the strands, we find no evidence for global selection against spurious transcription termination signals in introns. We therefore conclude that premature termination of host gene transcription due to the presence of poly(A) signal motifs in retroelements might only partially explain strand-specific detrimental effects in the D. melanogaster genome. © Springer Science+Business Media, Inc. 2005.

Proceedings Publications

  • Marbut, A., McKinney-Bock, K., & Wheeler, T. (2023). Reliable measures of spread in high dimensional latent spaces. In International Conference on Machine Learning.
  • Nord, A., Hornbeck, P., Carey, K., & Wheeler, T. (2018). Splice-Aware Multiple Sequence Alignment of Protein Isoforms. In 9th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM-BCB 2018.
    More info
    Multiple sequence alignment (MSA) is a classic problem in computational genomics. In typical use, MSA software is expected to align a collection of homologous genes, such as orthologs from multiple species or duplication-induced paralogs within a species. Recent focus on the importance of alternatively-spliced isoforms in disease and cell biology has highlighted the need to create MSAs that more effectively accommodate isoforms. MSAs are traditionally constructed using scoring criteria that prefer alignments with occasional mismatches over alignments with long gaps. Alternatively spliced protein isoforms effectively contain exon-length insertions or deletions (indels) relative to each other, and demand an alternative approach. Some improvements can be achieved by making indel penalties much smaller, but this is merely a patchwork solution. In this work we present Mirage, a novel MSA software package for the alignment of alternatively spliced protein isoforms. Mirage aligns isoforms to each other by first mapping each protein sequence to its encoding genomic sequence, and then aligning isoforms to one another based on the relative genomic coordinates of their constitutive codons. Mirage is highly effective at mapping proteins back to their encoding exons, and these protein-genome mappings lead to extremely accurate intra-species alignments; splice site information in these alignments is used to improve the accuracy of inter-species alignments of isoforms. Mirage alignments have also revealed the ubiquity of dual-coding exons, in which an exon conditionally encodes multiple open reading frames as overlapping spliced segments of frame-shifted genomic sequence.
  • Olson, D., & Wheeler, T. (2018). ULTRA: A Model Based Tool to Detect Tandem Repeats. In 9th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM-BCB 2018.
    More info
    In biological sequences, tandem repeats consist of tens to hundreds of residues of a repeated pattern, such as atgatgatgatgatg ('atg' repeated), often the result of replication slippage. Over time, these repeats decay so that the original sharp pattern of repetition is somewhat obscured, but even degenerate repeats pose a problem for sequence annotation: when two sequences both contain shared patterns of similar repetition, the result can be a false signal of sequence homology. We describe an implementation of a new hidden Markov model for detecting tandem repeats that shows substantially improved sensitivity to labeling decayed repetitive regions, presents low and reliable false annotation rates across a wide range of sequence composition, and produces scores that follow a stable distribution. On typical genomic sequence, the time and memory requirements of the resulting tool (ULTRA) are competitive with the most heavily used tool for repeat masking (TRF). ULTRA is released under an open source license and lays the groundwork for inclusion of the model in sequence alignment tools and annotation pipelines.
  • Deblasio, D., Wheeler, T., & Kececioglu, J. (2012). Estimating the accuracy of multiple alignments and its use in parameter advising. In Annual International Conference on Research in Computational Molecular Biology, 2012.
    More info
    We develop a novel and general approach to estimating the accuracy of protein multiple sequence alignments without knowledge of a reference alignment, and use our approach to address a new problem that we call parameter advising. For protein alignments, we consider twelve independent features that contribute to a quality alignment. An accuracy estimator is learned that is a polynomial function of these features; its coefficients are determined by minimizing its error with respect to true accuracy using mathematical optimization. We evaluate this approach by applying it to the task of parameter advising: the problem of choosing alignment scoring parameters from a collection of parameter values to maximize the accuracy of a computed alignment. Our estimator, which we call Facet (for "feature-based accuracy estimator"), yields a parameter advisor that on the hardest benchmarks provides more than a 20% improvement in accuracy over the best default parameter choice, and outperforms the best prior approaches to selecting good alignments for parameter advising. © 2012 Springer-Verlag Berlin Heidelberg.
  • Kim, E., Wheeler, T., & Kececioglu, J. (2009). Learning models for aligning protein sequences with predicted secondary structure. In Annual International Conference on Research in Computational Molecular Biology, 2009.
    More info
    Accurately aligning distant protein sequences is notoriously difficult. A recent approach to improving alignment accuracy is to use additional information such as predicted secondary structure. We introduce several new models for scoring alignments of protein sequences with predicted secondary structure, which use the predictions and their confidences to modify both the substitution and gap cost functions. We present efficient algorithms for computing optimal pairwise alignments under these models, all of which run in near-quadratic time. We also review an approach to learning the values of the parameters in these models called inverse alignment. We then evaluate the accuracy of these models by studying how well an optimal alignment under the model recovers known benchmark reference alignments. Our experiments show that using parameters learned by inverse alignment, these new secondarystructure-based models provide a significant improvement in alignment accuracy for distant sequences. The best model improves upon the accuracy of the standard sequence alignment model for pairwise alignment by as much as 15% for sequences with less than 25% identity, and improves the accuracy of multiple alignment by 20% for difficult benchmarks whose average accuracy under standard tools is less than 40%. © Springer-Verlag Berlin Heidelberg 2009.
  • Wheeler, T. (2009). Large-scale neighbor-joining with NINJA. In WABI 2009.
    More info
    Neighbor-joining is a well-established hierarchical clustering algorithm for inferring phylogenies. It begins with observed distances between pairs of sequences, and clustering order depends on a metric related to those distances. The canonical algorithm requires O(n 3) time and O(n 2) space for n sequences, which precludes application to very large sequence families, e.g. those containing 100,000 sequences. Datasets of this size are available today, and such phylogenies will play an increasingly important role in comparative biology studies. Recent algorithmic advances have greatly sped up neighbor-joining for inputs of thousands of sequences, but are limited to fewer than 13,000 sequences on a system with 4GB RAM. In this paper, I describe an algorithm that speeds up neighbor-joining by dramatically reducing the number of distance values that are viewed in each iteration of the clustering procedure, while still computing a correct neighbor-joining tree. This algorithm can scale to inputs larger than 100,000 sequences because of external-memory-efficient data structures. A free implementation may by obtained from http://nimbletwist. com/software/ninja © 2009 Springer Berlin Heidelberg.

Profiles With Related Publications

  • John D Kececioglu

 Edit my profile

UA Profiles | Home

University Information Security and Privacy

© 2026 The Arizona Board of Regents on behalf of The University of Arizona.