Michael J Sanderson
Contact
- (520) 626-6848
- BIO SCI WEST, Rm. 310
- TUCSON, AZ 85721-0088
- sanderm@arizona.edu
Bio
No activities entered.
Interests
No activities entered.
Courses
2021-22 Courses
-
Honors Thesis
ECOL 498H (Fall 2021) -
Independent Study
ECOL 499 (Fall 2021)
2020-21 Courses
-
Bioinformatics
ECOL 346 (Spring 2021) -
Independent Study
ECOL 599 (Spring 2021) -
Rsrch Ecology+Evolution
ECOL 610A (Spring 2021) -
Dissertation
ECOL 920 (Fall 2020) -
Funct+Evolutnry Genomics
BIOC 553 (Fall 2020) -
Funct+Evolutnry Genomics
ECOL 553 (Fall 2020) -
Funct+Evolutnry Genomics
EIS 553 (Fall 2020) -
Funct+Evolutnry Genomics
MCB 553 (Fall 2020)
2019-20 Courses
-
Dissertation
ECOL 920 (Spring 2020) -
Rsrch Ecology+Evolution
ECOL 610A (Spring 2020) -
Dissertation
ECOL 920 (Fall 2019)
2018-19 Courses
-
Dissertation
ECOL 920 (Spring 2019) -
Fundament Of Evolution
ECOL 600A (Spring 2019) -
Dissertation
ECOL 920 (Fall 2018) -
Funct+Evolutnry Genomics
BIOC 553 (Fall 2018) -
Funct+Evolutnry Genomics
ECOL 553 (Fall 2018) -
Funct+Evolutnry Genomics
MCB 553 (Fall 2018)
2017-18 Courses
-
Dissertation
ECOL 920 (Spring 2018) -
Fundament Of Evolution
ECOL 600A (Spring 2018) -
Dissertation
ECOL 920 (Fall 2017) -
Funct+Evolutnry Genomics
BIOC 553 (Fall 2017) -
Funct+Evolutnry Genomics
ECOL 553 (Fall 2017) -
Funct+Evolutnry Genomics
MCB 553 (Fall 2017)
2016-17 Courses
-
Dissertation
ECOL 920 (Spring 2017) -
Dissertation
ECOL 920 (Fall 2016)
2015-16 Courses
-
Dissertation
ECOL 920 (Spring 2016) -
Phylogenetic Biology
ECOL 465 (Spring 2016) -
Phylogenetic Biology
ECOL 565 (Spring 2016)
Scholarly Contributions
Journals/Publications
- Badyaev, A. V., Morrison, E. S., Belloni, V., & Sanderson, M. J. (2015). Tradeoff between robustness and elaboration in carotenoid networks produces cycles of avian color diversification. BIOLOGY DIRECT, 10.
- McMahon, M. M., Deepak, A., Fernandez-Baca, D., Boss, D., & Sanderson, M. J. (2015). STBase: One Million Species Trees for Comparative Biology. PLOS ONE, 10(2).
- Sanderson, M. J. (2015). Recombinatorics [book review]. Quarterly Review of Biology, 90, 344-345.
- Sanderson, M. J., Copetti, D., Burquez, A., Bustamante, E., Charboneau, J. L., Eguiarte, L. E., Kumar, S., Lee, H. O., Lee, J., McMahon, M., Steele, K., Wing, R., Yang, T., Zwickl, D., & Wojciechowski, M. F. (2015). Exceptional reduction of the plastid genome of saguaro cactus (Carnegiea gigantea): Loss of the ndh gene suite and inverted repeat. AMERICAN JOURNAL OF BOTANY, 102(7), 1115-1127.
- Sanderson, M. J., McMahon, M. M., Stamatakis, A., Zwickl, D. J., & Steel, M. (2015). Impacts of Terraces on Phylogenetic Inference. SYSTEMATIC BIOLOGY, 64(5), 709-726.
- Deepak, A., Fernandez-Baca, D., Tirthapura, S., Sanderson, M. J., & McMahon, M. M. (2014). EvoMiner: frequent subtree mining in phylogenetic databases. KNOWLEDGE AND INFORMATION SYSTEMS, 41(3), 559-590.More infoThe problem of mining collections of trees to identify common patterns, called frequent subtrees (FSTs), arises often when trying to interpret the results of phylogenetic analysis. FST mining generalizes the well-known maximum agreement subtree problem. Here we present EvoMiner, a new algorithm for mining frequent subtrees in collections of phylogenetic trees. EvoMiner is an Apriori-like levelwise method, which uses a novel phylogeny-specific constant-time candidate generation scheme, an efficient fingerprinting-based technique for downward closure, and a lowest-common-ancestor-based support counting step that requires neither costly subtree operations nor database traversal. Our algorithm achieves speedups of up to 100 times or more over Phylominer, the current state-of-the-art algorithm for mining phylogenetic trees. EvoMiner can also work in depth-first enumeration mode to use less memory at the expense of speed. We demonstrate the utility of FST mining as a way to extract meaningful phylogenetic information from collections of trees when compared to maximum agreement subtrees and majority-rule trees-two commonly used approaches in phylogenetic analysis for extracting consensus information from a collection of trees over a common leaf set.
- Sanderson, M. J. (2014). Ceiba: scalable visualization of phylogenies and 2D/3D image collections. BIOINFORMATICS, 30(17), 2506-2507.More infoA Summary: Phylogenetic trees with hundreds of thousands of leaves are now being inferred from sequence data, posing significant challenges for visualization and exploratory analysis. Image data supplying valuable context for species in trees (and cues for exploring them) are becoming increasingly available in biodiversity databases and elsewhere but have rarely been built into tree visualization software in a scalable way. Ceiba lets the user explore large trees and inspect image collection arrays (sets of 'homologous' images) comprising mixtures of 2D and 3D image objects. Ceiba exploits recent improvements in graphics hardware, OpenGL toolkits and many standard high-performance computer graphics strategies, such as texture compression, level of detail control, culling, animations and image caching. Its tree layouts can be tuned by user-provided phylogenetic definitions of subtrees. The code has been extensively tested on phylogenies of up to 55 000 leaves and images.
- Shi, T., Huang, H., Sanderson, M. J., & Tax, F. E. (2014). Evolutionary dynamics of leucine-rich repeat receptor-like kinases and related genes in plants: A phylogenomic approach. JOURNAL OF INTEGRATIVE PLANT BIOLOGY, 56(7), 648-662.More infoLeucine-rich repeat (LRR) receptor-like kinases (RLKs), evolutionarily related LRR receptor-like proteins (RLPs) and receptor-like cytoplasmic kinases (RLCKs) have important roles in plant signaling, and their gene subfamilies are large with a complicated history of gene duplication and loss. In three pairs of closely related lineages, including Arabidopsis thaliana and A. lyrata (Arabidopsis), Lotus japonicus, and Medicago truncatula (Legumes), Oryza sativa ssp. japonica, and O. sativa ssp. indica (Rice), we find that LRR RLKs comprise the largest group of these LRR-related subfamilies, while the related RLCKs represent the smallest group. In addition, comparison of orthologs indicates a high frequency of reciprocal gene loss of the LRR RLK/LRR RLP/RLCK subfamilies. Furthermore, pairwise comparisons show that reciprocal gene loss is often associated with lineage-specific duplication(s) in the alternative lineage. Last, analysis of genes in A. thaliana involved in development revealed that most are highly conserved orthologs without species-specific duplication in the two Arabidopsis species and originated from older Arabidopsis-specific or rosid-specific duplications. We discuss potential pitfalls related to functional prediction for genes that have undergone frequent turnover (duplications, losses, and domain architecture changes), and conclude that prediction based on phylogenetic relationships will likely outperform that based on sequence similarity alone.
- Zwickl, D. J., Stein, J. C., Wing, R. A., Ware, D., & Sanderson, M. J. (2014). Disentangling Methodological and Biological Sources of Gene Tree Discordance on Oryza (Poaceae) Chromosome 3. SYSTEMATIC BIOLOGY, 63(5), 645-659.More infoWe describe new methods for characterizing gene tree discordance in phylogenomic data sets, which screen for deviations from neutral expectations, summarize variation in statistical support among gene trees, and allow comparison of the patterns of discordance induced by various analysis choices. Using an exceptionally complete set of genome sequences for the short arm of chromosome 3 in Oryza (rice) species, we applied these methods to identify the causes and consequences of differing patterns of discordance in the sets of gene trees inferred using a panel of 20 distinct analysis pipelines. We found that discordance patterns were strongly affected by aspects of data selection, alignment, and alignment masking. Unusual patterns of discordance evident when using certain pipelines were reduced or eliminated by using alternative pipelines, suggesting that they were the product of methodological biases rather than evolutionary processes. In some cases, once such biases were eliminated, evolutionary processes such as introgression could be implicated. Additionally, patterns of gene tree discordance had significant downstream impacts on species tree inference. For example, inference from supermatrices was positively misleading when pipelines that led to biased gene trees were used. Several results may generalize to other data sets: we found that gene tree and species tree inference gave more reasonable results when intron sequence was included during sequence alignment and tree inference, the alignment software PRANK was used, and detectable "block-shift" alignment artifacts were removed. We discuss our findings in the context of well-established relationships in Oryza and continuing controversies regarding the domestication history of O. sativa.
- Steel, M., Linz, S., Huson, D. H., & Sanderson, M. J. (2013). Identifying a species tree subject to random lateral gene transfer. Journal of Theoretical Biology, 322, 81-93.More infoPMID: 23340439;Abstract: A major problem for inferring species trees from gene trees is that evolutionary processes can sometimes favor gene tree topologies that conflict with an underlying species tree. In the case of incomplete lineage sorting, this phenomenon has recently been well-studied, and some elegant solutions for species tree reconstruction have been proposed. One particularly simple and statistically consistent estimator of the species tree under incomplete lineage sorting is to combine three-taxon analyses, which are phylogenetically robust to incomplete lineage sorting. In this paper, we consider whether such an approach will also work under lateral gene transfer (LGT). By providing an exact analysis of some cases of this model, we show that there is a zone of inconsistency when majority-rule three-taxon gene trees are used to reconstruct species trees under LGT. However, a triplet-based approach will consistently reconstruct a species tree under models of LGT, provided that the expected number of LGT transfers is not too high. Our analysis involves a novel connection between the LGT problem and random walks on cyclic graphs. We have implemented a procedure for reconstructing trees subject to LGT or lineage sorting in settings where taxon coverage may be patchy and illustrate its use on two sample data sets. © 2013 Elsevier Ltd.
- Marazzi, B., Ané, C., Simon, M. F., Delgado-Salinas, A., Luckow, M., & Sanderson, M. J. (2012). Locating evolutionary precursors on a phylogenetic tree. Evolution, 66(12), 3918-3930.More infoPMID: 23206146;Abstract: Conspicuous innovations in the history of life are often preceded by more cryptic genetic and developmental precursors. In many cases, these appear to be associated with recurring origins of very similar traits in close relatives (parallelisms) or striking convergences separated by deep time (deep homologies). Although the phylogenetic distribution of gain and loss of traits hints strongly at the existence of such precursors, no models of trait evolution currently permit inference about their location on a tree. Here we develop a new stochastic model, which explicitly captures the dependency implied by a precursor and permits estimation of precursor locations. We apply it to the evolution of extrafloral nectaries (EFNs), an ecologically significant trait mediating a widespread mutualism between plants and ants. In legumes, a species-rich clade with morphologically diverse EFNs, the precursor model fits the data on EFN occurrences significantly better than conventional models. The model generates explicit hypotheses about the phylogenetic location of hypothetical precursors, which may help guide future studies of molecular genetic pathways underlying nectary position, development, and function. © 2012 The Society for the Study of Evolution.
- Sanderson, M. J., McMahon, M. M., & Steel, M. (2011). Terraces in phylogenetic tree space. Science, 333(6041), 448-450.More infoPMID: 21680810;Abstract: A key step in assembling the tree of life is the construction of species-rich phylogenies from multilocus - but often incomplete - sequence data sets. We describe previously unknown structure in the landscape of solutions to the tree reconstruction problem, comprising sometimes vast "terraces" of trees with identical quality, arranged on islands of phylogenetically similar trees. Phylogenetic ambiguity within a terrace can be characterized efficiently and then ameliorated by new algorithms for obtaining a terrace's maximum-agreement subtree or by identifying the smallest set of new targets for additional sequencing. Algorithms to find optimal trees or estimate Bayesian posterior tree distributions may need to navigate strategically in the neighborhood of large terraces in tree space.
- Soltis, D. E., Smith, S. A., Cellinese, N., Wurdack, K. J., Tank, D. C., Brockington, S. F., Refulio-Rodriguez, N. F., Walker, J. B., Moore, M. J., Carlsward, B. S., Bell, C. D., Latvis, M., Crawley, S., Black, C., Diouf, D., Zhenxiang, X. i., Rushworth, C. A., Gitzendanner, M. A., Sytsma, K. J., , Qiu, Y., et al. (2011). Angiosperm phylogeny: 17 genes, 640 taxa. American Journal of Botany, 98(4), 704-730.More infoPMID: 21613169;Abstract: Premise of the study: Recent analyses employing up to five genes have provided numerous insights into angiosperm phylogeny, but many relationships have remained unresolved or poorly supported. In the hope of improving our understanding of angiosperm phylogeny, we expanded sampling of taxa and genes beyond previous analyses. Methods: We conducted two primary analyses based on 640 species representing 330 families. The first included 25 260 aligned base pairs (bp) from 17 genes (representing all three plant genomes, i.e., nucleus, plastid, and mitochondrion). The second included 19 846 aligned bp from 13 genes (representing only the nucleus and plastid). Key results: Many important questions of deep-level relationships in the nonmonocot angiosperms have now been resolved with strong support. Amborellaceae, Nymphaeales, and Austrobaileyales are successive sisters to the remaining angiosperms (Mesangiospermae), which are resolved into Chloranthales + Magnoliidae as sister to Monocotyledoneae + [Ceratophyllaceae + Eudicotyledoneae ]. Eudicotyledoneae contains a basal grade subtending Gunneridae. Within Gunneridae, Gunnerales are sister to the remainder (Pentapetalae), which comprises (1) Superrosidae, consisting of Rosidae (including Vitaceae) and Saxifragales; and (2) Superasteridae, comprising Berberidopsidales, Santalales, Caryophyllales, Asteridae, and, based on this study, Dilleniaceae (although other recent analyses disagree with this placement). Within the major subclades of Pentapetalae, most deep-level relationships are resolved with strong support. Conclusions: Our analyses confirm that with large amounts of sequence data, most deep-level relationships within the angiosperms can be resolved. We anticipate that this well-resolved angiosperm tree will be of broad utility for many areas of biology, including physiology, ecology, paleobiology, and genomics. © 2011 Botanical Society of America.
- Wertheim, J. O., & Sanderson, M. J. (2011). Estimating diversification rates: How useful are divergence times?. Evolution, 65(2), 309-320.More infoPMID: 21044059;PMCID: PMC3057369;Abstract: The dynamics of species diversification rates are a key component of macroevolutionary patterns. Although not absolutely necessary, the use of divergence times inferred from sequence data has led to development of more powerful methods for inferring diversification rates. However, it is unclear what impact uncertainty in age estimates have on diversification rate inferences. Here, we quantify these effects using both Bayesian and frequentist methodology. Through simulation, we demonstrate that adding sequence data results in more precise estimates of internal node ages, but a reasonable approximation of these node ages is often sufficient to approach the theoretical minimum variance in speciation rate estimates. We also find that even crude estimates of divergence times increase the power of tests of diversification rate differences between sister clades. Finally, because Bayesian and frequentist methods provided similar assessments of error, novel Bayesian approaches may provide a useful framework for tests of diversification rates in more complex contexts than are addressed here. © 2010 The Author(s). Evolution© 2010 The Society for the Study of Evolution.
- Cranston, K. A., Hurwitz, B., Sanderson, M. J., Ware, D., Wing, R. A., & Stein, L. (2010). Phylogenomic analysis of BAC-end sequence libraries in Oryza (Poaceae). Systematic Botany, 35(3), 512-523.More infoAbstract: Analyses of genome scale data sets are beginning to clarify the phylogenetic relationships of species with complex evolutionary histories. Broad sampling across many genes allows for both large concatenated data sets to improve genome-scale phylogenetic resolution and also for independent analysis of gene trees and detection of phylogenetic incongruence. Recent sequencing projects in Oryza sativa and its wild relatives have positioned rice as a model system for such "phylogenomic" studies. We describe the assembly of a phylogenomic data set from 800,000 bacterial artificial chromosome (BAC) end sequences, producing an alignment of 2.4 million nucleotides for 10 diploid species of Oryza. A supermatrix approach confirms the broad outline of previous phylogenetic studies, although the nonphylogenetic signal and high levels of missing data must be handled carefully. Phylogenetic analysis of 12 chromosomes and nearly 2,000 genes finds strikingly high levels of incongruence across different genomic scales, a result that is likely to apply to other low-level phylogenies in plants. We conclude that there is great potential for phylogenetic inference using data from next-generation sequencing protocols but that attention to methodological issues arising inevitably in these data sets is critical. © 2010 by the American Society of Plant Taxonomists.
- Marazzi, B., & Sanderson, M. J. (2010). Large-scale patterns of diversification in the widespread legume genus senna and the evolutionary role of extrafloral nectaries. Evolution, 64(12), 3570-3592.More infoPMID: 21133898;Abstract: Unraveling the diversification history of old, species-rich and widespread clades is difficult because of extinction, undersampling, and taxonomic uncertainty. In the context of these challenges, we investigated the timing and mode of lineage diversification in Senna (Leguminosae) to gain insights into the evolutionary role of extrafloral nectaries (EFNs). EFNs secrete nectar, attracting ants and forming ecologically important ant-plant mutualisms. In Senna, EFNs characterize one large clade (EFN clade), including 80% of its 350 species. Taxonomic accounts make Senna the largest caesalpinioid genus, but quantitative comparisons to other taxa require inferences about rates. Molecular dating analyses suggest that Senna originated in the early Eocene, and its major lineages appeared during early/mid Eocene to early Oligocene. EFNs evolved in the late Eocene, after the main radiation of ants. The EFN clade diversified faster, becoming significantly more species-rich than non-EFN clades. The shift in diversification rates associated with EFN evolution supports the hypothesis that EFNs represent a (relatively old) key innovation in Senna. EFNs may have promoted the colonization of new habitats appearing with the early uplift of the Andes. This would explain the distinctive geographic concentration of the EFN clade in South America. © 2010 The Author(s). Evolution © 2010 The Society for the Study of Evolution.
- Sanderson, M. J., McMahon, M. M., & Steel, M. (2010). Phylogenomics with incomplete taxon coverage: The limits to inference. BMC Evolutionary Biology, 10(1).More infoPMID: 20500873;PMCID: PMC2897806;Abstract: Background. Phylogenomic studies based on multi-locus sequence data sets are usually characterized by partial taxon coverage, in which sequences for some loci are missing for some taxa. The impact of missing data has been widely studied in phylogenetics, but it has proven difficult to distinguish effects due to error in tree reconstruction from effects due to missing data per se. We approach this problem using a explicitly phylogenomic criterion of success, decisiveness, which refers to whether the pattern of taxon coverage allows for uniquely defining a single tree for all taxa. Results. We establish theoretical bounds on the impact of missing data on decisiveness. Results are derived for two contexts: a fixed taxon coverage pattern, such as that observed from an already assembled data set, and a randomly generated pattern derived from a process of sampling new data, such as might be observed in an ongoing comparative genomics sequencing project. Lower bounds on how many loci are needed for decisiveness are derived for the former case, and both lower and upper bounds for the latter. When data are not decisive for all trees, we estimate the probability of decisiveness and the chances that a given edge in the tree will be distinguishable. Theoretical results are illustrated using several empirical examples constructed by mining sequence databases, genomic libraries such as ESTs and BACs, and complete genome sequences. Conclusion. Partial taxon coverage among loci can limit phylogenomic inference by making it impossible to distinguish among multiple alternative trees. However, even though lack of decisiveness is typical of many sparse phylogenomic data sets, it is often still possible to distinguish a large fraction of edges in the tree. © 2010 Sanderson et al; licensee BioMed Central Ltd.
- Steel, M., & Sanderson, M. J. (2010). Characterizing phylogenetically decisive taxon coverage. Applied Mathematics Letters, 23(1), 82-86.More infoAbstract: Increasingly, biologists are constructing evolutionary trees on large numbers of overlapping sets of taxa, and then combining them into a 'supertree' that classifies all the taxa. In this paper, we ask how much coverage of the total set of taxa is required by these subsets in order to ensure that we have enough information to reconstruct the supertree uniquely. We describe two results - a combinatorial characterization of the covering subsets to ensure that at most one supertree can be constructed from the smaller trees (whichever trees these may be) and a more liberal analysis that asks only that the supertree is highly likely to be uniquely specified by the tree structure on the covering subsets. © 2009 Elsevier Ltd. All rights reserved.
- Wertheim, J. O., Sanderson, M. J., Worobey, M., & Bjork, A. (2010). Relaxed molecular clocks, the bias-variance trade-off, and the quality of phylogenetic inference. Systematic Biology, 59(1), 1-8.More infoPMID: 20525616;PMCID: PMC2909785;Abstract: Because a constant rate of DNA sequence evolution cannot be assumed to be ubiquitous, relaxed molecular clock inference models have proven useful when estimating rates and divergence dates. Furthermore, it has been recently suggested that using relaxed molecular clocks may provide superior accuracy and precision in phylogenetic inference compared with traditional time-free methods that do not incorporate a molecular clock. We perform a simulation study to determine if assuming a relaxed molecular clock does indeed improve the quality of phylogenetic inference. We analyze sequence data simulated under various rate distributions using relaxed-clocks, strict-clocks, and time-free Bayesian phylogenetic inference models. Our results indicate that no difference exists in the quality of phylogenetic inference between assuming a relaxed molecular clock and making no assumption about the clock-likeness of sequence evolution. This pattern is likely due to the bias-variance trade-off inherent in this type of phylogenetic inference. We also compared the quality of inference between Bayesian and maximum likelihood time-free inference models and found them to be qualitatively similar.
- Ané, C., Eulenstein, O., Piaggio-Talice, R., & Sanderson, M. J. (2009). Groves of phylogenetic trees. Annals of Combinatorics, 13(2), 139-167.More infoAbstract: A major challenge in biological sciences is the reconstruction of the Tree of Life. To this effect, large genomic databases like GenBank and SwissProt are being mined for clusters from which phylogenies can be inferred. Systematists and comparative biologists commonly combine such phylogenies into informative supertrees that reveal information which was not explicitly displayed in any of the original phylogenies. However, whether a supertree is informative depends on particular overlap properties among the clusters from which it originates. In this work we formally introduce the concept of groves - sets of clusters with the potential to construct informative supertrees. Thus maximal potential candidate clusters for informative supertree construction can be identified in large databases through groves, prior to inferring trees for each cluster. Groves also have the potential to lead to informative supermatrix construction. We developed methods that (i) efficiently identify particular types of groves and (ii) find lower and upper bounds on the minimal number of groves needed to cover all the trees or data sets in a database. Finally, we apply our methods to the green plant sequences from GenBank. © Birkhäuser Verlag Basel/Switzerland 2009.
- Kim, J., & Sanderson, M. J. (2008). Penalized likelihood phylogenetic inference: Bridging the parsimony-likelihood gap. Systematic Biology, 57(5), 665-674.More infoPMID: 18853355;Abstract: The increasing diversity and heterogeneity of molecular data for phylogeny estimation has led to development of complex models and model-based estimators. Here, we propose a penalized likelihood (PL) framework in which the levels of complexity in the underlying model can be smoothly controlled. We demonstrate the PL framework for a four-taxon tree case and investigate its properties. The PL framework yields an estimator in which the majority of currently employed estimators such as the maximum-parsimony estimator, homogeneous likelihood estimator, gamma mixture likelihood estimator, etc., become special cases of a single family of PL estimators. Furthermore, using the appropriate penalty function, the complexity of the underlying models can be partitioned into separately controlled classes allowing flexible control of model complexity. Copyright © Society of Systematic Biologists.
- Sanderson, M. J. (2008). Phylogenetic signal in the eukaryotic tree of life. Science, 321(5885), 121-123.More infoPMID: 18599787;Abstract: Molecular sequence data have been sampled from 10% of all species known to science. Although it is not yet feasible to assemble these data into a single phylogenetic tree of life, it is possible to quantify how much phylogenetic signal is present. Analysis of 14,289 phylogenies built from 2.6 million sequences in GenBank suggests that signal is strong in vertebrates and specific groups of nonvertebrate model organisms. Across eukaryotes, however, although phylogenetic evidence is very broadly distributed, for the average species in the database it is equivalent to less than one well-supported gene tree. This analysis shows that a stronger sampling effort aimed at genomic depth, in addition to taxonomic breadth, will be required to build high-resolution phylogenetic trees at this scale.
- Sanderson, M. J., Boss, D., Chen, D., Cranston, K. A., & Wehe, A. (2008). The PhyLoTA Browser: Processing GenBank for molecular phylogenetics research. Systematic Biology, 57(3), 335-346.More infoPMID: 18570030;Abstract: As an archive of sequence data for over 165,000 species, GenBank is an indispensable resource for phylogenetic inference. Here we describe an informatics processing pipeline and online database, the PhyLoTA Browser (http://loco.biosci.arizona.edu/pb), which offers a view of GenBank tailored for molecular phylogenetics. The first release of the Browser is computed from 2.6 million sequences representing the taxonomically enriched subset of GenBank sequences for eukaryotes (excluding most genome survey sequences, ESTs, and other high-throughput data). In addition to summarizing sequence diversity and species diversity across nodes in the NCBI taxonomy, it reports 87,000 potentially phylogenetically informative clusters of homologous sequences, which can be viewed or downloaded, along with provisional alignments and coarse phylogenetic trees. At each node in the NCBI hierarchy, the user can display a "data availability matrix" of all available sequences for entries in a subtaxa-by-clusters matrix. This matrix provides a guidepost for subsequent assembly of multigene data sets or supertrees. The database allows for comparison of results from previous GenBank releases, highlighting recent additions of either sequences or taxa to GenBank and letting investigators track progress on data availability worldwide. Although the reported alignments and trees are extremely approximate, the database reports several statistics correlated with alignment quality to help users choose from alternative data sources. Copyright © Society of Systematic Biologists.
- Scherson, R. A., Vidal, R., & Sanderson, M. J. (2008). Phylogeny, biogeography, and rates of diversification of New World Astragalus (Leguminosae) with an emphasis on South American radiations. American Journal of Botany, 95(8), 1030-1039.More infoPMID: 21632423;Abstract: This study uses phylogenetic relationships of New World representatives of the species-rich genus Astragalus (Leguminosae; Papilionoideae) to follow up on recent evidence pointing to rapid and recent plant diversification patterns in the Andes. Bayesian and maximum likelihood phylogenetic analyses were done using nuclear rDNA ITS and chloroplast spacers trnD-trnT and trnfM-trnS1, either separately or in combination. The effect of using partitioned vs. nonpartitioned analyses in a Bayesian approach was evaluated. Highest resolution was obtained when the data were combined in partitioned or nonpartitioned Bayesian analyses. All phylogenies support two clades of South American species nested within the North American species, implying two separate invasions from North to South America. These two clades correspond to the original morphological classification of lohnston (1947 Journal of the Arnold Arboretum 28: 336-409). The mean ages of the South American clades were very recent but still significantly different (1.89 and 0.98 Ma). Upper and lower bounds on rates of diversification varied between 2.01 and 0.65 species/Ma for the older clade and 2.06 and 1.24 species/Ma for the younger clade. Even the lower bounds are still very high, reasserting Neo-Astragalus in the growing list of recent rapid radiations of plants, especially in areas with a high physiographic diversity, such as the Andes.
- Hackett, J. D., Yoon, H. S., Butterfield, N. J., Sanderson, M. J., & Bhattacharya, D. (2007). Plastid Endosymbiosis. Sources and Timing of the Major Events.. Evolution of Primary Producers in the Sea, 109-132.More infoAbstract: This chapter reviews the current ideas regarding the origin of plastids in eukaryotes and the timing of these events, with particular emphasis on the initial source of eukaryotic photosynthesis. First, a general introduction is provided on plastid endosymbiosis. The currently available evidence suggests that a single primary endosymbiosis gave rise to the Plantae, comprising the glaucophytes, red algae, and Viridiplantae. The chapter looks in detail at the evidence regarding the source and timing of the plastids that have resulted from primary, secondary, and tertiary endosymbiosis. The greatest attention is paid to the primary endosymbiosis. Primary plastid's origin and Plantae monophyly are discussed in detail. The unique and relatively late appearance of photosynthetic eukaryotes has important implications for understanding the early biosphere and its fossil record. Nuclear phylogeny supports the monophyly of photosynthetic eukaryotes containing a primary plastid, and molecular clock estimates provide a timeline for reconstructing the early evolutionary history of the Plantae. Together with the fossil and geochemical records, these data provide an increasingly resolved view of early eukaryotic photosynthesis and, by extension, important features in evolutionary and Earth history. Plastid endosymbiosis has clearly been a driving force in eukaryotic evolution and instrumental to the success of many eukaryotic groups and has influenced the evolution of other organisms that utilize these organisms for food or habitat. Ongoing studies of Paulinella may shed light on the early stages of primary endosymbiosis, a process that has profoundly impacted evolution of life on Earth. © 2007 Elsevier Inc. All rights reserved.
- Sanderson, M. J. (2007). L. A. S. Johnson review no. 9. Construction and annotation of large phylogenetic trees. Australian Systematic Botany, 20(4), 287-301.More infoAbstract: Broad availability of molecular sequence data allows construction of phylogenetic trees with 1000s or even 10 000s of taxa. This paper reviews methodological, technological and empirical issues raised in phylogenetic inference at this scale. Numerous algorithmic and computational challenges have been identified surrounding the core problem of reconstructing large trees accurately from sequence data, but many other obstacles, both upstream and downstream of this step, are less well understood. Before phylogenetic analysis, data must be generated de novo or extracted from existing databases, compiled into blocks of homologous data with controlled properties, aligned, examined for the presence of gene duplications or other kinds of complicating factors, and finally, combined with other evidence via supermatrix or supertree approaches. After phylogenetic analysis, confidence assessments are usually reported, along with other kinds of annotations, such as clade names, or annotations requiring additional inference procedures, such as trait evolution or divergence time estimates. Prospects for partial automation of large-tree construction are also discussed, as well as risks associated with 'outsourcing' phylogenetic inference beyond the systematics community. © CSIRO.
- Sanderson, M. J., & McMahon, M. M. (2007). Inferring angiosperm phylogeny from EST data with widespread gene duplication. BMC Evolutionary Biology, 7(SUPPL. 1).More infoPMID: 17288576;PMCID: PMC1796612;Abstract: Background. Most studies inferring species phylogenies use sequences from single copy genes or sets of orthologs culled from gene families. For taxa such as plants, with very high levels of gene duplication in their nuclear genomes, this has limited the exploitation of nuclear sequences for phylogenetic studies, such as those available in large EST libraries. One rarely used method of inference, gene tree parsimony, can infer species trees from gene families undergoing duplication and loss, but its performance has not been evaluated at a phylogenomic scale for EST data in plants. Results. A gene tree parsimony analysis based on EST data was undertaken for six angiosperm model species and Pinus, an outgroup. Although a large fraction of the tentative consensus sequences obtained from the TIGR database of ESTs was assembled into homologous clusters too small to be phylogenetically informative, some 557 clusters contained promising levels of information. Based on maximum likelihood estimates of the gene trees obtained from these clusters, gene tree parsimony correctly inferred the accepted species tree with strong statistical support. A slight variant of this species tree was obtained when maximum parsimony was used to infer the individual gene trees instead. Conclusion. Despite the complexity of the EST data and the relatively small fraction eventually used in inferring a species tree, the gene tree parsimony method performed well in the face of very high apparent rates of duplication. © 2007 Sanderson and McMahon; licensee BioMed Central Ltd.
- Burleigh, J. G., Driskell, A. C., & Sanderson, M. J. (2006). Supertree bootstrapping methods for assessing phylogenetic variation among genes in genome-scale data sets. Systematic Biology, 55(3), 426-440.More infoPMID: 16861207;Abstract: Nonparamtric bootstrapping methods may be useful for assessing confidence in a supertree inference. We examined the performance of two supertree bootstrapping methods on four published data sets that each include sequence data from more than 100 genes. In "input tree bootstrapping," input gene trees are sampled with replacement and then combined in replicate supertree analyses; in "stratified bootstrapping," trees from each gene's separate (conventional) bootstrap tree set are sampled randomly with replacement and then combined. Generally, support values from both supertree bootstrap methods were similar or slightly lower than corresponding bootstrap values from a total evidence, or supermatrix, analysis. Yet, supertree bootstrap support also exceeded supermatrix bootstrap support for a number of clades. There was little overall difference in support scores between the input tree and stratified bootstrapping methods. Results from supertree bootstrapping methods, when compared to results from corresponding supermatrix bootstrapping, may provide insights into patterns of variation among genes in genome-scale data sets. Copyright © Society of Systematic Biologists.
- Chen, D., Eulenstein, O., Fernández-Baca, D., & Sanderson, M. (2006). Minimum-flip supertrees: Complexity and algorithms. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 3(2), 165-173.More infoPMID: 17048402;Abstract: The input to a supertree problem is a collection of phylogenetic trees that intersect pairwise in their leaf sets; the goal is to construct a single tree that retains as much as possible of the information in the input. This task is complicated by inconsistencies due to errors. We consider the case where the input trees are rooted and are represented by the clusters they exhibit. The problem is to find the minimum number of flips needed to resolve all inconsistencies, where each flip moves a taxon into or out of a cluster. We prove that the minimum-flip problem is NP-complete, but show that it is fixed-parameter tractable and give approximation algorithms for special cases. © 2006 IEEE.
- McMahon, M. M., & Sanderson, M. J. (2006). Phylogenetic supermatrix analysis of GenBank sequences from 2228 papilionoid legumes. Systematic Biology, 55(5), 818-836.More infoPMID: 17060202;Abstract: A comprehensive phylogeny of papilionoid legumes was inferred from sequences of 2228 taxa in GenBank release 147. A semiautomated analysis pipeline was constructed to download, parse, assemble, align, combine, and build trees from a pool of 11,881 sequences. Initial steps included all-against-all BLAST similarity searches coupled with assembly, using a novel strategy for building length-homogeneous primary sequence clusters. This was followed by a combination of global and local alignment protocols to build larger secondary clusters of locally aligned sequences, thus taking into account the dramatic differences in length of the heterogeneous coding and noncoding sequence data present in GenBank. Next, clusters were checked for the presence of duplicate genes and other potentially misleading sequences and examined for combinability with other clusters on the basis of taxon overlap. Finally, two supermatrices were constructed: a "sparse"matrix based on the primary clusters alone (1794 taxa x 53,977 characters), and a somewhat more "dense" matrix based on the secondary clusters (2228 taxa x 33,168 characters). Both matrices were very sparse, with 95% of their cells containing gaps or question marks. These were subjected to extensive heuristic parsimony analyses using deterministic and stochastic heuristics, including bootstrap analyses. A "reduced consensus" bootstrap analysis was also performed to detect cryptic signal in a subtree of the data set corresponding to a "backbone" phylogeny proposed in previous studies. Overall, the dense supermatrix appeared to provide much more satisfying results, indicated by better resolution of the bootstrap tree, excellent agreement with the backbone papilionoid tree in the reduced bootstrap consensus analysis, few problematic large polytomies in the strict consensus, and less fragmentation of conventionally recognized genera. Nevertheless, at lower taxonomic levels several problems were identified and diagnosed. A large number of methodological issues in supermatrix construction at this scale are discussed, including detection of annotation errors in GenBank sequences; the shortage of effective algorithms and software for local multiple sequence alignment; the difficulty of overcoming effects of fragmentation of data into nearly disjoint blocks in sparse supermatrices; and the lack of informative tools to assess confidence limits in very large trees. Copyright © Society of Systematic Biologists.
- O'Meara, B. C., Cécile, A., Sanderson, M. J., & Wainwright, P. C. (2006). Testing for different rates of continuous trait evolution using likelihood. Evolution, 60(5), 922-923.More infoPMID: 16817533;Abstract: Rates of phenotypic evolution have changed throughout the history of life, producing variation in levels of morphological, functional, and ecological diversity among groups. Testing for the presence of these rate shifts is a key component of evaluating hypotheses about what causes them. In this paper, general predictions regarding changes in phenotypic diversity as a function of evolutionary history and rates are developed, and tests are derived to evaluate rate changes. Simulations show that these tests are more powerful than existing tests using standardized contrasts. The new approaches are distributed in an application called Brownie and in r8s. © 2006 The Society for the Study of Evolution. All rights reserved.
- Sanderson, M. J. (2006). Paloverde: An OpenGL 3D phylogeny browser. Bioinformatics, 22(8), 1004-1006.More infoPMID: 16500938;Abstract: Summary: Paloverde is a new program designed to help visualize the phylogenetic structure of moderately large trees - trees on the scale of 100-2500 leaf nodes. The program embeds the user in an interactive virtual 3D world in which a large tree presented in various layouts can be manipulated through a mouse interface. The program implements radial 2D layouts, and true 3D spiral, conical and hemispherical (i.e. truly 'tree'-like) layouts. Subclades can be defined in the input file (using standard node-based definitions) and displayed collapsed as new leaf nodes, or left intact but annotated with names around the periphery of the tree. A search tool lets the user zoom to any selected leaf node. Paloverde is an open source project written in ANSI C using the OpenGL library for 3D visualization. © 2006 Oxford University Press.
- Smythe, A. B., Sanderson, M. J., & Nadler, S. A. (2006). Nematode small subunit phylogeny correlates with alignment parameters. Systematic Biology, 55(6), 972-992.More infoPMID: 17345678;Abstract: The number of nuclear small subunit (SSU) ribosomal RNA (rRNA) sequences for Nematoda has increased dramatically in recent years, and although their use in constructing phylogenies has also increased, relatively little attention has been given to their alignment. Here we examined the sensitivity of the nematode SSU data set to different alignment parameters and to the removal of alignment ambiguous regions. Ten alignments were created with CLUSTAL W using different sets of alignment parameters (10 full alignments), and each alignment was examined by eye and alignment ambiguous regions were removed (creating 10 reduced alignments). These alignment ambiguous regions were analyzed as a third type of data set, culled alignments. Maximum parsimony, neighbor-joining, and parsimony bootstrap analyses were performed. The resulting phylogenies were compared to each other by the symmetric difference distance tree comparison metric (SymD). The correlation of the phylogenies with the alignment parameters was tested by comparing matrices from SymD with corresponding matrices of Manhattan distances representing the alignment parameters. Differences among individual parsimony trees from the full alignments were frequently correlated with the differences among alignment parameters (580/1000 tests), as were trees from the culled alignments (403/1000 tests). Differences among individual parsimony trees from the reduced alignments were less frequently correlated with the differences among alignment parameters (230/1000 tests). Differences among majority-rule consensus trees (50%) from the parsimony analysis of the full alignments were significantly correlated with the differences among alignment parameters, whereas consensus trees from the reduced and culled analyses were not correlated with the alignment parameters. These patterns of correlation confirm that choice of alignment parameters has the potential to bias the resultant phylogenies for the nematode SSU data set, and suggest that the removal of alignment ambiguous regions reduces this effect. Finally, we discuss the implications of conservative phylogenetic hypotheses for Nematoda produced by exploring alignment space and removing alignment ambiguous regions for SSU rDNA. © 2006 Society of Systematic Biologists.
- Strong, D. R., & Sanderson, M. (2006). Cenozoic insect-plant diversification in the tropics. Proceedings of the National Academy of Sciences of the United States of America, 103(29), 10827-10828.More infoPMID: 16832057;PMCID: PMC1544132;
- Ané, C., & Sanderson, M. J. (2005). Missing the forest for the trees: Phylogenetic compression and its implications for inferring complex evolutionary histories. Systematic Biology, 54(1), 146-157.More infoPMID: 15805016;Abstract: Phylogenetic tree reconstruction is difficult in the presence of lateral gene transfer and other processes generating conflicting signals. We develop a new approach to this problem using ideas borrowed from algorithmic information theory. It selects the hypothesis that simultaneously minimizes the descriptive complexity of the tree(s) plus the data when encoded using those tree(s). In practice this is the hypothesis that can compress the data the most. We show not only that phylogenetic compression is an efficient method for encoding most phylogenetic data sets and is more efficient than compression schemes designed for single sequences, but also that it provides a clear information theoretic rule for determining when a collection of conflicting trees is a better explanation of the data than a single tree. By casting the parsimony problem in this more general framework, we also conclude that the so-called total-evidence tree - the tree constructed from all the data simultaneously - is not always the most economical explanation of the data. Copyright © Society of Systematic Biologists.
- Ané, C., Burleigh, J. G., McMahon, M. M., & Sanderson, M. J. (2005). Covarion structure in plastid genome evolution: A new statistical test. Molecular Biology and Evolution, 22(4), 914-924.More infoPMID: 15625184;Abstract: Covarion models of molecular evolution allow the rate of evolution of a site to vary through time. There are few simple and effective tests for covarion evolution, and consequently, little is known about the presence of covarion processes in molecular evolution. We describe two new tests for covarion evolution and demonstrate with simulations that they perform well under a wide range of conditions. A survey of covarion evolution in sequenced plastid genomes found evidence of covarion drift in at least 26 out of 57 genes. Covarion evolution is most evident in first and second codon positions of the plastid genes, and there is no evidence of covarion evolution in third codon positions. Therefore, the significant covarion tests are likely due to changes in the selective constraints of amino acids. The frequency of covarion evolution within the plastid genome suggests that covarion processes of evolution were important in generating the observed patterns of sequence variation among plastid genomes. © Society for Molecular Biology and Evolution 2004; all rights reserved.
- Magallón, S. A., & Sanderson, M. J. (2005). Angiosperm divergence times: The effect of genes, codon positions, and time constraints. Evolution, 59(8), 1653-1670.More infoPMID: 16329238;Abstract: An understanding of the evolution of modern terrestrial ecosystems requires an understanding of the dynamics associated with angiosperm evolution, including the timing of their origin and diversification into their extraordinary present-day diversity. Molecular estimates of angiosperm age have varied widely, and many substantially predate the Early Cretaceous fossil appearance of the group. In this study, the effect of different genes, codon positions, and chronological constraints on node ages are examined on divergence time estimates across seed plants, with a special focus on angiosperms. Penalized likelihood was used to estimate divergence times on a phylogenetic hypothesis for seed plants derived from Bayesian analysis, with branch lengths estimated with maximum likelihood. The plastid genes atpB, psaA, psbB, and rbcL were used individually and in combination, using first and second, third, and the three codon positions, including and excluding age constraints on 20 nodes derived from a critical examination of the land-plant fossil record. The optimal level of rate smoothing according to each unconstrained and constrained dataset was obtained with penalized likelihood. Tests for a molecular clock revealed significantly unclocklike rates in all datasets. Addition of fossil constraints resulted in even greater departures from constancy. Consistently with significant deviations from a clock, estimated optimal smoothing values were low, but a strict correlation between rate heterogeneity and optimal smoothing value was not found. Age estimates for nodes across the phylogeny varied, sometimes substantially, with gene and codon position. Nevertheless, estimates based on the four concatenated genes are very similar to the mean of the four individual gene estimates. For any given node, unconstrained age estimates are more variable than constrained estimates and are frequently younger than well-substantiated fossil members of the clade. Constrained estimates of ages of clades are older than unconstrained estimates and oldest fossil representatives, sometimes substantially so. Angiosperm age estimates decreased as rate smoothing increased. Whereas the range of unconstrained angiosperm age estimates spans the fossil age of the clade, the range of constrained estimates is narrower (and older) than the earliest angiosperm fossils. Results unambiguously indicate the relevance of constraints in reducing the variability of ages derived from different partitions of the data and diminishing the effect of the smoothing parameter. Constrained optimizations of divergence times and substitution rates across the phylogeny suggest appreciably different evolutionary dynamics for angiosperms and for gymnosperms. Whereas the gymnosperm crown group originated shortly after the origin of seed plants, a long time elapsed before the origin of crown group angiosperms. Although absolute age estimates of angiosperms and angiosperm clades are older than their earliest fossils, the estimated pace of phylogenetic diversification largely agrees with the rapid appearance of angiosperm lineages in stratigraphic sequences. © 2005 The Society for the Study of Evolution. All rights reserved.
- Scherson, R. A., Choi, H., Cook, D. R., & Sanderson, M. J. (2005). Phylogenetics of New World Astragalus: Screening of novel nuclear loci for the reconstruction of phylogenies at low taxonomic levels. Brittonia, 57(4), 354-366.More infoAbstract: This study explores methods to use information gathered from genomics technology to understand evolutionary relationships in the hyperdiverse legume group Neo-Astragalus. These species inhabit deserts and mountains of North and South America, and even though the monophyly of the group is well established, relationships within it are still poorly understood. Plastid genes, commonly used to infer phylogenies in plants, are usually not useful for closely related taxa because of low levels of genetic variation. The Medicago truncatula genome project provided a suite of candidate nuclear loci with high levels of variation that might prove suitable for low-level phylogenetics. This paper reports the development of methods for screening a large number of these nuclear loci, and detailed analysis of four of them. Four different patterns of phylogenetic diversification occur in the loci sampled from these genomes of Astragalus species. One locus (CNGC4) was single copy and could be directly used in phylogenetic analyses. Two loci (ARG10 and FENR) showed patterns strongly suggestive of duplication events in some taxa, and one locus (tRALS) has apparently undergone a cryptic duplication, making it very difficult to diagnose. Potential methods for using the information provided by these loci are discussed. © 2005, by The New York Botanical Garden Press.
- Driskell, A. C., Ané, C., Burleig, J. G., McMahon, M. M., O'Meara, B. C., & Sanderson, M. J. (2004). Prospects for building the tree of life from large sequence databases. Science, 306(5699), 1172-1174.More infoPMID: 15539599;Abstract: We assess the phylogenetic potential of ∼300,000 protein sequences sampled from Swiss-Prot and GenBank. Although only a small subset of these data was potentially phylogenetically informative, this subset retained a substantial fraction of the original taxonomic diversity. Sampling biases in the databases necessitate building phylogenetic data sets that have large numbers of missing entries. However, an analysis of two "supermatrices" suggests that even data sets with as much as 92% missing data can provide insights into broad sections of the tree of life.
- Eulenstein, O., Chen, D., Burleih, J. G., Fernández-Baca, D., & Sanderson, M. J. (2004). Performance of flip supertree construction with a heuristic algorithm. Systematic Biology, 53(2), 299-308.More infoPMID: 15205054;Abstract: Supertree methods are used to assemble separate phylogenetic trees with shared taxa into larger trees (supertrees) in an effort to construct more comprehensive phylogenetic hypotheses. In spite of much recent interest in supertrees, there are still few methods for supertree construction. The flip supertree problem is an error correction approach that seeks to find a minimum number of changes (flips) to the matrix representation of the set of input trees to resolve their incompatibilities. A previous flip supertree algorithm was limited to finding exact solutions and was only feasible for small input trees. We developed a heuristic algorithm for the flip supertree problem suitable for much larger input trees. We used a series of 48- and 96-taxon simulations to compare supertrees constructed with the flip supertree heuristic algorithm with supertrees constructed using other approaches, including MinCut (MC), modified MC (MMC), and matrix representation with parsimony (MRP). Flip supertrees are generally far more accurate than supertrees constructed using MC or MMC algorithms and are at least as accurate as supertrees built with MRP. The flip supertree method is therefore a viable alternative to other supertree methods when the number of taxa is large.
- Grotkopp, E., Rejmánek, M., Sanderson, M. J., & Rost, T. L. (2004). Evolution of genome size in pines (Pinus) and its life-history correlates: Supertree analyses. Evolution, 58(8), 1705-1729.More infoPMID: 15446425;Abstract: Genome size has been suggested to be a fundamental biological attribute in determining life-history traits in many groups of organisms. We examined the relationships between pine genome sizes and pine phylogeny, environmental factors (latitude, elevation, annual rainfall), and biological traits (latitudinal and elevational ranges, seed mass, minimum generation time, interval between large seed crops, seed dispersal mode, relative growth rate, measures of potential and actual invasiveness, and level of rarity). Genome sizes were determined for 60 pine taxa and then combined with published values to make a dataset encompassing 85 species, or 70% of species in the genus. Supertrees were constructed using 20 published source phylogenies. Ancestral genome size was estimated as 32 pg. Genome size has apparently remained stable or increased over evolutionary time in subgenus Strobus, while it has decreased in most subsections in subgenus Pinus. We analyzed relationships between genome size and life-history variables using cross-species correlations and phylogenetically independent contrasts derived from supertree constructions. The generally assumed positive relation between genome size and minimum generation time could not be confirmed in phylogenetically controlled analyses. We found that the strongest correlation was between genome size and seed mass. Because the growth quantities specific leaf area and leaf area ratio (and to a lesser extent relative growth rate) are strongly negatively related to seed mass, they were also negatively correlated with genome size. Northern latitudinal limit was negatively correlated with genome size. Invasiveness, particularly of wind-dispersed species, was negatively associated with both genome size and seed mass. Seed mass and its relationships with seed number, dispersal mode, and growth rate contribute greatly to the differences in life-history strategies of pines. Many life-history patterns are therefore indirectly, but consistently, associated with genome size.
- Near, T. J., & Sanderson, M. J. (2004). Assessing the quality of molecular divergence time estimates by fossil calibrations and fossil-based model selection. Philosophical Transactions of the Royal Society B: Biological Sciences, 359(1450), 1477-1483.More infoPMID: 15519966;PMCID: PMC1693436;Abstract: Estimates of species divergence times using DNA sequence data are playing an increasingly important role in studies of evolution, ecology and biogeography. Most work has centred on obtaining appropriate kinds of data and developing optimal estimation procedures, whereas somewhat less attention has focused on the calibration of divergences using fossils. Case studies with multiple fossil calibration points provide important opportunities to examine the divergence time estimation problem in new ways. We discuss two cross-validation procedures that address different aspects of inference in divergence time estimation. 'Fossil cross-validation' is a procedure used to identify the impact of different individual calibrations on overall estimation. This can identify fossils that have an exceptionally large error effect and may warrant further scrutiny. 'Fossil-based model cross-validation' is an entirely different procedure that uses fossils to identify the optimal model of molecular evolution in the context of rate smoothing or other inference methods. Both procedures were applied to two recent studies: an analysis of monocot angiosperms with eight fossil calibrations and an analysis of placental mammals with nine fossil calibrations. In each case, fossil calibrations could be ranked from most to least influential, and in one of the two studies, the fossils provided decisive evidence about the optimal molecular evolutionary model.
- Sanderson, M. J., Thorne, J. L., Wikström, N., & Bremer, K. (2004). Molecular evidence on plant divergence times. American Journal of Botany, 91(10), 1656-1665.More infoPMID: 21652315;Abstract: Estimation of divergence times from sequence data has become increasingly feasible in recent years. Conflicts between fossil evidence and molecular dates have sparked the development of new methods for inferring divergence times, further encouraging these efforts. In this paper, available methods for estimating divergence times are reviewed, especially those geared toward handling the widespread variation in rates of molecular evolution observed among lineages. The assumptions, strengths, and weaknesses of local clock, Bayesian, and rate smoothing methods are described. The rapidly growing literature applying these methods to key divergence times in plant evolutionary history is also reviewed. These include the crown group ages of green plants, land plants, seed plants, angiosperms, and major subclades of angiosperms. Finally, attempts to infer divergence times are described in the context of two very different temporal settings: recent adaptive radiations and much more ancient biogeographic patterns.
- Scotland, R. W., & Sanderson, M. J. (2004). The Significance of Few Versus Many in the Tree of Life. Science, 303(5658), 643-.More infoPMID: 14752153;
- Wojciechowski, M. F., Lavin, M., & Sanderson, M. J. (2004). A phylogeny of legumes (Leguminosae) based on analysis of the plastid matK gene resolves many well-supported subclades within the family. American Journal of Botany, 91(11), 1846-1862.More infoPMID: 21652332;Abstract: Phylogenetic analysis of 330 plastid matK gene sequences, representing 235 genera from 37 of 39 tribes, and four outgroup taxa from eurosids I supports many well-resolved subclades within the Leguminosae. These results are generally consistent with those derived from other plastid sequence data (rbcL and trnL), but show greater resolution and clade support overall. In particular, the monophyly of subfamily Papilionoideae and at least seven major subclades are well-supported by bootstrap and Bayesian credibility values. These subclades are informally recognized as the Cladrastis clade, genistoid sensu lato, dalbergioid sensu lato, mirbelioid, millettioid, and robinioid clades, and the inverted-repeat-lacking clade (IRLC). The genistoid clade is expanded to include genera such as Poecilanthe, Cyclolobium, Bowdichia, and Diplotropis and thus contains the vast majority of papilionoids known to produce quinolizidine alkaloids. The dalbergioid clade is expanded to include the tribe Amorpheae. The mirbelioids include the tribes Bossiaeeae and Mirbelieae, with Hypocalypteae as its sister group. The millettioids comprise two major subclades that roughly correspond to the tribes Millettieae and Phaseoleae and represent the only major papilionoid clade marked by a macromorphological apomorphy, pseudoracemose inflorescences. The robinioids are expanded to include Sesbania and members of the tribe Loteae. The IRLC, the most species-rich subclade, is sister to the robinioids. Analysis of the matK data consistently resolves but modestly supports a clade comprising papilionoid taxa that accumulate canavanine in the seeds. This suggests a single origin for the biosynthesis of this most commonly produced of the nonprotein amino acids in legumes.
- Piel, W. H., Sanderson, M. J., & Donoghue, M. J. (2003). The small-world dynamics of tree networks and data mining in phyloinformatics. Bioinformatics, 19(9), 1162-1168.More infoPMID: 12801879;Abstract: Motivation: A noble and ultimate objective of phyloinformatic research is to assemble, synthesize, and explore the evolutionary history of life on earth. Data mining methods for performing these tasks are not yet well developed, but one avenue of research suggests that network connectivity dynamics will play an important role in future methods. Analysis of disordered networks, such as small-world networks, has applications as diverse as disease propagation, collaborative networks, and power grids. Here we apply similar analyses to networks of phylogenetic trees in order to understand how synthetic information can emerge from a database of phylogenies. Results: Analyses of tree network connectivity in Tree-BASE show that a collection of phylogenetic trees behaves as a small-world network-while on the one hand the trees are clustered, like a non-random lattice, on the other hand they have short characteristic path lengths, like a random graph. Tree connectivities follow a dual-scale power-law distribution (first power-law exponent ≈1.87; second ≈4.82). This unusual pattern is due, in part, to the presence of alternative tree topologies that enter the database with each published study. As expected, small collections of trees decrease connectivity as new trees are added, while large collections of trees increase connectivity. However, the inflection point is surprisingly low: after about 600 trees the network suddenly jumps to a higher level of coherence. More stringent definitions of 'neighbour' greatly delay the threshold whence a database achieves sufficient maturity for a coherent network to emerge. However, more stringent definitions of 'neighbour' would also likely show improved focus in data mining.
- Sanderson, M. J. (2003). Molecular data from 27 proteins do not support a Precambrian origin of land plants. American Journal of Botany, 90(6), 954-956.More infoPMID: 21659192;Abstract: Heckman et al. (Science 293: 1129-1133) used sequences obtained from GenBank to infer divergence times in fungi and green plants. They estimated that the crown group of land plants originated in the Precambrian, at 703 ± 45 mya, a date much older than dates implied by the fossils, which are no older than about 450 mya. This paper presents an analysis of an entirely different set of sequence data from 27 plastid protein-coding genes in 10 land plants and a green algal outgroup. It uses a calibration point closer to the origin of land plants and inference methods that do not assume a molecular clock. This leads to estimates ranging from 425 to 490 mya, which brackets the age suggested by the fossil record. Possible explanations for the differing conclusions in the two studies include differences in calibration points and use of single-copy plastid genes rather than nuclear gene families.
- Sanderson, M. J. (2003). r8s: Inferring absolute rates of molecular evolution and divergence times in the absence of a molecular clock. Bioinformatics, 19(2), 301-302.More infoPMID: 12538260;Abstract: Estimating divergence times and rates of substitution from sequence data is plagued by the problem of rate variation between lineages. R8s version 1.5 is a program which uses parametric, nonparametric and semiparametric methods to relax the assumption of constant rates of evolution to obtain better estimates of rates and times. Unlike most programs for rate inference or phylogenetics, r8s permits users to convert results to absolute rates and ages by constraining one or more node times to be fixed, minimum or maximum ages (using fossil or other evidence). Version 1.5 uses truncated Newton nonlinear optimization code with bound constraints, offering superior performance over previous versions.
- Sanderson, M. J., & Driskell, A. C. (2003). The challenge of constructing large phylogenetic trees. Trends in Plant Science, 8(8), 374-379.More infoPMID: 12927970;Abstract: The amount of sequence data available to reconstruct the evolutionary history of genes and species has increased 20-fold in the past decade. Consequently the size of phylogenetic analyses has grown as well, and phylogenetic methods, algorithms and their implementations have struggled to keep pace. Computational and other challenges raised by this burgeoning database emerge at several stages of analysis, from the optimal assembly of large data matrices from sequence databases, to the efficient construction of trees from these large matrices and the piece-wise assembly of 'supertrees' from those trees in turn. A final challenge is posed by the difficulty of visualizing and making inferences from trees that might soon routinely contain thousands of species.
- Sanderson, M. J., Driskell, A. C., Ree, R. H., Eulenstein, O., & Langley, S. (2003). Obtaining maximal concatenated phylogenetic data sets from large sequence databases. Molecular Biology and Evolution, 20(7), 1036-1042.More infoPMID: 12777519;Abstract: To improve the accuracy of tree reconstruction, phylogeneticists are extracting increasingly large multigene data sets from sequence databases. Determining whether a database contains at least k genes sampled from at least m species is an NP-complete problem. However, the skewed distribution of sequences in these databases permits all such data sets to be obtained in reasonable computing times even for large numbers of sequences. We developed an exact algorithm for obtaining the largest multigene data sets from a collection of sequences. The algorithm was then tested on a set of 100,000 protein sequences of green plants and used to identify the largest multigene ortholog data sets having at least 3 genes and 6 species. The distribution of sizes of these data sets forms a hollow curve, and the largest are surprisingly small, ranging from 62 genes by 6 species, to 3 genes by 65 species, with more symmetrical data sets of around 15 taxa by 15 genes. These upper bounds to sequence concatenation have important implications for building the tree of life from large sequence databases.
- Hu, J., Lavin, M., Wojciechowski, M. F., Sanderson, M. J., & Davis, J. I. (2002). Phylogenetic analysis of nuclear ribosomal ITS/5.8S sequences in the tribe Millettieae (Fabaceae): Poecilanthe-Cyclolobium, the core Millettieae, and the Callerya group. Systematic Botany, 27(4), 722-733.More infoAbstract: The taxonomic composition of three principal and distantly related groups of the former tribe Millettieae, which were first identified from nuclear phytochrome and chloroplast trnK/matK sequences, was more extensively investigated with a phylogenetic analysis of nuclear ribosomal DNA ITS/5.8S sequences. The first of these groups includes the neotropical genera Poecilanthe and Cyclolobium, which are resolved as basal lineages in a clade that otherwise includes the neotropical genera Brongniartia and Harpalyce and the Australian Templetonia and Hovea. The second group includes the large millettioid genera, Millettia, Lonchocarpus, Derris, and Tephrosia, which are referred to as the "core Millettieae" group. Phylogenetic analysis of nuclear ribosomal DNA ITS/5.8S sequences reveals that Millettia is polyphyletic, and that subclades of the core Millettieae group, such as the New World Lonchocarpus or the pantropical Tephrosia and segregate genera (e.g., Chadsia and Mundulea), each form well supported monophyletic subgroups. The third lineage includes the genera Afgekia, Callerya, and Wisteria. These genera are resolved as a basal subclade in the inverted-repeat-lacking clade, which is a large legume group that includes the many well known temperate and herbaceous legumes, such as Astragalus, Medicago and Pisum, but not any other Millettieae.
- Magallón, S., & Sanderson, M. J. (2002). Relationships among seed plants inferred from highly conserved genes: Sorting conflicting phylogenetic signals among ancient lineages. American Journal of Botany, 89(12), 1991-2006.More infoPMID: 21665628;Abstract: Phylogenetic studies based on different types and treatment of data provide substantially conflicting hypotheses of relationships among seed plants. We conducted phylogenetic analyses of sequences of two highly conserved chloroplast genes, psaA and psbB, for a comprehensive taxonomic sample of seed plants and land plants. Parsimony analyses of two different codon position partitions resulted in well-supported, but significantly conflicting, phylogenetic trees. First and second codon positions place angiosperms and gymnosperms as sister clades and Gnetales as sister to Pinaceae. Third positions place Gnetales as sister to all other seed plants. Maximum likelihood trees for the two partitions are also in conflict. Relationships among the main seed plant clades according to first and second positions are similar to those found in parsimony analysis for the same data, but the third position maximum likelihood tree is substantially different from the corresponding parsimony tree, although it agrees partially with the first and second position trees in placing Gnetales as the sister group of Pinaceae. Our results document high rate heterogeneity among lineages, which, together with the greater average rate of substitution for third positions, may reduce phylogenetic signal due to long-branch attraction in parsimony reconstructions. Whereas resolution of relationships among major seed plant clades remains pending, this study provides increased support for relationships within major seed plant clades.
- Sanderson, M. J. (2002). Estimating absolute rates of molecular evolution and divergence times: A penalized likelihood approach. Molecular Biology and Evolution, 19(1), 101-109.More infoPMID: 11752195;Abstract: Rates of molecular evolution vary widely between lineages, but quantification of how rates change has proven difficult. Recently proposed estimation procedures have mainly adopted highly parametric approaches that model rate evolution explicitly. In this study, a semiparametric smoothing method is developed using penalized likelihood. A saturated model in which every lineage has a separate rate is combined with a roughness penalty that discourages rates from varying too much across a phylogeny. A data-driven cross-validation criterion is then used to determine an optimal level of smoothing. This criterion is based on an estimate of the average prediction error associated with pruning lineages from the tree. The methods are applied to three data sets of six genes across a sample of land plants. Optimally smoothed estimates of absolute rates entailed 2- to 10-fold variation across lineages.
- Sanderson, M. J., & Shaffer, H. B. (2002). Troubleshooting molecular phylogenetic analyses. Annual Review of Ecology and Systematics, 33, 49-72.More infoAbstract: The number, size, and scope of phylogenetic analyses using molecular data has increased steadily in recent years. This has simultaneously led to a dramatic improvement in our understanding of phylogenetic relationships and a better appreciation for an array of methodological problems that continue to hinder progress in phylogenetic studies of certain data sets and/or particular parts of the tree of life. This review focuses on several persistent problems, including rooting, conflict among data sets, weak support in trees, strong but evidently incorrect support, and the computational issues arising when methods are applied to the large data sets that are becoming increasingly commonplace. We frame each of these issues as a specific problem to be overcome, review the relevant theoretical and empirical literature, and suggest solutions, or at least strategies, for further investigation of the issues involved.
- Bininda-Emonds, O., Brady, S. G., Kim, J., & Sanderson, M. J. (2001). Scaling of accuracy in extremely large phylogenetic trees.. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, 547-558.More infoPMID: 11262972;Abstract: The accuracy of phylogenetic inference was examined in simulated data sets up to nearly 10,000 taxa, the size of the largest set of homologous genes in existing molecular sequence databases. Even with a simple search algorithm (maximum parsimony without branch swapping), the number of characters needed to estimate 80% of a tree correctly can scale remarkably well at optimal substitution rates (on the order of log N, where N is the number of taxa). In other words, the number of taxa in an analysis can be doubled and only an arithmetic increase in the number of characters is required to maintain the same level of accuracy. Even substitution rates that are much higher than normally used in phylogenetic studies did not affect the scaling too adversely. However, scaling is usually worse than log N for more stringent levels of accuracy. Moreover, errors are not distributed randomly throughout the tree. Shallow nodes are remarkably easy to reconstruct and display favourable log-linear scaling. The deepest nodes are extremely difficult to reconstruct accurately, even with branch swapping, and the scaling is poor. Therefore, the strategy of sequencing large numbers of homologous genes may not always provide global solutions to extreme phylogenetic problems and alternative strategies may be required.
- Lavin, M., Wojciechowski, M. F., Richman, A., Rotella, J., Sanderson, M. J., & Matos, A. B. (2001). Identifying tertiary radiations of fabaceae in the greater antilles: Alternatives to cladistic vicariance analysis. International Journal of Plant Sciences, 162(6 SUPPL.), S53-S76.More infoAbstract: The fossil record shows that the legume family was abundant and taxonomically diverse in Early Tertiary tropical deciduous forests of North America. Today, woody members of this family are almost nonexistent in temperate deciduous forests. This former North American legume diversity now lies in the Tropics, including the Greater Antilles. To show the Antillean refugia, we detail a phylogenetic and biogeographic analysis of two legume groups, the Ormocarpum and Robinia clades, which have either a Tertiary fossil record in North America or a sister clade with such a fossil record. A combined analysis of molecular and nonmolecular data is used for the cladistic vicariance approaches, while an exhaustively sampled data set of nrDNA ITS/5.8S sequences is used for the molecular biogeographic analysis. Results from component, three-area-statements, and Brooks parsimony analysis are equivocal in suggesting an influence of Tertiary history on the distribution of the woody genera Pictetia (Ormocarpum clade) and Poitea (Robinia clade), two of the most speciose endemic legume radiations in the Greater Antilles. Alternatively, nucleotide diversity, evolutionary rates, and coalescent analyses of molecular phylogenies all suggest a Tertiary diversification of Pictetia and Poitea. The results are corroborated by a regression analysis that implicates both age of island biota and island area in accurately predicting numbers of endemic legume taxa. These findings, combined with the legume fossil record, suggest that both Pictetia and Poitea stem from Tertiary North American boreotropical groups. J. A. Wolfe's hypothesis that the Greater Antilles harbor boreotropical relicts is supported.
- Magallón, S., & Sanderson, M. J. (2001). Absolute diversification rates in angiosperm clades. Evolution, 55(9), 1762-1780.More infoPMID: 11681732;Abstract: The extraordinary contemporary species richness and ecological predominance of flowering plants (angiosperms) are even more remarkable when considering the relatively recent onset of their evolutionary diversification. We examine the evolutionary diversification of angiosperms and the observed differential distribution of species in angiosperm clades by estimating the rate of diversification for angiosperms as a whole and for a large set of angiosperm clades. We also identify angiosperm clades with a standing diversity that is either much higher or lower than expected, given the estimated background diversification rate. Recognition of angiosperm clades, the phylogenetic relationships among them, and their taxonomic composition are based on an empirical compilation of primary phylogenetic studies. By making an integrative and critical use of the paleobotanical record, we obtain reasonably secure approximations for the age of a large set of angiosperm clades. Diversification was modeled as a stochastic, time-homogeneous birth-and-death process that depends on the diversification rate (r) and the relative extinction rate (ε). A statistical analysis of the birth and death process was then used to obtain 95% confidence intervals for the expected number of species through time in a clade that diversifies at a rate equal to that of angiosperms as a whole. Confidence intervals were obtained for stem group and for crown group ages in the absence of extinction (ε = 0.0) and under a high relative extinction rate (ε = 0.9). The standing diversity of angiosperm clades was then compared to expected species diversity according to the background rate of diversification, and, depending on their placement with respect to the calculated confidence intervals, exceedingly species-rich or exceedingly species-poor clades were identified. The rate of diversification for angiosperms as a whole ranges from 0.077 (ε = 0.9) to 0.089 (ε = 0.0) net speciation events per million years. Ten clades fall above the confidence intervals of expected species diversity, and 13 clades were found to be unexpectedly species poor. The phylogenetic distribution of clades with an exceedingly high number of species suggests that traits that confer high rates of diversification evolved independently in different instances and do not characterize the angiosperms as a whole.
- R., O., & Sanderson, M. J. (2001). Assessment of the accuracy of matrix representation with parsimony analysis supertree construction. Systematic Biology, 50(4), 565-579.More infoPMID: 12116654;Abstract: Despite the growing popularity of supertree construction for combining phylogenetic information to produce more inclusive phylogenies, large-scale performance testing of this method has not been done. Through simulation, we tested the accuracy of the most widely used supertree method, matrix representation with parsimony analysis (MRP), with respect to a (maximum parsimony) total evidence solution and a known model tree. When source trees overlap completely, MRP provided a reasonable approximation of the total evidence tree; agreement was usually >85%. Performance improved slightly when using smaller, more numerous, or more congruent source trees, and especially when elements were weighted in proportion to the bootstrap frequencies of the nodes they represented on each source tree ("weighted MRP"). Although total evidence always estimated the model tree slightly better than nonweighted MRP methods, weighted MRP in turn usually out-performed total evidence slightly. When source studies were even moderately nonoverlapping (i.e., sharing only three-quarters of the taxa), the high proportion of missing data caused a loss in resolution that severely degraded the performance for all methods, including total evidence. In such cases, even combining more trees, which had positive effects elsewhere, did not improve accuracy. Instead, "seeding" the supertree or total evidence analyses with a single largely complete study improved performance substantially. This finding could be an important strategy for any studies that seek to combine phylogenetic information. Overall, our results suggest that MRP supertree construction provides a reasonable approximation of a total evidence solution and that weighted MRP should be used whenever possible.
- Hu, J., Lavin, M., Wojciechowski, M. F., & Sanderson, M. J. (2000). Phylogenetic systematics of the tribe Millettieae (Leguminosae) based on chloroplast trnK/matK sequences and its implications for evolutionary patterns in Papilionoideae. American Journal of Botany, 87(3), 418-430.More infoPMID: 10719003;Abstract: Phylogenetic relationships in the tribe Millettieae and allies in the subfamily Papilionoideae (Leguminosae) were reconstructed from chloroplast trnK/matK sequences. Sixty-two accessions representing 57 traditionally recognized genera of Papilionoideae were sampled, including 27 samples from Millettieae. Phylogenies were constructed using maximum parsimony and are well resolved and supported by high bootstrap values. A well-supported 'core Millettieae' clade is recognized, comprising the four large genera Millettia, Lonchocarpus, Derris, and Tephrosia. Several other small genera of Millettieae are not in the core Millettieae clade. Platycyamus is grouped with Phaseoleae (in part). Ostryocarpus, Austrosteenisia, and Dalbergiella are neither in the core Millettieae or Phaseoleae clade. These taxa, along with core Millettieae and Phaseoleae, form a monophyletic sister group to Indigofereae. Cyclolobium and Poecilanthe are close to Brongniartieae. Callerya and Wisteria belong to a large clade that includes all the legumes that lack the inverted repeat in their chloroplast genome, which confirms previous rbcL and phytochrome gene family phylogenies. The evolutionary history of four characters was examined in Millettieae and allies: the presence of canavanine, inflorescence types, the dehiscence of pods, and the presence of winged pods. trnK/matK sequence analysis suggests that the presence of a pseudoraceme or pseudopanicle and the accumulation of nonprotein amino acids are phylogenetically informative for Millettieae and allies with only a few exceptions.
- Sanderson, M. J., & Kim, J. (2000). Parametric phylogenetics?. Systematic Biology, 49(4), 817-829.More infoPMID: 12116443;
- Sanderson, M. J., & Wojciechowski, M. F. (2000). Improved bootstrap confidence limits in large-scale phylogenies, with an example from Neo-Astragalus (Leguminosae). Systematic Biology, 49(4), 671-685.More infoPMID: 12116433;Abstract: Phylogenetic analyses of large data sets pose special challenges, including the apparent tendency for the bootstrap support for a clade to decline with increased taxon sampling of that clade. We document this decline in data sets with increasing numbers of taxa in Astragalus, the most species-rich angiosperm genus. Support for one subclade, Neo-Astragalus, declined monotonically with increased sampling of taxa inside Neo-Astragalus, irrespective of whether parsimony or neighbor-joining methods were used or of which particular heuristic search algorithm was used (although more stringent algorithms tended to yield higher support). Three possible explanations for this decline were examined, including (1) mistaken assignment of the most recent common ancestor of the taxon sample (and its bootstrap support) with the most recent common ancestor of the clade from which it was sampled; (2) computational limitations of heuristic search strategies; and (3) statistical bias in bootstrap proportions, especially that from random homoplasy distributed among taxa. The best explanation appears to be (3), although computational shortcomings (2) may explain some of the problem. The bootstrap proportion, as currently used in phylogenetic analysis, does not accurately capture the classical notion of confidence assessments on the null hypothesis of nonmonophyly, especially in large data sets. More accurate assessments of confidence as type 1 error levels (relying on iterated bootstrap methods) remove most of the monotonic decline in confidence with increasing numbers of taxa.
- Sanderson, M. J., Wojciechowski, M. F., Hu, J. -., Khan, T. S., & Brady, S. G. (2000). Error, bias, and long-branch attraction in data for two chloroplast photosystem genes in seed plants. Molecular Biology and Evolution, 17(5), 782-797.More infoPMID: 10779539;Abstract: Sequences of two chloroplast photosystem genes, psaA and psbB, together comprising about 3,500 bp, were obtained for all five major groups of extant seed plants and several outgroups among other vascular plants. Strongly supported, but significantly conflicting, phylogenetic signals were obtained in parsimony analyses from partitions of the data into first and second codon positions versus third positions. In the former, both genes agreed on a monophyletic gymnosperms, with Gnetales closely related to certain conifers. In the latter, Gnetales are inferred to be the sister group of all other seed plants, with gymnosperms paraphyletic. None of the data supported the modem 'anthophyte hypothesis,' which places Gnetales as the sister group of flowering plants. A series of simulation studies were undertaken to examine the error rate for parsimony inference. Three kinds of errors were examined: random error, systematic bias (both properties of finite data sets), and statistical inconsistency owing to long-branch attraction (an asymptotic property). Parsimony reconstructions were extremely biased for third-position data for psbB. Regardless of the true underlying tree, a tree in which Gnetales are sister to all other seed plants was likely to be reconstructed for these data. None of the combinations of genes or partitions permits the anthophyte tree to be reconstructed with high probability. Simulations of progressively larger data sets indicate the existence of long-branch attraction (statistical inconsistency) for third-position psbB data if either the anthophyte tree or the gymnosperm tree is correct. This is also true for the anthophyte tree using either psaA third positions or psbB first and second positions. A factor contributing to bias and inconsistency is extremely short branches at the base of the seed plant radiation, coupled with extremely high rates in Gnetales and nonseed plant outgroups.
- Wagstaff, S. J., Heenan, P. B., & Sanderson, M. J. (1999). Classification, origins, and patterns of diversification in New Zealand Carmichaelinae (Fabaceae). American Journal of Botany, 86(9), 1346-1356.More infoPMID: 10487821;Abstract: Analysis of ITS sequences provides support for a clade that includes Carmichaelia, Clianthus, Montigena, and Swainsona. We provide a node-based definition and recommend that this clade be called Carmichaelinae. Results suggest that Carmichaelinae are derived from northern hemisphere Astragalinae. The clade has extensively radiated in Australia and two independent lineages have diversified in New Zealand. The New Zealand lineages differ in species richness. One lineage consists of 24 species placed in Carmichaelia and Clianthus, while the other corresponds to the monotypic genus Montigena. The pattern of relationships inferred from ITS sequences suggests that the New Zealand radiation was recent and possibly accompanied episodes of mountain-building and glaciation.
- Wojciechowski, M. F., Sanderson, M. J., & Hu, J. (1999). Evidence on the monophyly of Astragalus (Fabaceae) and its major subgroups based on nuclear ribosomal DNA ITS and chloroplast DNA trnL intron data. Systematic Botany, 24(3), 409-437.More infoAbstract: Phylogenetic relationships among 115 species representing the legume genus Astragalus and 12 related genera were inferred from an analysis of nucleotide sequence variation in the internal transcribed spacers and 5.8S gene of nuclear ribosomal DNA. For a subset of these taxa, the ITS data were supplemented by sequences from the chloroplast trnL intron. Phylogenies derived from maximum parsimony and neighbor-joining analyses of sequence and insertion/deletion characters all suggest that the vast majority of Astragalus is monophyletic (with the exception of 'outlier' species). All New World Astragalus species with aneuploid chromosome numbers (n = 11-15) form a monophyletic group ('Neo-Astragalus'), which now includes the Mediterranean aneuploid Astragalus echinatus. Other Old World aneuploid species are not closely related to Neo-Astragalus, but rather are found among Old World euploid (n = 8, 16) groups. Similarly, the relatively few North American species with euploid numbers are not the closest relatives to Neo-Astragalus but are dispersed among divergent Old World groups that include both aneuploid and euploid species. The historically allied genus Oxytropis is not nested within Astragalus, but forms a separate clade within the larger 'Astragalean' clade. The proposed segregate genera Astracantha (Eurasian) and Orophaca (North American) are clearly nested within Astragalus s. str. South American species of Astragalus are nested within Neo-Astragalus and comprise at least two independently derived clades (along with their close North American relatives), as previously suggested by morphology. Parsimony reconstructions of characters that have been used in the traditional subgeneric taxonomy of the genus were examined and show high levels of homoplasy. Preliminary estimates of the absolute rate of species diversification in Astragalus suggest it may be higher than in some other, often cited, continental or insular adaptive radiations in angiosperms.
- Baldwin, B. G., & Sanderson, M. J. (1998). Age and rate of diversification of the Hawaiian silversword alliance (Compositae). Proceedings of the National Academy of Sciences of the United States of America, 95(16), 9402-9406.More infoPMID: 9689092;PMCID: PMC21350;Abstract: Comparisons between insular and continental radiations have been hindered by a lack of reliable estimates of absolute diversification rates in island lineages. We took advantage of rate-constant rDNA sequence evolution and an 'external' calibration using paleoclimatic and fossil data to determine the maximum age and minimum diversification rate of the Hawaiian silversword alliance (Compositae), a textbook example of insular adaptive radiation in plants. Our maximum-age estimate of 5.2 ± 0.8 million years ago for the most recent common ancestor of the silversword alliance is much younger than ages calculated by other means for the Hawaiian drosophilids, lobelioids, and honeycreepers and falls approximately within the history of the modern high islands (≤5.1 ± 0.2 million years ago). By using a statistically efficient estimator that reduces error variance by incorporating clock-based estimates of divergence times, a minimum diversification rate for the silversword alliance was estimated to be 0.56 ± 0.17 species per million years. This exceeds average rates of more ancient continental radiations and is comparable to peak rates in taxa with sufficiently rich fossil records that changes in diversification rate can be reconstructed.
- Sanderson, M. J., Purvis, A., & Henze, C. (1998). Phylogenetic supertrees: Assembling the trees of life. Trends in Ecology and Evolution, 13(3), 105-109.More infoPMID: 21238221;Abstract: Systematists and comparative biologists commonly want to make statements about relationships among taxa that have never been collectively included in any single phylogenetic analysis. Construction of phylogenetic 'supertrees' provides one solution. Supertrees are estimates of phylogeny assembled from sets of smaller estimates (source trees) sharing some but not necessarily all their taxa in common. If certain conditions are met, supertrees can retain all or most of the information from the source trees and also make novel statements about relationships of taxa that do not co-occur on any one source tree. Supertrees have commonly been constructed using subjective and informal approaches, but several explicit approaches have recently been proposed.
- Sanderson, M. J. (1997). A nonparametric approach to estimating divergence times in the absence of rate constancy. Molecular Biology and Evolution, 14(12), 1218-1231.More infoAbstract: A new method for estimating divergence times when evolutionary rates are variable across lineages is proposed. The method, called nonparametric rate smoothing (NPRS), relies on minimization of ancestor-descendant local rate changes and is motivated by the likelihood that evolutionary rates are autocorrelated in time. Fossil information pertaining to minimum and/or maximum ages of nodes in a phylogeny is incorporated into the algorithms by constrained optimization techniques. The accuracy of NPRS was examined by comparison to a clock-based maximum-likelihood method in computer simulations. NPRS provides more accurate estimates of divergence times when (1) sequence lengths are sufficiently long, (2) rates are truly nonclocklike, and (3) rates are moderately to highly autocorrelated in time. The algorithms were applied to estimate divergence times in seed plants based on data from the chloroplast rbcL gene. Both constrained and unconstrained NPRS methods tended to produce divergence time estimates more consistent with paleobotanical evidence than did clock-based estimates.
- Baldwin, B. G., Sanderson, M. J., Porter, J. M., Wojciechowski, M. F., Campbell, C. S., & Donoghue, M. J. (1996). Erratum: The ITS region of nuclear ribosomal DNA: A valuable source of evidence on angiosperm phylogeny (Annals of the Missouri Botanical Garden (1995) 82 (247-277)). Annals of the Missouri Botanical Garden, 83(1), 151-.
- Sanderson, M. J. (1996). How many taxa must be sampled to identify the root node of a large clade?. Systematic Biology, 45(2), 168-173.More infoAbstract: The importance of choice of taxa in phylogenetic analysis has been explored mainly with reference to its effect on the accuracy of tree estimation. Taxon sampling can also introduce other kinds of errors. Even if the sampled topology agrees with the true topology, it may not include the true root node of a clade, a node that is of interest for many reasons. Using a simple Yule model for the diversification process, the probability of identifying this node is derived under random sampling of taxa. For large clades, the minimum sample size needed to be 95% confident of identifying the root node is approximately 40 and is independent of the size of the clade. If rates of diversification differ in the two sister groups descended from the root node, the minimum sample size needed increases markedly. If these two sister groups are so different in diversity that a Yule model would be rejected by conventional diversification tests, then the necessary sample size is an order of magnitude greater than when diversification is homogeneous.
- Sanderson, M. J., & Donoghue, M. J. (1996). Reconstructing shifts in diversification rates on phylogenetic trees. Trends in Ecology and Evolution, 11(1), 15-20.More infoPMID: 21237745;Abstract: Few issues in evolutionary biology have received as much attention over the years or have generated as much controversy as those involving evolutionary rates. One unresolved issue is whether or not shifts in speciation and/or extinction rates are closely tied to the origin of 'key' innovations in evolution. This discussion has long been dominated by 'time-based' methods using data from the fossil record. Recently, however, attention has shifted to 'tree-based' methods, in which time, if it plays any role at all, is incorporated secondarily, usually based on molecular data. Tests of hypotheses about key innovations do require information about phylogenetic relationships, and some of these tests can be implemented without any information about time. However, every effort should be made to obtain information about time, which greatly increases the power of such tests. © 1996, Elsevier Science Ltd.
- Sanderson, M. J., & Wojciechowski, M. F. (1996). Diversification rates in a temperate legume clade: Are there "so many species" of Astragalus (Fabaceae)?. American Journal of Botany, 83(11), 1488-1502.More infoAbstract: Astragalus, the largest genus of flowering plants, contains upwards of 2500 species. Explanations for this exceptional species diversity have pointed to unusual population structure or modes of speciation. Surprisingly, however, three different statistical analyses indicate that diversification rates in Astragalus are not exceptionally high compared to its closest relatives. Instead, rates are high throughout the "Astragalean clade," a much broader radiation distributed throughout the temperate zone. The increase in diversification rate is associated with the origin and divergence of this clade from common ancestors of it and several much less diverse and more narrowly distributed Asian genera. This suggests that causal factors in the shift toward higher rates of diversification must be due not to factors unique to Astragalus, but to characteristics common to the entire Astragalean clade. However, this larger clade has never been circumscribed in classifications based on morphological data This raises the possibility that the causes of increased diversification may not be due to morphological innovation, but may instead be related to ecological factors or cryptic physiological or biochemical features.
- Sanderson, M. J. (1995). Objections to bootstrapping phylogenies: A critique. Systematic Biology, 44(3), 299-320.More infoAbstract: Despite widespread use, the bootstrap remains a controversial method for assessing confidence limits in phylogenies. Opposition to its use has centered on a small set of basic philosophical and statistical objections that have largely gone unanswered by advocates of statistical approaches to phylogeny reconstruction. The level of generality of these objections varies greatly, however. Some of the objections are merely technical, involving problems that are found in almost all statistical tests, such as bias in small data sets. Other objections are really associated not so much with a rejection of the bootstrap but with the rejection of statistical methods in phylogeny reconstruction, which resurrects an old debate. The most relevant aspects of this debate revolve around the issue of whether or not an unknown parameter, such as a tree, can have probabilities (confidence limits) associated with it. The relevant statistical aspects are reviewed, but because this issue remains controversial within statistical theory, it is unreasonable to expect it to be anything else in phylogenetic systematics. An area of common ground between statistical and nonstatistical approaches emerges in the use of statistical likelihood as a measure of support for phylogenetic hypotheses. This common ground requires the abandonment of classical notions of confidence limits by statistically oriented systematists and the acceptance of probabilistic models and likelihood by opponents of statistical methods. There remains a small set of objections directly germane to bootstrapping phylogenies per se. These objections involve issues of random sampling and whether or not character data are independent and identically distributed (IID). Nonrandom-sample bootstrapping is discussed, as are sample designs that impose the IID assumption on characters regardless of evolutionary nonindependence and nonidentical distribution of those data. Systematists wishing to use the bootstrap have an alternative to making explicit and rather strong evolutionary assumptions; they can consider the issue of character sampling designs much more carefully.
- Sanderson, M. J. (1994). Reconstructing the history of evolutionary processes using maximum likelihood.. Society of General Physiologists series, 49, 13-26.More infoPMID: 7939892;
- Sanderson, M. J., & Donoghue, M. J. (1994). Shifts in diversification rate with the origin of angiosperms. Science, 264(5165), 1590-1593.More infoPMID: 17769604;Abstract: The evolutionary success of flowering plants has been attributed to key innovations that originated at the base of that clade. Maximum likelihood methods were used to assess whether branching rate increases were correlated with the origin of these traits. Four hypotheses for the basal relationships of angiosperms were examined by methods that are robust to uncertainty about the timing of internal branch points. Recent hypotheses based on molecular evidence, or on a combination of molecular and morphological characters, imply that large increases in branching rate did not occur until after the putative key innovations of angiosperms had evolved.
- Sanderson, M. J., & Bharathan, G. (1993). Does cladistic information affect inferences about branching rates?. Systematic Biology, 42(1), 1-17.More infoAbstract: Despite long-standing interest in reconstructing rates of branching in the history of groups and recent attempts to use cladistic information to make inferences about such rates, the conditions under which genealogy affects rate reconstruction have not been demonstrated because studies of branching rates rely on methods that either ignore genealogy (and focus on changes in species richness through time) or do not reconstruct absolute rates. We consider stochastic and deterministic approaches that associate branching rates with branches of a phylogeny, allowing the influence of genealogy to be directly assessed. Both approaches assume that the phylogeny is known. The stochastic approach uses maximum likelihood to estimate one or more parameters of a Yule model in which individual lineages branch according to a Poisson process. In a model with only one rate parameter over the entire tree, genealogy affects the estimation of rate whenever some taxa are not extant (i.e., are known only from fossils) or are direct descendants of fossils of known age. In more complex multiparameter models, the estimated rates always depend on genealogy regardless of when the taxa are observed in time. The deterministic model uses nonlinear optimization methods to reconstruct local branching rates in a tree. This procedure minimizes the transformation in local rate required by the data on topology and times of occurrence. A uniform tree need not entail any transformation in local rate, but a nonuniform tree does. Genealogy therefore affects reconstructed branching rates in both deterministic and stochastic approaches. The approaches are illustrated using Vrba's phylogeny of fossil and extant African bovids.
- Sanderson, M. J., Baldwin, B. G., Bharathan, G., Campbell, C. S., Dohlen, C. V., Ferguson, D., Porter, J. M., Wojciechowski, M. F., & Donoghue, M. J. (1993). The growth of phylogenetic information and the need for a phylogenetic data base. Systematic Biology, 42(4), 562-568.
- Sanderson, M. J., & Doyle, J. J. (1992). Reconstruction of organismal and gene phylogenies from data on multigene families: Concerted evolution, homoplasy, and confidence. Systematic Biology, 41(1), 4-17.More infoAbstract: The reliability of phylogenies reconstructed from data on multigene families is investigated via simulation. The evolutionary scenario used is a character-based model of a two-gene family in four species in which clocklike divergence is postulated but neither convergence nor reversal is allowed except as a result of recombination and gene conversion. Thus, any homoplasy emerging from parsimony reconstructions from the simulated data matrices can be attributed to concerted evolution. The probabilities of correctly reconstructing two standard trees are estimated by replicate runs of the simulation. One standard tree (the OP or "orthology/ paralogy" tree) reflects the true gene genealogy in the absence of concerted evolution; the other (the CE or "concerted evolution" tree) depicts gene relationships under complete homogenization of the gene family. The probability of correct reconstruction of the OP tree declines quickly as concerted evolution increases, but above an intermediate level of concerted evolution the probability of correctly inferring the CE tree increases rapidly. Trees similar but not identical to the correct trees can be reconstructed above or below the critical intermediate level of concerted evolution. Levels of homoplasy and numbers of equally parsimonious minimal trees are maximized, and bootstrap confidence levels are minimized, near this intermediate level of concerted evolution. When reconstructing the correct gene tree is the goal, both consistency indices and bootstrap levels will show misleadingly high values when concerted evolution is high. However, because the correct species tree can be inferred from either the OP or CE tree (in the absence of homoplasy from sources other than concerted evolution), these same measures correlate well with fidelity of reconstructing the species tree.
- Sanderson, M. J. (1991). In search of homoplastic tendenices: statistical inference of topological patterns in homoplasy. Evolution, 45(2), 351-358.More infoAbstract: Proposes statistical tests that examine the topological distribution of homoplasy within characters in phylogenies. They test whether character changes are localized (confined to some subtree), or clustered (occur in proximity to each other), relative to 2 null models of character evolution. -from Author
- Sanderson, M. J. (1989). Confidence limits on phylogenies: the bootstrap revisited. Cladistics, 5(2), 113-129.More infoAbstract: The bootstrap, a non-parametric statistical analysis, can be used to assess confidence limits on phylogenies. The method most widely used tests the monophyly of individual clades. This paper proposes additional applications of the bootstrap which provide useful information about phylogeny even when many clades are found not to be supported with confidence (as often occurs in practice). -from Author
Reviews
- Donoghue, M. J., & Sanderson, M. J. (2015. Confluence, synnovation, and depauperons in plant diversification(pp 260-274).
Others
- Sanderson, M. J. (2015, JUL). Back to the past: a new take on the timing of flowering plant diversification. NEW PHYTOLOGIST.