Claire Darnell McWhite
- Assistant Professor, Molecular and Cellular Biology
- Member of the Graduate Faculty
Contact
Biography
B.S. Biochemistry and Cell Biology, Rice University, 2014Ph.D Cell and Molecular Biology, The University of Texas at Austin, 2020
Lewis-Sigler Scholar, Princeton, University 2020-2024
Degrees
- Ph.D. Cell and Molecular Biology
- The University of Texas at Austin, Austin, Texas, United States
- Conservation and comparison of protein interactions across evolution
- B.S. Biochemistry and Cell Biology
- Rice University, Houston, Texas, United States
Work Experience
- Princeton University, Princeton, New Jersey (2020 - 2024)
Awards
- Big Ideas Challenge Award
- The University of Arizona, Spring 2025
Interests
Teaching
Machine LearningProgrammingData VisualizationSystems BiologyComputational Biology
Research
Protein FunctionReceptor SpecificityProtein Language ModelsProtein functional divergence
Courses
2025-26 Courses
-
Research
MCB 900 (Spring 2026) -
Big Data Molecular Biology
MCB 447 (Fall 2025) -
Big Data Molecular Biology
MCB 547 (Fall 2025) -
Directed Research
ABBS 792 (Fall 2025) -
Research
MCB 900 (Fall 2025)
2024-25 Courses
-
Directed Research
ABBS 792 (Spring 2025) -
Directed Research
ABBS 792 (Fall 2024)
Scholarly Contributions
Journals/Publications
- Dang, V., Voigt, B., Yang, D., Hoogerbrugge, G., Lee, M., Cox, R. M., Papoulas, O., McWhite, C. D., Pradeep, R., Leggere, J. C., Neely, B. A., Gray, R. S., & Marcotte, E. M. (2025). VerteBrain reveals novel neural and non-neural protein assemblies conserved across vertebrate evolution. bioRxiv : the preprint server for biology.More infoProtein-protein interactions underlie core brain functions, including neurotransmitter release, receptor activation, and intracellular signaling essential for learning, memory, and cognition. Here, we systematically map conserved brain protein interactions across five vertebrate species-rabbit, chicken, dolphin, pig, and mouse-using co-fractionation and immunoprecipitation mass spectrometry. From 2,197 biochemical fractions, we identify over 81,000 high-confidence interactions among 6,108 conserved proteins. This interaction map (VerteBrain) reveals both regulatory and structural complexes, including extensive synaptonemal protein associations likely involved in inter-neuronal coordination. Conservation across species underscores essential roles in neuronal and glial function, as well as in additional tissues for more widely expressed complexes. The VerteBrain dataset uncovers candidate disease mechanisms, including roles for ARHGEF1 in short stature syndromes, synaptic vesicle trafficking complexes in epilepsy, and RELCH in congenital deafness. VerteBrain provides a publicly accessible framework for investigating brain protein interactions and their relevance to human neurological disorders.
- Majidian, S., Hadziahmetovic, A., Langschied, F., Pascarelli, S., Prieto-Baños, S., Rojas-Vargas, J., , Q. f., Braun, E. L., Dessimoz, C., Diallo, . B., Durand, D., Fang, G., Gabaldón, T., Glover, N., Liberles, D. A., McWhite, C., Sonnhammer, E. L., Thomas, P. D., Ouangraoua, A., & Julca, I. (2025). Quest for Orthologs in the era of Data Deluge and AI: Challenges and Innovations in Orthology Prediction and Data Integration. Journal of molecular evolution, 93(6), 702-719.More infoThe rapid advancement of DNA sequencing technologies and computational algorithms has led to an unprecedented surge in genomic data, driven by several large-scale sequencing projects worldwide. Orthology plays a crucial role in understanding evolutionary patterns of genes and their functions. At the last Quest for Orthologs meeting (Montréal, Canada-2024), we discussed recent advances in orthology inference, with a focus on the impact of artificial intelligence (AI), protein structures, RNA splicing isoforms, and protein domain evolution together with other evolutionary considerations. A long-standing challenge in the field is the functional annotation of paralogs, for which we present novel approaches. The meeting also emphasised strategies for integrating diverse genetic features into the concept of orthology, encouraging frameworks that account for elements like alternative splicing, domain organisation, and regulatory sequences. We discuss various applications of orthology and paralogy to environmental research, agriculture, and comparative genomics. Additionally, we report recent progress in orthology inference methodologies and resources. This work represents a collaborative synthesis of insights and innovations presented at the 8th Quest for Orthologs meeting, highlighting current progress while outlining future directions for orthology research.
- Shamail, A., & McWhite, C. (2025). A General Algorithm for Detecting Higher-Order Interactions via Random Sequential Additions.More infoMany systems exhibit complex interactions between their components: some features or actions amplify each other's effects, others provide redundant information, and some contribute independently. We present a simple geometric method for discovering interactions and redundancies: when elements are added in random sequential orders and their contributions plotted over many trials, characteristic L-shaped patterns emerge that directly reflect interaction structure. The approach quantifies how the contribution of each element depends on those added before it, revealing patterns that distinguish interaction, independence, and redundancy on a unified scale. When pairwise contributions are visualized as two--dimensional point clouds, redundant pairs form L--shaped patterns where only the first-added element contributes, while synergistic pairs form L--shaped patterns where only elements contribute together. Independent elements show order--invariant distributions. We formalize this with the L--score, a continuous measure ranging from $-1$ (perfect synergy, e.g. $Y=X_1X_2$) to $0$ (independence) to $+1$ (perfect redundancy, $X_1 \approx X_2$). The relative scaling of the L--shaped arms reveals feature dominance in which element consistently provides more information. Although computed only from pairwise measurements, higher--order interactions among three or more elements emerge naturally through consistent cross--pair relationships (e.g. AB, AC, BC). The method is metric--agnostic and broadly applicable to any domain where performance can be evaluated incrementally over non-repeating element sequences, providing a unified geometric approach to uncovering interaction structure.[Journal_ref: ]
- Shamail, A., & McWhite, C. D. (2025). Automated Protein Motif Localization using Concept Activation Vectors in Protein Language Model Embedding Space.More infoWe present an automated approach for identifying and annotating motifs and domains in protein sequences, using pretrained Protein Language Models (PLMs) and Concept Activation Vectors (CAVs), adapted from interpretability research in computer vision. We treat motifs as conceptual entities and represent them through learned CAVs in PLM embedding space by training simple linear classifiers to distinguish motif-containing from non-motif sequences. To identify motif occurrences, we extract embeddings for overlapping sequence windows and compute their inner products with motif CAVs. This scoring mechanism quantifies how strongly each sequence region expresses the motif concept and naturally detects multiple instances of the same motif within the same protein. Using a dataset of sixty-nine well-characterized motifs with curated positive and negative examples, our method achieves over 85\% F1 Score for segments strongly expressing the concept and accurately localizes motif positions across diverse protein families. As each motif is encoded by a single vector, motif detection requires only the pretrained PLM and a lightweight dictionary of CAVs, offering a scalable, interpretable, and computationally efficient framework for automated sequence annotation.[Journal_ref: ]
- Shaw, R., Love, S. D., & McWhite, C. D. (2025). Evaluating Pretrained Protein Language Model Embeddings as Proxies for Functional Similarity. Journal of molecular evolution, 93(6), 765-776.More infoProtein Language Models (PLMs) have emerged as powerful tools for representing protein sequences. We explore how embeddings (numeric vector representations) from pretrained PLMs can serve as direct numeric proxies for protein structure and function without requiring additional training or fine-tuning. In a proof-of-concept study of 22 cross-species complementation triplets-a gold standard for functional similarity where genes from one species are tested for their ability to rescue gene deletions in another species-we find that ESM-C 600 M embeddings summarized into pooled sliced-Wasserstein embeddings achieved high discrimination of subtle functional differences. This pooling method captures distributional properties of amino acid embeddings by comparing them against reference points using optimal transport theory. While our limited sample size precludes definitive conclusions about whether PLM embeddings systematically outperform sequence-based methods in detecting protein functional similarity, our preliminary results demonstrate the potential of using protein embeddings for functional analysis. Our exploratory analysis of orthology relationships suggests that embedding similarity may correlate with functional conservation, with the least diverged ortholog showing higher embedding similarity in approximately two-thirds of cases. Analyzing the Ortholog Conjecture-that orthologs maintain greater functional similarity than paralogs at equivalent sequence divergence-we do not observe clear differences between one-to-one orthologs and inparalog embedding similarities. Finally, we propose integrating PLMs with phylogenetic methods in a hybrid approach that leverages their complementary strengths: PLM-derived numeric embeddings for rapid homology detection and phylogenetics for evolutionary precision. We introduce embedding-tree versus gene-tree discordance as a potential metric to detect functional divergence between closely related proteins. Integrating protein embeddings with sequence analysis may enable a more nuanced understanding of protein function and evolutionary dynamics.
- Xu, H., Bierman, R., Akey, D., Koers, C., Comi, T., McWhite, C., & Akey, J. M. (2025). Landscape of human protein-coding somatic mutations across tissues and individuals. bioRxiv : the preprint server for biology.More infoAlthough somatic mutations are fundamentally important to human biology, disease, and aging, many outstanding questions remain about their rates, spectrum, and determinants in apparently healthy tissues. Here, we performed high-coverage exome sequencing on 265 samples from 14 GTEx donors sampled for a median of 17.5 tissues per donor (spanning 46 total tissues). Using a novel probabilistic method tailored to the unique structure of our data, we identified 8,470 somatic variants. We leverage our compendium of somatic mutations to quantify the burden of deleterious somatic variants among tissues and individuals, identify molecular features such as chromatin accessibility that exhibit significantly elevated somatic mutation rates, provide novel biological insights into mutational mechanisms, and infer developmental trajectories based on patterns of multi-tissue somatic mosaicism. Our data provides a high-resolution portrait of somatic mutations across genes, tissues, and individuals.
- McWhite, C., Sae-Lee, W., Yuan, Y., Mallam, A., Gort-Freitas, N., Ramundo, S., Onishi, M., & Marcotte, E. (2024). Alternative proteoforms and proteoform-dependent assemblies in humans and plants. Molecular Systems Biology, 20(8). doi:10.1038/s44320-024-00048-3More infoThe variability of proteins at the sequence level creates an enormous potential for proteome complexity. Exploring the depths and limits of this complexity is an ongoing goal in biology. Here, we systematically survey human and plant high-throughput bottom-up native proteomics data for protein truncation variants, where substantial regions of the full-length protein are missing from an observed protein product. In humans, Arabidopsis, and the green alga Chlamydomonas, approximately one percent of observed proteins show a short form, which we can assign by comparison to RNA isoforms as either likely deriving from transcript-directed processes or limited proteolysis. While some detected protein fragments align with known splice forms and protein cleavage events, multiple examples are previously undescribed, such as our observation of fibrocystin proteolysis and nuclear translocation in a green alga. We find that truncations occur almost entirely between structured protein domains, even when short forms are derived from transcript variants. Intriguingly, multiple endogenous protein truncations of phase-separating translational proteins resemble cleaved proteoforms produced by enteroviruses during infection. Some truncated proteins are also observed in both humans and plants, suggesting that they date to the last eukaryotic common ancestor. Finally, we describe novel proteoform-specific protein complexes, where the loss of a domain may accompany complex formation.
- Wallner, E., Mair, A., Handler, D., McWhite, C., Xu, S., Dolan, L., & Bergmann, D. (2024). Spatially resolved proteomics of the Arabidopsis stomatal lineage identifies polarity complexes for cell divisions and stomatal pores. Developmental Cell, 59(9). doi:10.1016/j.devcel.2024.03.001More infoCell polarity is used to guide asymmetric divisions and create morphologically diverse cells. We find that two oppositely oriented cortical polarity domains present during the asymmetric divisions in the Arabidopsis stomatal lineage are reconfigured into polar domains marking ventral (pore-forming) and outward-facing domains of maturing stomatal guard cells. Proteins that define these opposing polarity domains were used as baits in miniTurboID-based proximity labeling. Among differentially enriched proteins, we find kinases, putative microtubule-interacting proteins, and polar SOSEKIs with their effector ANGUSTIFOLIA. Using AI-facilitated protein structure prediction models, we identify potential protein-protein interaction interfaces among them. Functional and localization analyses of the polarity protein OPL2 and its putative interaction partners suggest a positive interaction with mitotic microtubules and a role in cytokinesis. This combination of proteomics and structural modeling with live-cell imaging provides insights into how polarity is rewired in different cell types and cell-cycle stages.
- Kafri, M., Patena, W., Martin, L., Wang, L., Gomer, G., Ergun, S., Sirkejyan, A., Goh, A., Wilson, A., Gavrilenko, S., Breker, M., Roichman, A., McWhite, C., Rabinowitz, J., Cross, F., Jonikas, M., & Wühr, M. (2023). Systematic identification and characterization of genes in the regulation and biogenesis of photosynthetic machinery. Cell, 186(25). doi:10.1016/j.cell.2023.11.007More infoPhotosynthesis is central to food production and the Earth's biogeochemistry, yet the molecular basis for its regulation remains poorly understood. Here, using high-throughput genetics in the model eukaryotic alga Chlamydomonas reinhardtii, we identify with high confidence (false discovery rate [FDR] < 0.11) 70 poorly characterized genes required for photosynthesis. We then enable the functional characterization of these genes by providing a resource of proteomes of mutant strains, each lacking one of these genes. The data allow assignment of 34 genes to the biogenesis or regulation of one or more specific photosynthetic complexes. Further analysis uncovers biogenesis/regulatory roles for at least seven proteins, including five photosystem I mRNA maturation factors, the chloroplast translation factor MTF1, and the master regulator PMR1, which regulates chloroplast genes via nuclear-expressed factors. Our work provides a rich resource identifying regulatory and functional genes and placing them into pathways, thereby opening the door to a system-level understanding of photosynthesis.
- McWhite, C., Armour-Garb, I., & Singh, M. (2023). Leveraging protein language models for accurate multiple sequence alignments. Genome Research, 33(7). doi:10.1101/gr.277675.123More infoMultiple sequence alignment (MSA) is a critical step in the study of protein sequence and function. Typically, MSA algorithms progressively align pairs of sequences and combine these alignments with the aid of a guide tree. These alignment algorithms use scoring systems based on substitution matrices to measure amino acid similarities. Although successful, standard methods struggle on sets of proteins with low sequence identity: the so-called twilight zone of protein alignment. For these difficult cases, another source of information is needed. Protein language models are a powerful new approach that leverages massive sequence data sets to produce high-dimensional contextual embeddings for each amino acid in a sequence. These embeddings have been shown to reflect physicochemical and higher-order structural and functional attributes of amino acids within proteins. Here, we present a novel approach to MSA, based on clustering and ordering amino acid contextual embeddings. Our method for aligning semantically consistent groups of proteins circumvents the need for many standard components of MSA algorithms, avoiding initial guide tree construction, intermediate pairwise alignments, gap penalties, and substitution matrices. The added information from contextual embeddings leads to higher accuracy alignments for structurally similar proteins with low amino-acid similarity. We anticipate that protein language models will become a fundamental component of the next generation of algorithms for generating MSAs.
- Wang, L., Patena, W., Van Baalen, K., Xie, Y., Singer, E., Gavrilenko, S., Warren-Williams, M., Han, L., Harrigan, H., Hartz, L., Chen, V., Ton, V., Kyin, S., Shwe, H., Cahn, M., Wilson, A., Onishi, M., Hu, J., Schnell, D., , McWhite, C., et al. (2023). A chloroplast protein atlas reveals punctate structures and spatial organization of biosynthetic pathways. Cell, 186(16). doi:10.1016/j.cell.2023.06.008More infoChloroplasts are eukaryotic photosynthetic organelles that drive the global carbon cycle. Despite their importance, our understanding of their protein composition, function, and spatial organization remains limited. Here, we determined the localizations of 1,034 candidate chloroplast proteins using fluorescent protein tagging in the model alga Chlamydomonas reinhardtii. The localizations provide insights into the functions of poorly characterized proteins; identify novel components of nucleoids, plastoglobules, and the pyrenoid; and reveal widespread protein targeting to multiple compartments. We discovered and further characterized cellular organizational features, including eleven chloroplast punctate structures, cytosolic crescent structures, and unexpected spatial distributions of enzymes within the chloroplast. We also used machine learning to predict the localizations of other nuclear-encoded Chlamydomonas proteins. The strains and localization atlas developed here will serve as a resource to accelerate studies of chloroplast architecture and functions.
- Sae-Lee, W., McCafferty, C., Verbeke, E., Havugimana, P., Papoulas, O., McWhite, C., Houser, J., Vanuytsel, K., Murphy, G., Drew, K., Emili, A., Taylor, D., & Marcotte, E. (2022). The protein organization of a red blood cell. Cell Reports, 40(3). doi:10.1016/j.celrep.2022.111103More infoRed blood cells (RBCs) (erythrocytes) are the simplest primary human cells, lacking nuclei and major organelles and instead employing about a thousand proteins to dynamically control cellular function and morphology in response to physiological cues. In this study, we define a canonical RBC proteome and interactome using quantitative mass spectrometry and machine learning. Our data reveal an RBC interactome dominated by protein homeostasis, redox biology, cytoskeletal dynamics, and carbon metabolism. We validate protein complexes through electron microscopy and chemical crosslinking and, with these data, build 3D structural models of the ankyrin/Band 3/Band 4.2 complex that bridges the spectrin cytoskeleton to the RBC membrane. The model suggests spring-like compression of ankyrin may contribute to the characteristic RBC cell shape and flexibility. Taken together, our study provides an in-depth view of the global protein organization of human RBCs and serves as a comprehensive resource for future research.
- Jha, S., Borowsky, A., Cole, B., Fahlgren, N., Farmer, A., Huang, S., Karia, P., Libault, M., Provart, N., Rice, S., Saura-Sanchez, M., Agarwal, P., Ahkami, A., Anderton, C., Briggs, S., Brophy, J., Denolf, P., Di Costanzo, L., Exposito-Alonso, M., , Giacomello, S., et al. (2021). Vision, challenges and opportunities for a plant cell atlas. eLife, 10. doi:10.7554/elife.66877More infoWith growing populations and pressing environmental problems, future economies will be increasingly plant-based. Now is the time to reimagine plant science as a critical component of fundamental science, agriculture, environmental stewardship, energy, technology and healthcare. This effort requires a conceptual and technological framework to identify and map all cell types, and to comprehensively annotate the localization and organization of molecules at cellular and tissue levels. This framework, called the Plant Cell Atlas (PCA), will be critical for understanding and engineering plant development, physiology and environmental responses. A workshop was convened to discuss the purpose and utility of such an initiative, resulting in a roadmap that acknowledges the current knowledge gaps and technical challenges, and underscores how the PCA initiative can help to overcome them.
- McWhite, C., Papoulas, O., Drew, K., Dang, V., Leggere, J., Sae-Lee, W., & Marcotte, E. (2021). Co-fractionation/mass spectrometry to identify protein complexes. STAR protocols, 2(1). doi:10.1016/j.xpro.2021.100370More infoCo-fractionation/mass spectrometry (CF/MS) is a flexible and powerful method to detect physical associations of proteins. CF/MS can be applied to any tissue or organism without the need for protein-specific antibodies or epitope tags. Here, we outline two alternate protocols for MS preparation of samples (containing low or high salt) and a computational pipeline (cfmsflow) that together allow the successful application of this approach. These protocols are based on CF/MS of over 16 diverse organisms including plants and animals. For complete details on the use and execution of this protocol, please refer to McWhite et al. (2020).
- Drew, K., Lee, C., Cox, R., Dang, V., Devitt, C., McWhite, C., Papoulas, O., Huizar, R., Marcotte, E., & Wallingford, J. (2020). A systematic, label-free method for identifying RNA-associated proteins in vivo provides insights into vertebrate ciliary beating machinery. Developmental Biology, 467(1-2). doi:10.1016/j.ydbio.2020.08.008More infoCell-type specific RNA-associated proteins are essential for development and homeostasis in animals. Despite a massive recent effort to systematically identify RNA-associated proteins, we currently have few comprehensive rosters of cell-type specific RNA-associated proteins in vertebrate tissues. Here, we demonstrate the feasibility of determining the RNA-associated proteome of a defined vertebrate embryonic tissue using DIF-FRAC, a systematic and universal (i.e., label-free) method. Application of DIF-FRAC to cultured tissue explants of Xenopus mucociliary epithelium identified dozens of known RNA-associated proteins as expected, but also several novel RNA-associated proteins, including proteins related to assembly of the mitotic spindle and regulation of ciliary beating. In particular, we show that the inner dynein arm tether Cfap44 is an RNA-associated protein that localizes not only to axonemes, but also to liquid-like organelles in the cytoplasm called DynAPs. This result led us to discover that DynAPs are generally enriched for RNA. Together, these data provide a useful resource for a deeper understanding of mucociliary epithelia and demonstrate that DIF-FRAC will be broadly applicable for systematic identification of RNA-associated proteins from embryonic tissues.
- McWhite, C., Papoulas, O., Drew, K., Cox, R., June, V., Dong, O., Kwon, T., Wan, C., Salmi, M., Roux, S., Browning, K., Chen, Z., Ronald, P., & Marcotte, E. (2020). A Pan-plant Protein Complex Map Reveals Deep Conservation and Novel Assemblies. Cell, 181(2). doi:10.1016/j.cell.2020.02.049More infoPlants are foundational for global ecological and economic systems, but most plant proteins remain uncharacterized. Protein interaction networks often suggest protein functions and open new avenues to characterize genes and proteins. We therefore systematically determined protein complexes from 13 plant species of scientific and agricultural importance, greatly expanding the known repertoire of stable protein complexes in plants. By using co-fractionation mass spectrometry, we recovered known complexes, confirmed complexes predicted to occur in plants, and identified previously unknown interactions conserved over 1.1 billion years of green plant evolution. Several novel complexes are involved in vernalization and pathogen defense, traits critical for agriculture. We also observed plant analogs of animal complexes with distinct molecular assemblies, including a megadalton-scale tRNA multi-synthetase complex. The resulting map offers a cross-species view of conserved, stable protein assemblies shared across plant cells and provides a mechanistic, biochemical framework for interpreting plant genetics and mutant phenotypes. This massive plant proteomics project, using co-fractionation mass spectrometry to measure the amounts and associations of over two million proteins from 13 diverse plant species, reveals stable protein complexes shared across plant cells and provides a framework for interpreting plant genetics and mutant phenotypes.
- Zeileis, A., Fisher, J., Hornik, K., Ihaka, R., McWhite, C., Murrell, P., Stauffer, R., & Wilke, C. (2020). Colorspace: A toolbox for manipulating and assessing colors and palettes. Journal of Statistical Software, 96. doi:10.18637/jss.v096.i01More infoThe R package colorspace provides a flexible toolbox for selecting individual colors or color palettes, manipulating these colors, and employing them in statistical graphics and data visualizations. In particular, the package provides a broad range of color palettes based on the HCL (hue-chroma-luminance) color space. The three HCL dimensions have been shown to match those of the human visual system very well, thus facilitating intu-itive selection of color palettes through trajectories in this space. Using the HCL color model, general strategies for three types of palettes are implemented: (1) Qualitative for coding categorical information, i.e., where no particular ordering of categories is available. (2) Sequential for coding ordered/numeric information, i.e., going from high to low (or vice versa). (3) Diverging for coding ordered/numeric information around a central neutral value, i.e., where colors diverge from neutral to two extremes. To aid selection and application of these palettes, the package also contains scales for use with ggplot2, shiny and tcltk apps for interactive exploration, visualizations of palette properties, accompany-ing manipulation utilities (like desaturation and lighten/darken), and emulation of color vision deficiencies. The shiny apps are also hosted online at http://hclwizard.org/.
- DuPai, C., McWhite, C., Smith, C., Garten, R., Maurer-Stroh, S., & Wilke, C. (2019). Influenza passaging annotations: what they tell us and why we should listen. Virus Evolution, 5(1). doi:10.1093/ve/vez016More infoInfluenza databases now contain over 100,000 worldwide sequence records for strains influenza A(H3N2) and A(H1N1). Although these data facilitate global research efforts and vaccine development practices, they also represent a stumbling block for researchers because of their confusing and heterogeneous annotation. Unclear passaging annotations are particularly concerning given the recent work highlighting the presence and risk of false adaptation signals introduced by cell passaging of viral isolates. With this in mind, we aim to provide a concise outline of why viruses are passaged, a clear overview of passaging annotation nomenclature currently in use, and suggestions for a standardized nomenclature going forward. Our hope is that this summary will empower researchers and clinicians alike to more easily understand a virus sample’s passage history when analyzing influenza sequences.
- Zhao, W., Bachhav, B., McWhite, C., & Segatori, L. (2018). A yeast selection system for the detection of proteasomal activation. Protein Engineering, Design and Selection, 31(11). doi:10.1093/protein/gzz006More infoThe ubiquitin proteasome system (UPS) is a complex cellular machinery that catalyzes degradation of misfolded or damaged proteins and regulates turnover of native proteins in eukaryotic cells, thus playing a crucial role in maintaining protein homeostasis. The UPS has emerged as a drug target for a diverse range of diseases characterized by accumulation of misfolded or aggregated proteins. While enhancement of UPS activity is widely recognized as a promising strategy to prevent accumulation of aberrant, off-pathway protein conformations and ameliorate the phenotypes of a wide range of protein misfolding diseases, the molecular mechanisms underlying activation of proteasomal degradation are poorly characterized. We report the development of a yeast selection platform for genome-wide selection of UPS activators. We engineered the Saccharomyces cerevisiae selection marker orotidine-5-phosphate decarboxylase (URA3) to function as a substrate of proteasomal degradation through fusion to UPS-sensitive tags. The resulting UPS-sensitive URA3 variant links UPS activity to cell growth. The yeast selection platform reported in this study will open the way to high-throughput, genome-wide studies aimed at identifying modulators of UPS function that might provide novel target for therapeutic applications.
- Drew, K., Lee, C., Huizar, R., Tu, F., Borgeson, B., McWhite, C., Ma, Y., Wallingford, J., & Marcotte, E. (2017). Integration of over 9,000 mass spectrometry experiments builds a global map of human protein complexes. Molecular Systems Biology, 13(6). doi:10.15252/msb.20167490More infoMacromolecular protein complexes carry out many of the essential functions of cells, and many genetic diseases arise from disrupting the functions of such complexes. Currently, there is great interest in defining the complete set of human protein complexes, but recent published maps lack comprehensive coverage. Here, through the synthesis of over 9,000 published mass spectrometry experiments, we present hu.MAP, the most comprehensive and accurate human protein complex map to date, containing > 4,600 total complexes, > 7,700 proteins, and > 56,000 unique interactions, including thousands of confident protein interactions not identified by the original publications. hu.MAP accurately recapitulates known complexes withheld from the learning procedure, which was optimized with the aid of a new quantitative metric (k-cliques) for comparing sets of sets. The vast majority of complexes in our map are significantly enriched with literature annotations, and the map overall shows improved coverage of many disease-associated proteins, as we describe in detail for ciliopathies. Using hu.MAP, we predicted and experimentally validated candidate ciliopathy disease genes in vivo in a model vertebrate, discovering CCDC138, WDR90, and KIAA1328 to be new cilia basal body/centriolar satellite proteins, and identifying ANKRD55 as a novel member of the intraflagellar transport machinery. By offering significant improvements to the accuracy and coverage of human protein complexes, hu.MAP (http://proteincomplexes.org) serves as a valuable resource for better understanding the core cellular functions of human proteins and helping to determine mechanistic foundations of human disease.
- Kachroo, A., Laurent, J., Akhmetov, A., Szilagyi-Jones, M., McWhite, C., Zhao, A., & Marcotte, E. (2017). Systematic bacterialization of yeast genes identifies a near-universally swappable pathway. eLife, 6. doi:10.7554/elife.25093More infoEukaryotes and prokaryotes last shared a common ancestor ~2 billion years ago, and while many present-day genes in these lineages predate this divergence, the extent to which these genes still perform their ancestral functions is largely unknown. To test principles governing retention of ancient function, we asked if prokaryotic genes could replace their essential eukaryotic orthologs. We systematically replaced essential genes in yeast by their 1:1 orthologs from Escherichia coli. After accounting for mitochondrial localization and alternative start codons, 31 out of 51 bacterial genes tested (61%) could complement a lethal growth defect and replace their yeast orthologs with minimal effects on growth rate. Replaceability was determined on a pathway-by-pathway basis; codon usage, abundance, and sequence similarity contributed predictive power. The heme biosynthesis pathway was particularly amenable to inter-kingdom exchange, with each yeast enzyme replaceable by its bacterial, human, or plant ortholog, suggesting it as a near-universally swappable pathway.
- Liebeskind, B., McWhite, C., & Marcotte, E. (2016). Towards consensus gene ages. Genome Biology and Evolution, 8(6). doi:10.1093/gbe/evw113More infoCorrectly estimating the age of a gene or gene family is important for a variety of fields, including molecular evolution, comparative genomics, and phylogenetics, and increasingly for systems biology and disease genetics. However, most studies use only a point estimate of a gene's age, neglecting the substantial uncertainty involved in this estimation. Here, we characterize this uncertainty by investigating the effect of algorithm choice on gene-age inference and calculate consensus gene ages with attendant error distributions for a variety of model eukaryotes. We use 13 orthology inference algorithms to create gene-age datasets and then characterize the error around each age-call on a per-gene and per-algorithm basis. Systematic error was found to be a large factor in estimating gene age, suggesting that simple consensus algorithms are not enough to give a reliable point estimate.We also found that different sourcesof error can affect downstream analyses, such asgene ontology enrichment. Our consensus gene-age datasets, with associated error terms, are made fully available at so that researchers can propagate this uncertainty through their analyses (geneages.org).
- McWhite, C., Meyer, A., & Wilke, C. (2016). Sequence amplification via cell passaging creates spurious signals of positive adaptation in influenza virus H3N2 hemagglutinin. Virus Evolution, 2(2). doi:10.1093/ve/vew026More infoClinical influenza A virus isolates are frequently not sequenced directly. Instead, a majority of these isolates (70% in 2015) are first subjected to passaging for amplification, most commonly in non-human cell culture. Here, we find that this passaging leaves distinct signals of adaptation, which can confound evolutionary analyses of the viral sequences. We find distinct patterns of adaptation to Madin-Darby (MDCK) and monkey cell culture absent from unpassaged hemagglutinin sequences. These patterns also dominate pooled datasets not separated by passaging type, and they increase in proportion to the number of passages performed. By contrast, MDCK-SIAT1 passaged sequences seem mostly (but not entirely) free of passaging adaptations. Contrary to previous studies, we find that using only internal branches of influenza virus phylogenetic trees is insufficient to correct for passaging artifacts. These artifacts can only be safely avoided by excluding passaged sequences entirely from subsequent analysis. We conclude that future influenza virus evolutionary analyses should appropriately control for potentially confounding effects of passaging adaptations.
- Zhao, W., Bonem, M., McWhite, C., Silberg, J., & Segatori, L. (2014). Sensitive detection of proteasomal activation using the Deg-On mammalian synthetic gene circuit. Nature Communications, 5. doi:10.1038/ncomms4612More infoThe ubiquitin proteasome system (UPS) has emerged as a drug target for diverse diseases characterized by altered proteostasis, but pharmacological agents that enhance UPS activity have been challenging to establish. Here we report the Deg-On system, a genetic inverter that translates proteasomal degradation of the transcriptional regulator TetR into a fluorescent signal, thereby linking UPS activity to an easily detectable output, which can be tuned using tetracycline. We demonstrate that this circuit responds to modulation of UPS activity in cell culture arising from the inhibitor MG-132 and activator PA28γ. Guided by predictive modelling, we enhanced the circuit's signal sensitivity and dynamic range by introducing a feedback loop that enables self-amplification of TetR. By linking UPS activity to a simple and tunable fluorescence output, these genetic inverters will enable a variety of applications, including screening for UPS activating molecules and selecting for mammalian cells with different levels of proteasome activity.
- Demelash, A., Rudrabhatla, P., Pant, H., Wang, X., Amin, N., McWhite, C., Naizhen, X., & Linnoila, R. (2012). Achaete-scute homologue-1 (ASH1) stimulates migration of lung cancer cells through Cdk5/p35 pathway. Molecular Biology of the Cell, 23(15). doi:10.1091/mbc.e10-12-1010More infoOur previous data suggested that the human basic helix-loop-helix transcription factor achaete-scute homologue-1 (hASH1) may stimulate both proliferation and migration in the lung. In the CNS, cyclin-dependent kinase 5 (Cdk5) and its activator p35 are important for neuronal migration that is regulated by basic helix-loop-helix transcription factors. Cdk5/p35 may also play a role in carcinogenesis. In this study, we found that the neuronal activator p35 was commonly expressed in primary human lung cancers. Cdk5 and p35 were also expressed by several human lung cancer cell lines and coupled with migration and invasion. When the kinase activity was inhibited by the Cdk5 inhibitor roscovitine or dominant-negative (dn) Cdk5, the migration of lung cancer cells was reduced. In neuroendocrine cells expressing hASH1, such as a pulmonary carcinoid cell line, knocking down the gene expression by short hairpin RNA reduced the levels of Cdk5/p35, nuclear p35 protein, and migration. Furthermore, expression of hASH1 in lung adenocarcinoma cells normally lacking hASH1 increased p35/Cdk5 activity and enhanced cellular migration. We were also able to show that p35 was a direct target for hASH1. In conclusion, induction of Cdk5 activity is a novel mechanism through which hASH1 may regulate migration in lung carcinogenesis. © 2012 Demelash et al.
Proceedings Publications
- Amorin De Hegedus, R., Arighi, C., Babor, J., Bateman, A., Blaby, I., Blaby-Haas, C., Bridge, A., Burley, S., Cleveland, S., Colwell, L., Conesa, A., Dallago, C., Danchin, A., De Waard, A., Deutschbauer, A., Dias, R., Ding, Y., Fang, G., Friedberg, I., , Gerlt, J., et al. (2022). A roadmap for the functional annotation of protein families: A community perspective. In (MoCeIS-DCL: Building a Network for Functional Annotation of Protein Families MCB-2129768) was held during 3–4 February 2022 at the Orlando Airport Marriott, FL, USA.More infoOver the last 25 years, biology has entered the genomic era and is becoming a science of 'big data'. Most interpretations of genomic analyses rely on accurate functional annotations of the proteins encoded by more than 500 000 genomes sequenced to date. By different estimates, only half the predicted sequenced proteins carry an accurate functional annotation, and this percentage varies drastically between different organismal lineages. Such a large gap in knowledge hampers all aspects of biological enterprise and, thereby, is standing in the way of genomic biology reaching its full potential. A brainstorming meeting to address this issue funded by the National Science Foundation was held during 3-4 February 2022. Bringing together data scientists, biocurators, computational biologists and experimentalists within the same venue allowed for a comprehensive assessment of the current state of functional annotations of protein families. Further, major issues that were obstructing the field were identified and discussed, which ultimately allowed for the proposal of solutions on how to move forward.
