Andrea Thomer
- Associate Professor, School of Information
- Member of the Graduate Faculty
- Richard P. Harvill Building, Rm. 409
- Tucson, AZ 85721
- athomer@arizona.edu
Biography
Dr. Andrea Thomer is an Assistant Professor at the University of Arizona School of Information. Her research interests include the maintenance and evolution of knowledge infrastructures, scientific data curation, and information organization. She is especially interested in long-term database curation, the use and impact of natural history collections, and in the conceptual foundations of data science. Her research has been funded by the Institute for Museum and Library Services and the National Science Foundation, and published in JASIST, CSCW, Slate, and more. Dr. Thomer earned her doctorate at the School of Information Sciences at the University of Illinois at Urbana‐Champaign in 2017. Prior to her graduate work, she was an excavator and ad hoc data curator at the La Brea Tar Pits in Los Angeles, California.Degrees
- Ph.D. Information Science
- University of Illinois at Urbana-Champaign, Urbana, Illinois, United States
- M.L.I.S Library and Information Science
- University of Illinois at Urbana-Champaign, Urbana, Illinois, United States
- B.A. English
- University of California, Los Angeles, Los Angeles, California, United States
Awards
- 3rd place, Best Long Paper, ASIS&T Annual Meeting
- Fall 2023 (Award Finalist)
- JASIST Best Paper Award, 2022
- Fall 2023
Interests
Research
data practices, information organization, natural history collections, database curation, knowledge infrastructures
Courses
2024-25 Courses
-
Information Research Methods
INFO 507 (Fall 2024)
2023-24 Courses
-
Honors Independent Study
ISTA 499H (Spring 2024) -
Intro Digital Curation/Preserv
INFO 671 (Spring 2024) -
Intro Digital Curation/Preserv
LIS 671 (Spring 2024) -
Information Research Methods
INFO 507 (Fall 2023)
2022-23 Courses
-
Information Research Methods
INFO 507 (Spring 2023) -
Special Topics in Information
INFO 595 (Spring 2023) -
Special Topics in Information
ISTA 495 (Spring 2023)
Scholarly Contributions
Chapters
- Thomer, A. K., Wofford, M. F., Lenard, M. C., Dominguez, V. S., Goring, S. J., Ma, X., Mookerjee, M., Hsu, L., & Hills, D. (2023). Revealing Earth science code and data-use practices using the Throughput Graph Database. In Revealing Earth science code and data-use practices using the Throughput Graph Database(pp 147-159). Geological Society of America.
Journals/Publications
- Hemphill, L., Thomer, A., Lafia, S., Fan, L., Bleckley, D., & Moss, E. (2024). A dataset for measuring the impact of research data and their curation. Scientific Data, 11(442). doi:10.1038/s41597-024-03303-2More infoScience funders, publishers, and data archives make decisions about how to responsibly allocate resources to maximize the reuse potential of research data. This paper introduces a dataset developed to measure the impact of archival and data curation decisions on data reuse. The dataset describes 10,605 social science research datasets, their curation histories, and reuse contexts in 94,755 publications that cover 59 years from 1963 to 2022. The dataset was constructed from study-level metadata, citing publications, and curation records available through the Inter-university Consortium for Political and Social Research (ICPSR) at the University of Michigan. The dataset includes information about study-level attributes (e.g., PIs, funders, subject terms); usage statistics (e.g., downloads, citations); archiving decisions (e.g., curation activities, data transformations); and bibliometric attributes (e.g., journals, authors) for citing publications. This dataset provides information on factors that contribute to long-term data reuse, which can inform the design of effective evidence-based recommendations to support high-impact research data curation decisions.
- Pasquetto, I., Cullen, Z., Thomer, A., & Wofford, M. (2024). What is research data "misuse"? And how can it be prevented or mitigated?. Journal of the Association for Information Science and Technology. doi:10.1002/asi.24944More infoDespite increasing expectations that researchers and funding agencies release their data for reuse, concerns about data misuse hinder the open sharing of data. The COVID-19 crisis brought urgency to these concerns, yet we are currently missing a theoretical framework to understand, prevent, and respond to research data misuse. In the article, we emphasize the challenge of defining misuse broadly and identify various forms that misuse can take, including methodological mistakes, unauthorized reuse, and intentional misrepresentation. We pay particular attention to underscoring the complexity of defining misuse, considering different epistemological perspectives and the evolving nature of scientific methodologies. We propose a theoretical framework grounded in the critical analysis of interdisciplinary literature on the topic of misusing research data, identifying similarities and differences in how data misuse is defined across a variety of fields, and propose a working definition of what it means “to misuse” research data. Finally, we speculate about possible curatorial interventions that data intermediaries can adopt to prevent or respond to instances of misuse.
- Rayburn, A., Punzalan, R., & Thomer, A. (2024). Persisting through friction: growing a community driven knowledge infrastructure. Archival Science, 24(1), 61–82. doi:10.1007/s10502-023-09427-5More infoMany memory institutions hold heritage items belonging to Indigenous peoples. There are current efforts to share knowledge about these heritage items with their communities; one way this is done is through digital access. This paper examines The Great Lakes Research Alliance for the Study of Aboriginal Arts and Cultures (GRASAC), a network of researchers, museum professionals, and community members who maintain a digital platform that aggregates museum and archival research on Anishinaabe, Haudenosaunee, and Huron-Wendat cultures into a centralized database. The database, known as the GRASAC Knowledge Sharing System (GKS), is at a point of infrastructural growth, moving from a password protected system to one that is open to the public. Rooted in qualitative research from semi-structured interviews with the creators, maintainers, and users of the database, we examine the frictions in this expanding knowledge infrastructure (KI), and how they are eased over time. We find the friction within GRASAC resides in three main categories: collaborative friction, data friction, and our novel contribution: systemic friction.
- King, K. B., Giacomini, H. C., Wehrly, K. E., López‐Fernández, H., Thomer, A. K., & Alofs, K. M. (2023).
Using historical catch data to evaluate predicted changes in fish relative abundance in response to a warming climate
. Ecography, 2023(8). doi:10.1111/ecog.06798More infoUsing models to predict future changes in species distributions in response to projected climate change is a common tool to aid management and species conservation. However, the assumption underlying this approach, that ecological processes remain stationary through time, can be unreliable, and more empirical tests are needed to validate predictions of biotic outcomes of global change. The scarcity of reliable historical and long‐term datasets can make these tests difficult. Moreover, incorporating abundance and multiple sampling methods can improve model predictions and usability for management. Our study 1) provides insight into how well models can predict environmental change under a warming climate, 2) incorporates multiple sampling gears and abundance data in modeling to better capture changes in populations and 3) shows the value of historical datasets for improving predictive models of population change. We used contemporary (2003–2019) and historical (1936–1964) abundance datasets of the North American fish largemouth bass Micropterus salmoides in lakes across the state of Michigan, USA. We developed Bayesian hierarchical models that leverage the use of multiple gears in contemporary lake surveys to estimate the relative catchabilities of largemouth bass for each gear and hindcast the models to predict historical abundance. Our estimates of relative density change over time were correlated with temperature change over time; increasing surface water temperature led to increasing largemouth bass density. Hindcasting models to historical lake temperatures performed similarly in predicting historical density to models predicting contemporary density. Our results suggest that models built using spatial environmental gradients can reliably predict population changes through time. Understanding the sampling methods and the environmental context of observational datasets can help researchers test for potential sampling biases and identify confounding factors that will improve predictions of future impacts of environmental change. - Lafia, S., Thomer, A. K., Moss, E., Bleckley, D., & Hemphill, L. (2023).
How and Why Do Researchers Reference Data? A Study of Rhetorical Features and Functions of Data References in Academic Articles
. Data Science Journal. doi:10.5334/dsj-2023-010More infoData reuse is a common practice in the social sciences. While published data play an essential role in the production of social science research, they are not consistently cited, which makes it difficult to assess their full scholarly impact and give credit to the original data producers. Furthermore, it can be challenging to understand researchers' motivations for referencing data. Like references to academic literature, data references perform various rhetorical functions, such as paying homage, signaling disagreement, or drawing comparisons. This paper studies how and why researchers reference social science data in their academic writing. We develop a typology to model relationships between the entities that anchor data references, along with their features (access, actions, locations, styles, types) and functions (critique, describe, illustrate, interact, legitimize). We illustrate the use of the typology by coding multidisciplinary research articles (n=30) referencing social science data archived at the Inter-university Consortium for Political and Social Research (ICPSR). We show how our typology captures researchers' interactions with data and purposes for referencing data. Our typology provides a systematic way to document and analyze researchers' narratives about data use, extending our ability to give credit to data that support research. - Palmer, C. L., Bonn, M., Coward, C., Knox, E., Marzullo, K., Ndumu, A., Subramaniam, M., & Thomer, A. (2023).
Advancing LIS in iSchools: Building a Coalition To Ensure a Vibrant Future
. Proceedings of the Association for Information Science and Technology, 60(1), 825-828. doi:10.1002/pra2.870More infoABSTRACT The LIS Forward initiative is addressing the urgent question: As LIS evolves within the context of iSchools, how do we best position our research and education programs to lead the field and the future of libraries? The initiative stems from the recognition that the evolution of iSchools presents opportunities and challenges for LIS and that there is great value in iSchools working together on charting directions forward. The growing coalition of iSchools is working to support LIS in taking full advantage of the multidisciplinary knowledge and expertise within iSchools, foster future leaders who will champion LIS within iSchools, and confront the dynamic tensions in research intensive iSchools. This session aims to engage international, professional, and academic stakeholders to guide activities and coalition building that can continue to strengthen LIS in iSchools. A panel will present highlights from a recent position paper to catalyze interactive, facilitated dialogue within the ASIS&T community on critical issues in LIS research and education. Breakout sessions will generate responses and recommendations to advance collaborative planning and strategy of value to LIS academic programs and the profession. - Plantin, J., & Thomer, A. K. (2023).
Platforms, programmability, and precarity: The platformization of research repositories in academic libraries
. New Media & Society. doi:10.1177/14614448231176758More infoWe investigate in this article how repository platforms change the sharing and preservation of digital objects in academic libraries. We use evidence drawn from semi-structured interviews with 31 data repository managers working at 21 universities using the product Figshare for institutions. We first show that repository managers use this platform to bring together actors, technologies, and processes usually scattered across the library to assign to them the tasks that they value less—such as data preparation or IT maintenance—and spend more time engaging in activities they appreciate—such as raising awareness of data sharing. While this platformization of data management improves their job satisfaction, we reveal how it simultaneously accentuates the outsourcing of libraries’ core mission to private actors. We eventually discuss how this platformization can deskill librarians and perpetuate precarity politics in university libraries. - Thomer, A. K., & Rayburn, A. J. (2023).
“A Patchwork of Data Systems”: Quilting as an Analytic Lens and Stabilizing Practice for Knowledge Infrastructures
. Science, Technology, & Human Values, 30. doi:10.1177/01622439231175535More infoMuseums and archives rely on databases and similar technologies to manage their collections, but even when tailor-made for memory institutions, databases require considerable adaptation to remain usable over long periods of time. To better understand how collection staff maintain and migrate databases over multiple years and decades, we talked to archivists from the US-based Archon User Collaborative and collection managers from the University of Michigan Research Museums. We found that the collection staff uses terms taken from quilting for database curation: they “tie” and “weave” a “patchwork of data systems” together. We extend their quilting metaphor as an analytical lens and show what can be gained through a shift in framing database work as a craft. We describe database curation as a process of creating a quilted infrastructure: a long-lived knowledge system that is sustained by the use of multiple “digital surfaces,” a reliance on a community of practice, intergenerational transfer of “quilts,” and by leveraging invisibility to conduct work. We argue that this nonnormative mode of computing needs better support from both software developers and administrators. We also show that although the invisibility of craft practices offers practitioners independence, it also can increase their precarity. - Thomer, A. K., Barbieri, L., Wyngaard, J., & Swanz, S. (2023).
Making Drone Data FAIR Through a Community-Developed Information Framework
. Data Science Journal. doi:10.31223/x5z338More infoSmall Uncrewed Aircraft Systems (sUAS) are an increasingly common tool for data collection in many scientific fields. However, there are few standards or best practices guiding the collection, sharing, or publication of data collected with these tools. This makes collaboration, data quality control, and reproducibility challenging. To that end, we have used iterative rounds of data modeling and user engagement to develop a Minimum Information Framework (MIF) to guide sUAS users in collecting the metadata necessary to ensure that their data is trust-worthy, shareable and reusable. This paper briefly outlines our methods and the MIF itself, which includes 74 metadata terms in four classes that sUAS users should consider collecting for any given study. The MIF provides a foundation which can be used for developing standards and best practices. - Lafia, S., Fan, L., Thomer, A. K., & Hemphill, L. (2022).
Subdivisions and Crossroads: Identifying Hidden Community Structures in a Data Archive's Citation Network
. Quantitative Science Studies.More infoData archives are an important source of high quality data in many fields, making them ideal sites to study data reuse. By studying data reuse through citation networks, we are able to learn how hidden research communities - those that use the same scientific datasets - are organized. This paper analyzes the community structure of an authoritative network of datasets cited in academic publications, which have been collected by a large, social science data archive: the Interuniversity Consortium for Political and Social Research (ICPSR). Through network analysis, we identified communities of social science datasets and fields of research connected through shared data use. We argue that communities of exclusive data reuse form subdivisions that contain valuable disciplinary resources, while datasets at a "crossroads" broadly connect research communities. Our research reveals the hidden structure of data reuse and demonstrates how interdisciplinary research communities organize around datasets as shared scientific inputs. These findings contribute new ways of describing scientific communities in order to understand the impacts of research data reuse. - Lafia, S., Fan, L., Thomer, A., & Hemphill, L. (2022). Subdivisions and Crossroads: Identifying Hidden Community Structures in a Data Archive’s Citation Network. Quantitative Science Studies. doi:10.1162/qss_a_00209
- Lee, S. M., Thomer, A. K., & Lampe, C. (2022). The Use of Negative Interface Cues to Change Perceptions of Online Retributive Harassment. Proceedings of the ACM on Human-Computer Interaction, 6(CSCW2), 334:1-334:23.
- Thomer, A. K. (2022). Integrative data reuse at scientifically significant sites: Case studies at Yellowstone National Park and the La Brea Tar Pits. Journal of the Association for Information Science and Technology (JASIST), n/a(n/a). doi:10.1002/asi.24620
- Thomer, A. K., Starks, J. R., Rayburn, A., & Lenard, M. (2022). Maintaining repositories, databases, and digital collections in memory institutions: an integrative review. Proceedings of the Association for Information Science and Technology, 59, 310-323. doi:10.1002/pra2.755
- Tyler, A. R., Thomer, A. K., Akmon, D., York, J. J., Polasek, F., Lafia, S., Hemphill, L., & Yakel, E. (2022). The Craft and Coordination of Data Curation: Complicating Workflow Views of Data Science. Proceedings of the ACM on Human-Computer Interaction, 6(CSCW2), 1-29. doi:10.1145/3555139More infoData curation is the process of making a dataset fit-for-use and archivable. It is critical to data-intensive science because it makes complex data pipelines possible, studies reproducible, and data reusable. Yet the complexities of the hands-on, technical, and intellectual work of data curation is frequently overlooked or downplayed. Obscuring the work of data curation not only renders the labor and contributions of data curators invisible but also hides the impact that curators' work has on the later usability, reliability, and reproducibility of data. To better understand the work and impact of data curation, we conducted a close examination of data curation at a large social science data repository, the Inter-university Consortium for Political and Social Research (ICPSR). We asked: What does curatorial work entail at ICPSR, and what work is more or less visible to different stakeholders and in different contexts? And, how is that curatorial work coordinated across the organization? We triangulated accounts of data curation from interviews and records of curation in Jira tickets to develop a rich and detailed account of curatorial work. While we identified numerous curatorial actions performed by ICPSR curators, we also found that curators rely on a number of craft practices to perform their jobs. The reality of their work practices defies the rote sequence of events implied by many life cycle or workflow models. Further, we show that craft practices are needed to enact data curation best practices and standards. The craft that goes into data curation is often invisible to end users, but it is well recognized by ICPSR curators and their supervisors. Explicitly acknowledging and supporting data curators as craftspeople is important in creating sustainable and successful curatorial infrastructures.
- Umberfield, E. E., Kardia, S. L., Jiang, Y., Thomer, A. K., & Harris, M. R. (2022). Regulations and Norms for Reuse of Residual Clinical Biospecimens and Health Data. Western journal of nursing research, 44(11), 1068-1081.More infoNurse scientists are increasingly interested in conducting secondary research using real world collections of biospecimens and health data. The purposes of this scoping review are to (a) identify federal regulations and norms that bear authority or give guidance over reuse of residual clinical biospecimens and health data, (b) summarize domain experts' interpretations of permissions of such reuse, and (c) summarize key issues for interpreting regulations and norms. Final analysis included 25 manuscripts and 23 regulations and norms. This review illustrates contextual complexity for reusing residual clinical biospecimens and health data, and explores issues such as privacy, confidentiality, and deriving genetic information from biospecimens. Inconsistencies make it difficult to interpret, which regulations or norms apply, or if applicable regulations or norms are congruent. Tools are necessary to support consistent, expert-informed consent processes and downstream reuse of residual clinical biospecimens and health data by nurse scientists.
- Umberfield, E. E., Stansbury, C., Ford, K., Jiang, Y., Kardia, S. L., Thomer, A. K., & Harris, M. R. (2022). Evaluating and Extending the Informed Consent Ontology for Representing Permissions from the Clinical Domain. Applied ontology, 17(2), 321-336.More infoThe purpose of this study was to evaluate, revise, and extend the Informed Consent Ontology (ICO) for expressing clinical permissions, including reuse of residual clinical biospecimens and health data. This study followed a formative evaluation design and used a bottom-up modeling approach. Data were collected from the literature on US federal regulations and a study of clinical consent forms. Eleven federal regulations and fifteen permission-sentences from clinical consent forms were iteratively modeled to identify entities and their relationships, followed by community reflection and negotiation based on a series of predetermined evaluation questions. ICO included fifty-two classes and twelve object properties necessary when modeling, demonstrating appropriateness of extending ICO for the clinical domain. Twenty-six additional classes were imported into ICO from other ontologies, and twelve new classes were recommended for development. This work addresses a critical gap in formally representing permissions clinical permissions, including reuse of residual clinical biospecimens and health data. It makes missing content available to the OBO Foundry, enabling use alongside other widely-adopted biomedical ontologies. ICO serves as a machine-interpretable and interoperable tool for responsible reuse of residual clinical biospecimens and health data at scale.
- Mhaidli, A., Hemphill, L., Schaub, F., Cundiff, J., & Thomer, A. K. (2021). Privacy Impact Assessments for Digital Repositories. International Journal of Digital Curation, 16(1), 21. doi:ijdc.v15i1.753
- Umberfield, E. E., Jiang, Y., Fenton, S. H., Stansbury, C., Ford, K., Crist, K., Kardia, S. L., Thomer, A. K., & Harris, M. R. (2021). Lessons Learned for Identifying and Annotating Permissions in Clinical Consent Forms. Applied clinical informatics, 12(3), 429-435.More infoThe lack of machine-interpretable representations of consent permissions precludes development of tools that act upon permissions across information ecosystems, at scale.
- Thomer, A. K., & Wickett, K. M. (2020).
Relational data paradigms: What do we learn by taking the materiality of databases seriously?
. Big Data & Society. doi:10.1177/2053951720934838More infoAlthough databases have been well-defined and thoroughly discussed in the computer science literature, the actual users of databases often have varying definitions and expectations of this essential computational infrastructure. Systems administrators and computer science textbooks may expect databases to be instantiated in a small number of technologies (e.g., relational or graph-based database management systems), but there are numerous examples of databases in non-conventional or unexpected technologies, such as spreadsheets or other assemblages of files linked through code. Consequently, we ask: How do the materialities of non-conventional databases differ from or align with the materialities of conventional relational systems? What properties of the database do the creators of these artifacts invoke in their rhetoric describing these systems—or in the data models underlying these digital objects? To answer these questions, we conducted a close analysis of four non-conventional scientific databases. By examining the materialities of information representation in each case, we show how scholarly communication regimes shape database materialities— and how information organization paradigms shape scholarly communication. These cases show abandonment of certain constraints of relational database construction alongside maintenance of some key relational data organization strategies. We discuss the implications that these relational data paradigms have for data use, preservation, and sharing, and discuss the need to support a plurality of data practices and paradigms. - Catanach, T. A., Sweet, A. D., Nguyen, N. D., Peery, R. M., Debevec, A. H., Thomer, A. K., Owings, A. C., Boyd, B. M., Katz, A. D., Soto-Adames, F. N., & Allen, J. M. (2019). Fully automated sequence alignment methods are comparable to, and much faster than, traditional methods in large data sets: an example with hepatitis B virus. PeerJ, 7, e6142.More infoAligning sequences for phylogenetic analysis (multiple sequence alignment; MSA) is an important, but increasingly computationally expensive step with the recent surge in DNA sequence data. Much of this sequence data is publicly available, but can be extremely fragmentary (i.e., a combination of full genomes and genomic fragments), which can compound the computational issues related to MSA. Traditionally, alignments are produced with automated algorithms and then checked and/or corrected "by eye" prior to phylogenetic inference. However, this manual curation is inefficient at the data scales required of modern phylogenetics and results in alignments that are not reproducible. Recently, methods have been developed for fully automating alignments of large data sets, but it is unclear if these methods produce alignments that result in compatible phylogenies when compared to more traditional alignment approaches that combined automated and manual methods. Here we use approximately 33,000 publicly available sequences from the hepatitis B virus (HBV), a globally distributed and rapidly evolving virus, to compare different alignment approaches. Using one data set comprised exclusively of whole genomes and a second that also included sequence fragments, we compared three MSA methods: (1) a purely automated approach using traditional software, (2) an automated approach including by eye manual editing, and (3) more recent fully automated approaches. To understand how these methods affect phylogenetic results, we compared resulting tree topologies based on these different alignment methods using multiple metrics. We further determined if the monophyly of existing HBV genotypes was supported in phylogenies estimated from each alignment type and under different statistical support thresholds. Traditional and fully automated alignments produced similar HBV phylogenies. Although there was variability between branch support thresholds, allowing lower support thresholds tended to result in more differences among trees. Therefore, differences between the trees could be best explained by phylogenetic uncertainty unrelated to the MSA method used. Nevertheless, automated alignment approaches did not require human intervention and were therefore considerably less time-intensive than traditional approaches. Because of this, we conclude that fully automated algorithms for MSA are fully compatible with older methods even in extremely difficult to align data sets. Additionally, we found that most HBV diagnostic genotypes did not correspond to evolutionarily-sound groups, regardless of alignment type and support threshold. This suggests there may be errors in genotype classification in the database or that HBV genotypes may need a revision.
- Silva, S. J., Barbieri, L. K., & Thomer, A. K. (2018). Observing vegetation phenology through social media. PloS one, 13(5), e0197325.More infoThe widespread use of social media has created a valuable but underused source of data for the environmental sciences. We demonstrate the potential for images posted to the website Twitter to capture variability in vegetation phenology across United States National Parks. We process a subset of images posted to Twitter within eight U.S. National Parks, with the aim of understanding the amount of green vegetation in each image. Analysis of the relative greenness of the images show statistically significant seasonal cycles across most National Parks at the 95% confidence level, consistent with springtime green-up and fall senescence. Additionally, these social media-derived greenness indices correlate with monthly mean satellite NDVI (r = 0.62), reinforcing the potential value these data could provide in constraining models and observing regions with limited high quality scientific monitoring.
- Palmer, C. L., Thomer, A. K., Baker, K. S., Wickett, K. M., Hendrix, C. L., Rodman, A., Sigler, S., & Fouke, B. W. (2017). Site-based data curation based on hot spring geobiology. PloS one, 12(3), e0172090.More infoSite-Based Data Curation (SBDC) is an approach to managing research data that prioritizes sharing and reuse of data collected at scientifically significant sites. The SBDC framework is based on geobiology research at natural hot spring sites in Yellowstone National Park as an exemplar case of high value field data in contemporary, cross-disciplinary earth systems science. Through stakeholder analysis and investigation of data artifacts, we determined that meaningful and valid reuse of digital hot spring data requires systematic documentation of sampling processes and particular contextual information about the site of data collection. We propose a Minimum Information Framework for recording the necessary metadata on sampling locations, with anchor measurements and description of the hot spring vent distinct from the outflow system, and multi-scale field photography to capture vital information about hot spring structures. The SBDC framework can serve as a global model for the collection and description of hot spring systems field data that can be readily adapted for application to the curation of data from other kinds scientifically significant sites.
- Hill, A., Guralnick, R., Smith, A., Sallans, A., Rosemary Gillespie, ., Denslow, M., Gross, J., Murrell, Z., Tim Conyers, ., Oboyski, P., Ball, J., Thomer, A., Prys-Jones, R., de Torre, J., Kociolek, P., & Fortson, L. (2012). The notes from nature tool for unlocking biodiversity records from museum records through citizen science. ZooKeys, 219-33.More infoLegacy data from natural history collections contain invaluable and irreplaceable information about biodiversity in the recent past, providing a baseline for detecting change and forecasting the future of biodiversity on a human-dominated planet. However, these data are often not available in formats that facilitate use and synthesis. New approaches are needed to enhance the rates of digitization and data quality improvement. Notes from Nature provides one such novel approach by asking citizen scientists to help with transcription tasks. The initial web-based prototype of Notes from Nature is soon widely available and was developed collaboratively by biodiversity scientists, natural history collections staff, and experts in citizen science project development, programming and visualization. This project brings together digital images representing different types of biodiversity records including ledgers , herbarium sheets and pinned insects from multiple projects and natural history collections. Experts in developing web-based citizen science applications then designed and built a platform for transcribing textual data and metadata from these images. The end product is a fully open source web transcription tool built using the latest web technologies. The platform keeps volunteers engaged by initially explaining the scientific importance of the work via a short orientation, and then providing transcription "missions" of well defined scope, along with dynamic feedback, interactivity and rewards. Transcribed records, along with record-level and process metadata, are provided back to the institutions. While the tool is being developed with new users in mind, it can serve a broad range of needs from novice to trained museum specialist. Notes from Nature has the potential to speed the rate of biodiversity data being made available to a broad community of users.
- Thomer, A., Vaidya, G., Guralnick, R., Bloom, D., & Russell, L. (2012). From documents to datasets: A MediaWiki-based method of annotating and extracting species observations in century-old field notebooks. ZooKeys, 235-53.More infoPart diary, part scientific record, biological field notebooks often contain details necessary to understanding the location and environmental conditions existent during collecting events. Despite their clear value for (and recent use in) global change studies, the text-mining outputs from field notebooks have been idiosyncratic to specific research projects, and impossible to discover or re-use. Best practices and workflows for digitization, transcription, extraction, and integration with other sources are nascent or non-existent. In this paper, we demonstrate a workflow to generate structured outputs while also maintaining links to the original texts. The first step in this workflow was to place already digitized and transcribed field notebooks from the University of Colorado Museum of Natural History founder, Junius Henderson, on Wikisource, an open text transcription platform. Next, we created Wikisource templates to document places, dates, and taxa to facilitate annotation and wiki-linking. We then requested help from the public, through social media tools, to take advantage of volunteer efforts and energy. After three notebooks were fully annotated, content was converted into XML and annotations were extracted and cross-walked into Darwin Core compliant record sets. Finally, these recordsets were vetted, to provide valid taxon names, via a process we call "taxonomic referencing." The result is identification and mobilization of 1,068 observations from three of Henderson's thirteen notebooks and a publishable Darwin Core record set for use in other analyses. Although challenges remain, this work demonstrates a feasible approach to unlock observations from field notebooks that enhances their discovery and interoperability without losing the narrative context from which those observations are drawn."Compose your notes as if you were writing a letter to someone a century in the future."Perrine and Patton (2011).
Proceedings Publications
- Fan, L., Lafia, S., Wofford, M. F., Thomer, A. K., Yakel, E., & Hemphill, L. (2023).
Mining Semantic Relations in Data References to Understand the Roles of Research Data in Academic Literature
. In Joint Conference of Digital Libraries.More infoResearch data serves important roles in scientific discovery and academic innovation. To appropriately assign credit for data work and to measure the value of research data, it is essential to articulate how data are actually used in research. We leveraged a combination of computational methods and human analysis to characterize different types of data use by mining semantic relations from the phrases where data are referenced in academic literature. In particular, we investigated references to data in the bibliography of a large social science data archive, the Inter-university Consortium for Political and Social Research (ICPSR). After retrieving and extracting semantic relations as subject-relation-object triples, we used rule-based methods to classify them. We then annotated samples from 11 frequent classes of data reference triples and found that they vary primarily along two dimensions of data use: proximity and function. Proximity describes the distance between the author and the data they reference (e.g., direct or indirect engagement). Function describes the role that data plays in each reference (e.g., describing interaction or providing context). These semantic relationships between authors and data reveal the ways data are used in scientific publications. Evidence of the variety of ways data are used can help stakeholders in research data curation and stewardship - including data providers, data curators, and data users - recognize the myriad ways that their investments in data sharing are realized. - Hyunju, S., Cui, H., Vieglais, D., Mandel, D., & Thomer, A. K. (2023).
Automated Metadata Enhancement for Physical Sample Record Aggregation in the iSamples Project
. In 86th Annual Meeting of the Association for Information Science and Technology.More infoABSTRACT Large amounts of samples have been collected and stored by different institutions and collections across the world. However, even the most carefully curated collections can appear incomplete when aggregated. To solve this problem and support the increasing multidisciplinary science conducted on these samples, we propose a method to support the FAIRness of the aggregation by augmenting the metadata of source records. Using a pipeline that is a combination of rule‐based and machine learning‐based procedures, we predict the missing values of the metadata fields of 4,388,514 samples. We use these inferred fields in our user interface to improve the reusability. - Wofford, M. F., & Thomer, A. K. (2023).
Curating for Contrarian Communities: Data Practices of Anthropogenic Climate Change Skeptics
. In ASIS&T Annual Meeting.More infoABSTRACT The open data movement is often touted as a sweeping strategy to democratize science, promote diverse data reuse, facilitate reproducibility, accelerate innovation, and much more. However, the potential perils of open data are seldom examined and discussed in equal measure to these promises. As we continue to invest in open data, we need to study the full spectrum of what open data facilitates in practice, which can then inform future policy and design decisions. This paper aims to address this gap by presenting an investigative digital ethnography of one contrarian community, anthropogenic climate change (ACC) skeptics, to describe how they process, analyze, preserve, and share data. Skeptics often engage in data reuse similar to conventional data reusers, albeit for unconventional purposes and with varying degrees of trust and expertise. The data practices of ACC skeptics challenge the assumption that open data is universally beneficial. These findings carry implications for data repositories and how they might curate data and design databases with this type of reuse in mind.
Presentations
- Goring, S. J., Vidana, S. D., Blois, J., Crawford, S., Nelson, J. K., Thomer, A. K., & Williams, J. W. (2022). Advancing Interdisciplinary Global Change Science Through Linked Research Services in the Neotoma Paleoecology Database. American Geophysical Union. Chicago, IL.
- Stanley, V., Ramdeen, S., Thomer, A. K., Damerow, J. E., Stall, S., & Erdmann, C. (2022). Developing Sample Citation Guidelines with AGU Publishers. American Geophysical Union. Chicago, IL.
- Thomer, A. (2022, November). Curating longitudinal natural history data through the CHANGES project. Research Symposium at University of Washington Information School. Seattle, WA.
- Huvila, I., Greenberg, J., Olle, S., Thomer, A., Trace, C., & Zhao, X. (2021, 2022/01/14/22:50:32). Documenting Information Processes and Practices: Paradata, Provenance Metadata, Life-Cycles and Pipelines. Annual Meeting of the Association for Information Science and Technology. Virtual.
Poster Presentations
- Rayburn, A. J., & Thomer, A. K. (2022, 2023/01/24/19:06:39). The craft of database curation: Taking cues from quiltmaking. iConference. Virtual.
Others
- Lafia, S., Thompson, C., Cassidy, E., Polasek, K., & Thomer, A. K. (2022, 2023/01/25/00:13:32). Surfacing Specimen Citations: Machine Learning, Manual Annotation, and Impact Metrics for Natural History Collections (invited). American Geophysical Union. https://agu.confex.com/agu/fm22/meetingapp.cgi/Paper/1115476