Sarah Elaine Bratt
- Assistant Professor, School of Information
- Member of the Graduate Faculty
Contact
- Richard P. Harvill Building, Rm. 409
- Tucson, AZ 85721
- sebratt@arizona.edu
Awards
- METSTI 2023 Best Paper Presentation Award
- The ASIS&T Special Interest Group for Metrics (SIG/MET) and the ASIS&T Special Interest Group for Scientific and Technical Information (SIG STI) sponsored by the International Center for the Study of Research (ICSR), Fall 2023
- George H. Davis Faculty Travel Fellowship
- University of Arizona Research Innovation & Impact (RII) George H. Davis Faculty Travel Fellowship, Spring 2023
- LSSTC Catalyst Fellowship, Social Science Fellow (declined)
- LSST Corporation funded by the John Templeton Foundation, Spring 2023
Interests
Research
Research data management, data science, library and information science, science of science and innovation, data curation, long-term data preservation, social studies of science
Teaching
information organization, feminist methodologies, research methods
Courses
2024-25 Courses
-
Directed Research
INFO 692 (Spring 2025) -
Honors Thesis
ECOL 498H (Spring 2025) -
Statistic Foundations Info Age
ISTA 116 (Spring 2025) -
Directed Research
INFO 692 (Fall 2024) -
Honors Thesis
ECOL 498H (Fall 2024) -
Organization/Information
LIS 515 (Fall 2024) -
Statistic Foundations Info Age
ISTA 116 (Fall 2024)
2023-24 Courses
-
Directed Research
INFO 692 (Spring 2024) -
Statistic Foundations Info Age
ISTA 116 (Spring 2024) -
Directed Research
INFO 692 (Fall 2023) -
Organization/Information
LIS 515 (Fall 2023) -
Statistic Foundations Info Age
ISTA 116 (Fall 2023)
2022-23 Courses
-
Capstone
INFO 698 (Spring 2023) -
Statistic Foundations Info Age
ISTA 116 (Spring 2023) -
Organization/Information
INFO 515 (Fall 2022) -
Organization/Information
LIS 515 (Fall 2022) -
Statistic Foundations Info Age
ISTA 116 (Fall 2022)
Scholarly Contributions
Journals/Publications
- Bratt, S., Leahey, E., Gomez, C., Lee, J., Kwon, Y., & Lassiter, C. (2024). Developing a Text-Based Measure of Humility in Inquiry Using Computational Grounded Theory. Proceedings of the Association for Information Science and Technology, 61(1). doi:10.1002/pra2.1119More infoWe describe a project in which we develop a text-based measure of HI in the context of scholarly communication using corpora of scientific publications. The data and analytic approach we use will circumvent known concerns with self-reported data on humility levels and will be calculable on a large scale. We use a computational grounded theory approach to develop a text-based measure of HI. We draw from an annotated corpus of scientific articles in economics, psychology, and sociology (2010–2023), generating three supra-dimensions of HI (Epistemic, Rhetorical, and Transparent) and several novel sub-codes of HI. We present our initial analysis with a focus on the three dimensions of HI derived from a computational grounded theory approach. The text-based measure helps us better understand how contextual factors shape HI and contribute to mixed methods in information science research.
- Kim, H., & Bratt, S. (2024). Assessing Privacy Policies and App Settings for User Data Protection: A Data Subject-Centered Framework Analysis of TikTok in the U.S. and Europe (2023–2024). Proceedings of the Association for Information Science and Technology, 61(1). doi:10.1002/pra2.1019More infoThis study examines the extent to which TikTok's privacy policies and app settings in the U.S. and Europe protect the rights entailed in the data subject-centered framework. Using a case study approach, we analyze current policy documents and app settings to identify the alignment of TikTok's policies with the GDPR perspective. Our findings reveal that current policies and settings fall short in key areas. First, TikTok policies lack details related to managing and protecting sensitive data. Second, the policies neglect to discuss the responsibilities of social media companies when such data is utilized by unspecified third parties. Furthermore, there is a noticeable deficiency in the U.S. regarding detailed in-app privacy notices and setting options, especially in terms of managing location data and advertisements. Additionally, there is a need for explanations on how specific settings impact users. Lastly, a critical demand exists for default settings, including those for advertisements, to enhance data protection.
- Arora, R., Beattie, K., Bernholdt, D. E., Bratt, S. E., Godoy, W. F., Katz, D. S., Laguna, I., Maji, A. K., Mudafort, R. M., Rouson, D., Rubio-Gonzalez, C., Sukhija, N., Thakur, A. M., & Vahi, K. (2023). Giving RSEs a Larger Stage through the Better Scientific Software Fellowship. Computing in Science & Engineering, 24(5), 1-10. doi:10.1109/mcse.2023.3253847
- Bratt, S. (2023). ‘Routine Infrastructuring’: How Social Scientists Appropriate Resources to Deposit Qualitative Data to ICPSR and Implications for FAIR and CARE. Proceedings of the Association for Information Science and Technology, 60(1), 61-72. doi:10.1002/pra2.769
- Bratt, S., Langalia, M., & Nanoti, A. (2023). North-south scientific collaborations on research datasets: a longitudinal analysis of the division of labor on genomic datasets (1992-2021). Frontiers in Big Data, 6, 1054655.More infoCollaborations between scientists from the global north and global south (N-S collaborations) are a key driver of the ‘fourth paradigm of science’ and have proven crucial to addressing global crises like COVID-19 and climate change. However, despite their critical role, N-S collaborations on datasets are little understood. Science of science studies tend to rely on publications and patents to examine N-S collaboration patterns. To this end, the rise of global crises requiring N-S collaborations to produce and share data presents an urgent need to understand the prevalence, dynamics, and political economy of N-S collaborations on research datasets. In this paper, we employ a mixed methods case study research approach to analyze the frequency of and division of labor in N-S collaborations on datasets submitted to GenBank over 29 years (1992-2021). We find: (1) there is a low representation of N-S collaborations over the 29-year period. When they do occur, N-S collaborations display “burstiness” patterns, suggesting that N-S collaborations on datasets are formed and maintained reactively in the wake of global health crises such as infectious disease outbreaks; (2) The division of labor between datasets and publications is disproportionate to the global south in the early years, but becomes more overlapping after 2003.
- Godoy, W. F., Arora, R., Beattie, K., Bernholdt, D. E., Bratt, S. E., Katz, D. S., Laguna, I., Maji, A. K., Thakur, A. M., & Mudafort, R. M. (2023). Giving RSEs a Larger Stage through the Better Scientific Software Fellowship. Computing in Science & Engineering.More infoThe Better Scientific Software Fellowship (BSSwF) was launched in 2018 to foster and promote practices, processes, and tools to improve developer productivity and software sustainability of scientific codes. The BSSwF’s vision is to grow the community with practitioners, leaders, mentors, and consultants to increase the visibility of scientific software. Over the last five years, many fellowship recipients and honorable mentions have identified as research software engineers (RSEs). Case studies from several of the program’s participants illustrate the diverse ways the BSSwF has benefited both the RSE and scientific communities. In an environment where the contributions of RSEs are too often undervalued, we believe that programs such as the BSSwF can help recognize and encourage community members to step outside of their regular commitments and expand on their work, collaborations, and ideas for a larger audience.
- Qin, J., Bratt, S., Hemsley, J., Smith, A., & Liu, Q. (2023). A FAIR Data Ecosystem for Science of Science. Proceedings of the Association for Information Science and Technology, 60(1), 1107-1109. doi:10.1002/pra2.960
- Hemsley, J., Qin, J., Bratt, S., & Smith, A. (2022). Collaboration Networks and Career Trajectories: What Do Metadata from Data Repositories Tell Us?. Proceedings of the Association for Information Science and Technology, 59(1). doi:10.1002/pra2.608More infoScience is increasingly carried out through scientific collaborations, allowing researchers pool their experience, knowledge, and skills. In this work we identify factors related to a scientist’s collaboration capacity, their ability accumulate new collaborations over their career. To do this offer a new collaboration capacity framework and begin the work of validating it empirically by testing a number of hypotheses. We use data from GenBank, a cyberinfrastructure (CI)-enabled data repository that stores and manages scientific data. The data allow us to construct longitudinal networks, thereby giving us yearly scientific collaboration maps. We find that a scientist’s network position at an early stage is related to their capacity to build new collaborations and that researchers who manage an upward trend in productivity tend to have higher collaboration capacity. Our work makes a contribution to science of science studies by offering a collaboration capacity framework and providing partial empirical support for it.
- Qin, J., Hemsley, J., & Bratt, S. (2022). The structural shift and collaboration capacity in GenBank Networks: A longitudinal study. Quantitative Science Studies, 3(1). doi:10.1162/qss_a_00181More infoMetadata in scientific data repositories such as GenBank contain links between data submissions and related publications. As a new data source for studying collaboration networks, metadata in data repositories compensate for the limitations of publication-based research on collaboration networks. This paper reports the findings from a GenBank metadata analytics project. We used network science methods to uncover the structures and dynamics of GenBank collaboration networks from 1992–2018. The longitudinality and large scale of this data collection allowed us to unravel the evolution history of collaboration networks and identify the trend of flattening network structures over time and optimal assortative mixing range for enhancing collaboration capacity. By incorporating metadata from the data production stage with the publication stage, we uncovered new characteristics of collaboration networks as well as developed new metrics for assessing the effectiveness of enablers of collaboration—scientific and technical human capital, cyberinfrastructure, and science policy.
- Hemsley, J., Qin, J., & Bratt, S. (2020). Data to knowledge in action: A longitudinal analysis of GenBank metadata. Proceedings of the Association for Information Science and Technology, 57(1). doi:10.1002/pra2.253More infoStudies typically use publication-based authorship data to study the relationships between collaboration networks and knowledge diffusion. However, collaboration in research often starts long before publication with data production efforts. In this project we ask how collaboration in data production networks affects and contributes to knowledge diffusion, as represented by patents, another form of knowledge diffusion. We drew our data from the metadata associated with genetic sequence records stored in the National Institutes of Health's GenBank database. After constructing networks for each year and aggregating summary statistics, regressions were used to test several hypotheses. Key among our findings is that data production team size is positively related to the number of patents each year. Also, when actors on average have more links, we tend to see more patents. Our study contributes in the area of science of science by highlighting the important role of data production in the diffusion of knowledge as measured by patents.
- Zeng, T., Wu, L., Bratt, S., & Acuna, D. (2020). Assigning credit to scientific datasets using article citation networks. Journal of Informetrics, 14(2). doi:10.1016/j.joi.2020.101013More infoA citation is a well-established mechanism for connecting scientific artifacts. Citation networks are used by citation analysis for a variety of reasons, prominently to give credit to scientists' work. However, because of current citation practices, scientists tend to cite only publications, leaving out other types of artifacts such as datasets. Datasets then do not get appropriate credit even though they are increasingly reused and experimented with. We develop a network flow measure, called DataRank, aimed at solving this gap. DataRank assigns a relative value to each node in the network based on how citations flow through the graph, differentiating publication and dataset flow rates. We evaluate the quality of DataRank by estimating its accuracy at predicting the usage of real datasets: web visits to GenBank and downloads of Figshare datasets. We show that DataRank is better at predicting this usage compared to alternatives while offering additional interpretable outcomes. We discuss improvements to citation behavior and algorithms to properly track and assign credit to datasets.
- Bandara, D., Velipasalar, S., Bratt, S., & Hirshfield, L. (2018). Building predictive models of emotion with functional near-infrared spectroscopy. International Journal of Human Computer Studies, 110. doi:10.1016/j.ijhcs.2017.10.001More infoWe demonstrate the capability of discriminating between affective states on the valence and arousal dimensions using functional near-infrared spectroscopy (fNIRS), a practical non-invasive device that benefits from its ability to localize activation in functional brain regions with spatial resolution superior to the Electroencephalograph (EEG). The high spatial resolution of fNIRS enables us to identify the neural correlates of emotion with spatial precision comparable to fMRI, but without requiring the use of the constricting and impractical fMRI scanner. We make these predictions across subjects, creating the capacity to generalize the model to new participants. We designed the experiment and evaluated our results in the context of a prior experiment—based on the same basic protocol and stimulus materials—which used EEG to measure participants’ valence and arousal. The F1-scores achieved by our classifiers suggest that fNIRS is particularly useful at distinguishing between high and low levels of valence (F1-score of 0.739), which has proven to be difficult to measure with physiological sensors.
- Bratt, S., Hemsley, J., Qin, J., & Costa, M. (2017). Big data, big metadata and quantitative study of science: A workflow model for big scientometrics. Proceedings of the Association for Information Science and Technology, 54(1). doi:10.1002/pra2.2017.14505401005More infoLarge cyberinfrastructure-enabled data repositories generate massive amounts of metadata, enabling big data analytics to leverage on the intersection of technological and methodological advances in data science for the quantitative study of science. This paper introduces a definition of big metadata in the context of scientific data repositories and discusses the challenges in big metadata analytics due to the messiness, lack of structures suitable for analytics and heterogeneity in such big metadata. A methodological framework is proposed, which contains conceptual and computational workflows intercepting through collaborative documentation. The workflow-based methodological framework promotes transparency and contributes to research reproducibility. The paper also describes the experience and lessons learned from a four-year big metadata project involving all aspects of the workflow-based methodologies. The methodological framework presented in this paper is a timely contribution to the field of scientometrics and the science of science and policy as the potential value of big metadata is drawing more attention from research and policy maker communities.
- Costa, M., Qin, J., & Bratt, S. (2016). Emergence of collaboration networks around large scale data repositories: a study of the genomics community using GenBank. Scientometrics, 108(1). doi:10.1007/s11192-016-1954-xMore infoThe advent of large data repositories and the necessity of distributed skillsets have led to a need to study the scientific collaboration network emerging around cyber-infrastructure-enabled repositories. To explore the impact of scientific collaboration and large-scale repositories in the field of genomics, we analyze coauthorship patterns in NCBIs big data repository GenBank using trace metadata from coauthorship of traditional publications and coauthorship of datasets. We demonstrate that using complex network analysis to explore both networks independently and jointly provides a much richer description of the community, and addresses some of the methodological concerns discussed in previous literature regarding the use of coauthorship data to study scientific collaboration.
Proceedings Publications
- Chmielinski, G., & Bratt, S. E. (2024, April). Plant-Based Predictions: An Exploratory Predictive Analysis of Purchasing Behavior of Meat-Alternatives by U.S. Consumers (2020). In iConference 2024.More infoDespite the recent increase of plant-based diets, animal-based product consumption remains a major environmental concern. Present information science research is focused on the role of consumer perception in meat-alternative purchasing behavior and its marketing implications. This paper shifts from the profitability aspect of consumer behavior and seeks to understand how external variables such as urban residency, poverty, grocery store access, household income, food expenditure, and grocery costs relate to a county's likelihood to purchase meat-alternatives. Using consumer survey responses from MRI Simmons and statistics from the U.S. government, we developed a logistic regression model, a support vector machine, and a generalized additive model to predict the likelihood of households in a county purchasing meat-alternatives. All features except for grocery access proved to be significant, positively correlated predictors. We conclude that features of physical and financial accessibility are useful in identifying, with roughly 68% accuracy, a U.S. county's tendency for purchasing meat-alternatives. This identification might further sustainability initiatives and local efforts to incentivize environmentally conscious food decisions.
- Bratt, S. E. (2023, oct). "Routine Infrastructuring": How Social Scientists Appropriate Resources to Deposit Qualitative Data to ICPSR and Implications for FAIR and CARE. In Proceedings of the Association for Information Science and Technology, 60, 61--72.More infoThis study develops a grounded theory of how social scientists facilitate qualitative data deposit and the impacts on making data FAIR and CARE. Drawing from 15 semi‐structured interviews with U.S. academic social science faculty who deposited data to ICPSR, I take a resource‐centric perspective to address the need for theorizing scientists' use of resources to bridge the gap between underspecified, heterogeneous data practices and repository requirements. The two primary contributions of the study are: First, the identification of three types of resources that social science faculty use to structure data deposit routines, namely: 1) bottom‐up, 2) top‐down, and 3) borrowed resources. Second, I import a theory from crisis informatics, ‘routine infrastructuring,’ to explain how social scientists deposit data to ICPSR. Results reveal that the resources social scientists use function as ostensive routines. I argue routine infrastructuring is not only a way to enact routines but also creates routines. Findings also show ‘in‐house’ resources have a mix of beneficial and negative impacts for data FAIR‐ and CARE‐ness. This study advances the small but growing body of literature that examines routine dynamics in research groups from a resource‐centric perspective to explain qualitative data deposit to research data repositories.
- Qin, J., Bratt, S. E., Hemsley, J., & Smith, A. O. (2023, 2023-07-03). Metadata Analytics: A Methodological Discussion. In International Society of Scientometrics and Informetrics (ISSI) 2023 Conference.More infoMetadata Analytics is a term used to describe a research field that utilizes quantitative methods and metadata for publications, patents, datasets, and other research entities to study science of science. Metadata analytics inherits the bibliometric and scientometric tradition while infusing novel data sources – metadata for datasets – to extend the traditional bibliometric and scientometric research. The large scale of metadata from scientific data repositories offers both opportunities and challenges in the quantitative study of science. This paper discusses the problems and opportunities that metadata analytics contends with from a methodological perspective. Using the authors’ experiences over the course of a multi-year metadata analytics project, the paper focuses on the subtle differences between methods and science (or means and end) that arise when conducting research in metadata analytics and, for the same reason, bibliometrics and scientometrics . Metadata analytics is both a methodology and a research field. The intertwining of methods and science in metadata analytics can create pitfalls for researchers. Steering clearly between the means and ends in metadata analytics is essential to produce good science.
- Bratt, S., & Smith, A. O. (2022). Evolutionary Archives: The Unlikely Comparison of GenBank and Know Your Meme. In IEEE Big Data.
- Neupane, A., Saxena, N., Hirshfield, L., & Bratt, S. (2019). The Crux of Voice (In)Security: A Brain Study of Speaker Legitimacy Detection. In network and distributed system security symposium.More infoA new generation of scams has emerged that uses voice impersonation to obtain sensitive information, eavesdrop over voice calls and extort money from unsuspecting human users. Research demonstrates that users are fallible to voice impersonation attacks that exploit the current advancement in speech synthesis. In this paper, we set out to elicit a deeper understanding of such human-centered “voice hacking” based on a neuro-scientific methodology (thereby corroborating and expanding the traditional behavioral-only approach in significant ways). Specifically, we investigate the neural underpinnings of voice security through functional near-infrared spectroscopy (fNIRS), a cutting-edge neuroimaging technique, that captures neural signals in both temporal and spatial domains. We design and conduct an fNIRS study to pursue a thorough investigation of users’ mental processing related to speaker legitimacy detection –whether a voice sample is rendered by a target speaker, a different other human speaker or a synthesizer mimicking the speaker. We analyze the neural activity associated within this task as well as the brain areas that may control such activity. Our key insight is that there may be no statistically significant differences in the way the human brain processes the legitimate speakers vs. synthesized speakers, whereas clear differences are visible when encountering legitimate vs. different other human speakers. This finding may help to explain users’ susceptibility to synthesized attacks, as seen from the behavioral self-reported analysis. That is, the impersonated synthesized voices may seem indistinguishable from the real voices in terms of both behavioral and neural perspectives. In sharp contrast, prior studies showed subconscious neural differences in other real vs. fake artifacts (e.g., paintings and websites), despite users failing to note these differences behaviorally. Overall, our work dissects the fundamental neural patterns underlying voice-based insecurity and reveals users’ susceptibility to voice synthesis attacks at a biological level. We believe that this could be a significant insight for the security community suggesting that the human detection of voice synthesis attacks may not improve over time, especially given that voice synthesis techniques will likely continue to improve, calling for the design of careful machine-assisted techniques to help humans counter these attacks.
- Bratt, S. (2017). Toward an open data repository and meta-analysis of cognitive data using fNIRS studies of emotion. In Human Computer Interaction International (HCII).More infoHCI research has increasingly incorporated the use of neurophysiological sensors to identify users’ cognitive and affective states. However, a persistent problem in machine learning on cognitive data is generalizability across participants. A proposed solution has been aggregating cognitive and survey data across studies to generate higher sample populations for machine learning and statistical analyses to converge in stable, generalizable results. In this paper, I argue that large data-sharing projects can facilitate the aggregation of results of brain imaging studies to address these issues, by smoothing noise in high-dimensional datasets. This paper contributes a small step towards large cognitive data sharing systems-design by proposing methods that facilitate the merging of currently incompatible fNIRS and FMRI datasets through term-based metadata analysis. To that end, I analyze 20 fNIRS studies of emotion using content analysis for: (1) synonym terms and definitions for ‘emotion,’ (2) the experimental stimuli, and (3) the use or non-use of self-report surveys. Results suggest that fNIRS studies of emotion have stable synonymy, using technical and folk conceptualizations of affective terms within and between publications to refer to emotion. The studies use different stimuli to elicit emotion but also show commonalities between shared use of standardized stimuli materials and self-report surveys. These similarities in conceptual synonymy and standardized experiment materials indicate promise for neuroimaging communities to establish open-data repositories based on metadata term-based analyses. This work contributes to efforts toward merging datasets across studies and between labs, unifying new modalities in neuroimaging such as fNIRS with fMRI datasets, increasing generalizability of machine learning models, and promoting the acceleration of science through open data-sharing infrastructure.
- Costa, M., & Bratt, S. (2016). Truthiness: Challenges associated with employing machine learning on neurophysiological sensor data. In human computer interaction international.More infoThe use of neurophysiological sensors in HCI research is increasing in use and sophistication, largely because such sensors offer the potential benefit of providing “ground truth” in studies, and also because they are expected to underpin future adaptive systems. Sensors have shown significant promise in the efforts to develop measurements to help determine users’ mental and emotional states in real-time, allowing the system to use that information to adjust user experience. Most of the sensors used generate a substantial amount of data, a high dimensionality and volume of data that requires analysis using powerful machine learning algorithms. However, in the process of developing machine learning algorithms to make sense of the data and subject’s mental or emotional state under experimental conditions, researchers often rely on existing and imperfect measures to provide the “ground truth” needed to train the algorithms. In this paper, we highlight the different ways in which researchers try to establish ground truth and the strengths and limitations of those approaches. The paper concludes with several suggestions and specific areas that require more discussion.
- Hirshfield, L., Costa, M., Bandara, D., & Bratt, S. (2015). Measuring situational awareness aptitude using functional near-infrared spectroscopy. In Foundations of Augmented Cognition: 9th International Conference.More infoAttempts have been made to evaluate people’s situational awareness (SA) in military and civilian contexts through subjective surveys, speed, and accuracy data acquired during SA target tasks. However, it is recognized in the SA domain that more systematic measurement is necessary to assess SA theories and applications. Recent advances in biomedical engineering have enabled relatively new ways to measure cognitive and physiological state changes, such as with functional near-infrared spectroscopy (fNIRS). In this paper, we provide a literature review relating to SA and fNIRS and present an experiment conducted with an fNIRS device comparing differences in the brains between people with high and low SA aptitude. Our results suggest statistically significant differences in brain activity between the high SA group and low SA group.
- Serwadda, A., Phoha, V., Poudel, S., Hirshfield, L., Bandara, D., Bratt, S., & Costa, M. (2015). FNIRS: A new modality for brain activity-based biometric authentication. In 2015 IEEE 7th International Conference on Biometrics Theory, Applications and Systems (BTAS).More infoThere is a rapidly increasing amount of research on the use of brain activity patterns as a basis for biometric user verification. The vast majority of this research is based on Electroencephalogram (EEG), a technology which measures the electrical activity along the scalp. In this paper, we evaluate Functional Near-Infrared Spectroscopy (fNIRS) as an alternative approach to brain activity-based user authentication. fNIRS is centered around the measurement of light absorbed by blood and, compared to EEG, has a higher signal-to-noise ratio, is more suited for use during normal working conditions, and has a much higher spatial resolution which enables targeted measurements of specific brain regions. Based on a dataset of 50 users that was analysed using an SVM and a Naïve Bayes classifier, we show fNIRS to respectively give EERs of 0.036 and 0.046 when using our best channel configuration. Further, we present some results on the areas of the brain which demonstrated highest discriminative power. Our findings indicate that fNIRS has significant promise as a biometric authentication modality.
Presentations
- Bratt, S. E. (2023, July). Detecting Invisible Labor in Scientific Communities: The Case of GenBank. NetSci 2023 Satellite Workshop. Vienna, Austria.
- Bratt, S. E. (2023, November). Making Qualitative Data ‘Machine Readable’: How Scientists “Fit” Data to Research Data Repositories in U.S. Social Sciences Research [and Impacts on Perceived Validity ] . Society for the Social Studies of Science (4S) Annual Meeting. Honolulu, Hawaii: Society for the Social Studies of Science (4S).More infoThis article examines whether and how scientists “fit” their data to open research data repositories and consequences for the perceived validity of qualitative data. We highlight the knowledge performances and infrastructures around qualitative data that stabilize notions of its "usefulness" across different contexts. Drawing from interviews with social scientists and genetics research faculty in U.S. academic institutions, we argue that scientists engage in “data fitting” practices to conform their data to repository requirements. When fitting practices are awkward or unnatural to scientists' routines they can be called “contortions” — the ungainly or inappropriate actions scientists take to align datasets with repository requirements. Although the methodological, epistemic, and ethical implications of data fitting strategies can be substantial, fitting decisions are largely field-specific. Fields with low consensus, such as social sciences, lack widespread best practices on how to fit data to repository requirements – and whether to fit data to repositories at all – to make data Finable, Interoperable, Accessible and Reusable (FAIR) (Wilkinson et al., 2016), in short “machine readable” – such that they can be reused or integrated into a ‘big data collection’ (Leonelli, 2019).
- Bratt, S. E., Gomez, C. J., Lee, J., Langalia, M., Nanoti, A., & Leahey, E. E. (2023, June). Division of Labor in Data-Intensive Science: Implications for Innovation and Equity. 2nd International Conference of Science of Science & Innovation (ICSSI). Kellogg Global HUB, Northwestern University, Evanston, IL, USA: Digital Science.More infoIn this paper, we systematically analyze the international division of labor on 1.2 million datasets submitted to GenBank over 29 years (1992-2021). GenBank [1] is an international open research data repository for the genomics community hosted by NCBI – and through which the Human Genome Project was conducted and COVID-19 sequences submitted – mak- ing it an ideal site to analyze the global distribution of labor on datasets. To classify countries, we use the the World Bank Income Classification [2] and a newer measure, the Scientific and Technical Capacity Index (STCI) [7], nuancing the binary of N-S. We analyzed the yearly struc- tures and dynamics of the division of N-S division of labor on genomic datasets by calculating the ratio of overlap of scientists appearing as (co)contributors to the dataset and on the dataset’s associated publication(s), inferring that a higher overlap is indicative of “coreness” in flat teams [8]. Coreness is indicative that the dataset submitter is more ‘core’ to the project, indicating the technical labor on a project is drawn into the intellectual center of the study. We find: (1) Scientists from the global south tend to be listed as datasets contributors more often that of global north researchers. Overlap increases overall, but there remain dis- tinct functional roles; that is, 40 percent of scientists are only dataset contributors. This finding is surprising given prior studies reporting the lack of infrastructures to produce and curate data in low income or scientifically developing countries. However, it could be that contribution is explained by the high frequency of N-S collaborations in genomics research on infectious diseases [5], leading to southern scientists being equipped to collect and submit datasets. (2) We identify a positive relationship between the “flatness” of a team and southern scientists leading or last author on the publication.
Poster Presentations
- Bratt, S. E., Buchanan, S., Honick, B., & Gala, B. (2023, June). Invisible Data Communities: Detecting Scientific Communities Based on Dataset Affinity Networks. 2nd International Conference of Science of Science & Innovation (ICSSI). Kellogg Global HUB, Northwestern University, Evanston, IL, USA: Digital Science, Alfred P. Sloan Foundation, AFOSR, Northwestern Kellogg School of Management.More infoIn this paper, we analyzed patterns of communities defined outside of conventional com- munity detection using an affinity network approach. We identify a tripartite network of links between (1) scientists co-authoring datasets, (2) taxonomic classifications, and (3) journals to surface often invisible affinity networks based on dataset properties. We use GenBank datasets’ bibliographic metadata (e.g., author names, journal name, publication title, year published) and link them to the NCBI Taxonomy database which connects the bibliographic metadata to bio- logical metadata about the dataset. The biological metadata describes attributes of the sequence (e.g., mRNA/DNA) with information about the organism from which the sample was taken, and the taxonomic classification of the organism. For instance, a mouse genome sequence used for an experiment on influenza would have taxonomic tags for mus musculus and influenza.We demonstrate three novel ways to computationally reimagine scientific communities with a novel data source for studying the data-intensive scientific enterprise: GenBank repository metadata. We define communities according to the taxonomic lineage of the datatset the sci- entist submits to GenBank, the collaboration network on datasets, and the journal + taxon combination of the dataset submitted. We compare these novel ways to model communities to conventional theoretical and computational approaches to community detection, and reflect on the implications for how they can inform collaboration recommendation systems, academic library collection development, and science policy.
- Bratt, S. E., Gomez, C. J., Devitt, W., Langalia, M., Lee, J., & Leahey, E. E. (2023, June). North-South Collaborations on Scientific Datasets: A Longitudinal Exploration (1992-2021). 2nd International Conference of Science of Science & Innovation. Kellogg Global HUB, Northwestern University, Evanston, IL, USA: Digital Science.More infoIn this paper, we systematically analyze the frequency of N-S collaborations on approx- imately 1.2 million sequences submitted to GenBank over 29 years (1992-2021). GenBank [2] is an international open research data repository for the genomics community hosted by NCBI, and in which the Human Genome Project sequences were shared and infectious disease sequences submitted (including COVID-19) making GenBank an ideal site to analyze N-S col- laborations on datasets. To classify countries we use the World Bank Income Classification [4] and the Scientific and Technical Capacity Index (STCI) [11]. We find: (1) datasets are disproportionately produced by the global north, but there is a higher rate of collaborations between nations with discrepant S&T capacity on datasets over time. The preponderance of the datasets submitted are domestic collaborations, but where there is international collaborations, over 89 percent are collaborations among scientifically advanced countries. The N-S collaborations networks demonstrate “burstiness” in their forma- tion and dissolution [5], suggesting scientific reactivity to outbreaks of infectious disease (e.g. HIV/AIDs) and ad hoc influx of resources to build capacity in southern scientists’ institutions (see Figure 1). (2) The classification indices commonly used to characterize the global north and south at a national level are incompatible revealing a need for composite mea- sures to nuance the N-S binary. The S&T capacity index [11] to the need for measures that capture the multi-faceted nature of the N-S political economy [1, 7], where S&T capacity and income measures are not interchangeable. For instance, United Arab Emirates is classified as a High Income Country (HIC) by the World Bank income classification, but as a Scientifically Lagging Country (SLC) by the parameters of the S&T index.
- Bratt, S. E., Kingsley, S., Thomas, E., & Flores, J. (2023). Speculative Design Thinking in iSchool Education: Comparing Borges' Library of Babel and Bush's Memex to Surface Values in the Design of Organizing Systems. iConference 2023 Proceedings.More infoDesign thinking is critical in information science practiceand education. However, we lack applied approaches to implement de-sign thinking in iSchool graduate courses to elicit the values implicitin technologies. To address this gap, this poster presents a preliminaryspeculative design study with students that compares Vannevar Bush’smemex and Jorge Luis Borges’ Library of Babel as “visions of organiz-ing systems” as an applied approach to implementing design thinking iniSchool education. Drawing on experiences across two iSchool courses,we describe a speculative design approach for identifying values in orga-nizing systems. Second, we describe an analytic schema developed fromdesign sessions that used the scenario of a “modern memex” to surfacethe values encoded in modern information landscapes (e.g., misinforma-tion). We argue that analyzing the “visions of organizing systems” artic-ulated in speculative texts enables students to identify values implicit ininformation technologies. We conclude with recommendations for usingspeculative texts in iSchool education and practice.
- Qin, J., Bratt, S. E., Hemsley, J., Smith, A., & Liu, Q. (2023). A FAIR Data Ecosystem for Science of Science. Proceedings of the Association for Information Science and Technology.More infoThis poster discusses Automated Research Workflows (ARWs) in the context of a FAIR data ecosystem for the science of science research. We offer a conceptual discussion from the point of view of information science and characteristics and expectations for designers and developers of a FAIR data ecosystem. Drawing from a 10-year data science project developing GenBank metadata workflows, we incorporate the ideas of ARWs into the FAIR data ecosystem discussion to set a broader context and increase generalizability. Researchers can use these as a guide for their data science projects to automate research workflows in the science of science domain and beyond.
Others
- Bratt, S. E., Fu, Y., & Lee, H. (2024, July). NetSci 2024 Satellite Symposium Proposal: Networks in the Science of Science. NetSci 2024.More infoThe vast amount of research articles, datasets, grant proposals, and patents produced by scientists provide a rich digital trace of the scientific ecosystem. These data are the foundation of an emerging field called the “Science of Science,” which aims to understand the evolution of science in a quantitative manner. This field has the potential to bring significant benefits in terms of scientific, technological, and educational advancement. Networks are a useful and intuitive tool for understanding the connections between various elements in the scientific community, such as scientists, institutions, journals, conferences, ideas, theories, and funding agencies. We welcome submissions that explore the use of network science in the field of Science of Science. This includes topics such as the analysis of citation networks, collaboration networks, semantic networks, knowledge graphs, time-varying graphs, hypergraphs, link prediction, graph mining, and graph embedding. In addition, we seek submissions in the developing areas of exploration concerned with the impact of diverse voices, intellectual backgrounds, and intelligences (including Artificial Intelligence) on scientific advancements.