Biography
I am involved in repository support and scholarly datasets. My current work focus on data repository (ReDATA) and UA scholarly datasets. My past major contributions include Afghanistan Digital Collections (the largest one in the World), journal services, and other digital collections. My research interests include data sciences, OCR, Machine Learning and its applications, metadata (XMP), and PDF (PDF/A, PDF/UA).
Degrees
- M.S. Computer Science
- University of Western Ontario, London, Ontario, Canada
- M.L.I.S Library and Information Science
- University of Western Ontario
Awards
- IFLA 2016 National Committee Fellowship Grant
- The National Organizing Committee of the 82nd General Conference and Assembly of the International Federation of Library Associations and Institutions (IFLA), Summer 2016
- UA News: “UA Preservation Project Makes Afghanistan History Available to World”
- University of Arizona, Spring 2013
- Reflections on Nine Years in Afghanistan
- Arizona Public Media, Fall 2010
Interests
Research
Data Sciences; Machine Learning; XMP, PDF, PDF/A, PDF/UA
Teaching
Data Sciences
Courses
2024-25 Courses
-
Data Mining/Discovery
INFO 523 (Fall 2024)
2023-24 Courses
-
Data Mining/Discovery
INFO 523 (Fall 2023) -
Independent Study
INFO 699 (Fall 2023)
2022-23 Courses
-
Data Mining/Discovery
INFO 523 (Fall 2022)
Scholarly Contributions
Chapters
- Han, Y., Alemneh, D., Donovan, B., Halbert, M., Henry, G., Hswe, P., McMillan, G., & Wang, X. (2014). Guidelines for Collecting Usage Metrics & Demonstrations of Value for ETD Programs. In Guidance Documents for Lifecycle Management of ETDs. 1230 Peachtree Street, Suite 1900, Atlanta, GA 30309: Educopia Institue.
Journals/Publications
- Han, Y. (2024). ISO16684-4 Graphic technology — Extensible metadata platform (XMP) specification — Part 4: Use of XMP for semantic units. ISO Standard, 16.More infoThis document: a) introduces the concept of the semantic unit (SU). b) provides requirements and guidance on how to define the target resource(s) in an SU by adopting the “target” syntax from the Web Annotation Model; c) provides requirements and guidance on the extensible metadata platform (XMP) serialization syntaxes for SU.This document broadens the concept of XMP specified in ISO 16684-1 so that XMP can be used to describe an SU. A new flexible way of defining and describing SUs aims to bring innovation to textual and non-textual content, metadata, linked data, big data and artificial intelligence.
- Rychlik, M., Tanriover, B., & Han, Y. (2023). Large-scale data extraction from the UNOS organ donor documents. arXIV.More infoIn this paper we focus on three major task: 1) discussing our methods: Our method captures a portion of the data in DCD flowsheets, kidney perfusion data, and Flowsheet data captured peri-organ recovery surgery. 2) demonstrating the result: We built a comprehensive, analyzable database from 2022 OPTN data. This dataset is by far larger than any previously available even in this preliminary phase; and 3) proving that our methods can be extended to all the past OPTN data and future data. The scope of our study is all Organ Procurement and Transplantation Network (OPTN) data of the USA organ donors since 2008. The data was not analyzable in a large scale in the past because it was captured in PDF documents known as ``Attachments'', whereby every donor's information was recorded into dozens of PDF documents in heterogeneous formats. To make the data analyzable, one needs to convert the content inside these PDFs to an analyzable data format, such as a standard SQL database. In this paper we will focus on 2022 OPTN data, which consists of $\approx 400,000$ PDF documents spanning millions of pages. The entire OPTN data covers 15 years (2008--20022). This paper assumes that readers are familiar with the content of the OPTN data. [Journal_ref: ]
- Han, Y., & Rychlik, M. R. (2021). Development of a Gold-standard Pashto Dataset and a Segmentation App. Information Technology and Libraries, 40(1). doi:https://doi.org/10.6017/ital.v40i1.12553More infoThe article aims to introduce a gold-standard Pashto dataset and a segmentation app. The Pashto dataset consists of 300 line images and corresponding Pashto text from three selected books. A line image is simply an image consisting of one text line from a scanned page. To our knowledge, this is one of the first open access datasets which directly maps line images to their corresponding text in the Pashto language. We also introduce the development of a segmentation app using textbox expanding algorithms, a different approach to OCR segmentation. The authors discuss the steps to build a Pashto dataset and develop our unique approach to segmentation. The article starts with the nature of the Pashto alphabet and its unique diacritics which require special considerations for segmentation. Needs for datasets and a few available Pashto datasets are reviewed. Criteria of selection of data sources are discussed and three books were selected by our language specialist from the Afghan Digital Repository. The authors review previous segmentation methods and introduce a new approach to segmentation for Pashto content. The segmentation app and results are discussed to show readers how to adjust variables for different books. Our unique segmentation approach uses an expanding textbox method which performs very well given the nature of the Pashto scripts. The app can also be used for Persian and other languages using the Arabic writing system. The dataset can be used for OCR training, OCR testing, and machine learning applications related to content in Pashto.
- Rychlik, M., Nwaigwe, D., Han, Y., & Murphy, D. (2020). Development of a New Image-to-text Conversion System for Pashto, Farsi and Traditional Chinese. ArXiv.More infoWe report upon the results of a research and prototype building project Worldly~OCR dedicated to developing new, more accurate image-to-text conversion software for several languages and writing systems. These include the cursive scripts Farsi and Pashto, and Latin cursive scripts. We also describe approaches geared towards Traditional Chinese, which is non-cursive, but features an extremely large character set of 65,000 characters. Our methodology is based on Machine Learning, especially Deep Learning, and Data Science, and is directed towards vast quantities of original documents, exceeding a billion pages. The target audience of this paper is a general audience with interest in Digital Humanities or in retrieval of accurate full-text and metadata from digital images.
- Han, Y., & Wan, X. (2018). Digitization of Textual Documents Using PDF/A. Information Technology and Libraries, 37(1).
- Han, Y. (2015). Beyond TIFF and JPEG2000: PDF/A as an OAIS submission information package container. Library Hi Tech, 33(3), 409--423. doi:10.1108/LHT-06-2015-0068
- Han, Y. (2015). Cloud storage for digital preservation: optimal uses of Amazon S3 and Glacier. Library Hi Tech, 33(2), 261--271. doi:10.1108/LHT-12-2014-0118
- Han, Y. (2014). ETD: Total Cost of Ownership - Collecting, Archiving and Providing Access. Library Management, 35(4/5), 1-10.More infoHan, Y. (2014). ETD: Total Cost of Ownership - Collecting, Archiving and Providing Access. Library Management. (35)4/5: 1-10. also appears in ETD 2013 Conference (Sep 23-26, Hong Kong)
- Han, Y. (2013). IaaS Cloud Computing Services for Libraries: Cloud Storage and Virtual Machines. OCLC Systems & Services, 29(2), 87-100.
- Han, Y. (2006). A RDF‐based digital library system. Library Hi Tech, 24(2), 234-240. doi:10.1108/07378830610669600More infoPurpose – To research a resource description framework (RDF) based digital library system that facilitates digital resource management and supports knowledge management for an interoperable information environment.Design/methodology/approach – The paper first introduces some of issues with metadata management and knowledge management and describes the needs for a true interoperable environment for information transferring across domains. A journal delivery application has been implemented as a concept‐proof project to demonstrate the usefulness of RDF in digital library systems.Findings – The RDF‐based digital library system at the University of Arizona Libraries provides an easy way for digital resource management by integrating other applications regardless of metadata formats and web presence.Practical implications – A journal delivery application has been running in the RDF‐based digital library system since April 2005. An electronic theses and dissertation application will be handled by the same syst...
- Han, Y., Pfander, J., & Bracke, M. S. (2005). Digitizing Rangelands: Providing Open Access to the Archives of Society for Range Management Journals. Quarterly Bulletin of IAALD, 3(4), 105-110.
Proceedings Publications
- Oro, P., Rychlik, M., Han, Y., & Menchik, D. (2024). Resilience Through Fragmentation: The Structure of the American Sociological Association in 1981, 1992, and 2000. In Annual Meeting of the American Sociological Association.
- Han, Y. (2016, August). PDF/A for Mass Digitization. In The IFLA World Library and Information Congress - 82nd IFLA General Conference and Assembly.
- Han, Y., Gillespie, T., Zhang, Q., & Subramanian, C. S. (2016, August). Identifiers and Use Case in Scientific Research. In The IFLA 2016 World Library and Information Congress - 82nd IFLA General Conference and Assembly.
Presentations
- Han, Y. (2024). Navigating Scholarly Data: UArizona’s Approach using OPenAlex for Measuring Authors’ impact. 2024 OpenAlex Virtual User Conference.
- Han, Y. (2020, December). ISO 16684-4 Draft Standard: Graphic technology — Extensible metadata platform (XMP) specification — Part 4: Use of XMP for semantic units. ISO TC 171 SC2 Working Group 12 MetadataISO.More infoPresenting the working document of ISO 16684-4 as the outline and general comments received from multiple nations including USA, Germany, and Russia.
- Han, Y., Sisco, L., Grevin, F., Johnson, D., & Rosenthol, L. (2018, September). PDF/A: Unpacking the standardization process. iPres 2018. Cambridge, MA: Harvard Library and MIT Libraries.More infoStandards development is not a siloed process. International delegates cross physical and cultural borders to participate in online and biannual in-person meetings in an ongoing effort to advance a common understanding of archival PDF technology. ISO delegations are comprised of industry professionals and practitioners of libraries, archives, and records management. These members ensure that PDF technology standards are thoroughly vetted and documented, but also meet the needs of the archival community. In this panel, delegates from the International Organization for Standardization’s technical committee for Document Management Applications (ISO/TC 171 SC 2) will unpack: the development process for PDF/A; coordination with digital preservation specialists; concerns about PDF/A’s implementations and limitations; and the consequences of failure to participate in the process. This panel seeks to shed light on the full context of PDF standardization, fostering conversation with the preservation community about how archivists and information professionals can utilize specifications of PDF and how these communities can engage with the development process of the format. In this sense, having these discussions about PDF format in particular, enables us to raise awareness and build digital preservation capacity for a variety of stakeholders, to address cultural gaps or barriers that result from using and developing one common format, and to expand on how gaining context and knowledge of technical specifications and development of even one format can inform how that format is implemented across cultures and communities, whether they’re artistic, scientific, or both. Panel attendees can expect the following: an in depth discussion of PDF standards development (the who, what, how); perspectives of developers, tool creators, archivists, and standards committee members regarding the intended and applied role of PDF/A in various communities - as a format, as it interacts with tools, as it’s used by users, developers, etc); and the issues “beyond PDF,” including concerns about the PDF format and various “what if’s” about using PDF in a preservation capacity. Attendees will have a better understanding of the development, use, and technical specifications of PDF format in addition to learning how to get involved in the process and provide feedback to the ISO committee.
- Pfander, J. L., & Han, Y. (2018, January). Partnering to Make SRM Archival Content Accessible at the University of Arizona. The Society for Range Management 71st Annual Meeting. Sparks, Nevada.
- Han, Y. (2015, January). Rethinking of Digitization File Formats. University of Arizona Libraries.
- Han, Y. (2015, October). Beyond TIFF: Digitization and Born-digital File Formats. University of Arizona Libraries.
- Han, Y., & Austin, M. (2013, July). Usage Metrics for ETDs: Demonstrating the Value of Dissemination. USETDA Annual Conference. Claremont, CA: USETDA.
- Han, Y. (2012, June). Trends in Cloud Computing. 2012 American Library Association (ALA) Annual Conference. Anaheim, CA, USA.More infoCloud computing is having a transformative impact on how libraries approach our information systems, our services and our data. As more libraries turn to the cloud they are exploring approaches data/metadata management, data curation and patron services. Presenters include Yan Han (University of Arizona), David Minor (San Diego Supercomputer Center of UC San Diego), Chris Tonjes (District of Columbia Public Library) and Erik Mitchell (University of Maryland).
- Han, Y., & Rawan, A. (2012, June). International Library Partnerships: Logistical and Technical Issues Relating to International Digitization Projects. 2012 American Library Association (ALA) Annual Conference. Anaheim, CA, USA.
- Han, Y., & Rawan, A. (2012, Nov). Inter-institutional collaboration between the University of Arizona Libraries and the Afghanistan Centre at Kabul University on preservation and digitization of a unique Afghan collection. 2012 Middle East Librarians Association conference. Denver, CO, USA.
Creative Productions
- Han, Y., Rawan, A., & Parastesh, S. (2020. Pashto Dataset. Github. https://github.com/yhan818/Pashto-DatasetMore infoA gold standard dataset for Pashto. The source data come from three selected books, published in 1986, 2002, and 2006 respectively, vary in fonts, printing, and digitization quality