pyCatalstReader: Extracting Text and Tokenization of Technical Catalysis Science Papers

Castro, Giordanno

dc.contributor.author	Castro, Giordanno
dc.date.accessioned	2021-12-14T20:58:51Z
dc.date.available	2021-12-14T20:58:51Z
dc.date.issued	2021-12-08
dc.identifier.uri	http://hdl.handle.net/1808/32288
dc.description	This project was submitted to the graduate degree program in the Department of Electrical Engineering and Computer Science and the Graduate Faculty of the University of Kansas in partial fulfillment of the requirements for the degree of Masters of Science in Computer Science.	en_US
dc.description.abstract	Catalysts are an essential and ubiquitous component of our modern life, from empowering our agriculture to reducing toxic emissions. There is a constant need for more and better catalysts. The catalysis research literature is immense, growing, and scattered. Natural Language Processing (NLP), a sub-field of Machine Learning (ML), offers a potential solution to automatically make full use of all this valuable information and speed innovation. Even though NLP has made much progress in the analysis of everyday text, its application in more technical text has not been as successful. Specifically, there are even a dearth of tools that can appropriately extract text from the PDF files of research articles, which are the most common format used in the catalyst field. Therefore, this project aims to define a tool that can extract text from PDF files of catalysis science articles, which is prerequisite to applying NLP and ML tools. We also explore the first stage of the NLP pipeline, tokenization, by objectively comparing different tokenizers for catalysis science articles.	en_US
dc.rights	Copyright 2021 Giordanno Castro	en_US
dc.subject	Catalysis
dc.subject	Chemistry
dc.subject	Data extraction
dc.subject	Machine learning
dc.subject	Natural language processing
dc.title	pyCatalstReader: Extracting Text and Tokenization of Technical Catalysis Science Papers	en_US
dc.type	Project	en_US
dc.rights.accessrights	openAccess	en_US

Files in this item

Name:: Castro_2021_Masters_Project_Re ...
Size:: 1.029Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

The University of Kansas prohibits discrimination on the basis of race, color, ethnicity, religion, sex, national origin, age, ancestry, disability, status as a veteran, sexual orientation, marital status, parental status, gender identity, gender expression and genetic information in the University’s programs and activities. The following person has been designated to handle inquiries regarding the non-discrimination policies: Director of the Office of Institutional Opportunity and Access, IOA@ku.edu, 1246 W. Campus Road, Room 153A, Lawrence, KS, 66045, (785)864-6414, 711 TTY.