ATTENTION: The software behind KU ScholarWorks is being upgraded to a new version. Starting July 15th, users will not be able to log in to the system, add items, nor make any changes until the new version is in place at the end of July. Searching for articles and opening files will continue to work while the system is being updated.
If you have any questions, please contact Marianne Reed at mreed@ku.edu .
pyCatalstReader: Extracting Text and Tokenization of Technical Catalysis Science Papers
dc.contributor.author | Castro, Giordanno | |
dc.date.accessioned | 2021-12-14T20:58:51Z | |
dc.date.available | 2021-12-14T20:58:51Z | |
dc.date.issued | 2021-12-08 | |
dc.identifier.uri | http://hdl.handle.net/1808/32288 | |
dc.description | This project was submitted to the graduate degree program in the Department of Electrical Engineering and Computer Science and the Graduate Faculty of the University of Kansas in partial fulfillment of the requirements for the degree of Masters of Science in Computer Science. | en_US |
dc.description.abstract | Catalysts are an essential and ubiquitous component of our modern life, from empowering our agriculture to reducing toxic emissions. There is a constant need for more and better catalysts. The catalysis research literature is immense, growing, and scattered. Natural Language Processing (NLP), a sub-field of Machine Learning (ML), offers a potential solution to automatically make full use of all this valuable information and speed innovation. Even though NLP has made much progress in the analysis of everyday text, its application in more technical text has not been as successful. Specifically, there are even a dearth of tools that can appropriately extract text from the PDF files of research articles, which are the most common format used in the catalyst field. Therefore, this project aims to define a tool that can extract text from PDF files of catalysis science articles, which is prerequisite to applying NLP and ML tools. We also explore the first stage of the NLP pipeline, tokenization, by objectively comparing different tokenizers for catalysis science articles. | en_US |
dc.rights | Copyright 2021 Giordanno Castro | en_US |
dc.subject | Catalysis | |
dc.subject | Chemistry | |
dc.subject | Data extraction | |
dc.subject | Machine learning | |
dc.subject | Natural language processing | |
dc.title | pyCatalstReader: Extracting Text and Tokenization of Technical Catalysis Science Papers | en_US |
dc.type | Project | en_US |
dc.rights.accessrights | openAccess | en_US |