Show simple item record

dc.contributor.advisorLuo, Bo
dc.contributor.authorMcCollister, Caitlin
dc.date.accessioned2017-01-02T20:05:11Z
dc.date.available2017-01-02T20:05:11Z
dc.date.issued2016-08-31
dc.date.submitted2016
dc.identifier.otherhttp://dissertations.umi.com/ku:14757
dc.identifier.urihttp://hdl.handle.net/1808/22342
dc.description.abstractOne source of insight into the motivations of a modern human being is the text they write and post for public consumption online, in forms such as personal status updates, product reviews, or forum discussions. The task of inferring traits about an author based on their writing is often called "author profiling." One challenging aspect of author profiling in today’s world is the increasing diversity of natural languages represented on social media websites. Furthermore, the informal nature of such writing often inspires modifications to standard spelling and grammatical structure which are highly language-specific. These are some of the dilemmas that inspired a series of "shared task" competitions, in which many participants work to solve a single problem in different ways, in order to compare their methods and results. This thesis describes our submission to one author profiling shared task in which 22 teams implemented software to predict the age, gender, and certain personality traits of Twitter users based on the content of their posts to the website. We will also analyze the performance and implementation of our system compared to those of other teams, all of which were described in open-access reports. The competition organizers provided a labeled training dataset of tweets in English, Spanish, Dutch, and Italian, and evaluated the submitted software on a similar but hidden dataset. Our approach is based on applying a topic modeling algorithm to an auxiliary, unlabeled but larger collection of tweets we collected in each language, and representing tweets from the competition dataset in terms of a vector of 100 topics. We then trained a random forest classifier based on the labeled training dataset to predict the age, gender and personality traits for authors of tweets in the test set. Our software ranked in the top half of participants in English and Italian, and the top third in Dutch.
dc.format.extent44 pages
dc.language.isoen
dc.publisherUniversity of Kansas
dc.rightsCopyright held by the author.
dc.subjectComputer science
dc.subjectauthor profiling
dc.subjectclassification
dc.subjectmultilingual data
dc.subjectpersonality
dc.subjectshared task
dc.subjecttopic modeling
dc.titlePredicting Author Traits Through Topic Modeling of Multilingual Social Media Text
dc.typeThesis
dc.contributor.cmtememberAgah, Arvin
dc.contributor.cmtememberHuan, Luke
dc.thesis.degreeDisciplineElectrical Engineering & Computer Science
dc.thesis.degreeLevelM.S.
dc.identifier.orcid
dc.rights.accessrightsopenAccess


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record