Show simple item record

dc.contributor.advisorChen, Xue-wen
dc.contributor.authorWasikowski, Michael
dc.date.accessioned2009-08-31T02:40:36Z
dc.date.available2009-08-31T02:40:36Z
dc.date.issued2009-07-07
dc.date.submitted2009
dc.identifier.otherhttp://dissertations.umi.com/ku:10469
dc.identifier.urihttp://hdl.handle.net/1808/5451
dc.description.abstractThe class imbalance problem is a recent development in machine learning. It is frequently encountered when using a classifier to generalize on real-world application data sets, and it causes a classifier to perform sub-optimally. Researchers have rigorously studied resampling methods, new algorithms, and feature selection methods, but no studies have been conducted to understand how well these methods combat the class imbalance problem. In particular, feature selection has been rarely studied outside of text classification problems. Additionally, no studies have looked at the additional problem of learning from small samples. This paper develops a new feature selection metric, Feature Assessment by Sliding Thresholds (FAST), specifically designed to handle small sample imbalanced data sets. FAST is based on the area under the receiver operating characteristic (AUC) generated by moving the decision boundary of a single feature classifier with thresholds placed using an even-bin distribution. This paper also presents a first systematic comparison of the three types of methods developed for imbalanced data classification problems and of seven feature selection metrics evaluated on small sample data sets from different applications. We evaluated the performance of these metrics using AUC and area under the P-R curve (PRC). We compared each metric on the average performance across all problems and on the likelihood of a metric yielding the best performance on a specific problem. We examined the performance of these metrics inside each problem domain. Finally, we evaluated the efficacy of these metrics to see which perform best across algorithms. Our results showed that signal-to-noise correlation coefficient (S2N) and FAST are great candidates for feature selection in most applications.
dc.format.extent105 pages
dc.language.isoEN
dc.publisherUniversity of Kansas
dc.rightsThis item is protected by copyright and unless otherwise specified the copyright of this thesis/dissertation is held by the author.
dc.subjectComputer science
dc.subjectBioinformatics
dc.subjectClass imbalance problem
dc.subjectFeature evaluation and selection
dc.subjectMachine learning
dc.subjectPattern recognition
dc.subjectText mining
dc.titleCombating the Class Imbalance Problem in Small Sample Data Sets
dc.typeThesis
dc.contributor.cmtememberHuan, Jun
dc.contributor.cmtememberPotetz, Brian
dc.thesis.degreeDisciplineElectrical Engineering & Computer Science
dc.thesis.degreeLevelM.S.
kusw.oastatusna
kusw.oapolicyThis item does not meet KU Open Access policy criteria.
kusw.bibid6857594
dc.rights.accessrightsopenAccess


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record