dc.contributor.advisor | Chen, Xue-wen | |
dc.contributor.author | Wasikowski, Michael | |
dc.date.accessioned | 2009-08-31T02:40:36Z | |
dc.date.available | 2009-08-31T02:40:36Z | |
dc.date.issued | 2009-07-07 | |
dc.date.submitted | 2009 | |
dc.identifier.other | http://dissertations.umi.com/ku:10469 | |
dc.identifier.uri | http://hdl.handle.net/1808/5451 | |
dc.description.abstract | The class imbalance problem is a recent development in machine learning. It is frequently encountered when using a classifier to generalize on real-world application data sets, and it causes a classifier to perform sub-optimally. Researchers have rigorously studied resampling methods, new algorithms, and feature selection methods, but no studies have been conducted to understand how well these methods combat the class imbalance problem. In particular, feature selection has been rarely studied outside of text classification problems. Additionally, no studies have looked at the additional problem of learning from small samples. This paper develops a new feature selection metric, Feature Assessment by Sliding Thresholds (FAST), specifically designed to handle small sample imbalanced data sets. FAST is based on the area under the receiver operating characteristic (AUC) generated by moving the decision boundary of a single feature classifier with thresholds placed using an even-bin distribution. This paper also presents a first systematic comparison of the three types of methods developed for imbalanced data classification problems and of seven feature selection metrics evaluated on small sample data sets from different applications. We evaluated the performance of these metrics using AUC and area under the P-R curve (PRC). We compared each metric on the average performance across all problems and on the likelihood of a metric yielding the best performance on a specific problem. We examined the performance of these metrics inside each problem domain. Finally, we evaluated the efficacy of these metrics to see which perform best across algorithms. Our results showed that signal-to-noise correlation coefficient (S2N) and FAST are great candidates for feature selection in most applications. | |
dc.format.extent | 105 pages | |
dc.language.iso | EN | |
dc.publisher | University of Kansas | |
dc.rights | This item is protected by copyright and unless otherwise specified the copyright of this thesis/dissertation is held by the author. | |
dc.subject | Computer science | |
dc.subject | Bioinformatics | |
dc.subject | Class imbalance problem | |
dc.subject | Feature evaluation and selection | |
dc.subject | Machine learning | |
dc.subject | Pattern recognition | |
dc.subject | Text mining | |
dc.title | Combating the Class Imbalance Problem in Small Sample Data Sets | |
dc.type | Thesis | |
dc.contributor.cmtemember | Huan, Jun | |
dc.contributor.cmtemember | Potetz, Brian | |
dc.thesis.degreeDiscipline | Electrical Engineering & Computer Science | |
dc.thesis.degreeLevel | M.S. | |
kusw.oastatus | na | |
kusw.oapolicy | This item does not meet KU Open Access policy criteria. | |
kusw.bibid | 6857594 | |
dc.rights.accessrights | openAccess | |