Combating the Class Imbalance Problem in Small Sample Data Sets
Issue Date
2009-07-07Author
Wasikowski, Michael
Publisher
University of Kansas
Format
105 pages
Type
Thesis
Degree Level
M.S.
Discipline
Electrical Engineering & Computer Science
Rights
This item is protected by copyright and unless otherwise specified the copyright of this thesis/dissertation is held by the author.
Metadata
Show full item recordAbstract
The class imbalance problem is a recent development in machine learning. It is frequently encountered when using a classifier to generalize on real-world application data sets, and it causes a classifier to perform sub-optimally. Researchers have rigorously studied resampling methods, new algorithms, and feature selection methods, but no studies have been conducted to understand how well these methods combat the class imbalance problem. In particular, feature selection has been rarely studied outside of text classification problems. Additionally, no studies have looked at the additional problem of learning from small samples. This paper develops a new feature selection metric, Feature Assessment by Sliding Thresholds (FAST), specifically designed to handle small sample imbalanced data sets. FAST is based on the area under the receiver operating characteristic (AUC) generated by moving the decision boundary of a single feature classifier with thresholds placed using an even-bin distribution. This paper also presents a first systematic comparison of the three types of methods developed for imbalanced data classification problems and of seven feature selection metrics evaluated on small sample data sets from different applications. We evaluated the performance of these metrics using AUC and area under the P-R curve (PRC). We compared each metric on the average performance across all problems and on the likelihood of a metric yielding the best performance on a specific problem. We examined the performance of these metrics inside each problem domain. Finally, we evaluated the efficacy of these metrics to see which perform best across algorithms. Our results showed that signal-to-noise correlation coefficient (S2N) and FAST are great candidates for feature selection in most applications.
Collections
- Engineering Dissertations and Theses [1055]
- Theses [3940]
Items in KU ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.
We want to hear from you! Please share your stories about how Open Access to this item benefits YOU.