Combating the Class Imbalance Problem in Small Sample Data Sets

Wasikowski, Michael

dc.contributor.advisor	Chen, Xue-wen
dc.contributor.author	Wasikowski, Michael
dc.date.accessioned	2009-08-31T02:40:36Z
dc.date.available	2009-08-31T02:40:36Z
dc.date.issued	2009-07-07
dc.date.submitted	2009
dc.identifier.other	http://dissertations.umi.com/ku:10469
dc.identifier.uri	http://hdl.handle.net/1808/5451
dc.description.abstract	The class imbalance problem is a recent development in machine learning. It is frequently encountered when using a classifier to generalize on real-world application data sets, and it causes a classifier to perform sub-optimally. Researchers have rigorously studied resampling methods, new algorithms, and feature selection methods, but no studies have been conducted to understand how well these methods combat the class imbalance problem. In particular, feature selection has been rarely studied outside of text classification problems. Additionally, no studies have looked at the additional problem of learning from small samples. This paper develops a new feature selection metric, Feature Assessment by Sliding Thresholds (FAST), specifically designed to handle small sample imbalanced data sets. FAST is based on the area under the receiver operating characteristic (AUC) generated by moving the decision boundary of a single feature classifier with thresholds placed using an even-bin distribution. This paper also presents a first systematic comparison of the three types of methods developed for imbalanced data classification problems and of seven feature selection metrics evaluated on small sample data sets from different applications. We evaluated the performance of these metrics using AUC and area under the P-R curve (PRC). We compared each metric on the average performance across all problems and on the likelihood of a metric yielding the best performance on a specific problem. We examined the performance of these metrics inside each problem domain. Finally, we evaluated the efficacy of these metrics to see which perform best across algorithms. Our results showed that signal-to-noise correlation coefficient (S2N) and FAST are great candidates for feature selection in most applications.
dc.format.extent	105 pages
dc.language.iso	EN
dc.publisher	University of Kansas
dc.rights	This item is protected by copyright and unless otherwise specified the copyright of this thesis/dissertation is held by the author.
dc.subject	Computer science
dc.subject	Bioinformatics
dc.subject	Class imbalance problem
dc.subject	Feature evaluation and selection
dc.subject	Machine learning
dc.subject	Pattern recognition
dc.subject	Text mining
dc.title	Combating the Class Imbalance Problem in Small Sample Data Sets
dc.type	Thesis
dc.contributor.cmtemember	Huan, Jun
dc.contributor.cmtemember	Potetz, Brian
dc.thesis.degreeDiscipline	Electrical Engineering & Computer Science
dc.thesis.degreeLevel	M.S.
kusw.oastatus	na
kusw.oapolicy	This item does not meet KU Open Access policy criteria.
kusw.bibid	6857594
dc.rights.accessrights	openAccess

Files in this item

Name:: Wasikowski_ku_0099M_10469_DATA ...
Size:: 469.0Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

The University of Kansas prohibits discrimination on the basis of race, color, ethnicity, religion, sex, national origin, age, ancestry, disability, status as a veteran, sexual orientation, marital status, parental status, gender identity, gender expression and genetic information in the University’s programs and activities. The following person has been designated to handle inquiries regarding the non-discrimination policies: Director of the Office of Institutional Opportunity and Access, IOA@ku.edu, 1246 W. Campus Road, Room 153A, Lawrence, KS, 66045, (785)864-6414, 711 TTY.