A Global Discretization Approach to Handle Numerical Attributes as Preprocessing

Wu, Xun

dc.contributor.advisor	Grzymala-Busse, Jerzy W
dc.contributor.author	Wu, Xun
dc.date.accessioned	2016-01-01T22:11:58Z
dc.date.available	2016-01-01T22:11:58Z
dc.date.issued	2015-08-31
dc.date.submitted	2015
dc.identifier.other	http://dissertations.umi.com/ku:14127
dc.identifier.uri	http://hdl.handle.net/1808/19410
dc.description.abstract	Discretization is a common technique to handle numerical attributes in data mining, and it divides continuous values into several intervals by defining multiple thresholds. Decision tree learning algorithms, such as C4.5 and random forests, are able to deal with numerical attributes by applying discretization technique and transforming them into nominal attributes based on one impurity-based criterion, such as information gain or Gini gain. However, there is no doubt that a considerable amount of distinct values are located in the same interval after discretization, through which digital information delivered by the original continuous values are lost. In this thesis, we proposed a global discretization method that can keep the information within the original numerical attributes by expanding them into multiple nominal ones based on each of the candidate cut-point values. The discretized data set, which includes only nominal attributes, evolves from the original data set. We analyzed the problem by applying two decision tree learning algorithms (C4.5 and random forests) respectively to each of the twelve pairs of data sets (original and discretized data sets) and evaluating the performances (prediction accuracy rate) of the obtained classification models in Weka Experimenter. This is followed by two separate Wilcoxon tests (each test for one learning algorithm) to decide whether there is a level of statistical significance among these paired data sets. Results of both tests indicate that there is no clear difference in terms of performances by using the discretized data sets compared to the original ones. But in some cases, the discretized models of both classifiers slightly outperform their paired original models.
dc.format.extent	60 pages
dc.language.iso	en
dc.publisher	University of Kansas
dc.rights	Copyright held by the author.
dc.subject	Computer science
dc.subject	C4.5
dc.subject	continuous attribute
dc.subject	discretization
dc.subject	preprocessing
dc.subject	Random Forests
dc.title	A Global Discretization Approach to Handle Numerical Attributes as Preprocessing
dc.type	Thesis
dc.contributor.cmtemember	Kulkarni, Prasad A
dc.contributor.cmtemember	Yun, Heechul
dc.thesis.degreeDiscipline	Electrical Engineering & Computer Science
dc.thesis.degreeLevel	M.S.
dc.rights.accessrights	openAccess

Files in this item

Name:: Wu_ku_0099M_14127_DATA_1.pdf
Size:: 1.041Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

The University of Kansas prohibits discrimination on the basis of race, color, ethnicity, religion, sex, national origin, age, ancestry, disability, status as a veteran, sexual orientation, marital status, parental status, gender identity, gender expression and genetic information in the University’s programs and activities. The following person has been designated to handle inquiries regarding the non-discrimination policies: Director of the Office of Institutional Opportunity and Access, IOA@ku.edu, 1246 W. Campus Road, Room 153A, Lawrence, KS, 66045, (785)864-6414, 711 TTY.