Show simple item record

dc.contributor.advisorGrzymala-Busse, Jerzy W
dc.contributor.authorWu, Xun
dc.date.accessioned2016-01-01T22:11:58Z
dc.date.available2016-01-01T22:11:58Z
dc.date.issued2015-08-31
dc.date.submitted2015
dc.identifier.otherhttp://dissertations.umi.com/ku:14127
dc.identifier.urihttp://hdl.handle.net/1808/19410
dc.description.abstractDiscretization is a common technique to handle numerical attributes in data mining, and it divides continuous values into several intervals by defining multiple thresholds. Decision tree learning algorithms, such as C4.5 and random forests, are able to deal with numerical attributes by applying discretization technique and transforming them into nominal attributes based on one impurity-based criterion, such as information gain or Gini gain. However, there is no doubt that a considerable amount of distinct values are located in the same interval after discretization, through which digital information delivered by the original continuous values are lost. In this thesis, we proposed a global discretization method that can keep the information within the original numerical attributes by expanding them into multiple nominal ones based on each of the candidate cut-point values. The discretized data set, which includes only nominal attributes, evolves from the original data set. We analyzed the problem by applying two decision tree learning algorithms (C4.5 and random forests) respectively to each of the twelve pairs of data sets (original and discretized data sets) and evaluating the performances (prediction accuracy rate) of the obtained classification models in Weka Experimenter. This is followed by two separate Wilcoxon tests (each test for one learning algorithm) to decide whether there is a level of statistical significance among these paired data sets. Results of both tests indicate that there is no clear difference in terms of performances by using the discretized data sets compared to the original ones. But in some cases, the discretized models of both classifiers slightly outperform their paired original models.
dc.format.extent60 pages
dc.language.isoen
dc.publisherUniversity of Kansas
dc.rightsCopyright held by the author.
dc.subjectComputer science
dc.subjectC4.5
dc.subjectcontinuous attribute
dc.subjectdiscretization
dc.subjectpreprocessing
dc.subjectRandom Forests
dc.titleA Global Discretization Approach to Handle Numerical Attributes as Preprocessing
dc.typeThesis
dc.contributor.cmtememberKulkarni, Prasad A
dc.contributor.cmtememberYun, Heechul
dc.thesis.degreeDisciplineElectrical Engineering & Computer Science
dc.thesis.degreeLevelM.S.
dc.rights.accessrightsopenAccess


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record