Loading...
New Probabilistic Techniques for Classification Problems and an Application
Tan, Yi
Tan, Yi
Citations
Altmetric:
Abstract
The main focus of this dissertation is to develop new machine learning and statistical methodologies for classification problems, with a real–life application in healthcare. The dissertation has three chapters. In the first chapter, we examine the construction of hybrid logistic regression–naïve Bayes model, a restricted Bayesian network classifier that combines two probabilistic models in a graphical way, with the aim of combining the strengths of both models. We follow the strategy of balancing the tradeoff between model bias and variance with the objective of minimizing the sum of these two errors. Specifically, we use training set size as a proxy for model variance and conditional dependence among features as a proxy for model bias. Experimental results show that, the resulting hybrid logistic regression–naïve Bayes model is a competitive alternative to a variety of state-of-the-art classifiers. In the second chapter, we focus on a regularization method, which is a technique of adding information to the learning algorithm to improve the estimation of the model. Most of the existing regularization methods (e.g., lasso) rely on sparsity assumption, which reduces a model’s variance by shrinking its coefficients towards zero. One limitation of lasso is that, in practice, sparsity assumption is often violated. Shrinking the coefficients of influential predictors towards zero introduces bias, and make the regression estimates suboptimal. As a consequence, lasso may not perform well when the training set size is relatively large as compared to the number of parameters to be estimated. We argue that for such a situation, shrinking the coefficients towards a low-variance data driven estimate could be a better strategy. For classification purposes, we propose a naïve Bayes regularized logistic regression, which shrinks its coefficients towards naïve Bayes estimates, a well-known low variance estimator, instead of zero. This method is driven by the fact that naïve Bayes and logistic regression converge toward identical classifiers if the naïve Bayes’ conditional independence assumptions hold. Simulation and experimental results suggest that this method is highly competitive with a variety of state-of-the-art classifiers. In the third chapter, we are collaborating with the U.S. Veterans Affairs’ (VA) Eastern Kansas Health Care System, to help them construct a clinical model that can assist doctors in predicting and diagnosing the post-traumatic stress disorder (PTSD). This study is motivated by the need to provide more efficient service process of VA hospitals and reduce veterans’ waiting time. Specifically, we propose a sparsity-enforcing l1 penalized Bayesian network-based model by addressing three clinical challenges presented in veteran PTSD prediction problem: 1. probabilistic classification, 2. large amount of missing data, and 3. high dimensional search space. The proposed model provides better prediction in veterans’ likelihood of suffering from PTSD as compared with a variety of state-of-art probabilistic classifiers. In addition, our model identifies eight variables which provide the most directly predictive power.
Description
Date
2020-05-31
Journal Title
Journal ISSN
Volume Title
Publisher
University of Kansas
Collections
Research Projects
Organizational Units
Journal Issue
Keywords
Business administration, Statistics, Business Analytics, Healthcare Analytics, Machine Learning, Probabilistic Classification