A MACHINE LEARNING APPROACH TO QUERY TIME-SERIES MICROARRAY DATA SETS FOR FUNCTIONALLY RELATED GENES USING HIDDEN MARKOV MODELS

Senf, Alexander J.

dc.contributor.advisor	Chen, Xue-wen
dc.contributor.author	Senf, Alexander J.
dc.date.accessioned	2011-06-21T16:22:02Z
dc.date.available	2011-06-21T16:22:02Z
dc.date.issued	2011-02-04
dc.date.submitted	2011
dc.identifier.other	http://dissertations.umi.com/ku:11286
dc.identifier.uri	http://hdl.handle.net/1808/7639
dc.description.abstract	Microarray technology captures the rate of expression of genes under varying experimental conditions. Genes encode the information necessary to build proteins; proteins used by cellular functions exhibit higher rates of expression for the associated genes. If multiple proteins are required for a particular function then their genes show a pattern of coexpression during time periods when the function is active within a cell. Cellular functions are generally complex and require groups of genes to cooperate; these groups of genes are called functional modules. Modular organization of genetic functions has been evident since 1999. Detecting functionally related genes in a genome and detecting all genes belonging to particular functional modules are current research topics in this field. The number of microarray gene expression datasets available in public repositories increases rapidly, and advances in technology have now made it feasible to routinely perform whole-genome studies where the behavior of every gene in a genome is captured. This promises a wealth of biological and medical information, but making the amount of data accessible to researchers requires intelligent and efficient computational algorithms. Researchers working on specific cellular functions would benefit from this data if it was possible to quickly extract information useful to their area of research. This dissertation develops a machine learning algorithm that allows one or multiple microarray data sets to be queried with a set of known and functionally related input genes in order to detect additional genes participating in the same or closely related functions. The focus is on time-series microarray datasets where gene expression values are obtained from the same experiment over a period of time from a series of sequential measurements. A feature selection algorithm selects relevant time steps where the provided input genes exhibit correlated expression behavior. Time steps are the columns in microarray data sets, rows list individual genes. A specific linear Hidden Markov Model (HMM) is then constructed to contain one hidden state for each of the selected experiments and is trained using the expression values of the input genes from the microarray. Given the trained HMM the probability that a sequence of gene expression values was generated by that particular HMM can be calculated. This allows for the assignment of a probability score for each gene in the microarray. High-scoring genes are included in the result set (of genes with functional similarities to the input genes.) P-values can be calculated by repeating this algorithm to train multiple individual HMMs using randomly selected genes as input genes and calculating a Parzen Density Function (PDF) from the probability scores of all HMMs for each gene. A feedback loop uses the result generated from one algorithm run as input set for another iteration of the algorithm. This iterated HMM algorithm allows for the characterization of functional modules from very small input sets and for weak similarity signals. This algorithm also allows for the integration of multiple microarray data sets; two approaches are studied: Meta-Analysis (combination of the results from individual data set runs) and the extension of the linear HMM across multiple individual data sets. Results indicate that Meta-Analysis works best for integration of closely related microarrays and a spanning HMM works best for the integration of multiple heterogeneous datasets. The performance of this approach is demonstrated relative to the published literature on a number of widely used synthetic data sets. Biological application is verified by analyzing biological data sets of the Fruit Fly D. Melanogaster and Baker‟s Yeast S. Cerevisiae. The algorithm developed in this dissertation is better able to detect functionally related genes in common data sets than currently available algorithms in the published literature.
dc.format.extent	132 pages
dc.language.iso	en
dc.publisher	University of Kansas
dc.rights	This item is protected by copyright and unless otherwise specified the copyright of this thesis/dissertation is held by the author.
dc.subject	Bioinformatics
dc.subject	Computer science
dc.subject	Hidden Markov model
dc.subject	Hmm
dc.subject	Machine learning
dc.subject	Microarray
dc.title	A MACHINE LEARNING APPROACH TO QUERY TIME-SERIES MICROARRAY DATA SETS FOR FUNCTIONALLY RELATED GENES USING HIDDEN MARKOV MODELS
dc.type	Dissertation
dc.contributor.cmtemember	Huan, Jun
dc.contributor.cmtemember	Miller, James
dc.contributor.cmtemember	Agah, Arvin
dc.contributor.cmtemember	Vakser, Ilya
dc.thesis.degreeDiscipline	Electrical Engineering & Computer Science
dc.thesis.degreeLevel	Ph.D.
kusw.oastatus	na
dc.identifier.orcid	https://orcid.org/0000-0001-6910-7306
kusw.oapolicy	This item does not meet KU Open Access policy criteria.
kusw.bibid	7642722
dc.rights.accessrights	openAccess

Files in this item

Name:: Senf_ku_0099D_11286_DATA_1.pdf
Size:: 1.465Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

The University of Kansas prohibits discrimination on the basis of race, color, ethnicity, religion, sex, national origin, age, ancestry, disability, status as a veteran, sexual orientation, marital status, parental status, gender identity, gender expression and genetic information in the University’s programs and activities. The following person has been designated to handle inquiries regarding the non-discrimination policies: Director of the Office of Institutional Opportunity and Access, IOA@ku.edu, 1246 W. Campus Road, Room 153A, Lawrence, KS, 66045, (785)864-6414, 711 TTY.