Mining Evolutionary Data to Reveal the Layered Architecture of Protein Function

Parente, Daniel Joseph

View/Open

Parente_ku_0099D_13473_DATA_1.pdf (116.7Mb)

Issue Date

2014-08-31

Author

Parente, Daniel Joseph

Publisher

University of Kansas

Format

306 pages

Type

Dissertation

Degree Level

Ph.D.

Discipline

Biochemistry & Molecular Biology

Rights

Metadata

Show full item record

Abstract

Revolutionary advances in sequencing technology have dramatically expanded the set of known, naturally-occurring protein sequences. Protein sequences arise from an evolutionary process and during evolution proteins experience pressure to maintain and diversity their functions via mutation. Some mutations arise merely from neutral drift, but other changes enable organisms to adapt to their unique niche. Positions that are important for structure or function are expected to be mutationally constrained during evolution. To that end, many algorithms have been devised to identify mutational constraints in the evolutionary record in order to predict the location of functionally important sites. Accurate prediction of functionally important positions would have important practical implications. For example, individual humans each carry about 10,000 exomic sequence polymorphisms. Which of these are functionally and/or clinically significant? Similarly, protein engineers may target such sites for mutagenesis to derive variant functions. To detect these constraints, homologous proteins must first be sorted into protein families, based on sequence similarity, which typically indicates structural and functional similarity. Protein family multiple sequence alignments (MSAs) can then be computationally analyzed to understand the family in light of its evolutionary history. MSA analyses have detect various evolutionary patterns that are thought to confer functional significance. For example, positions that are absolutely conserved across a family are commonly inferred to play important structural or functional roles and, consequently, be intolerant to mutation. Other analyses attempt to identify important non-conserved positions, some of which must be functionally significant for the family to evolve functional variations. One important example is “co-evolutionary” analyses, which seek pairs of positions that vary in a coordinated manner across evolution. MSA analyses make a number of simplifying assumptions to abstract away the full complexity of real proteins. Here, we have (1) assessed the validity of some of these assumptions, and (2) investigated strategies to maximize the usefulness of existing tools in identifying functionally important positions, in light of their limitations, and (3) evaluated the ability of existing tools to identify known-significant positions. To that end, we have applied MSA analyses to the LacI/GalR bacterial transcription regulator family as our primary model system. Our studies have proceeded in three phases. First, preceding work indicated that published predictions based on a small LacI/GalR MSA fail to identify several functionally-significant positions in the 18-amino acid linker of LacI/GalR proteins. We have investigated whether making better use of these tools — by expanding the set of sequence in the LacI/GalR MSA and sorting the family based on external experimental knowledge — can improve predictive accuracy. Interestingly, comparison of existing predictions to all available experimental data also suggests that — contrary to a common assumption — functionally neutral positions may be much more rare than previously thought. Second, LacI/GalR proteins exhibit substantial functional diversity, even though their structures are extremely similar. One question is: how can a common structure support high levels of functional diversity? We have used conservation and co-evolutionary analyses to determine whether (a) functionally significant positions are dictated by the tertiary structure — an assumption of most MSA analyses — or (b) whether the structure serves as an accommodating scaffold, by permitting multiple subfamily-specific networks of functionally significant positions. Finally, alternative co-evolutionary algorithms disagree about which pairs of positions are evolutionarily-linked. However, we have analyzed alternative co-evolution networks using graph theory and have observed that the eigenvector network centrality (a) improves agreement between diverse analyses, and (b) can identify functionally significant positions in protein families. Thus, eigenvector centrality may be a useful framework for interpreting co-evolution analyses. Taken together, our studies provide tools to make best use of existing MSA analyses and indicate that future tools should avoid making several common assumptions.

URI

http://hdl.handle.net/1808/23958

Collections

Dissertations [4889]
Molecular Biosciences Dissertations and Theses [270]

The University of Kansas prohibits discrimination on the basis of race, color, ethnicity, religion, sex, national origin, age, ancestry, disability, status as a veteran, sexual orientation, marital status, parental status, gender identity, gender expression and genetic information in the University’s programs and activities. The following person has been designated to handle inquiries regarding the non-discrimination policies: Director of the Office of Institutional Opportunity and Access, IOA@ku.edu, 1246 W. Campus Road, Room 153A, Lawrence, KS, 66045, (785)864-6414, 711 TTY.