Mining Evolutionary Data to Reveal the Layered Architecture of Protein Function
Parente, Daniel Joseph
University of Kansas
Biochemistry & Molecular Biology
Copyright held by the author.
MetadataShow full item record
Revolutionary advances in sequencing technology have dramatically expanded the set of known, naturally-occurring protein sequences. Protein sequences arise from an evolutionary process and during evolution proteins experience pressure to maintain and diversity their functions via mutation. Some mutations arise merely from neutral drift, but other changes enable organisms to adapt to their unique niche. Positions that are important for structure or function are expected to be mutationally constrained during evolution. To that end, many algorithms have been devised to identify mutational constraints in the evolutionary record in order to predict the location of functionally important sites. Accurate prediction of functionally important positions would have important practical implications. For example, individual humans each carry about 10,000 exomic sequence polymorphisms. Which of these are functionally and/or clinically significant? Similarly, protein engineers may target such sites for mutagenesis to derive variant functions. To detect these constraints, homologous proteins must first be sorted into protein families, based on sequence similarity, which typically indicates structural and functional similarity. Protein family multiple sequence alignments (MSAs) can then be computationally analyzed to understand the family in light of its evolutionary history. MSA analyses have detect various evolutionary patterns that are thought to confer functional significance. For example, positions that are absolutely conserved across a family are commonly inferred to play important structural or functional roles and, consequently, be intolerant to mutation. Other analyses attempt to identify important non-conserved positions, some of which must be functionally significant for the family to evolve functional variations. One important example is “co-evolutionary” analyses, which seek pairs of positions that vary in a coordinated manner across evolution. MSA analyses make a number of simplifying assumptions to abstract away the full complexity of real proteins. Here, we have (1) assessed the validity of some of these assumptions, and (2) investigated strategies to maximize the usefulness of existing tools in identifying functionally important positions, in light of their limitations, and (3) evaluated the ability of existing tools to identify known-significant positions. To that end, we have applied MSA analyses to the LacI/GalR bacterial transcription regulator family as our primary model system. Our studies have proceeded in three phases. First, preceding work indicated that published predictions based on a small LacI/GalR MSA fail to identify several functionally-significant positions in the 18-amino acid linker of LacI/GalR proteins. We have investigated whether making better use of these tools — by expanding the set of sequence in the LacI/GalR MSA and sorting the family based on external experimental knowledge — can improve predictive accuracy. Interestingly, comparison of existing predictions to all available experimental data also suggests that — contrary to a common assumption — functionally neutral positions may be much more rare than previously thought. Second, LacI/GalR proteins exhibit substantial functional diversity, even though their structures are extremely similar. One question is: how can a common structure support high levels of functional diversity? We have used conservation and co-evolutionary analyses to determine whether (a) functionally significant positions are dictated by the tertiary structure — an assumption of most MSA analyses — or (b) whether the structure serves as an accommodating scaffold, by permitting multiple subfamily-specific networks of functionally significant positions. Finally, alternative co-evolutionary algorithms disagree about which pairs of positions are evolutionarily-linked. However, we have analyzed alternative co-evolution networks using graph theory and have observed that the eigenvector network centrality (a) improves agreement between diverse analyses, and (b) can identify functionally significant positions in protein families. Thus, eigenvector centrality may be a useful framework for interpreting co-evolution analyses. Taken together, our studies provide tools to make best use of existing MSA analyses and indicate that future tools should avoid making several common assumptions.
Items in KU ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.
We want to hear from you! Please share your stories about how Open Access to this item benefits YOU.