sitemap

MRO-W Final Reports 2008-2009

Project:

Machine Learning Algorithms for Artificial Protein Design

Student Researchers:

Wendy Hom
Rebecca Reich

Advisor:

Lisa Hellerstein
Phyllis Frankl
Jin Montclare

Institution:

Polytechnic Institute of NYU

Webpage:


http://cis.poly.edu/~amoe/mlpd/index.html





General Project Description: The design and synthesis of functional proteins capable of selectively and efficiently catalyzing particular reactions remains a challenge for protein engineers . Although there are numerous examples exploring protein structure and function, there are comparatively few successful strategies that enable the predictable design of proteins with specified activity[1]. Traditional methods of protein engineering fall under two categories: rational structure-based design [1] and directed evolution [2]. Rational design relies on the knowledge of a 3-dimensional structure to pinpoint residues for mutation, while directed evolution exploits the power of natural selection to identify variants with improved properties from a random or targeted library. Both methods have yielded impressive results with proteins exceeding the functional properties of the parent protein.

The ultimate goal of this research project is the design of a highly active variant of the tGcn5 protein, bearing unnatural amino acids. Previous work by Prof Montclare has shown that substituting certain amino acids by unnatural fluorinated amino acids increases the stability of the protein, but inhibits the activity. We wish to find a variant of this artificial protein that remains stable but is highly active. In the case of employing rational design to proteins bearing unnatural amino acids, a structure of the artificial protein is needed. Because the current structure of the fluorinated protein is unknown, it is challenging to predict mutations to improve functions. For directed evolution in the presence of unnatural amino acids, the burden lies in the synthesis of libraries and the accurate high-throughput screening to evaluate the resulting variants [3]. Oftentimes, this requires the time-consuming and cost-prohibitive synthesis of multiple libraries and the screening of thousands to billions of variants. We propose an alternative approach where we employ machine-learning algorithms [4, 5] to identify variants expected to have improved activity. These variants will be generated and experimentally tested. The results from our experiments will then be used to provide training data for a second round of machine learning, aimed at further improving the activity of the artificial proteins.


The target protein we selected is tGcn5, a histone acetyltransferase (HAT) (Figure 1) [6]. The HAT tGcn5 acetylates histones, causing chromatin remodeling [7]. Changes in the activity of enzymes that are involved in maintaining the histone acetylation balance have been implicated in human disease as well as birth defects [8, 9]. Because of this link between acetylation and human disease, the design of an artificial protein capable of correcting abnormal chromatin remodeling caused by these mutations represents an attractive prospect for therapeutics.

The ultimate goal of this project is to apply machine-learning to assist in the design of highly active artificial proteins of tGcn5 bearing unnatural amino acids. This is being done in a multi-stage process. In the first stage an initial set of protein variants was designed and synthesized in the lab and the activity of each was measured. The data from this first phase serve as training data for a second phase in which machine learning is used to identify those mutations that appear to enhance activity in order to select additional proteins to synthesize. The activity data from the augmented set can then be used as training data in a subsequent round of machine learning, and so on.

Results:

classification algorithms were applied using numerous approaches to feature modeling.
As the project concludes, we have almost completed the lab work on the initial set of protein variants, along with several others having one mutation each. We have identified appropriate feature models and learning algorithms. Results showed that simple feature sets, based on sequence information, perform surprisingly well, compared to more complex feature sets incorporating 2D and 3D information about the proteins. SVMs and Simple Logistics consistently performed better than the other classification algorithms studied. These findings will guide other members of the research team in applying machine learning when the initial data set is ready.
Detailed reports from Yan Mei Chan and Rebecca Reich are attached.

References

[1] M. A. Dwyer, L. L. Looger, H. W. Hellinga, Science 2004, 304, 1967-1971.
[2] L. Giver, A. Gershenson, P. O. Freskgard, F. H. Arnold, Proceedings of the National Academy of Sciences of the United States of America 1998, 95, 12809-12813.
[3] J. K. Montclare, D. A. Tirrell, Angewandte Chemie-International Edition 2006, 45, 4518-4521.
[4] J. Liao, M. K. Warmuth, S. Govindarajan, J. E. Ness, R. P. Wang, C. Gustafsson, J. Minshull, BMC Biotechnology 2007, 7, 16-35.
[5] S. A. Danzinger, J. Zeng, Y. Wang, R. K. Brachmann, R. H. Lathrop, 2007, 23, 104-114.
[6] R. Marmorstein, Journal of Molecular Biology 2001, 311, 433-444.
[7] R. Marmorstein, Cellular and Molecular Life Sciences 2001, 58, 693-703.
[8] F. Petrij, R. H. Giles, H. G. Dauwerse, J. J. Saris, R. C. M. Hennekam, M. Masuno, N. Tommerup, G. J. B. Vanommen, R. H. Goodman, D. J. M. Peters, M. H. Breuning, Nature 1995, 376, 348-351.
[9] T. Murata, R. Kurokawa, A. Krones, K. Tatsumi, M. Ishii, T. Taki, M. Masuno, H. Ohashi, M. Yanagisawa, M. G. Rosenfeld, C. K. Glass, Y. Hayashi, Human Molecular Genetics 2001, 10, 1071-1076.