Genetic Algorithms and Turing Machines in Chemistry and Drug Design
The White Knight Society is a 501(c)3 organization that promotes the use of artificial intelligence in chemistry and drug design. We provide genetic algorithm and Turing machine software to help chemists design, discover and synthesize new drugs, one of the most exciting applications of artificial intelligence.
Throughout most of history, the main problem scientists were confronted with was how to obtain sufficient data. Today, the main problem is not how to gather data, but how to get rid of most of them. Only a small amount of data produced by our computerized instruments are relevant to the problem. We need to be prepared to handle large quantities of data. The only way to adapt ourselves to this flood of data is to acquire new knowledge, to rethink the methods we are using, and to adapt our skill for data handling to the new situation.
In chemistry, as in all natural sciences, we would like to learn new methods that can improve, shorten, or bring new insight into old ways of handling experimental data. Here we want to introduce you to a modern artificial intelligence method called "genetic algorithms". The genetic algorithm approach is a method for handling multivariate and multiresponse data. Consider the problem of making predictions on the binding affinity of several molecules to a protein receptor. Problems of this type cannot be solved by ab initio (theoretical) calculations; instead, they require a combination of experimental data and genetic algorithm methods.
The most important key to success in the application of genetic algorithms lies in the choice of the proper representation of information. In chemical applications, we must usually deal with relationships between chemical structure and properties. Chemical structures contain a variable number of atoms, but a requirement set by the use of neural networks is that molecules have to be represented by the same number of descriptors, irrespective of the number of atoms in the molecule. In the manual we provide along with our genetic algorithm software, we teach by example how to easily translate various chemical structures into variable-length SMILES strings that can be used by a genetic algorithm.
Fundamentals of Genetic Algorithms
Genetic algorithms very closely resemble the biological model of chromosomes and genes. Individual organisms in a genetic algorithm are generally composed of single chromosomes. These chromosomes are composed of genes. By manipulating the genes, new chromosomes are created that have different traits. These manipulations occur through cross over and mutation, just as they occur in nature. Cross over is analogous to the biological process of mating. Mutation is a way that new information can be introduced into an existing population.
In a genetic algorithm, genes represent individual components of a solution. This is a very important part of the analysis of a problem that is to be used with a genetic algorithm. To effectively solve a problem, you must determine a way to break the problem into related components, or genes. There is a finite number of genes that are used. Individual genes are not modified as the organisms evolve. It is the chromosomes that evolve by changing the order and makeup of their genes.
The steps involved in a genetic algorithm are as follows:
- 1. Create an initial population of chromosomes.
- 2. Evaluate the fitness of each chromosome that makes up the population.
- 3. Based on this fitness, select the chromosomes that will mate, or those that have the "privilege" to mate.
- 4. Cross over (or mate) the selected chromosomes and produce offspring.
- 5. Randomly mutate some of the genes of the chromosomes.
- 6. Repeat steps two through five until a new population is created.
- 7. The algorithm ends when the best solution has not changed for a preset number of generations.
Turing Machines
The concept of the Turing machine is based on the idea of a person executing a well-defined procedure by changing the contents of an unlimited paper tape, which is divided into squares, where each square contains one of a finite set of symbols. The person needs to remember one of a finite set of states and the procedure is formulated in very basic steps in the form of "If your state is 42 and the symbol you see is a '0' then replace this with a '1', move one symbol to the right, and assume state 17 as your new state."
More precisely, a Turing machine consists of:1. A TAPE which is divided into cells, one next to the other. Each cell contains a symbol from some finite alphabet. The alphabet contains a special blank symbol (here written as 'B') and one or more other symbols. The tape is assumed to be arbitrarily extendable to the left and to the right, i.e., the Turing machine is always supplied with as much tape as it needs for its computation. Cells that have not been written to before are assumed to be filled with the blank symbol. In some models the tape has a left end marked with a special symbol; the tape extends or is indefinitely extensible to the right. The symbols are sometimes referred to as colors.
2. A HEAD that can read and write symbols on the tape and move the tape left and right one (and only one) cell at a time. In some models the head moves and the tape is stationary.
3. A TABLE ("action table", or transition function) of instructions (usually quintuples or 5-tuples but sometimes 4-tuples) that, given the state the machine is currently in and the symbol it is reading on the tape tells the machine to do the following in sequence (for the 5-tuple models): (i) either erase or write a symbol, and then (ii) move the head ('L' for one step left or 'R' for one step right), and then (iii) assume the same or a new state as prescribed. In the 4-tuple models the TABLE tells the machine to (ia) erase or to write a symbol or (ib) move the head left or right, and then (ii) assume the same or a new state as prescribed, but not both actions (ia) and (ib) in the same instruction. In some models, if there is no entry in the table for the current combination of symbol and state then the machine will halt; other models require all entries to be filled.
4. A state register that stores the state of the Turing table. The number of different states is always finite and there is one special start state with which the state register is initialized. Turing defined this as a "note of instructions" to preserve the computation of the "computer" (a person) who is working in a "desultory manner":
| Tape symbol | Current state A | Current state B | Current state C | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Write symbol | Move tape | Next state | Write symbol | Move tape | Next state | Write symbol | Move tape | Next state | |
| 0 | 1 | R | B | 1 | L | A | 1 | L | B |
| 1 | 1 | L | C | 1 | R | B | 1 | N | HALT |
Using Genetic Algorithms & Turing Machines for QSAR and Synthesis Planning
"QSAR" stands for quantitative structure-activity relationships. Correlating the chemical structures of drugs with their pharmacological activities is an exercise in pattern recognition that is of particular interest. Because of the large development costs of new drugs, a reliable quantitative prediction of activity before the compound is made is of great interest to synthesis laboratories. In addition, finding the correct synthetic pathway to a target compound is often one of the most challenging tasks of the medicinal chemist.
At the White Knight Society, we are developing Turing machines that process SMILES strings, which describe molecules. Consider the scheme reactant>agents>product. Using our Turing machines, we can process the strings (>agents>product, reactant>>product, reactant>agents>) to output (reactant, agents, product) as the result. Genetic algorithms are used to construct the Turing machines that accomplish this task.
The role of the medicinal chemist has remained essentially unchanged for the past 50 years. Their role is dictated by their quest for rapid and efficient methods that optimize biological activity through structural variations. This need is historically driven by the fact that, on average, approximately 10,000 compounds are prepared and evaluated for every 1 that becomes a marketable drug. According to Lehman Brothers, financial analysts estimate that the current cost of drug development is nearly $600 million. Additionally, increased development time has shortened the useful patent life in which companies can recover their costs; one-third of which are estimated to occur in the lead generation, discovery, and optimization phase. The dramatic increase in the cost of discovery resources is highlighted by the fact that a single, traditionally synthesized compound is estimated to cost $6,000 per research size sample. These factors have contributed to a paradigm shift in the way pharmaceutical research is being conducted; companies are adopting approaches which reduce costs early in the drug development process.
Supporting Genetic Algorithm & Turing Machine Software for Chemistry
The White Knight Society is a registered 501(c)3 donor-supported charity. You can support our research by donating. Our 501(c)3 status means that your donations are tax deductible. You may donate to the White Knight Society via postal mail. Our mailing address is
The White Knight Society
2625 Washington Avenue
Racine, WI 53405.
USA
Correspondance: whiteknightsociety@gmail.com
