Building artificial intelligence for drug discovery

Drug discovery is perhaps the most exciting potential use case for Artificial Intelligence (AI). While many fear the inevitable foray of AI into our lives, there are many ways AI will improve our world for humans rather than annihilate us from the earth.

One way is to help us create new treatments for diseases. AI-assisted Discovery of new drugs is already something over 43 companies have been working on for a few years now.

Unassisted (non-AI) drug discovery is a time-consuming process that can take over 10 years and as much as $2.5B to get a new drug to market. It can be a tedious process of testing thousands of compounds knowing that the likelihood of any of them becoming a profitable drug is small. But what if we could zero in on a small list of compounds that are much more likely to be successful?

AI can help with exactly that. We can create models that predict a compound’s potential and run this model on millions of compounds to find the most promising options to test in a wet lab. But with an estimated 10⁶⁰ compounds we still have a lot of compounds to evaluate. (To better conceptualize this number, realize that 1 million is just 10⁶.)

So where do we get started? One technique is “virtual screening”. This predicts the repurposing of existing drugs. How does this work?

Create a database of compounds with their known properties like redox potential, solubility, toxicity, the 2D or 3D structure of the compound, etc. Then label them for the disease on which they are effective.
Create a model to accept these values as features and predict what diseases they are effective against.
Run all compounds in the database to see if any are predicted to have efficacy on diseases other than the one they are already approved for.

This has been useful in some cases and the drug is easier to get approval for from the FDA when it is for currently off-label use since the toxicity is already deemed tolerable. However, remember that 10⁶⁰ number? How do we find new compounds that are helpful?

Well, we could tag new compounds and run those through the model. This is very time-consuming and we would likely want to select compounds that are similar to existing successful compounds (like antibiotics) to increase the probability that we are successful.

It might make sense that a similar compound would work well but what about the other compounds? There could be some very different compounds that have never been tested before and could drastically improve treatment.

So, how do we solve this? Well, some scientists started thinking truly outside the box. They turned the whole process inside out. Instead of taking compounds as input and properties as predicted output, they decided to predict the compounds themselves directly. This is called Generative AI.

You’ve probably seen Generative AI before. For years we’ve used it to generate stories, poems, emails, etc. (Autocomplete in your favorite email app or word processor is common.) More recently you’ve probably seen it generate images or even short videos. In the image below we see the text prompt “A robot couple fine dining with Eiffel Tower in the background.” and an impressive generated image.

A sample image generated with AI by Imagen — Google Research, Brain Team.

When generating compounds the solution is more like NLP’s Generative AI. We start with our own language, the SMILES (Simplified Molecular Input Line Entry System) molecular modeling syntax.

If we treat SMILES like a language with the compounds being sentences, then tokenize it as we do in standard transformer models, we can predict molecules that will make up a new “sentence” that has the same meaning. So, how well did this work? Well, it didn’t. BUT the problem was that the SMILE method didn’t embed all of the meaning of the components of the compound.

Think of it this way, consider the following two sentences:

John was sitting on the bank of the river.
The bank was open so John sat inside.

In each sentence, the word “bank” has a very different meaning. As you may know, in modern NLP models these two words would have different embeddings and the sentences would have different attention vectors.

This solved the problem of distinguishing different meanings of the same word. Back to our compounds, the scientists also found that in some cases two almost identical compounds would be encoded to completely different strings using SMILES. This meant that we would never predict that the 2 compounds may provide a similar treatment.

So how do we do this in the context of molecular bonds and their three-dimensional properties? What the scientists involved in one project decided to do is to create a new version of SMILES that looked at a different vocabulary.

For example, common molecular structures (common within a compound) could be combined into 1 token (word) in the vocabulary. This provided better results in the next “word” prediction. The team was then able to reliably predict similar compounds and therefore (potentially) similar or better treatments.

The AI community is still in the early stages of building our AI assistant in drug discovery. We have already seen a few exciting results. (See the Halicin story.) But creativity is key. There is no one way to build this AI and ingenuity will be critical in making greater progress.

If you’d like to hear more of my writing please follow me on LinkedIn or Medium

Building artificial intelligence for drug discovery

Related