How not to evaluate an AI scientist

Why drug repositioning evaluation is a terrible benchmark; a brief note

May 30, 2025

Before I begin, note that I am posting in a personal capacity and this blog does not represent any positions of Relation.

The AI scientist is coming and it’s so much better than your silly human brains — it finds hidden patterns and crazy ideas that are beyond the basic reasoning of us mere mortals. Well, that’s the narrative, but encouraged by the glitzy press release from FutureHouse, I decided to look a little deeper and give my take on what good looks like.

Below, I give:

A mini-dissection of the FutureHouse claims
A how-to guide to game “scientific discoveries” via a drug repositioning narrative
What is actually useful.

ASIDE 1: I should say that I’m really excited by the concept of AI scientists, but premature declarations of victory only hold back the field. For example, there are at least 5 papers showing simple statistical models outperforms deep learning for transcriptomic profile prediction. This being said, I’ve been playing with LLMs to do some quite challenging maths and I am insanely impressed.

The FutureHouse claims

Jumping right to it, three bits jumped out at me:

"To be clear, no one has proposed using ROCK inhibitors to treat dry AMD in the literature before, as far as we can find..."

As soon as I read this, I felt I’d heard this story before (thank-you fellow Relationeer Cristian Regep and ALSA’s Katie Sunnucks), but perhaps related to diabetic retinopathy. A split second google search revealed a review article on ROCK inhibitors in ophthalmology more generally, so it felt like well trodden ground… typically experimentalists like to stick molecules into assays of vaguely related diseases and see what happens, so this is hardly a surprise.

I posted this on twitter, there was a bit of back-and-forth, but soon people pasted repeated links showing ROCKi in dry AMD.

ASIDE 2: perhaps one of the AI tools used to examine novelty (Elicit) may in fact hallucinate results! Hilarious!

There’s a lot of talk about what constitutes “novelty”, but it seems that if there’s this much debate, it’s safe to assume that this isn’t novel.

However, there’s actually a more pernicious point: LLMs tokenize words not concepts. Therefore, it is not clear that was not data leakage; essentially, if ROCK inhibitors were in the training data for wet AMD, does the model differentiate between “wet AMD” and “dry AMD” as separate entities? As far as I am aware, the answer is no.

There’s also a softer claim, for example:

“This is the first time that we are aware of that hypothesis generation, experimentation, and data analysis have been joined up in closed loop”

Depending on how strict your definition is (particularly on ‘data analysis’), this claim is also questionable. For example, people have been for quite some time using active learning (AL), reinforcement learning, or sequential model optimization (SMO) for experimental design. For example, if you want to search through the combinatorial space of drug pairs, we used SMO to find synergistic drug pairs in 5 wet-lab/dry-lab cycles. We’re certainly not the only ones.

At the end, we have the key caveat:

“Also, this discovery is cool, but it is not yet a "move 37"-style discovery. At the current rate of progress, I'm sure we will get to that level soon.”

This part I’m really not sure of. The reason is that we will never see nonlinear results if we constrain ourselves to highly gamable drug repositioning narratives. Essentially, if this is the benchmark, it’s so easy to cheat, how can we expect to make genuine progress? Let me explain.

Gaming “scientific discoveries” via drug repositioning

A few observations about drugs and disease:

Many chemotherapies and immunotherapies work across many types of cancer.
Many antibiotics work across many infectious diseases.
Many anti-inflammatories therapies work across many autoimmune diseases.

Basically, core biological mechanisms are repeated over and over, and new cutting-edge drug discovery revolves around finding new mechanisms specific to some organ system or disease, not rehashing old mechanisms.

However, imagine you want to come up with a (fake) compelling story, here’s what you do: consider the graph of drug-target-disease triplets with edges like this

Edge 1: Drug A is interacts with target B
Edge 2: Target B modulation can be used to treat disease C

and then we can state the common knowledge that “Drug A treats disease C”. With me?

Suppose now you compliment this data that:

Edge 3: Drug X is linked to target B

then you can have a “eureka” moment from your “AI scientist” that “Drug X treats disease C 🚀🚀🚀” — Not much of a revolution.

ASIDE 3: As an alternative to Edge 3, perhaps we can also use Drug A to treat (new) disease Y if target B is also appropriate for disease Y. Again, not groundbreaking.

But why haven’t we caught onto this trick? In my mind, it’s due to a lack of knowledge for how the pharmaceutical industry actually works. Essentially, pharma companies seldom want to publish literature specifying how drugs work outside of their primary FDA approval without extensive scrutiny. This is for a bunch of reasons, namely:

Pricing: Imagine you have a drug on the market in an expensive disease area (say, some rare and aggressive cancer), but then someone tells you that you can also use that drug to treat an inexpensive disease area (say, mild headache), what do you do? The savvy businessman would tell you: do nothing! This is because as soon as you get an approval for headaches, then you will need to offer the drug at bargain basement prices so it is competitive with ibuprofen, but then suddenly you will see that lucrative oncology market disappear as cancer patients buy copious amounts of your new headache drug!

Whilst this may contravene the intended label for the drug and you have a delightful intellectual property portfolio and litigation strategy saying that prescribers can’t do this, it’s not hugely enforceable.

Crossover safety warnings: Here’s another issue, imagine your drug is being used in both acute and chronic settings. Perhaps after long term use, an epidemiologist notices an increase in the rate of heart attacks. Now you have an obligation to bring this to the FDA, and perhaps they may decide that due to the availability of other treatment choices in the acute setting that maybe your drug should be withdrawn!

Basically, there’s a lot of very interesting non-scientific reasons why interesting scientific ideas do not go very far — or get published.

What would a useful AI scientist do?

This part is quite a struggle. If not recovering plausible hypotheses, how do you evaluate success and what are good project ideas? A few principles jump out:

Making quantitative estimates. For example, what target when knocked out will inhibit my phenotype by >90% in a 3 week physiologically relevant assay? Here, we need formal representations of variables that describe disease, e.g. how is the phenotype measured? How to incorporate time? What is a knock out? This will all need to be extensively experimentally validated.
AI for lab automation interoperability. Imagine if you run an experiment in Lab A; is this the same as Lab B — even if they have different equipment? If they’re different, how different is this? Are there rules where experiments replicate, and rules as to when you’re comparing apples with oranges? Before you even build an AI scientist, it would be great to know what training data you should use. It’s worth listening to folk like Vincent Alessi who have smart ideas in this space.
Chasing value. Something we wrestled with at Relation: how to know if a project makes sense to begin with? At a high level, one should research what the “old fashioned” way of doing something is; build an understanding of cost/time/practicality of doing it in this way at scale; understand the incremental benefit of doing it using new sexy “AI method”; and then calculate if it moves the needle. A great idea is to look for new use cases.

ASIDE 4: The other option is finding some new way of analysing data that no one has done before, but I suspect this is substantially harder for AI if there is no precedent – now wouldn’t that be exciting?!

Finally, it’s just worth remembering that we’ve been here before in molecular design, and any progress we make will take years before its in the clinic.

Calvin McCarter

Jun 1

> AI for lab automation interoperability

The problem of combining datasets in the presence of simultaneous biological variation and technical variation is not as bad as it seems. As I showed in a recently published paper (https://openreview.net/pdf?id=GSp2WC7q0r) you can train generative models for measurements given technical confounders, then minimize the expected distance between the conditional distributions. (Apologies for tooting my own horn here...)

Expand full comment

3 replies by Jake P. Taylor-King and others

Francesco

May 30

"This is because as soon as you get an approval for headaches, then you will need to offer the drug at bargain basement prices so it is competitive with ibuprofen". Why? I would simply price as cancer treatment and purposefully lose the headache treatment market

1 reply by Jake P. Taylor-King

4 more comments...

wild type human thoughts

Discussion about this post