It was only a matter of time before someone did this. But seriously, this is nonsense and should be treated as such:

James Hawley, PhD

@UlrikeHahn The paper describes how it “executes experiments” by generating code, a matching description, and running the code. This will only work with 1. experiments that can be performed with code, and 2. code that can be run on data that already exists and is accessible.

This is an extremely narrow set of criteria, since we actually need to engage with the real world to get proper scientific data. Models like this can’t do that, and selling this as such, in my opinion, is nonsense.

Ulrike Hahn

@jrhawley isn’t lots of ML research run on extant data sets? Benchmark tests are tests, no?

I have no desire to defend the paper as such, I was just genuinely surprised having read it following your post, so am curious about the conceptual reasons for ruling these out specifically.

James Hawley, PhD

@UlrikeHahn Yes, that's definitely true. Large public datasets like MNIST would probably work well enough with this tool, since it would be used in many papers for training and benchmarking.

But I think there are issues around the multiple layers of requirements and transparency involved with a tool like this. You need someone to attest that the code that was generated as part of the "experiment" matches the description, that the plots faithfully represent the data, etc.

James Hawley, PhD

@UlrikeHahn Knowing that everything matches what was actually run, assuming the code was run correctly, is tough, too - especially given hallucinations in LLMs and how nuanced the "correct" thing can be in novel research.

None of that verification is trivial to do, in my experience. Lots of novel CS research also needs to define new algorithms or functions for doing the correct calculations, so there's a lot that needs to be checked, theoretically, outside the manuscript.

Ulrike Hahn

@jrhawley by sheer coincidence I had posted a link to this paper earlier today https://arxiv.org/abs/2409.11363

James Hawley, PhD

@UlrikeHahn > I have no desire to defend the paper as such

No worries! I didn't think you were trying to defend it, but looking for clarity about what I meant. I understand that "this is nonsense" may be a bit inflammatory, and there are some particular circumstances where a tool like this might actually be interesting.

But, IMO, those circumstances are so niche and still require lots of extra work that saying "this can generate new discoveries" is misleading, at best.

James Hawley, PhD

@UlrikeHahn And spamming dozens of papers a week is dangerous for scientific research, more generally. We already have problems with predatory journals and articles. Tools like this will only make those problems worse, IMO, without offering much benefit.

James Hawley, PhD

@UlrikeHahn Oh, interesting! I hadn't seen this one. Just from the abstract:

> The best agent achieved an accuracy of 21% on the hardest task, showing the vast scope for improvement in automating routine scientific tasks.

This seems like an accurate disclosure. They made AI models but their accuracy was pretty low and they don't oversell the results. I'm fine with that!

Putting a preprint on arXiv to hype your startup, like the paper I posted, is not something I'm a fan of.

Ulrike Hahn

@jrhawley that one was never going to be a hype paper, see the author list

the book comes out on Monday https://press.princeton.edu/books/hardcover/9780691249131/ai-snake-oil

Ulrike Hahn

@jrhawley couldn’t agree more that this is new and uncharted, dangerous, territory for the academic record.

That’s why I read the paper after seeing your post, and my response was more ‘yikes, it can do that’ than ‘this is just hype’. It would be less worrisome to me if it was just the latter. As it stands, it feels like it falls in a bad in-between space: good enough to do real damage, not good enough to seem genuinely valuable.