It was only a matter of time before someone did this. But seriously, this is nonsense and should be treated as such:

James Hawley, PhD

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

One of the grand challenges of artificial general intelligence is developing agents capable of conducting scientific research and discovering new knowledge. While frontier models have already been used as aides to human scientists, e.g. for brainstorming ideas, writing code, or prediction tasks, they still conduct only a small part of the scientific process. This paper presents the first comprehensive framework for fully automatic scientific discovery, enabling frontier large language models to perform research independently and communicate their findings. We introduce The AI Scientist, which generates novel research ideas, writes code, executes experiments, visualizes results, describes its findings by writing a full scientific paper, and then runs a simulated review process for evaluation. In principle, this process can be repeated to iteratively develop ideas in an open-ended fashion, acting like the human scientific community. We demonstrate its versatility by applying it to three distinct subfields of machine learning: diffusion modeling, transformer-based language modeling, and learning dynamics. Each idea is implemented and developed into a full paper at a cost of less than $15 per paper. To evaluate the generated papers, we design and validate an automated reviewer, which we show achieves near-human performance in evaluating paper scores. The AI Scientist can produce papers that exceed the acceptance threshold at a top machine learning conference as judged by our automated reviewer. This approach signifies the beginning of a new era in scientific discovery in machine learning: bringing the transformative benefits of AI agents to the entire research process of AI itself, and taking us closer to a world where endless affordable creativity and innovation can be unleashed on the world's most challenging problems. Our code is open-sourced at https://github.com/SakanaAI/AI-Scientist

arXiv.org (arxiv.org)

If you recognize that #reproducibility and credibility of scientific papers is a legitimate concern across fields now, then this is actively promoting fraud. There is no "execution" of an experiment with an #llm because there's no experiment. To suggest anything of the sort is academic misconduct.

James Hawley, PhD

This is another example of "everything is better when we remove humans from the system", while misunderstanding that scientific knowledge and investigation is a _human endeavour_ and the knowledge and experience is _meant to be shared with others_. That requires engaging and communicating with other people.

A paper is a summary of scientific work meant to communicate with others. The paper itself is not the end-all-be-all of scientific research.

Ulrike Hahn

@jrhawley could you just clarify in what sense there are “no experiments”?

James Hawley, PhD

@UlrikeHahn The paper describes how it “executes experiments” by generating code, a matching description, and running the code. This will only work with 1. experiments that can be performed with code, and 2. code that can be run on data that already exists and is accessible.

This is an extremely narrow set of criteria, since we actually need to engage with the real world to get proper scientific data. Models like this can’t do that, and selling this as such, in my opinion, is nonsense.

Ulrike Hahn

@jrhawley isn’t lots of ML research run on extant data sets? Benchmark tests are tests, no?

I have no desire to defend the paper as such, I was just genuinely surprised having read it following your post, so am curious about the conceptual reasons for ruling these out specifically.

James Hawley, PhD

@UlrikeHahn Yes, that's definitely true. Large public datasets like MNIST would probably work well enough with this tool, since it would be used in many papers for training and benchmarking.

But I think there are issues around the multiple layers of requirements and transparency involved with a tool like this. You need someone to attest that the code that was generated as part of the "experiment" matches the description, that the plots faithfully represent the data, etc.

James Hawley, PhD

@UlrikeHahn Knowing that everything matches what was actually run, assuming the code was run correctly, is tough, too - especially given hallucinations in LLMs and how nuanced the "correct" thing can be in novel research.

None of that verification is trivial to do, in my experience. Lots of novel CS research also needs to define new algorithms or functions for doing the correct calculations, so there's a lot that needs to be checked, theoretically, outside the manuscript.

Ulrike Hahn

@jrhawley by sheer coincidence I had posted a link to this paper earlier today https://arxiv.org/abs/2409.11363

James Hawley, PhD

@UlrikeHahn > I have no desire to defend the paper as such

No worries! I didn't think you were trying to defend it, but looking for clarity about what I meant. I understand that "this is nonsense" may be a bit inflammatory, and there are some particular circumstances where a tool like this might actually be interesting.

But, IMO, those circumstances are so niche and still require lots of extra work that saying "this can generate new discoveries" is misleading, at best.

James Hawley, PhD

@UlrikeHahn And spamming dozens of papers a week is dangerous for scientific research, more generally. We already have problems with predatory journals and articles. Tools like this will only make those problems worse, IMO, without offering much benefit.

James Hawley, PhD

@UlrikeHahn Oh, interesting! I hadn't seen this one. Just from the abstract:

> The best agent achieved an accuracy of 21% on the hardest task, showing the vast scope for improvement in automating routine scientific tasks.

This seems like an accurate disclosure. They made AI models but their accuracy was pretty low and they don't oversell the results. I'm fine with that!

Putting a preprint on arXiv to hype your startup, like the paper I posted, is not something I'm a fan of.

Ulrike Hahn

@jrhawley that one was never going to be a hype paper, see the author list

the book comes out on Monday https://press.princeton.edu/books/hardcover/9780691249131/ai-snake-oil

Ulrike Hahn

@jrhawley couldn’t agree more that this is new and uncharted, dangerous, territory for the academic record.

That’s why I read the paper after seeing your post, and my response was more ‘yikes, it can do that’ than ‘this is just hype’. It would be less worrisome to me if it was just the latter. As it stands, it feels like it falls in a bad in-between space: good enough to do real damage, not good enough to seem genuinely valuable.