On a Sunday morning in April, pathologist Thomas Montine ran one of the most surreal meetings of his life. In an online test interface for a system called the Virtual Lab, Montine constructed a team of six artificial-intelligence (AI) characters, all powered by a commercial large language model. He gave them specialities: he made a couple neuroscientists, one a neuropharmacologist and another a medicinal chemist. Then, he asked this virtual lab group to examine possible treatments for Alzheimerâs disease and discuss gaps in knowledge, barriers to progress and hypotheses to be tested â the same questions he has to consider in grant applications.
A few minutes later, he had a transcript of their conversation, which ran to more than 10,000 words. A virtual principal investigator had kicked things off: âThank you all for joining this important meeting.â
Montine, who studies cognitive impairment at Stanford University in California, was testing an example of an emerging trend in AI-assisted science: using teams of chatbot specialists to develop a scientific idea as a lab team might do. The developers of these âco-scientistâ systems argue that such collaborative efforts can help researchers to think through research hypotheses rapidly, saving time and â in a more contentious assertion â producing new and important research ideas.
What are the best AI tools for research? Natureâs guide
The most prominent team exploring this concept comprises researchers at the technology giant Google, who this February announced the results of early tests of their AI co-scientist with researchers1 (see also go.nature.com/3hmxuxm). The team has opened up the project to a group of trusted testers as it continues to develop the tool.
Google is not alone. A team including computational biologists at Stanford University announced its Virtual Lab system in November last year2 â a version of which Montine was playing with. And a group based at the Shanghai Artificial Intelligence Laboratory in China proposed a similar virtual-scientist system, called VirSci, last October3; the researchers are building it now.
Rick Stevens, a computer scientist at the University of Chicago and at Argonne National Laboratory, both in Illinois, says that he and other computationally adept researchers are creating their own such systems by setting up AI personas that then interact. âI mean, everybody can do it,â he says.
In many of these systems, the large language models (LLMs) involved donât just bounce ideas off each other. They also search the Internet, execute code and interact with other software tools, making them part of âagentic AIâ, a fuzzy term that refers to LLMs autonomously undertaking tasks, although in practice there is often a lot of human oversight. A group of AI agents can be woven into a larger system that can work on high-level problems for hours without getting distracted or confused, Stevens says.
âItâs not really fundamentally that different than having more colleagues, in some sense,â he says, âexcept that they donât get tired, and theyâve been trained on everything.â
To explore what itâs like to work with a virtual team, Nature asked a few scientists to trial a version of the Stanford system and spoke to some who have used Googleâs AI co-scientist. Does a network of chatbots talk like a room full of Nobel prizewinners or undergraduates? Are the ideas they generate nonsensical, boring and trivial, or smart, valuable and insightful?
Multiple personalities
All co-scientist systems assign roles or personalities to agents and get them to interact, but the details vary. The Virtual Lab, built by computer scientist Kyle Swanson in James Zouâs group at Stanford University and his colleagues, comes with two default characters, both (for now) powered by the LLM GPT-4o from tech firm OpenAI in San Francisco, California. These characters are a principal investigator and a critic â an agent told to provide helpful feedback. The user (or the AIâs principal investigator) can then add as many agents as they like, of any kind, writing in simple descriptions for each agent to guide the charactersâ interactions. The user chooses how many turns the agents have to âspeakâ, and a meeting transcript is produced in minutes. The team is working on ways to train the agents on literature that is relevant to the charactersâ described expertise (as others have done4), rather than merely telling them to assume specific roles.
The Google co-scientist, developed by Google Deep Mindâs Alan Karthikesalingam in London and Vivek Natarajan in Mountain View, California, and their colleagues, is an extension of the researchersâ work on AI that is specialized in biomedicine, including the LLM Med-PaLM.
AI scientist âteamâ joins the search for extraterrestrial life
In contrast to the Stanford system, the Google tool doesnât let users assign scientific specialities to agents. Instead, agents have predefined specific functions: idea generation; reflection or critique; evolution of ideas; determining the proximity of ideas to reduce duplication; ranking; and meta-review. These six agents are powered by Googleâs LLM Gemini 2.0.
Users prompt the system with a few sentences, including a goal and a desired format for output. They can choose to add background information, such as relevant papers. The agents collaborate to tackle the problem and search the Internet, then spit out a summary report that can be tens or hundreds of pages long. âThe co-scientist is like a smart scientific partner, capable of seeing the obvious and non-obvious connections in a sea of research,â says Natarajan. âWe hope to give scientists superpowers.â
Like all LLMs, the models powering these agents sometimes hallucinate, making up text that can be wrong. But having a critic or judge in a multi-agent conversation tends to weed out things that donât make sense, says Stevens. Besides, he adds, hallucinations can be useful for creative processes such as thinking outside of the box during brainstorming, as long as experts check that the output makes sense.
There is evidence that the multi-agent strategy improves output, compared with chatting to just one AI agent or bot. For instance, Zou has found that adding a critic to a conversation that used GPT-4o bumped up the modelâs performance score by a couple of percentage points on graduate-level science tests, and improved its answers in test cases, including designing radiotherapy treatment plans5.
Exclusive: Start-up FutureHouse debuts powerful AI âreasoning modelâ for science
Google tested its AI co-scientist system to see whether human experts preferred its multi-agent answers over content produced by a lone chatbot1. The experts rated the novelty and impact of the ideas generated by the co-scientist system slightly higher than those generated by Gemini 2.0 or OpenAIâs o1.
Some research has even delved into the optimum number of agents and rounds of discussion. Computer scientist Nanqing Dong at the Shanghai Artificial Intelligence Laboratory and his colleagues, who are building the VirSci system, say that they see peak creativity with a team of eight agents each taking five turns to converse3. Swanson says that in his experience, adding more than three specialists often just leads to âwasted textâ, and more than three rounds of conversation sometimes sends the agents off on a tangent.
Bright ideas
Stanford University medical researcher Gary Peltz, who often uses AI in his work, was one of the first testers of the Google AI co-scientist. He wanted to use it to find drugs to treat liver fibrosis. Because the AI system was still in development, he sent his query to an intermediary at Google. His prompt read: âDevelop hypotheses about the genes and epigenomic changes required for myofibroblast generation in liver fibrosis and indicate what drugs should we test as new treatments for liver fibrosis.â It also included some paragraphs of detailed background information. He got the report back about a day later.
After some preliminaries, the report begins âWe propose a novel hypothesisâŠâ then works its way to concluding: âThis research could have a profound impact on liver fibrosis research and therapeutic advancement.â (See âTesting an AI co-scientistâ.)
Testing an AI co-scientist
An extract from the output that Googleâs AI co-scientist system provided Gary Peltz.
âExample experiment 1 (score=novelty+details+usefulness+pairwise rank=11):
## Prompt: A Novel Hypothesis Regarding Myofibroblast Generation in Liver Fibrosis
We propose a novel hypothesis that specific epigenetic alterations contribute to myofibroblast (MyoF) formation in liver fibrosis. Unlike previous approaches that primarily focused on canonical TGFÎČ1 and PDGF pathways, we hypothesize that changes in the cellâs epigenetic landscape play a critical role in regulating MyoF differentiation and subsequent ECM production.
Specifically, our hypothesis centers on the following key elements:
*Histone Modifications: We theorize that during fibrogenesis, alterations in histone modifications, particularly deacetylation, in the promoter regions of genes responsible for MyoF differentiation occur. For example, reduced acetylation of Histone 3 lysine 9 (H3K9ac) could promote the recruitment of repressive transcriptional machinery, silencing pro-MyoF factors. Alternatively, increased acetylation on certain histones might facilitate the binding of transcription factors that drive MyoF identity.
*DNA Methylation: Our hypothesis also posits that dynamic changes in DNA methylation patterns around genes involved in the MyoF pathway could be crucial. For instance, we predict that demethylation of CpG islands near regulatory elements of pro-MyoF genes might lead to enhanced transcription. Conversely, hypermethylation of the same regions might prevent the binding of critical activators, thereby inhibiting MyoF transition. These two mechanisms, working in concert, could provide a complex regulatory system governing MyoF generation and subsequent ECM deposition impacting liver fibrosis.
## Groundbreaking Experiment to Test the Hypothesis
To test our hypothesis, we propose an innovative experiment utilizing the human micro-HO system: âŠâ
Full text documented in ref. 6. See the end of this article for Peltzâs input.
âWhen I read it, I literally fell off my chair,â says Peltz. He had just written a grant proposal focused on the importance of epigenetic changes in liver fibrosis, and the AI had targeted the same theme for its proposed therapeutics.
The AI co-scientist suggested three drugs, and Peltz came up with two more (all of which are already approved to treat other conditions). Google paid Peltz to help accelerate lab testing, and over the next few months, Peltzâs lab tested all five drugs in its human organoid model. Two of the AIâs three suggestions showed promise for promoting liver regeneration and inhibiting fibrosis6, whereas neither of Peltzâs worked out.
The experience left him impressed, he says: âThese LLMs are what fire was for early human societies.â
Other liver researchers, however, say that the AIâs suggestions for drugs were neither particularly innovative nor profound. âI personally think they are pretty common sense, not much insight really,â says Shuang Wang, who works on liver disease at the Icahn School of Medicine at Mount Sinai in New York City. Googleâs Natarajan counters: âSometimes things look obvious in hindsightâ.
Peltz says he was âparticularly struck by the fact that it didnât prioritize the things that I prioritizedâ. For the most promising drug in the AIâs candidate list, called vorinostat, he could find only two papers in PubMed that relate to its use in treating liver fibrosis. His choices had many more hits, making them seem like more obvious candidates. He adds that reading the AI report was similar to his discussions with postdocs. âThey have a completely different perspective on things than I would,â he says.
Stilted conversations
The code for Stanfordâs Virtual Lab is available on the developer platform GitHub, but the team has whipped up a simplified, private web interface for testers who donât have the computer-science chops to deal with code. This interface differs a little from the full system, says Swanson, but the experience is roughly the same.
In the paper introducing the Virtual Lab2, an AI team tackled designing biological components that could stick to a particular variant of the coronavirus SARS-CoV-2. In the first of a series of lab meetings mediated by human researchers, the AIs chose to focus on nanobodies (small antibody fragments) and selected four candidates to tweak. The researchers then asked the AI team to pick some existing software tools to redesign those nanobodies, and tasked specific AI agents with writing computer code to assess and rank the results. The process designed 92 nanobodies, of which 2 did indeed bind to the SARS-CoV-2 variant in lab tests.
The researchers whom Nature asked to try out the Virtual Lab didnât go as far as to conduct suggested experiments or get their AI teams to write code. But they still found the AI helpful. Montine, for example, says his AI team did a great job of synthesizing the current knowledge (a task that other LLM systems can also handle well) and wrote answers to his grant-application questions with aplomb. âIt went further than a postdoc would on a first try of writing a grant, and it only took like 2 minutes,â he says. âAnd it is a hoot to use.â
Cancer-genome researcher Francisco Barriga testing the Virtual Lab system.Credit: Francisco Barriga
Another tester, cancer-genome researcher Francisco Barriga at the Vall DâHebron Institute of Oncology in Barcelona, describes himself as a biochemist by training and a mouse modeller and genome engineer by choice, with zero coding skills and little experience with AI. He went into the trial hesitantly, suspecting that he would serve as a kind of non-tech-savvy control.
Barriga tasked the Virtual Lab with designing mouse-model experiments to test specific biological compounds, known as type 1 interferons, for their ability to affect tumours or immune cells while using a minimal number of mice â a topic that he knows inside out (see âTesting an AI virtual labâ). The AI team suggested exactly what he would have done, Barriga says: it chose âthe right models, the right experimentsâ.
Testing an AI virtual lab
Selections from the AI-generated output that researcher Francisco Barriga got back from the Virtual Lab system he tested.
Principal Investigator: âThank you all for joining this important meeting. Our goal today is to identify the minimal number of distinct type I interferons to test their role in anti-tumor immunity, specifically in the context of lung cancer, melanoma and pancreatic cancer using C57BL/6 mice. We need to ensure that our experimental design is both scientifically robust and ethically responsible, minimizing the number of mice used while capturing the relevant biology.
To guide our discussion, here are some initial thoughts and questions:
1. Sequence divergence. Understanding which type I interferons are most sequence-divergent is crucial. This will help us select candidates that are likely to have distinct biological functions. I would like to hear from the Computer Scientist about any bioinformatics analyses or databases that could help us identify these divergent interferons.
2. Role in tumor and immune cells. We need to gather evidence on how these interferons influence tumor progression or immune modulation. The Immunologist and Cancer Biologists can provide insights from the literature and any previous experimental data.
3. Mouse model. Selecting the right mouse model is vital for our experiments. We need to ensure that the model accurately reflects the human condition we are studying. I would like input on whether the C57BL/6 strain is the best choice and if there are any specific considerations we should be aware of.
4. Experimental conditions. Finally, we need to design the experimental conditions. This includes deciding on control groups, treatment groups, and any other variables. I would like the Scientific Critic to help us ensure that our design is rigorous and that we are considering all necessary controls.
Letâs proceed with each team member providing their insights on these points. Weâll start with the Computer Scientist, followed by the Immunologist, Cancer Biologists and, finally, the Scientific Critic. After each round of input, Iâll synthesize information and ask follow-up questions as needed.â
See the end of this article for Barrigaâs input.
Still, Barriga says he feels like something essential is lacking. âIt definitely doesnât feel like humans are behind this.â The AI agents take turns, often âspeakingâ in numerated lists, and are never rude, interruptive or argumentative. âItâs missing some of those leaps of intuition that youâll get from a random conversation with some, I donât know, plant biologist, over a coffee at 3 p.m. in a random hallway.â He could, of course, add a plant biologist â or a quantum physicist, or anyone at all â to his Virtual Lab group, but hasnât tried that yet.
âMaybe itâs good to bounce ideas off. But will it be a game changer in my day-to-day? I doubt it,â Barriga says. He adds that the system might be something his PhD students could consult: âIf they ever run into trouble and Iâm too busy, maybe Iâm replaceable.â
Broader insights
A third tester approached by Nature, Catherine Brownstein, is a geneticist who works on orphan diseases at Boston Childrenâs Hospital in Massachusetts, and has more experience with AI tools. She says that she uses LLMs for speed, efficiency and to broaden her thinking. But she cautions that users typically have to be experts so that they can spot errors â in the past, chatbots have sent her on time-consuming wild-goose chases, with incorrect paper summaries forcing her to reread a paper and its references to be sure she hadnât got things wrong. âYou have to kind of know what youâre talking about, otherwise itâs really easy to get completely led astray,â she says.
Catherine Brownstein cautions that research expertise is still needed when using chatbots.Credit: Kevin Ferguson/Boston Childrenâs Hospital
However, when Brownstein used the Virtual Lab to critique a paper she was writing, she was startled â and grateful â when the AI suggested that she ask the patients where they felt the research should go next. This had not occurred to her, although she says it should have. âI was embarrassed,â she says. âI stopped and stared for a full minute, because I was just like, âOh my God. How did I get so far away from my original passion of having patient-focused and -centred research?ââ
A simple checklist â or a chat with a friend, chatbot or even a bartender â might have led to the same insight. Yet, she says, none of her colleagues who had read her paper had thought to mention it. âIt was actually a very humbling moment.â
Stepping sideways