🧠 From SRL to AI Co-Scientist: Inside Google’s Dual Breakthrough in AI Reasoning and Real Science

Artificial intelligence is evolving faster than most of us can comprehend — and Google is quietly leading that charge with not one but two stunning breakthroughs that redefine what small models and AI research systems can do.

The first is a radical new training method called Supervised Reinforcement Learning (SRL) — a fusion of two opposing paradigms that teaches small models how to “earn” intelligence instead of just memorizing it.

The second is even more extraordinary: a multi-agent “AI Co-Scientist” built on Gemini 2.0, capable of generating and validating real scientific hypotheses. In just days, it reproduced a biological discovery that took human researchers a decade to solve — and even identified new drug candidates for liver fibrosis.

🧠 From SRL to AI Co-Scientist: Inside Google’s Dual Breakthrough in AI Reasoning and Real Science

Let’s dive deep into both — first exploring how SRL changes small-model reasoning forever, and then how Google’s AI Co-Scientist may reshape how science itself is done.


1️⃣ The Birth of SRL — Google’s Most Unlikely AI Training Experiment

At first glance, “Supervised Reinforcement Learning” sounds contradictory — like saying organized chaos. Traditionally, supervised learning and reinforcement learning are two completely different worlds.

  • Supervised learning gives a model all the right answers up front. It’s like handing a student the solution key before the exam — the model learns to imitate those answers.
  • Reinforcement learning, by contrast, makes the model learn through trial and error. It tries actions, receives rewards or penalties, and improves over time, much like an animal learning by experience.

So when researchers at Google Cloud AI Research and UCLA announced a method combining both — supervised and reinforcement — the AI community was puzzled. How could two opposites be merged effectively?

The Motivation Behind SRL

The issue they wanted to solve was critical:
Small AI models collapse on hard reasoning problems.

When you give a small model like Quen-2.57B-Instruct a math or logic problem, it often starts hallucinating — producing incorrect steps even with perfect teacher examples. Supervised fine-tuning (SFT) tends to make models mimic the teacher’s text token by token, rather than truly understanding the process.

The result? Performance gets worse.
Even after fine-tuning on expert data, small models fail to generalize.

The question Google asked was deceptively simple:

“Can a small model learn to reason instead of copy?”


2️⃣ How Supervised Reinforcement Learning Works

Here’s where things get exciting. Google’s answer was to inject supervision into the reward channel, not the loss function. This subtle but profound shift changed everything.

Instead of punishing the model for predicting wrong tokens, SRL rewards it for reasoning correctly step by step.

Let’s break this down in plain English.

Step-by-Step Breakdown of SRL

  1. Start with expert solutions — Researchers take example solutions to math or coding problems.
  2. Break them into smaller steps — Each expert solution is decomposed into what they call “expert trajectories.”
  3. Ask the model to think in private — The model is encouraged to generate hidden reasoning snippets (inside <think> tags) — a private scratchpad, invisible in the final output.
  4. Output one single action per step — The model decides what to do next, like taking the next logical move in chess.
  5. Compare with the expert’s move — Instead of checking the final answer, SRL compares each step to the teacher’s using a string similarity metric (diff).
  6. Reward every micro-decision — The model receives a “dense reward,” meaning even partial progress earns credit.

Unlike normal reinforcement learning (where rewards come only at the end), SRL provides continuous feedback, teaching the model which decisions truly matter.

This is the brilliance of SRL — the model learns to reason in tiny increments, refining logic without overfitting or mimicking.


3️⃣ SRL’s Experimental Results — The Numbers That Shocked Researchers

After implementing SRL, Google’s team tested it using small open models such as Quen-2.57B-Instruct. They trained it on DeepSeek R1-formatted S1 K1.1 datasets, which include challenging math benchmarks like AIME and AMC.

Here’s what they found:

BenchmarkBaseline (Before SRL)After SRLAfter RLVR (Post-SRL)
AMC 23.5023.5023.5723.57
AIME-2413.316.720.0
AIME-256.713.310.0

That’s an unprecedented jump — nearly doubling the reasoning performance of a small model.

Then they added RLVR (Reinforcement Learning with Verifiable Rewards) after SRL.
The combination exploded in performance, setting the highest open-source result for small models in reasoning tasks.

As the paper summarized:

“The winning recipe is SRL first, then RLVR.”

Code Reasoning Results

The researchers didn’t stop with math.
They also tested SRL for software reasoning using Quen 2.5 Coder 7B Instruct on SWE-Bench Verified tasks (a benchmark for software bug fixing and code understanding).

ModelOracle File Edit ModeEnd-to-End Accuracy
Base Model5.8%3.2%
SWE-Gym 7B8.4%4.2%
SRL-Trained Model14.8%8.6%

That’s roughly double the base model’s performance, using the same dataset and architecture — just a smarter training method.


4️⃣ Why SRL Works — Turning Reasoning into Action Generation

So what makes SRL fundamentally different from everything before it?

Traditional training makes the model predict the next token — like autocomplete on steroids. SRL, on the other hand, treats reasoning as a sequence of actions.

Each action is judged for quality, not just correctness.
That subtle shift rewires how the model approaches problem-solving.

The feedback is:

  • Dense — every micro-decision earns a reward.
  • Stable — no catastrophic collapse when examples are imperfect.
  • Efficient — no need for giant reward models or massive GPUs.

In simpler terms, SRL trains intelligence, not imitation.
It forces the model to earn its logic through feedback, much like how a student learns by showing their work instead of just writing the answer.

That’s why small models trained with SRL can now perform reasoning tasks once reserved for 70B or 100B-parameter behemoths.


5️⃣ Why This Matters for Open-Source AI Developers

For the open-source community, SRL is a dream come true.

Most developers don’t have access to clusters of NVIDIA H100 GPUs or Google TPUs. SRL removes that dependency because:

  • It uses small datasets (as low as 1K examples).
  • It doesn’t require massive reward models.
  • It relies on simple string matching (GPO-style) objectives.
  • It scales elegantly on consumer hardware.

This means anyone can train smaller models to reason more effectively — without needing billion-parameter architectures or multimillion-dollar infrastructure.

It’s like handing the indie AI world a blueprint for intelligence efficiency.


6️⃣ Transition — From Training Models to Training Scientists

So far, we’ve seen how Google reinvented model training to make small AIs think smarter. But that’s only half the story.

While one research group was teaching models to reason, another Google team — DeepMind — decided to take it a step further.

Their goal wasn’t to make AI solve equations, but to make it do science itself.

And that’s how the Gemini 2.0 AI Co-Scientist was born.


7️⃣ Meet the AI Co-Scientist — A Team of Thinking Agents

Unlike SRL, which focuses on model behavior, the AI Co-Scientist is an entire ecosystem of specialized agents, each assigned a scientific role.

It’s built atop Gemini 2.0, Google DeepMind’s latest multimodal AI system — but instead of acting as a single model, it functions like a mini research team of autonomous thinkers.

Here’s how its internal “lab team” works:

  • 🧬 Generation Agent: Brainstorms and proposes new hypotheses.
  • 🧠 Reflection Agent: Acts as a peer reviewer, pointing out flaws or weak logic.
  • ⚖️ Ranking Agent: Runs an ELO-style tournament to rank competing hypotheses.
  • 🔬 Evolution Agent: Merges or mutates top ideas to explore novel directions.
  • 🧩 Meta-Review Agent: Oversees the process and optimizes the system over time.

It’s like watching a group of scientists debate ideas — except they’re all AIs collaborating inside one system.

Humans still remain in the loop by defining the research goal and providing natural-language feedback. But once the experiment begins, the heavy intellectual lifting happens autonomously.


8️⃣ The First Major Test — Finding Drugs for Liver Fibrosis

DeepMind tested its new AI Co-Scientist on a real biomedical challenge: liver fibrosis, a disease caused by liver scarring that leads to organ failure.

For decades, scientists have struggled to find effective drugs because lab models can’t mimic real liver behavior. Existing drugs often fail in human trials.

So the DeepMind team asked the AI Co-Scientist to explore epigenomic mechanisms — chemical modifications that influence gene expression without altering DNA itself.

What Happened Next

With a single prompt defining the goal and methods, the AI dove into thousands of biomedical papers and produced three candidate drug classes that could reverse fibrosis:

  1. HDAC inhibitors
  2. DNMT1 inhibitors
  3. BRD4 inhibitors

It didn’t just name them — it even outlined how to experimentally test them using single-cell RNA sequencing.

When researchers tested these predictions using human hepatic organoids (miniature lab-grown livers), two of the AI’s suggestions — HDAC and BRD4 inhibitors — worked beautifully.

One of them, Verinostat, already FDA-approved for cancer, not only stopped fibrosis but actually stimulated healthy tissue regrowth.

That discovery stunned scientists like Dr. Gary Peltz of Stanford University, who verified that out of 180,000 liver fibrosis papers, only seven even mentioned Verinostat, and just two had tested it. The AI found the connection instantly.

Even better, when humans selected alternative drug targets, none performed as well as the AI’s picks.

This is where artificial intelligence crossed into genuine scientific discovery.


9️⃣ Second Breakthrough — Solving a Decade-Old Mystery in Microbiology

Google’s AI Co-Scientist wasn’t done. In another study published in Cell, the same system tackled a long-standing enigma: how CFPICs (Capsid-Forming Phage-Inducible Chromosomal Islands) spread between bacteria.

Human researchers at Imperial College London had spent over 10 years decoding this puzzle. The answer was a mechanism called tail piracy — where these genetic islands steal virus tails from other phages to inject themselves into new hosts.

To test the AI, researchers gave it only pre-discovery data — no hints about the final mechanism — and asked it to explain how CFPICs might spread across bacterial species.

The AI returned five hypotheses.
Its top choice:

“CFPICs achieve broad host range through capsid-tail interactions.”

That’s exactly the same mechanism (tail piracy) that took humans a decade to discover. The AI reproduced it in just a few days.

Benchmark comparisons confirmed that no other model, including GPT or Claude, reached that reasoning depth. Only Gemini 2.0’s multi-agent system pieced it together.


10️⃣ Implications — The Birth of Autonomous Science

Together, these two breakthroughs — SRL and the AI Co-Scientist — mark a paradigm shift in artificial intelligence.

SRL shows us that reasoning can be taught, not scaled.
Gemini 2.0’s Co-Scientist shows us that discovery can be automated, not just assisted.

Dr. Peltz summarized it perfectly:

“AI output still needs human evaluation, but the speed boost is unreal.”

His lab now uses the system for genetic discovery and drug repurposing, with early discussions for clinical testing already underway.

The implications are vast:

  • Faster hypothesis generation.
  • Cheaper biomedical research.
  • Cross-disciplinary scientific acceleration.
  • Potential for AI-driven breakthroughs humans might not even understand yet.

It’s the dawn of autonomous research systems — where AI doesn’t just summarize science; it creates it.


💬 Frequently Asked Questions (FAQ)

Q1. What makes SRL different from normal reinforcement learning?
SRL integrates supervision inside the reward signal. Instead of punishing wrong predictions, it rewards correct reasoning steps. This allows smaller models to learn efficiently without massive computation.

Q2. Can open-source developers use SRL today?
Yes. The method is designed for lightweight setups and works with small datasets and simple objectives. Anyone training models under 10B parameters can experiment with it.

Q3. Is Google’s AI Co-Scientist publicly available?
Not yet. It’s currently used internally in research collaborations. However, it’s built on Gemini 2.0, which forms the base for future scientific tools.

Q4. Should we be worried about AI doing science independently?
Not at this stage. The system still requires human validation and operates under ethical oversight. Think of it as an accelerator, not a replacement, for human scientists.

Q5. How soon could AI systems publish papers on their own?
That day may come sooner than expected. Systems like Gemini Co-Scientist are already drafting full research manuscripts and testing experimental logic autonomously.


⚠️ Disclaimer

The research described in this article is based on published Google AI and DeepMind studies. While all results are drawn from credible scientific sources, experimental outcomes are still under review and subject to replication. Always refer to Google Research and DeepMind’s official publications for verification.


✍️ Final Thoughts

What began as an odd experiment — mixing supervised and reinforcement learning — may have unlocked a new frontier in AI reasoning. SRL proves that even small models can think deeply when trained intelligently.

Meanwhile, Google’s AI Co-Scientist demonstrates the ultimate leap — from reasoning to discovery. In both cases, the takeaway is the same:
The future of AI isn’t just about bigger models. It’s about smarter learning, structured reasoning, and the courage to let machines explore the unknown.


Hashtags: #GoogleAI #DeepMind #Gemini2 #SRL #ArtificialIntelligence #MachineLearning #AIDiscovery #AIResearch #AIinScience #dtptips

Visited 33 times, 1 visit(s) today

Daniel Hughes

Daniel Hughes

Daniel is a UK-based AI researcher and content creator. He has worked with startups focusing on machine learning applications, exploring areas like generative AI, voice synthesis, and automation. Daniel explains complex concepts like large language models and AI productivity tools in simple, practical terms.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.