<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://rekursiv.ai/feed.xml" rel="self" type="application/atom+xml" /><link href="https://rekursiv.ai/" rel="alternate" type="text/html" hreflang="en" /><updated>2026-05-30T20:06:16+00:00</updated><id>https://rekursiv.ai/feed.xml</id><title type="html">rekursiv.ai</title><subtitle>rekursiv.ai builds AI Scientists — autonomous systems that form hypotheses, run experiments, and discover new knowledge.</subtitle><entry><title type="html">sagent: a Python API for coding agents</title><link href="https://rekursiv.ai/blog/introducing-sagent/" rel="alternate" type="text/html" title="sagent: a Python API for coding agents" /><published>2026-05-05T00:00:00+00:00</published><updated>2026-05-05T00:00:00+00:00</updated><id>https://rekursiv.ai/blog/introducing-sagent</id><content type="html" xml:base="https://rekursiv.ai/blog/introducing-sagent/"><![CDATA[<article>
    <header class="post-hero">
      <div class="container-narrow">
        <div class="post-meta"><span class="pill pill-release">release</span><time>May 5, 2026</time></div>
        <h1 class="post-title">sagent: a Python API for coding agents</h1>
        <p class="post-lead">Sagent is a strongly typed Python API and CLI for building coding and developer agents.</p>
        <div class="post-byline">
          <span><a href="/josh/"><strong>Joshua V. Dillon</strong></a> &amp; <a href="/dan/"><strong>Dan Kondratyuk</strong></a></span>
        </div>
      </div>
    </header>

    <div class="post-body">
      <div class="container-narrow">
        <p>We wanted an open-source coding-agent API that could hot-swap providers and models without losing context.</p>
        <p>That means the same agent loop should work on top of Anthropic, OpenAI, Google, Kimi, Qwen, MiniMax, and self-hosted models. It should be usable as a terminal coding assistant, but also as normal Python code: import an agent, give it tools, run it inside a script, spawn reviewers, switch models, persist sessions, and inspect the typed results.</p>
        <p>That became <a href="https://github.com/rekursiv-ai/sagent">sagent</a>, a strongly typed Python API and CLI for building coding and developer agents.</p>
        <p>The public package is about 31k lines of typed Python code with another 26k lines of tests. The main design rule is still simple: everything that crosses the runtime boundary is a message, and every surface uses the same agent loop.</p>

        <figure style="text-align: center;">
          <img src="/assets/rekursiv-site-2/img/e735bb377168fcba.webp" alt="sagent logo" width="575" height="575" style="width: 180px; margin: 0 auto;" />
        </figure>

        <pre><code>pip install sagent</code></pre>

        <pre><code>from sagent import tools
from sagent.agent import Agent
from sagent.lib.json import json_freeze
from sagent.providers import Google

agent = Agent(
    model=Google.from_env().model("gemini-2.5-flash"),
    system="You are a scientist.",
    tools=[tools.Read(), tools.Glob(), tools.Grep()],
)
result = await agent.run(json_freeze({"prompt": "analyze the CSV in ./data/"}))
print(result.content)</code></pre>

        <h2>Everything is a message</h2>
        <p>The core idea in sagent is simple: everything that crosses the runtime boundary is a <code>Message</code>.</p>
        <p>Text, bytes, JSON, tool calls, tool results, model responses, user prompts, compaction summaries, and multi-part assistant turns all move through the same typed message graph. Conceptually, a message is content plus a MIME descriptor, such as <code>text/plain</code>, <code>application/json</code>, or <code>multipart/x-tool-call</code>.</p>
        <p>That decision removes a lot of special plumbing. Providers, tools, sessions, compaction, the CLI, Slack, parent agents, and child agents all speak the same shape. The runtime does not need one representation for model output, another for tool calls, another for persisted sessions, and another for UI events.</p>
        <p>The second core idea is that an agent owns an inbox.</p>
        <pre><code>while True:
    drain inbox into user messages
    call model
    if tool calls exist: dispatch tools and loop
    if inbox is empty and model is done: go idle</code></pre>
        <p>The inbox is a deque. User messages go to the front, so a person can interrupt a running session. Background completions, peer-agent messages, delayed wakeups, and tool results go to the back. The agent keeps draining, running, and checking until there is nothing left to do.</p>
        <p>This is closer to an Erlang-style process than to a request-response wrapper. An agent has state, a mailbox, and a loop. It can wake, receive messages, spawn other agents, and keep working without each surface needing its own control plane.</p>

        <h2>Tools are the core abstraction</h2>
        <p>A <code>Tool</code> is anything with a schema and a <code>run</code> method. It takes a <code>Message</code>, which may be multipart, and returns a <code>Message</code>. Streaming tools yield intermediate <code>Message</code> events and finish with a final result message.</p>
        <p>That makes <code>Agent</code> fit the same shape. An agent can be used directly, but it can also be treated as a streaming tool: send it a message, stream its events, and receive its final message.</p>
        <p>This is why <code>AgentSpawn</code> is small conceptually. Spawning an agent is just calling another agent-shaped tool with an isolated model, tool set, session, and depth limit. Recursion falls out of the type, not from a separate orchestration layer.</p>

        <h2>Agents can change themselves</h2>
        <p>Sagent has three built-in coordination tools: <code>AgentSelf</code>, <code>AgentSend</code>, and <code>AgentSpawn</code>.</p>
        <p><code>AgentSelf</code> lets an agent inspect and mutate its own state. It can update its status, compact context, clear history, inspect diagnostics, adjust token limits, and change models.</p>
        <p>The model swap is a useful consequence of this design. You can conversationally hotswap the backend while keeping the session context. Start with Claude, switch to Gemini, move to an OpenAI-compatible endpoint, then switch back. The provider normalization happens at the edge, so the agent loop keeps seeing the same typed model response shape.</p>
        <p>That makes provider choice a runtime decision instead of an architectural one. Researchers can compare model behavior inside the same agent session. Framework builders can route different work to different backends. Coding-agent users can switch models mid-task without restarting from scratch.</p>

        <h2>Agents can talk to agents</h2>
        <p><code>AgentSend</code> lets one live agent send a message to another live agent's inbox. This gives agents peer-to-peer communication instead of only parent-to-child calls.</p>
        <p><code>AgentSpawn</code> creates child agents. A parent can spawn an isolated reviewer, a specialized implementation agent, or a map-reduce worker. Spawned agents can also spawn more agents, subject to explicit depth and tool limits.</p>
        <p>This matters because agent composition should not require a separate orchestration framework. In sagent, an agent follows the same protocol shape as a tool. A parent calls <code>AgentSpawn</code>, the child runs in isolation, and the child's final output returns as a normal tool result.</p>
        <p>The same primitives cover common agent workflows:</p>
        <ul>
          <li>ask one child agent to review code;</li>
          <li>split a large search across many child agents;</li>
          <li>keep a persistent background agent alive;</li>
          <li>let two agents coordinate through inbox messages;</li>
          <li>use different models or tool sets for different subtasks.</li>
        </ul>
        <p>The important constraint is that these are still typed Python objects. You can construct them, test them, limit their tools, inspect their sessions, and embed them in another application.</p>

        <h2>A lightweight interface over a real runtime</h2>
        <p>For day-to-day use, sagent also works as a terminal coding assistant:</p>
        <pre><code>GOOGLE_API_KEY=... sagent --provider Google --model gemini-2.5-flash</code></pre>
        <p>The REPL is intentionally lightweight: closer to an IPython-like working session than to a full IDE. It has local tools for files, shell commands, web fetching, search, scholarly papers, and agent coordination. It persists sessions per working directory by default, tracks cost, and compacts old context when the session gets long.</p>
        <p>The same <code>Agent</code> powers the CLI, Slack service, parent agents, child agents, and Python applications. Surfaces differ in how they put messages into the inbox and render events. They do not own separate agent logic.</p>

        <h2>What sagent is for</h2>
        <p>Use sagent when you want:</p>
        <ul>
          <li>a strongly typed Python interface for coding agents;</li>
          <li>provider and/or model hot-swapping without changing the context or agent loop;</li>
          <li>custom tools as normal Python objects;</li>
          <li>session persistence and compaction;</li>
          <li>child agents and peer messaging for review, delegation, and map-reduce work.</li>
        </ul>
        <p>It is not a sandbox. Enabled tools run with the current process permissions. Sessions are plaintext local state. If a task needs hard isolation, run sagent inside your own OS or container sandbox and give the agent a narrow tool set.</p>
        <p>Sagent is also not trying to be every agent framework. There is no hosted service, desktop UI, browser automation, MCP integration, or LSP integration today. The focus is smaller: a typed Python runtime with concrete coordination primitives.</p>

        <h2>Try it</h2>
        <p>Sagent is open source under Apache 2.0.</p>
        <ul>
          <li>GitHub: <a href="https://github.com/rekursiv-ai/sagent">github.com/rekursiv-ai/sagent</a></li>
          <li>Docs: <a href="https://github.com/rekursiv-ai/sagent/tree/main/docs">github.com/rekursiv-ai/sagent/tree/main/docs</a></li>
          <li>Examples: <a href="https://github.com/rekursiv-ai/sagent/tree/main/examples">github.com/rekursiv-ai/sagent/tree/main/examples</a></li>
          <li>PyPI: <a href="https://pypi.org/project/sagent">pypi.org/project/sagent</a></li>
        </ul>
        <p>If you build agent systems in Python, try it and tell us where the abstractions hold up and where they break.</p>

        <div style="margin-top: 40px; display: flex; gap: 10px; flex-wrap: wrap;">
          <a href="/join/" class="btn btn-primary">Join us →</a>
          <a href="/blog/" class="btn btn-outline">Return to the blog</a>
        </div>
      </div>
    </div>
  </article>]]></content><author><name>Joshua V. Dillon</name></author><category term="release" /><category term="open-source" /><category term="agents" /><summary type="html"><![CDATA[Sagent is a strongly typed Python API and CLI for building coding and developer agents.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://rekursiv.ai/assets/img/sagent-logo.webp" /><media:content medium="image" url="https://rekursiv.ai/assets/img/sagent-logo.webp" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">I Built Copybarista in a Day to Open-Source Code Safely</title><link href="https://rekursiv.ai/blog/i-built-copybarista-in-a-day/" rel="alternate" type="text/html" title="I Built Copybarista in a Day to Open-Source Code Safely" /><published>2026-04-30T00:00:00+00:00</published><updated>2026-04-30T00:00:00+00:00</updated><id>https://rekursiv.ai/blog/i-built-copybarista-in-a-day</id><content type="html" xml:base="https://rekursiv.ai/blog/i-built-copybarista-in-a-day/"><![CDATA[<article>
    <header class="post-hero">
      <div class="container-narrow">
        <div class="post-meta"><span class="pill pill-release">release</span><time>April 30, 2026</time></div>
        <h1 class="post-title">I Built Copybarista in a Day to Open-Source Code Safely</h1>
        <p class="post-lead">How a private monorepo sync problem became an OSS tool for clean exports and verified imports.</p>
        <div class="post-byline">
          <span><a href="/dan/"><strong>Dan Kondratyuk</strong></a>, Co-founder, rekursiv.ai</span>
        </div>
      </div>
    </header>

    <div class="post-body">
      <div class="container-narrow">
        <figure>
          <p style="text-align: center; margin: 0 0 14px;">
            <a href="https://github.com/rekursiv-ai/copybarista">github.com/rekursiv-ai/copybarista</a>
          </p>
          <img src="/assets/rekursiv-site-2/img/ce16f25a3db0de8f.webp" alt="Copybarista logo and launch graphic." width="2400" height="1350" style="max-width: 100%; margin: 0 auto;" />
          <figcaption>Copybarista is a Python-native tool for safely open-sourcing packages from private repositories and monorepos.</figcaption>
        </figure>

        <p>The hard part of open-sourcing code from your company's private source is not copying the files. Anyone can copy a directory from one place to another.</p>
        <p>No, the hard part starts ten minutes later: it needs to be a normal public Python package, but the source of truth is still private. The imports point at internal modules, and README mentions internal workflow details. The GitHub Actions can push files one direction, but now someone sent a pull request and you need to send the changes back.</p>
        <p>I wanted the boring version of this workflow: push private changes out as a clean pull request, pull public fixes back only when the reverse mapping is verifiable, and never maintain a second hand-edited copy of the package.</p>
        <p>The first working version came together in a day of work. The release-ready version took another day of review, hardening, and documentation, but the core shape stayed small: choose a source subtree, rewrite it deterministically, export it as a clean repository, and let public fixes flow back through pull requests only when the mapping is safe.</p>
        <p>That became <a href="https://github.com/rekursiv-ai/copybarista">Copybarista</a>.</p>

        <h2>The problem was not copying files</h2>
        <p>The concrete case was a small Python tool inside our private monorepo. We wanted to release it as its own repository, and wanted that repository to look like a package someone would actually install:</p>
        <ul>
          <li><code>copybarista/</code> at the repository root</li>
          <li>a standard <code>pyproject.toml</code>, since we work with lots of ML code and love <code>uv</code></li>
          <li>public docs and examples</li>
          <li>GitHub Actions for linting, type checking, tests, build, and publishing</li>
          <li>no private docs, source paths, internal workflow files, caches, or generated junk</li>
        </ul>
        <p>The private source of truth still lived in the monorepo. I did not want to manually maintain a second copy of the package, and I did not want the open-source repository to become a fork that slowly drifted away.</p>
        <p>The sync had to work in both directions:</p>
        <ol>
          <li>Source changes in the private monorepo should export into a public pull request.</li>
          <li>Public fixes should be importable back into the private source tree.</li>
          <li>Both directions should use GitHub's normal PR review model.</li>
          <li>The public repo should never receive private-only files or metadata.</li>
          <li>The private repo should not blindly accept public edits that cannot be mapped back safely.</li>
        </ol>
        <p>At first glance, this sounds solved. Just use <code>rsync</code> or <code>git subtree</code>. Or if you want to pull out the big guns, use <a href="https://github.com/google/copybara">Copybara</a>.</p>
        <p>But after trying these options, I immediately ran into a great deal of friction.</p>

        <h2>Why not just use Copybara?</h2>
        <p>Copybara is the obvious comparison. Its README describes almost exactly the class of problem we had: moving code between repositories, often between a private repository and a public repository, with one repository treated as the source of truth.</p>
        <p>I've used Copybara many times when I was at Google, it works. It supports broad repository migration workflows, transformations, and bidirectional movement. It is also intentionally general: built with a Java runtime, using a <code>copy.bara.sky</code> config, and a large surface area.</p>
        <p>But I ran into a snag: I had to do a few manual steps to install a separate Java runtime, which didn't interface nicely with the Python ecosystem we were familiar with. We just wanted to run <code>uv sync</code> and have it available as a dependency.</p>
        <p>I came to the conclusion that we wanted a Python-native tool for one narrow workflow:</p>
        <ul>
          <li>export a Python package from a private or monorepo source tree</li>
          <li>rewrite imports, docs, and private blocks</li>
          <li>preserve public <code>.github/</code> metadata like GitHub Actions</li>
          <li>open GitHub PRs instead of pushing directly to <code>main</code>, in both directions</li>
          <li>run verification checks before anything can ship</li>
          <li>easily install the tool with the rest of my Python packages using <code>uv</code></li>
        </ul>
        <p>Copybara could likely be made to do most of that with enough glue, but we weren't satisfied with glue.</p>

        <h2>The design fell out of the requirements</h2>
        <p>The core design stayed simple:</p>
        <pre><code>config -&gt; stage files -&gt; transform staged tree -&gt; write destination</code></pre>
        <p>This was inspired a lot by Copybara's design philosophy, but the design is quite mature so in this case it's pragmatic to not reinvent good ideas. The basic principles lay out how to export the code and make sure the transforms are deterministic and safe. But now, it's written in a lightweight Python profile.</p>
        <p>Export is fairly simple, coping a selected tree into a temporary directory, running deterministic transforms, then writing a folder or one clean Git commit.</p>
        <p>Import is where things get more interesting. Copybarista compares the public tree before and after a public change, maps that diff back into the private source tree, reverses the supported transforms, and then exports again. If the new export does not reproduce the public head exactly, the import is rejected and touched files are rolled back.</p>
        <p>That one invariant shaped the code, i.e., public changes come back only when they are reversible and verifiable.</p>

        <figure>
          <img src="/assets/rekursiv-site-2/img/882aef55c4a74e5f.webp" alt="Copybarista sync flow between a source repository, a standalone package repository, export pull requests, and import pull requests." width="1200" height="1920" style="max-width: 100%; margin: 0 auto;" />
          <figcaption>Copybarista keeps a private or monorepo source tree and a standalone public repository in sync through reviewed GitHub pull requests.</figcaption>
        </figure>

        <h2>The GitHub Actions part was the real test</h2>
        <p>Once I had a basic transformation and sync working, the real test became: "Can Copybarista export itself in a safe way"? This required writing and testing the full pipeline between our source repo and a new test repo, while ensuring security best practices.</p>
        <p><a href="https://securitylab.github.com/resources/github-actions-preventing-pwn-requests/">GitHub's Security Lab</a> has written about one class of security problem in the context of Actions: untrusted pull request code should not run in a privileged context with repository write permissions or secrets. Copybarista follows the same rule. Import and validation run without <code>GH_TOKEN</code>. The token appears only in the final PR creation step, and that step runs trusted helper code captured before public changes are applied.</p>

        <h2>All of this in a day of work?</h2>
        <p>I truly believe we are rapidly approaching a completely new paradigm of software. It's never been cheaper to spin up a bunch of coding agents to do your bidding and write lots of code to do your task.</p>
        <p>Now I know what readers may be thinking: "doesn't this lead to lots of unintelligible AI slop?". And if you use the coding agents in the way most people do, my answer is yes.</p>
        <p>But that's not quite what happened (at least from my reviews and testing). The results seem to paint a different picture: the code has docstrings that explain intent, is fully typed, has 90+% test coverage, was tested with a profiler to ensure fast execution, and uses a clean set of abstractions. The repo also contains lots of documentation, tutorials, and examples, and withstood my manual testing. The first time I ran Copybarista to export itself, it just worked out of the box.</p>
        <p>The coding workflow was not "ask for a tool, receive a tool." It was: state a desired end goal, describe the workflow, and write and rewrite a spec 3 times over. I spent a couple of hours just going back and forth to design the right set of core primitives. Once that was done, the task was to implement the smallest working version, run tests, and then harden the implementation by having multiple agents run the software manually as if a person would.</p>
        <p>The human part was scope control. Without a narrow target, the agents would rathole in some unimportant direction, so watching the shape of the APIs between major module rewrites was important. The project was able to stay on track because the answer to many changes was "no, we don't need that yet, let's try a simpler way."</p>

        <h2>What came after the first day</h2>
        <p>Now I should mention that technically it did take an extra half-day to polish the implementation. The first working version was functional, but not ready for release. That is the part that I underestimated the most. A tool can work internally long before it is ready for strangers. Making it open-source-ready meant writing more solid documentation and examples, setting up PyPI, deciding what happens when a public PR cannot be reversed, when an export PR goes stale, etc. The answers became docs, tests, and workflow rules.</p>

        <h2>What Copybarista is, and what it is not</h2>
        <p>Copybarista is for Python projects where the source of truth lives inside a private repository or monorepo, but the package should be released as a clean standalone repository.</p>
        <p>It is good at:</p>
        <ul>
          <li>exporting a selected tree</li>
          <li>rewriting imports and docs and stripping private blocks</li>
          <li>keeping GitHub PRs as the sync interface</li>
        </ul>
        <p>It is not trying to be:</p>
        <ul>
          <li>a full Copybara-compatible migration engine</li>
          <li>a replacement for human review</li>
        </ul>

        <h2>The lesson</h2>
        <p>The surprising part was not that AI helped write code quickly, but rather how it's still possible to move fast and yet write battle-hardened, clean, and working software that does its job.</p>
        <p>Copybarista was a test in seeing how far we can push making OSS software in the new paradigm of coding assistance tooling. And it makes me excited to see all the cool software people build.</p>
        <p>With the right approach, you can now move fast and not break things.</p>

        <div style="margin-top: 40px; display: flex; gap: 10px; flex-wrap: wrap;">
          <a href="/join/" class="btn btn-primary">Join us →</a>
          <a href="/blog/" class="btn btn-outline">Return to the blog</a>
        </div>
      </div>
    </div>
  </article>]]></content><author><name>Dan Kondratyuk</name></author><category term="release" /><category term="open-source" /><category term="agents" /><summary type="html"><![CDATA[How a private monorepo sync problem became an OSS tool for clean exports and verified imports.]]></summary></entry><entry><title type="html">An Autonomous AI Scientist Team Invented an Algorithm I Wouldn’t Have</title><link href="https://rekursiv.ai/blog/an-ai-team-invented-an-algorithm-i-wouldnt-have/" rel="alternate" type="text/html" title="An Autonomous AI Scientist Team Invented an Algorithm I Wouldn’t Have" /><published>2026-03-23T00:00:00+00:00</published><updated>2026-03-23T00:00:00+00:00</updated><id>https://rekursiv.ai/blog/an-ai-team-invented-an-algorithm-i-wouldnt-have</id><content type="html" xml:base="https://rekursiv.ai/blog/an-ai-team-invented-an-algorithm-i-wouldnt-have/"><![CDATA[<article>
    <header class="post-hero">
      <div class="container-narrow">
        <div class="post-meta"><span class="pill pill-accent">research</span><time>March 23, 2026</time></div>
        <h1 class="post-title">An Autonomous AI Scientist Team Invented an Algorithm I Wouldn't Have</h1>
        <p class="post-lead">Four AI Scientists and $500 created a machine learning paper with an idea I never would have tried.</p>
        <div class="post-byline">
          <span><a href="/josh/"><strong>Joshua V. Dillon</strong></a>, Co-founder, rekursiv.ai</span>
        </div>
      </div>
    </header>

    <div class="post-body">
      <div class="container-narrow">
        <figure>
          <img src="/assets/rekursiv-site-2/img/b2d93f3a07806049.png" alt="A subset of experiments showing accuracy on Sudoku-Extreme improving from 75% to 97% over time." width="2800" height="1400" />
          <figcaption>Figure 1. A subset of experiments, showcasing improvement in Sudoku-Extreme puzzle accuracy over time. The staircase of kept experiments shows how the AI Scientists incrementally pushed accuracy from 75% to 97%.</figcaption>
        </figure>

        <p>From November 26 to December 25, 2025, I experienced a new joy: every morning I rushed to my terminal to read what my autonomous research team had discovered overnight.</p>
        <p>The team was four AI Scientists. They ran on a Claude Max subscription and a single GPU in my garage (an unnecessary but fun alternative to a cloud GPU). And by the end of the month, they'd written a machine learning paper, "<a href="https://arxiv.org/abs/2601.19085">Speed is Confidence</a>", where the total cost amounted roughly to staying at a hotel for several days. The paper proposed an idea, that despite 21 years in ML I would not have come up with myself, because it involved wasting compute. The end result is that it achieved <strong>97% accuracy on Sudoku-Extreme</strong>, a benchmark where the previous state-of-the-art was 85%, using <strong>167× lower training compute</strong>.</p>
        <p>This is the story of how that happened, what it felt like to work alongside them, and why we started rekursiv.ai.</p>

        <h2>What today's AI systems are missing</h2>
        <p>Everyone is building on top of AI systems that write code and has been for quite some time. And they're getting really, really good at it. But writing code isn't the hard part of research. The hard part is knowing <em>what</em> to build and <em>why</em>.</p>
        <p>Current AI systems execute but struggle to <em>discover</em>. They don't form avant-garde hypotheses. They don't design sophisticated experiments. They don't stare at an unexpected result and ask "wait, why did <em>that</em> happen?" They don't do the thing that has driven every major discovery in human history.</p>
        <p><strong>The scientific method is in its infancy in AI research loops.</strong></p>
        <p>Observe, hypothesize, experiment, analyze, and repeat. This cycle is ancient, and while "obvious," no AI coding tool on the market does it well enough to invent fundamental and transformative ideas.</p>

        <h2>Four AI Scientists + one RTX 5090 ≈ $500 for one month</h2>
        <p>To see how far I could push commercial LLMs, I tried to set up a system that could do scientific discovery, which immediately failed. The second attempt was a lot more promising, but that also failed. It took 8 attempts before it worked. But when it did, it was able to iterate on research for a whole month.</p>
        <p>The final system revolved around the design of 4 distinct AI Scientists, each with their own role:</p>
        <ul>
          <li><strong>Scientist</strong>: forms hypotheses, designs experiments, owns research strategy. Never runs an experiment without a prediction of what should happen and why.</li>
          <li><strong>Analyst</strong>: studies results, identifies patterns, proposes diagnostics. Gut-checks the scientist's proposal.</li>
          <li><strong>Engineer</strong>: implements exactly what's specified. No freelancing. Asks for clarification rather than guessing.</li>
          <li><strong>Reviewer</strong>: validates code against spec, dispatches GPU runs, owns the leaderboard.</li>
        </ul>
        <p>A coordination script routes messages between the four scientists and manages their lifecycle. I built the system so a human can join as a fellow Scientist, allowing me to propose ideas, redirect experiments, and provide guidance. It also runs fully unattended when I step away.</p>

        <h2>Watch them think</h2>
        <p>Here's one experiment cycle, one of 984. It starts with a single prompt:</p>

        <div class="trace">
          <div class="trace-msg" style="--p: var(--ink-3);">
            <div class="trace-head"><span class="from">User → Scientist</span>, 12:39:05</div>
            <div class="trace-body">"Propose experiment_17. Ask Analyst to sanity-check your hypothesis."</div>
          </div>
          <div class="trace-msg" style="--p: #1d8a73;">
            <div class="trace-head"><span class="from">Scientist → Analyst</span>, 12:39:17</div>
            <div class="trace-body"><strong>Hypothesis:</strong> ACT (Adaptive Computation Time) is required to reach 87% because it allows the model to dynamically allocate more thinking steps to harder puzzles. <em>Evidence:</em> TRM paper states "removing ACT hurts generalization"; experiment_10 peaks at 76% then degrades.</div>
          </div>
          <div class="trace-msg" style="--p: #7aa2f7;">
            <div class="trace-head"><span class="from">Analyst → Scientist</span>, 12:39:36</div>
            <div class="trace-body">Mixed support. We don't know the ACT ablation magnitude. <em>experiment_10</em>'s degradation looks like <strong>overfitting</strong>, not depth limitation. Three simpler hypotheses untested: overfitting, batch size, PDC done correctly. Find the quantitative ACT ablation before 2-3 days of impl.</div>
          </div>
          <div class="trace-msg" style="--p: #1d8a73;">
            <div class="trace-head"><span class="from">Scientist → Analyst</span>, 12:40:47</div>
            <div class="trace-body"><strong>Critical finding: ACT hypothesis is DEAD.</strong> ACT actually hurts by 1.3%. Pivoting to batch size.</div>
          </div>
        </div>

        <p>The Scientist searched the literature, found quantitative evidence against its own hypothesis, and changed its mind. It pivoted to the simpler explanation the Analyst suggested, despite no human telling it to do this.</p>

        <h2>984 experiments in 30 days</h2>
        <p>To kick-start the research loop, I pointed them at an open problem: take the <a href="https://github.com/SamsungSAILMontreal/TinyRecursiveModels">Tiny Recursive Models</a> paper and try to reach 100% accuracy on Sudoku-Extreme. TRM is a 7M-parameter neural network that solves constraint satisfaction problems by iteratively refining its predictions. Its reported test-set accuracy was about 85%.</p>
        <p>Over those 30 days, the team ran 984 experiments and generated 882,534 lines of scientific discourse. Most experiments failed to make improvements. They documented 23 different approaches that made things worse: diversity losses, causal masking, entropy regularization, confidence weighting.</p>
        <p>But buried in the wreckage were three observations that changed everything.</p>

        <h2>Three discoveries</h2>
        <p>First, the scientists noticed that training four models with different random seeds and averaging their predictions improved accuracy. Standard ensemble result.</p>
        <p>Second, the Analyst flagged something in the halting dynamics. TRM learns a "stop" signal: when the model is confident, it halts early rather than iterating further. They noticed that when you run multiple models in parallel, <em>the one that halts first is almost always correct.</em> Selecting by <em>speed</em> instead of averaging significantly increased accuracy while using 10× fewer reasoning steps.</p>
        <p>The result is "<a href="https://arxiv.org/abs/2601.19085">speed is confidence</a>". The model that converges fastest has found the cleanest solution path. It's the same principle behind cortical winner-take-all circuits: the first neuron to fire suppresses the alternatives.</p>
        <p>Third, I posed a revised objective to the scientists: achieve the halt-first-ensemble accuracy but as a train-time procedure with test-time compute matching the TRM baseline. The result? An algorithm I wouldn't have devised.</p>

        <h2>The algorithm I wouldn't have invented</h2>
        <p>During training, maintain four parallel models from different random initializations. At each step, find the one with the lowest loss (the "oracle winner") and only backpropagate through that one. Copy the winner's high-level reasoning state to all four chains, but let each keep its own low-level state so they stay diverse. At inference, run just one chain.</p>

        <table>
          <thead><tr><th></th><th>Hardware</th><th>Training</th><th>Accuracy</th></tr></thead>
          <tbody>
            <tr><td><strong>TRM, Previous SOTA</strong> (Oct '25)</td><td>8× H100</td><td>5.3 GPU-hours</td><td>85%</td></tr>
            <tr><td><strong>Our AI Scientists</strong> (Dec '25)</td><td>1× RTX 5090</td><td>0.67 GPU-hours</td><td class="best">97%</td></tr>
          </tbody>
        </table>

        <p>The AI Scientists invented an approach that is roughly 167× cheaper, and 12 points more accurate.</p>
        <p>Here's why I wouldn't have invented this: the algorithm throws away 75% of the training compute. Every instinct I've developed over two decades says to use every gradient. The scientists didn't carry that baggage. They had a hypothesis (diversity in solution paths matters more than gradient efficiency), they tested it, and they found a reasonable solution.</p>

        <h2>Waking up excited</h2>
        <p>Every morning, I'd open the Analyst's latest report. Sometimes it was a failed experiment with a puzzling note. Other mornings, it was a breakthrough. The day the Analyst reported the halt-first result (97% accuracy where we'd been stuck at 91%) with an explanation grounded in biological winner-take-all circuits, I sat at my desk and laughed.</p>
        <p>This is what the scientific method gives you: surprise.</p>

        <h2>What this means</h2>
        <p>That system was just a proof of concept. What it demonstrated is that <strong>AI Scientists can create new knowledge</strong>. Given a research problem and the right methodology, they form hypotheses, run experiments, interpret results, and arrive at novel solutions.</p>
        <p>That's what <a href="/">rekursiv.ai</a> is building. We're creating a platform for self-improvement, where AI Scientists collaborate with humans to do research, not just write code.</p>
        <p>We're not done until we've scaled to millions of AI Scientists working collectively to solve important challenging problems.</p>
        <p>If that interests you, consider reaching out to <!--email_off--><a href="mailto:contact@rekursiv.ai">contact@rekursiv.ai</a><!--/email_off-->.</p>

        <div style="margin-top: 40px; display: flex; gap: 10px; flex-wrap: wrap;">
          <a href="/join/" class="btn btn-primary">Join us →</a>
          <a href="/blog/" class="btn btn-outline">Return to the blog</a>
        </div>
      </div>
    </div>
  </article>]]></content><author><name>Joshua V. Dillon</name></author><category term="research" /><category term="agents" /><category term="ai-scientists" /><summary type="html"><![CDATA[Four AI Scientists and $500 created a machine learning paper with an idea I never would have tried.]]></summary></entry></feed>