Buy Me A Coffee Buy Me a Coffee at ko-fi.com

Replicating My AI-Built Research Workstation

Created: June 13, 2026   Last Modified: July 03, 2026   Category: research, tools   Print this pageBack to Home

Summary

Over time my research machine grew into a small fleet of AI coding agents — Claude Code, Codex, GitHub Copilot CLI, OpenCode, DeepSeek, Gemini, and a self-hosted bot called OpenClaw — wired together with a shared set of research skills: a Zotero library, a paper downloader, daily literature digests, multi-agent proof/manuscript reviews, and a SageMath sandbox. To make that setup reproducible (and survivable), the agents built their own backup-and-replication system: a public GitHub repo of sanitized configuration plus a single encrypted archive of secrets that, together, can rebuild the whole machine on a fresh Ubuntu box. This post is a short tour of what the system does, two or three real examples, how to try its basic functions for free in a GitHub Codespace, and which keys you would need to fully replicate it.

This is experimental. Everything below — and this very post — was generated by the multi-agent AI coding system it describes. It was built for one person’s workflow (combinatorics and graph theory, especially combinatorial reconfiguration), it assumes an arm64 Ubuntu host, and it may not work as expected on other machines, other agent versions, or research tasks outside those assumptions.

What this is

The system lives in three public repositories:

The motivation was never “back up dotfiles” — it was to keep a research workflow intact: ask a question, gather the literature, compute, verify, and write, with the agents doing the legwork. The rest of this post focuses on that workflow. Architecture details are in ARCHITECTURE.md.

What it does

The research workflow (the main motivation)

A non-trivial research question goes through visible gates rather than a single black-box answer: a short Research Brief scopes the question, a deep-research pass fans out web searches, fetches sources, and adversarially verifies each claim, and review/verification gates check the evidence before anything is called “done.”

The gate discipline has since grown at both ends, and beyond research. An intent-interview pins down what is actually being asked before the brief — one question at a time, each with a guess attached, until the real question is confirmed. An in-flight doubt check re-examines a non-trivial decision — a reduction step, a boundary, an assumption a type or proof checker cannot see — with a fresh-context reviewer while it is still cheap to change, not only at the end. And the same prove-it-then-verify discipline now applies to engineering and general tasks too, not just research.

Vertical flowchart of the research workflow: research question; Research Brief; gather literature (Zotero, Calibre, getscipapers); deep research with adversarial verification; compute and draw (SageMath, TikZ); multi-agent review; verification gate; deliver.
The research workflow. The blue steps are the visible gates — the brief, the adversarial verification, the multi-agent review, and the final evidence check — that keep every claim sourced before anything is called “done”.

These flows live in the agent instructions and the multi-agent templates in the agent-group-discuss skill.

Getting papers (Zotero first)

Document lookup follows a strict order — Zotero → Calibre → online — so a paper already in my 10,000-item library is never re-downloaded.

Ingestion is powered by a local Zotero Translation Server — a small Docker service (the same engine behind the Zotero browser connector) that turns a URL, DOI, or identifier into a fully-catalogued item with correct metadata. It’s what makes one-command “add this paper” work, and the rebuild restores it along with the library config (see INSTALL.md).

Daily arXiv / Semantic Scholar and RSS digests surface new papers on tracked topics. (Some download methods are a separate topic, discussed here.)

Multi-agent tasks

For work that benefits from independent perspectives, tasks are split across several agents that run in parallel within a round and hand off between rounds — the same idea as the proof example above, applied to other jobs:

There is also a SageMath sandbox for the small computations these tasks need — chromatic/Tutte polynomials, automorphism groups, exhaustive small-case checks — with ready-made templates such as reconfiguration_check.sage and counterexample_search.sage.

Figures (TikZ), Sage-assisted

Those computed structures usually have to become figures in a paper. The tikz-draw skill builds structural diagrams — finite graphs, gadgets, automata, trees, commutative diagrams — with a structure-first loop: figure brief → spec → render → verify-semantic → compile → review, so a diagram is checked against the structure it is meant to show rather than just compiled.

Slides to a narrated video (and math animations)

Beyond static figures, the system can turn a finished slide deck into a narrated, captioned video. The slides-to-video skill takes prepared slides (PNG, PDF, or PPTX), writes or accepts a script in a chosen language and presenter role, and renders a spoken-over video with synchronized captions — a three-phase, human-in-the-loop flow with an explicit approval gate before anything is rendered, built only on free tools. For the mathematics itself, the manim-math-animation companion renders handwritten-style equation writing, morphing, and emphasis as short silent clips that splice straight into the deck.

Heavy compute, offloaded (Modal + GitHub Actions)

Some research steps are easy to parallelise but too heavy for a single box — enumerating all graphs up to some order to hunt for a counterexample, sweeping a parameter grid, or re-running a SageMath check over thousands of cases. The /research-compute skill routes those jobs to Modal through a small local broker, picking remote CPU, high-memory CPU, or GPU to fit the job: the agent packages the work, Modal spins up containers on demand, fans the work out, and streams results back. A search that would run for hours on the local box finishes in minutes across many workers, and you pay only for the seconds they actually run.

The same broker can also route a job to GitHub Actions, as the last automatic backend after local and Modal. This lane is deliberately narrow and ToS-compliant: GitHub restricts hosted-runner Actions to a repository’s own testing and validation, so the broker dispatches only into a private research repo’s own committed experiment code — passing parameters as data, never code — as that project’s own validation, and never as a general compute pool. Every dispatch is budget-gated: it reserves the worst-case minutes against your account’s remaining Actions minutes first, and refuses rather than overspend.

Trying it (limited) in a GitHub Codespace

You can run a live, interactive replica in a GitHub Codespace without any of my secrets. Open the repo, Code → Codespaces → Create, and the container builds itself: it installs the software stack, renders all the configuration, and runs the health checks.

Full instructions and the honest caveats (a Codespace is amd64; my machine is arm64, so it’s a functional — not bit-identical — replica) are in CODESPACES.md.

Want a real arm64 box like mine? The machine this system runs on is an Oracle Cloud Ampere A1 (arm64) instance, and Oracle’s Always Free tier hands out one in the same family at no cost: up to 4 Arm cores and 24 GB of RAM (which you can split across as many as four small VMs) plus around 200 GB of block storage — enough to host the full arm64 replica rather than the amd64 Codespace approximation. Free-tier offerings change, so check Oracle’s current Always Free list before relying on the exact numbers.

Secrets and keys you’d need to replicate it

The public repos contain no secrets — only sanitized templates and the names of the keys. To replicate the system you would supply your own. The key thing to understand is that almost every secret exists only because the system talks to some external service I happen to use — so for several of them you would not need the same key at all, and might plug in a different service entirely. Here is why each category is needed, and where your choices would differ from mine:

Every key — where to obtain it, which file it lives in, and exactly what stops working without it — is documented key-by-key in SECRETS.md. Without them the system still installs and the Codespace still runs; it just degrades feature-by-feature, and make verify-secrets --degraded prints which feature each missing key disables.

Caveats

This is a personal, experimental system, not a product. It targets a specific arm64 Ubuntu setup, pins specific tool versions, and bakes in assumptions about my research (reconfiguration problems, graph invariants, LaTeX manuscripts). Expect to adapt it. If you only want to look, the Codespace degraded mode is the safest way to poke around.

What’s actually worked

It is not all cautionary tales — the same logs record real wins, across several different agents:

When the agents get it wrong

Because the same agents wrote this system, it is only fair to show how they fail. The following are real incidents from this project’s own logs (spanning Claude Code, Codex, DeepSeek, and the self-hosted OpenClaw bot), not hypotheticals — and they are why the workflow above has so many explicit gates:

The pattern is the same across all of them: an agent sounded certain, or an automated check looked green, when it was not. That is exactly why the research workflow front-loads a visible brief, adversarial verification, independent multi-agent review, and a final evidence gate — and why this system is labelled experimental and meant to be supervised, not trusted blindly.

Tools it builds on

The system glues together a number of smaller tools, several of which have their own repositories:

It also relies on established third-party software — Zotero, SageMath, Calibre, Lean, and the AI agent CLIs themselves — installed and configured by the rebuild.

Update (July 2026)

Three things have hardened since this post was first published — all in the same spirit of making the system survive the loss of its own machine:

A note on how this was made

The three repositories, their documentation, the backup/restore machinery, the CI, the Codespace, and this blog post were all written by the same multi-agent AI coding system that the repositories back up and replicate — the agents building (and documenting) their own infrastructure. That self-referential loop is half the fun, and half the reason to treat it as experimental.

Recently the loop turned on its own toolset. The agents mined a public collection of engineering agent skills, proposed how each might sharpen the system’s own clarity and verification, and — the part that matters — adversarially verified every candidate against what the system already did, keeping only the genuine gaps and dropping the duplicates. Several survivors map straight onto the failures in When the agents get it wrong: the intent-interview and the in-flight doubt check against confident-but-wrong, a prove-it delivery gate against “done” before tested, a source-grounding gate against invented references. The same discipline the system applies to research, turned on itself.