Replicating My AI-Built Research Workstation
Summary
Over time my research machine grew into a small fleet of AI coding agents — Claude Code, Codex, GitHub Copilot CLI, OpenCode, DeepSeek, Gemini, and a self-hosted bot called OpenClaw — wired together with a shared set of research skills: a Zotero library, a paper downloader, daily literature digests, multi-agent proof/manuscript reviews, and a SageMath sandbox. To make that setup reproducible (and survivable), the agents built their own backup-and-replication system: a public GitHub repo of sanitized configuration plus a single encrypted archive of secrets that, together, can rebuild the whole machine on a fresh Ubuntu box. This post is a short tour of what the system does, two or three real examples, how to try its basic functions for free in a GitHub Codespace, and which keys you would need to fully replicate it.
This is experimental. Everything below — and this very post — was generated by the multi-agent AI coding system it describes. It was built for one person’s workflow (combinatorics and graph theory, especially combinatorial reconfiguration), it assumes an arm64 Ubuntu host, and it may not work as expected on other machines, other agent versions, or research tasks outside those assumptions.
What this is
The system lives in three public repositories:
- coding-system-rebuild — the umbrella: a
Makefile-driven backup/restore that captures every agent’s settings, the research skills, and the system layer (npm globals, Python environments, Docker images, services, shell), with a manifest-based capture and a multi-layer secret-leak scanner. It splits into a public repo (sanitized config + scripts) and one AES-256 encrypted zip of secrets kept off GitHub.
- openclaw-bot — the rebuild component for OpenClaw, the always-on research-assistant bot.
- ai-agents-skills — the shared, manifest-driven skill installer that gives each agent the same research commands. It now also provisions the runtime-backed skills into the sandboxed OpenClaw bot — through an approval-gated runtime manifest and a small host broker — so the always-on bot runs the same skills as the interactive agents.
The motivation was never “back up dotfiles” — it was to keep a research workflow intact: ask a question, gather the literature, compute, verify, and write, with the agents doing the legwork. The rest of this post focuses on that workflow. Architecture details are in ARCHITECTURE.md.
What it does
The research workflow (the main motivation)
A non-trivial research question goes through visible gates rather than a single black-box answer: a short Research Brief scopes the question, a deep-research pass fans out web searches, fetches sources, and adversarially verifies each claim, and review/verification gates check the evidence before anything is called “done.”
The gate discipline has since grown at both ends, and beyond research. An intent-interview pins down what is actually being asked before the brief — one question at a time, each with a guess attached, until the real question is confirmed. An in-flight doubt check re-examines a non-trivial decision — a reduction step, a boundary, an assumption a type or proof checker cannot see — with a fresh-context reviewer while it is still cheap to change, not only at the end. And the same prove-it-then-verify discipline now applies to engineering and general tasks too, not just research.
- Example — literature landscape. “Map the complexity of token sliding on a graph class.” The system first shows a one-screen brief (what it will search, which databases, what counts as evidence), then runs the deep pass and returns a cited summary; claims that can’t be sourced are flagged, not hidden.
- Example — proof stress-test. “Find holes in this proof.” A multi-agent panel — the Lakatos “proof and refutations” template — runs as separate agents (a Prover, a Counterexample Hunter, a Monster-Barrer, and a Formalist) over several rounds, and reports concrete gaps and counterexamples.
These flows live in the agent instructions and the multi-agent templates in the agent-group-discuss skill.
Getting papers (Zotero first)
Document lookup follows a strict order — Zotero → Calibre → online — so a paper already in my 10,000-item library is never re-downloaded.
- Example. “Get me the PDF of the Nishimura reconfiguration survey.” The Zotero skill searches the library and returns the attachment; if it isn’t there, a separate downloader,
getscipapers (wrapped by the getscipapers-requester skill), fetches it by DOI/ISBN/title. Adding a new arXiv paper automatically sets its type to manuscript, renames the PDF to a consistent pattern, and asks which collection it belongs to.
Ingestion is powered by a local Zotero Translation Server — a small Docker service (the same engine behind the Zotero browser connector) that turns a URL, DOI, or identifier into a fully-catalogued item with correct metadata. It’s what makes one-command “add this paper” work, and the rebuild restores it along with the library config (see INSTALL.md).
Daily arXiv / Semantic Scholar and RSS digests surface new papers on tracked topics. (Some download methods are a separate topic, discussed here.)
Multi-agent tasks
For work that benefits from independent perspectives, tasks are split across several agents that run in parallel within a round and hand off between rounds — the same idea as the proof example above, applied to other jobs:
- Example — pre-submission review (how the procedure runs). Ask “review this draft.” The Knuth manuscript-review template spawns three reviewers — Correctness, Exposition, and Literature — each its own agent with its own context. They run in parallel within a round (two rounds by default), then a synthesis step merges their findings into one report that separates “this is wrong” from “this is unclear.” A lighter single-reviewer pass uses the paper-review skill.
- Example — annotated review (independent verification). When you want a marked-up PDF, the annotated-review pipeline runs four phases with deliberately separate, clean contexts: a Reviewer, then a Verifier that re-checks the findings without seeing the reviewer’s reasoning, then a Trust Verifier that checks every citation, then an output phase. It emits three artifacts — an annotated LaTeX PDF, a PyMuPDF-marked-up PDF, and a companion HTML. The clean-context separation is the point: it keeps one agent’s mistake from silently propagating into the “verified” output.
- Example — formalization. A Lean team (Planner, Formalizer, Miner, Repair, Checker) turns a lemma into a Lean skeleton and chips away at the
sorrys, starting from the lean-formalization-intake skill.
- Example — a bounded handoff to a second model (cross-agent delegation). When a review wants an independent opinion from a different agent — say, having Codex or DeepSeek re-check the citations in a section while Claude reviews the proofs — the cross-agent-delegation skill writes a bounded task packet: only the objective, the references, the constraints, what evidence to return, and the expected output shape — no credentials, no conversation history. The parent agent stays in charge — it confirms the handoff and treats whatever comes back as untrusted evidence to validate, not an answer to trust. That keeps two agents’ contexts cleanly separate while still letting one check the other.
- Example — reviewing a whole draft autonomously, with a ledger (autonomous-research-loop). For “review this paper and keep going until it’s done,” the review runs as a bounded loop rather than one pass: a fixed budget (so many iterations, so many helper agents), an explicit goal and success criteria, and an append-only ledger that records, every iteration, what was checked, what evidence backed it, what gaps remain, and whether to continue, revise, delegate, or stop. Evidence gates keep a finding from being accepted without a source, a recovery note lets the loop resume after a context reset, and a single iteration can hand a sub-check to another agent via the delegation packet above. It stops when the criteria are met or the budget runs out — never looping forever.
There is also a SageMath sandbox for the small computations these tasks need — chromatic/Tutte polynomials, automorphism groups, exhaustive small-case checks — with ready-made templates such as reconfiguration_check.sage and counterexample_search.sage.
Those computed structures usually have to become figures in a paper. The tikz-draw skill builds structural diagrams — finite graphs, gadgets, automata, trees, commutative diagrams — with a structure-first loop: figure brief → spec → render → verify-semantic → compile → review, so a diagram is checked against the structure it is meant to show rather than just compiled.
- Example — Sage-assisted graph figure. For a graph beyond the built-in layouts — a specific construction, a computed layout, or a transformation before drawing — tikz-draw switches to a Sage-assisted graph mode (
graph_mode: auto | local | sage): SageMath computes the graph’s semantics and coordinates, while tikz-draw keeps ownership of render, compile, and review. So the same Sage that checks a construction (e.g. a reconfiguration_check.sage run) can also produce the picture of it that goes into the manuscript.
- The
verify-semantic pass then confirms the rendered figure actually encodes the intended nodes and edges, and the figure can go through the same independent review discipline as the prose — connecting the compute, draw, and review skills end to end.
Slides to a narrated video (and math animations)
Beyond static figures, the system can turn a finished slide deck into a narrated, captioned video. The slides-to-video skill takes prepared slides (PNG, PDF, or PPTX), writes or accepts a script in a chosen language and presenter role, and renders a spoken-over video with synchronized captions — a three-phase, human-in-the-loop flow with an explicit approval gate before anything is rendered, built only on free tools. For the mathematics itself, the manim-math-animation companion renders handwritten-style equation writing, morphing, and emphasis as short silent clips that splice straight into the deck.
- Example — a talk from a paper. Hand it the slides for a result and ask for a short explainer: it drafts the narration, lets you correct it at the approval gate, animates the key equations with Manim, and renders the whole thing into one captioned video — with no proprietary tools anywhere in the path.
Heavy compute, offloaded (Modal + GitHub Actions)
Some research steps are easy to parallelise but too heavy for a single box — enumerating all graphs up to some order to hunt for a counterexample, sweeping a parameter grid, or re-running a SageMath check over thousands of cases. The /research-compute skill routes those jobs to Modal through a small local broker, picking remote CPU, high-memory CPU, or GPU to fit the job: the agent packages the work, Modal spins up containers on demand, fans the work out, and streams results back. A search that would run for hours on the local box finishes in minutes across many workers, and you pay only for the seconds they actually run.
- Example. “Is there a counterexample to this bound among all graphs on at most n vertices?” The agent wraps the same Sage/Python check it would otherwise run once in the local sandbox,
.map()s it over the generated instances on remote CPU (enumeration and counterexample search default to CPU, not GPU), and returns the first counterexample — or a clean “none up to n.” Batch OCR, embeddings, or other tensor work instead routes to a GPU automatically.
- What’s on tap. Modal is serverless and pay-per-use, with (at the time of writing) a monthly slice of free credits that comfortably covers occasional searches. Per job you can request, roughly, tens of CPU cores and hundreds of GB of RAM, GPUs from a T4/L4/A10G up to A100s and H100s, and thousands of short-lived containers in parallel — so one skill covers both a quick brute-force sweep and an occasional GPU run, without keeping any of that hardware around.
The same broker can also route a job to GitHub Actions, as the last automatic backend after local and Modal. This lane is deliberately narrow and ToS-compliant: GitHub restricts hosted-runner Actions to a repository’s own testing and validation, so the broker dispatches only into a private research repo’s own committed experiment code — passing parameters as data, never code — as that project’s own validation, and never as a general compute pool. Every dispatch is budget-gated: it reserves the worst-case minutes against your account’s remaining Actions minutes first, and refuses rather than overspend.
- Example. One of my private research repos carries a small in-repo
experiments/ runner — Python, C++, and SageMath jobs that validate that project’s own results. Submitting one of those parameter sweeps through the broker dispatched a GitHub Actions run, which it then correlated, waited on, and fetched the result back here — budget-gated end to end. A one-time bootstrap sets the broker up first: generate config, authenticate the GitHub CLI, check readiness. The routing rules and the full ToS rationale live in the experiment-runner plan.
Trying it (limited) in a GitHub Codespace
You can run a live, interactive replica in a GitHub Codespace without any of my secrets. Open the repo, Code → Codespaces → Create, and the container builds itself: it installs the software stack, renders all the configuration, and runs the health checks.
- Without secrets (default): a working but degraded replica — you can read every configuration, run
make verify / make test, and exercise the skill plumbing. This is the same thing the project’s GitHub Actions run on every commit.
- With secrets (optional): the Codespace forwards a small web upload form. If you have an encrypted secrets zip, you upload it there (it never touches GitHub, and is shredded right after use) to complete the full replica. Starting the live bot is opt-in, because it would connect to real chat channels.
Full instructions and the honest caveats (a Codespace is amd64; my machine is arm64, so it’s a functional — not bit-identical — replica) are in CODESPACES.md.
Want a real arm64 box like mine? The machine this system runs on is an Oracle Cloud Ampere A1 (arm64) instance, and Oracle’s Always Free tier hands out one in the same family at no cost: up to 4 Arm cores and 24 GB of RAM (which you can split across as many as four small VMs) plus around 200 GB of block storage — enough to host the full arm64 replica rather than the amd64 Codespace approximation. Free-tier offerings change, so check Oracle’s current Always Free list before relying on the exact numbers.
Secrets and keys you’d need to replicate it
The public repos contain no secrets — only sanitized templates and the names of the keys. To replicate the system you would supply your own. The key thing to understand is that almost every secret exists only because the system talks to some external service I happen to use — so for several of them you would not need the same key at all, and might plug in a different service entirely. Here is why each category is needed, and where your choices would differ from mine:
- Model providers — API keys for the LLM backends the agents actually call (two Claude-model resellers as primary and fallback, plus DeepSeek, Groq, an OpenAI/Codex key, and a Google/Gemini key). Why: every agent turn is an API call, so with no working provider key nothing runs at all. You would use whichever provider(s) you have an account with — my particular fallback chain is not special, and a single working key is enough to start.
- Zotero + attachments — a Zotero API key, plus a WebDAV password. Why: the
/zotero skill reads and writes my online Zotero library, and my PDF attachments sync over WebDAV. If you keep your library locally, or sync attachments through Zotero’s own storage or a different WebDAV host, you would swap or drop these.
- Google Drive service account — Why: my ebook library and several research files live on Google Drive, so the Calibre/Drive skills authenticate with a Google service-account JSON. You very likely do not need Google at all. If your library sits on a local disk you need nothing here; and since the off-machine copy of the encrypted secrets zip already goes to Dropbox (via
rclone), you might instead add a Dropbox, S3, or plain-local remote. This key encodes my storage choice, not a requirement of the system — exactly the kind of secret you would replace rather than reuse.
- Paper-retrieval logins — per-service accounts for the paper downloader, used only for papers not already in my library. Why: each academic source needs its own login; a missing one disables just that one source while the rest keep working.
- Messaging channels and email — a Telegram bot token (+ chat id), optional Zalo / Zulip / Google-Chat credentials, and SMTP credentials for email. Why: the system delivers files and digest notifications to me over chat or email, and the self-hosted OpenClaw bot listens on those channels. The send-email skill adds plain SMTP delivery from every agent, with optional PGP-signed messages. Use whatever channel you prefer, or none — delivery then just falls back to writing files locally.
- Infrastructure — SSH keys and a GitHub token (to push backups), plus an optional Tailscale auth key (private networking between my own machines), a Docker registry login (private image pulls), a Modal token (offloading heavy compute), a GnuPG keyring (the PGP private key that signs outgoing email), and the
rclone remote above. Why: these wire the machine to my hosting choices; each is optional and degrades exactly one capability when absent.
Every key — where to obtain it, which file it lives in, and exactly what stops working without it — is documented key-by-key in SECRETS.md. Without them the system still installs and the Codespace still runs; it just degrades feature-by-feature, and make verify-secrets --degraded prints which feature each missing key disables.
Caveats
This is a personal, experimental system, not a product. It targets a specific arm64 Ubuntu setup, pins specific tool versions, and bakes in assumptions about my research (reconfiguration problems, graph invariants, LaTeX manuscripts). Expect to adapt it. If you only want to look, the Codespace degraded mode is the safest way to poke around.
What’s actually worked
It is not all cautionary tales — the same logs record real wins, across several different agents:
- Self-debugging against a brute-force oracle. An agent checked a fast structural algorithm against exhaustive ground truth, found and fixed two counting bugs to push agreement past 98%, and — the hard part — correctly diagnosed that the remaining mismatches were not a coding bug but a genuine limitation: a lemma proved for one structure that does not carry over to a more general one.
- Multi-agent reviews that change the outcome. Independent reviewer agents have flagged real gaps in a draft argument, leading to a deliberate rollback to an earlier, sound version instead of papering over the hole — exactly the “separate wrong from merely unclear” behaviour the review templates are built for.
- Semantically-verified figures. Codex produces TikZ hardness-reduction figures that are checked against a semantic contract — the actual source graph, gadgets, and ports must be present — and then inspected at high resolution for label/edge collisions, rather than trusted just because they compiled.
- The system that wrote this. The most concrete success is the subject of the post itself: the agents built, tested (green CI), and documented their own backup/replication machinery — the leak scanner, the Codespace, the key-rotation tooling — end to end.
When the agents get it wrong
Because the same agents wrote this system, it is only fair to show how they fail. The following are real incidents from this project’s own logs (spanning Claude Code, Codex, DeepSeek, and the self-hosted OpenClaw bot), not hypotheticals — and they are why the workflow above has so many explicit gates:
- A confident, wrong “it’s not there.” Asked to send a file over Telegram, an agent declared Telegram “not configured” — when the bot token was sitting in plain sight in the secrets file. It had trusted an incomplete memory note instead of checking. The standing rule now is: search the workspace before claiming anything is missing.
- “Done” before it was tested. An agent reported two code fixes as finished without ever running them. The rule now is that no change is called done until the changed code has actually been executed.
- Hallucinated links and references. An earlier draft of this very post linked “combinatorial reconfiguration” and a review-template name to entirely unrelated pages. A human noticed, which is why every external link here was re-checked against its real target before publishing.
- A secret that slipped past the scanner. An early version of the leak-scanner matched secrets by field name only; a real provider API key got past it and was briefly committed to a public repo. The fix was not just to make the repo private and rewrite its history, but to harden the scanner to catch value-shaped patterns (not only field names), add a test that the scanner always covers the redactor, and rotate the exposed key — because making a repo private does not un-expose a key that was already public.
- Quiet, invisible failures. A broken
package.json two directories up silently made every editor hook exit with an error for a while — including the guard meant to vet edits — and nobody noticed, because the failure was swallowed. Separately, an installer happily “succeeded” while silently skipping source directories that did not exist, which would have produced an empty install that looked fine.
- A different agent, the same overconfidence. Run through a one-shot path without being handed its sources, the DeepSeek agent invented source-ledger details instead of reporting that it had none — a textbook hallucination. The fix was to only ever run it through the path that actually loads project context.
- “Sent!” when nothing arrived. An OpenClaw agent reported a file delivered to chat when the upload had silently failed and the user received nothing. The rule now is to check the channel’s
"status":"ok" response before claiming a send succeeded — never infer delivery from the absence of an error.
- “Figure fixed” from a glance. An agent called a manuscript figure fixed after a quick full-page PDF glance, missing label-on-edge collisions that only show up at high resolution. Figures are now inspected at the figure level — geometry and semantics — before being called done.
- A runaway search. A broad
strings / rg sweep over system directories blew up the context window and had to be killed by hand — a reminder that an agent told to “just search everything” can quietly dig itself into a hole.
- “Complete,” verified by counting. A full reinstall of the shared skills silently dropped a quarter of them on every agent: one CLI flag quietly replaced the profile selection instead of adding to it, and dependency back-fill restored just enough to look normal. Parallel verification agents then passed the result by counting directories — leftover legacy folders inflated the counts. The gap surfaced only when installed coverage was diffed against the manifest itself. The rule: verify against the spec, never by counting artifacts.
- A plan that invalidated itself. The same installer skipped workflow templates as “missing backing skill” — in the very run that installed those backing skills — because the plan was computed once, against the state before its own changes applied. Installs are now re-verified after apply, not just planned.
- Two writers, one ledger. After a bounded research loop was handed to an unattended driver, the interactive “keep going” hook — built earlier to stop an agent from abandoning that same loop — fired anyway and demanded an iteration in parallel with the driver’s session, on a ledger designed for exactly one writer. Two enforcement mechanisms, each correct alone, had never been introduced to each other. The hook now stands down while a live driver owns the loop.
- Confident about the automation’s behavior. Asked whether backups publish automatically, an agent said yes — remembering the push line in the script but not the opt-in guard around it. Reading the guard and the actual cron entry showed publishing is manual by default. Quote the guard, not the memory of it.
- A lost answer is not a lost action. A commit-and-push command appeared to fail with an internal error and no output; the retry then found nothing left to do, and the agent briefly credited a phantom “parallel session” — until the git log showed its own first command had executed and only the tool’s answer had been lost. Effects live in the repository, not in the transcript: verify state before explaining it.
The pattern is the same across all of them: an agent sounded certain, or an automated check looked green, when it was not. That is exactly why the research workflow front-loads a visible brief, adversarial verification, independent multi-agent review, and a final evidence gate — and why this system is labelled experimental and meant to be supervised, not trusted blindly.
The system glues together a number of smaller tools, several of which have their own repositories:
- getscipapers — get and request scientific papers from various sources (the paper downloader used above).
- translation-server — a Node.js server that runs Zotero translators (the Zotero ingestion engine; the official image is arm64, this is the amd64 build).
- vnthuquan — a wrapper/CLI for Vietnam Thu Quan ebook discovery and EPUB downloads, on the Calibre/ebook side of the library.
- vnu-eoffice — a local VNU e-office document monitor with Telegram alerts.
It also relies on established third-party software — Zotero, SageMath, Calibre, Lean, and the AI agent CLIs themselves — installed and configured by the rebuild.
Update (July 2026)
Three things have hardened since this post was first published — all in the same spirit of making the system survive the loss of its own machine:
- The backup passphrase is no longer a single point of failure. The encrypted zip and the data snapshots are protected by one passphrase, and that passphrase used to exist only on the source machine — lose the machine, lose every backup it ever encrypted. It is now split with 2-of-N Shamir secret sharing (a small, standard-library-only implementation) across independent locations: the local disk, the cloud remote that already holds the offsite backups, and a private repository. Any single share reveals nothing; any two reconstruct the passphrase — verified by a live drill that recovered it from the two off-machine shares alone. Details in
SECRETS.md.
- Backups now run fully unattended. A weekly job executes the whole chain — sanitized capture, encrypted zip, offsite sync, an encrypted snapshot of the bot’s research data, the passphrase-escrow refresh, and component-pin updates — and alerts me when any step fails. Publishing to the public repo stays a manual, leak-scanned step, for the same reason as before: a scanner false negative must never go public automatically.
- The bounded research loops can now run headless. The autonomous-research-loop described above no longer needs a human typing “continue”: a small driver respawns a fresh agent session per iteration against the on-disk loop state — for any of the agent CLIs, not just one — and treats provider-quota exhaustion as pause and wait rather than failure. The stop conditions (budget, success criteria, an explicit stop request) stay in charge, exactly as before.
A note on how this was made
The three repositories, their documentation, the backup/restore machinery, the CI, the Codespace, and this blog post were all written by the same multi-agent AI coding system that the repositories back up and replicate — the agents building (and documenting) their own infrastructure. That self-referential loop is half the fun, and half the reason to treat it as experimental.
Recently the loop turned on its own toolset. The agents mined a public collection of engineering agent skills, proposed how each might sharpen the system’s own clarity and verification, and — the part that matters — adversarially verified every candidate against what the system already did, keeping only the genuine gaps and dropping the duplicates. Several survivors map straight onto the failures in When the agents get it wrong: the intent-interview and the in-flight doubt check against confident-but-wrong, a prove-it delivery gate against “done” before tested, a source-grounding gate against invented references. The same discipline the system applies to research, turned on itself.