Replicating My AI-Built Research Workstation
Summary
Over time my research machine grew into a small fleet of AI coding agents — Claude Code, Codex, GitHub Copilot CLI, OpenCode, DeepSeek, Gemini, and a self-hosted bot called OpenClaw — wired together with a shared set of research skills: a Zotero library, a paper downloader, daily literature digests, multi-agent proof/manuscript reviews, and a SageMath sandbox. To make that setup reproducible (and survivable), the agents built their own backup-and-replication system: a public GitHub repo of sanitized configuration plus a single encrypted archive of secrets that, together, can rebuild the whole machine on a fresh Ubuntu box. This post is a short tour of what the system does, two or three real examples, how to try its basic functions for free in a GitHub Codespace, and which keys you would need to fully replicate it.
This is experimental. Everything below — and this very post — was generated by the multi-agent AI coding system it describes. It was built for one person’s workflow (combinatorics and graph theory, especially combinatorial reconfiguration), it assumes an arm64 Ubuntu host, and it may not work as expected on other machines, other agent versions, or research tasks outside those assumptions.
What this is
The system lives in three public repositories:
- coding-system-rebuild — the umbrella: a
Makefile-driven backup/restore that captures every agent’s settings, the research skills, and the system layer (npm globals, Python environments, Docker images, services, shell), with a manifest-based capture and a multi-layer secret-leak scanner. It splits into a public repo (sanitized config + scripts) and one AES-256 encrypted zip of secrets kept off GitHub.
- openclaw-bot — the rebuild component for OpenClaw, the always-on research-assistant bot.
- ai-agents-skills — the shared, manifest-driven skill installer that gives each agent the same research commands.
The motivation was never “back up dotfiles” — it was to keep a research workflow intact: ask a question, gather the literature, compute, verify, and write, with the agents doing the legwork. The rest of this post focuses on that workflow. Architecture details are in ARCHITECTURE.md.
What it does
The research workflow (the main motivation)
A non-trivial research question goes through visible gates rather than a single black-box answer: a short Research Brief scopes the question, a deep-research pass fans out web searches, fetches sources, and adversarially verifies each claim, and review/verification gates check the evidence before anything is called “done.”
- Example — literature landscape. “Map the complexity of token sliding on a graph class.” The system first shows a one-screen brief (what it will search, which databases, what counts as evidence), then runs the deep pass and returns a cited summary; claims that can’t be sourced are flagged, not hidden.
- Example — proof stress-test. “Find holes in this proof.” A multi-agent panel — the Lakatos “proof and refutations” template — runs as separate agents (a Prover, a Counterexample Hunter, a Monster-Barrer, and a Formalist) over several rounds, and reports concrete gaps and counterexamples.
These flows live in the agent instructions and the multi-agent templates in the agent-group-discuss skill.
Getting papers (Zotero first)
Document lookup follows a strict order — Zotero → Calibre → online — so a paper already in my 10,000-item library is never re-downloaded.
- Example. “Get me the PDF of the Nishimura reconfiguration survey.” The Zotero skill searches the library and returns the attachment; if it isn’t there, a separate downloader,
getscipapers (wrapped by the getscipapers-requester skill), fetches it by DOI/ISBN/title. Adding a new arXiv paper automatically sets its type to manuscript, renames the PDF to a consistent pattern, and asks which collection it belongs to.
Ingestion is powered by a local Zotero Translation Server — a small Docker service (the same engine behind the Zotero browser connector) that turns a URL, DOI, or identifier into a fully-catalogued item with correct metadata. It’s what makes one-command “add this paper” work, and the rebuild restores it along with the library config (see INSTALL.md).
Daily arXiv / Semantic Scholar and RSS digests surface new papers on tracked topics. (Some download methods are a separate topic, discussed here.)
Multi-agent tasks
For work that benefits from independent perspectives, tasks are split across several agents that run in parallel within a round and hand off between rounds — the same idea as the proof example above, applied to other jobs:
- Example — pre-submission review (how the procedure runs). Ask “review this draft.” The Knuth manuscript-review template spawns three reviewers — Correctness, Exposition, and Literature — each its own agent with its own context. They run in parallel within a round (two rounds by default), then a synthesis step merges their findings into one report that separates “this is wrong” from “this is unclear.” A lighter single-reviewer pass uses the paper-review skill.
- Example — annotated review (independent verification). When you want a marked-up PDF, the annotated-review pipeline runs four phases with deliberately separate, clean contexts: a Reviewer, then a Verifier that re-checks the findings without seeing the reviewer’s reasoning, then a Trust Verifier that checks every citation, then an output phase. It emits three artifacts — an annotated LaTeX PDF, a PyMuPDF-marked-up PDF, and a companion HTML. The clean-context separation is the point: it keeps one agent’s mistake from silently propagating into the “verified” output.
- Example — formalization. A Lean team (Planner, Formalizer, Miner, Repair, Checker) turns a lemma into a Lean skeleton and chips away at the
sorrys, starting from the lean-formalization-intake skill.
- Example — a bounded handoff to a second model (cross-agent delegation). When a review wants an independent opinion from a different agent — say, having Codex or DeepSeek re-check the citations in a section while Claude reviews the proofs — the cross-agent-delegation skill writes a bounded task packet: only the objective, the references, the constraints, what evidence to return, and the expected output shape — no credentials, no conversation history. The parent agent stays in charge — it confirms the handoff and treats whatever comes back as untrusted evidence to validate, not an answer to trust. That keeps two agents’ contexts cleanly separate while still letting one check the other.
- Example — reviewing a whole draft autonomously, with a ledger (autonomous-research-loop). For “review this paper and keep going until it’s done,” the review runs as a bounded loop rather than one pass: a fixed budget (so many iterations, so many helper agents), an explicit goal and success criteria, and an append-only ledger that records, every iteration, what was checked, what evidence backed it, what gaps remain, and whether to continue, revise, delegate, or stop. Evidence gates keep a finding from being accepted without a source, a recovery note lets the loop resume after a context reset, and a single iteration can hand a sub-check to another agent via the delegation packet above. It stops when the criteria are met or the budget runs out — never looping forever.
There is also a SageMath sandbox for the small computations these tasks need — chromatic/Tutte polynomials, automorphism groups, exhaustive small-case checks — with ready-made templates such as reconfiguration_check.sage and counterexample_search.sage.
Those computed structures usually have to become figures in a paper. The tikz-draw skill builds structural diagrams — finite graphs, gadgets, automata, trees, commutative diagrams — with a structure-first loop: figure brief → spec → render → verify-semantic → compile → review, so a diagram is checked against the structure it is meant to show rather than just compiled.
- Example — Sage-assisted graph figure. For a graph beyond the built-in layouts — a specific construction, a computed layout, or a transformation before drawing — tikz-draw switches to a Sage-assisted graph mode (
graph_mode: auto | local | sage): SageMath computes the graph’s semantics and coordinates, while tikz-draw keeps ownership of render, compile, and review. So the same Sage that checks a construction (e.g. a reconfiguration_check.sage run) can also produce the picture of it that goes into the manuscript.
- The
verify-semantic pass then confirms the rendered figure actually encodes the intended nodes and edges, and the figure can go through the same independent review discipline as the prose — connecting the compute, draw, and review skills end to end.
Heavy compute, offloaded (Modal)
Some research steps are easy to parallelise but too heavy for a single box — enumerating all graphs up to some order to hunt for a counterexample, sweeping a parameter grid, or re-running a SageMath check over thousands of cases. The /research-compute skill routes those jobs to Modal through a small local broker, picking remote CPU, high-memory CPU, or GPU to fit the job: the agent packages the work, Modal spins up containers on demand, fans the work out, and streams results back. A search that would run for hours on the local box finishes in minutes across many workers, and you pay only for the seconds they actually run.
- Example. “Is there a counterexample to this bound among all graphs on at most n vertices?” The agent wraps the same Sage/Python check it would otherwise run once in the local sandbox,
.map()s it over the generated instances on remote CPU (enumeration and counterexample search default to CPU, not GPU), and returns the first counterexample — or a clean “none up to n.” Batch OCR, embeddings, or other tensor work instead routes to a GPU automatically.
- What’s on tap. Modal is serverless and pay-per-use, with (at the time of writing) a monthly slice of free credits that comfortably covers occasional searches. Per job you can request, roughly, tens of CPU cores and hundreds of GB of RAM, GPUs from a T4/L4/A10G up to A100s and H100s, and thousands of short-lived containers in parallel — so one skill covers both a quick brute-force sweep and an occasional GPU run, without keeping any of that hardware around.
Trying it (limited) in a GitHub Codespace
You can run a live, interactive replica in a GitHub Codespace without any of my secrets. Open the repo, Code → Codespaces → Create, and the container builds itself: it installs the software stack, renders all the configuration, and runs the health checks.
- Without secrets (default): a working but degraded replica — you can read every configuration, run
make verify / make test, and exercise the skill plumbing. This is the same thing the project’s GitHub Actions run on every commit.
- With secrets (optional): the Codespace forwards a small web upload form. If you have an encrypted secrets zip, you upload it there (it never touches GitHub, and is shredded right after use) to complete the full replica. Starting the live bot is opt-in, because it would connect to real chat channels.
Full instructions and the honest caveats (a Codespace is amd64; my machine is arm64, so it’s a functional — not bit-identical — replica) are in CODESPACES.md.
Want a real arm64 box like mine? The machine this system runs on is an Oracle Cloud Ampere A1 (arm64) instance, and Oracle’s Always Free tier hands out one in the same family at no cost: up to 4 Arm cores and 24 GB of RAM (which you can split across as many as four small VMs) plus around 200 GB of block storage — enough to host the full arm64 replica rather than the amd64 Codespace approximation. Free-tier offerings change, so check Oracle’s current Always Free list before relying on the exact numbers.
Secrets and keys you’d need to replicate it
The public repos contain no secrets — only sanitized templates and the names of the keys. To replicate the system you would supply your own. The key thing to understand is that almost every secret exists only because the system talks to some external service I happen to use — so for several of them you would not need the same key at all, and might plug in a different service entirely. Here is why each category is needed, and where your choices would differ from mine:
- Model providers — API keys for the LLM backends the agents actually call (two Claude-model resellers as primary and fallback, plus DeepSeek, Groq, an OpenAI/Codex key, and a Google/Gemini key). Why: every agent turn is an API call, so with no working provider key nothing runs at all. You would use whichever provider(s) you have an account with — my particular fallback chain is not special, and a single working key is enough to start.
- Zotero + attachments — a Zotero API key, plus a WebDAV password. Why: the
/zotero skill reads and writes my online Zotero library, and my PDF attachments sync over WebDAV. If you keep your library locally, or sync attachments through Zotero’s own storage or a different WebDAV host, you would swap or drop these.
- Google Drive service account — Why: my ebook library and several research files live on Google Drive, so the Calibre/Drive skills authenticate with a Google service-account JSON. You very likely do not need Google at all. If your library sits on a local disk you need nothing here; and since the off-machine copy of the encrypted secrets zip already goes to Dropbox (via
rclone), you might instead add a Dropbox, S3, or plain-local remote. This key encodes my storage choice, not a requirement of the system — exactly the kind of secret you would replace rather than reuse.
- Paper-retrieval logins — per-service accounts for the paper downloader, used only for papers not already in my library. Why: each academic source needs its own login; a missing one disables just that one source while the rest keep working.
- Messaging channels — a Telegram bot token (+ chat id), and optional Zalo / Zulip / Google-Chat credentials. Why: the system delivers files and digest notifications to me over chat, and the self-hosted OpenClaw bot listens on those channels. Use whatever channel you prefer, or none — delivery then just falls back to writing files locally.
- Infrastructure — SSH keys and a GitHub token (to push backups), plus an optional Tailscale auth key (private networking between my own machines), a Docker registry login (private image pulls), a Modal token (offloading heavy compute), and the
rclone remote above. Why: these wire the machine to my hosting choices; each is optional and degrades exactly one capability when absent.
Every key — where to obtain it, which file it lives in, and exactly what stops working without it — is documented key-by-key in SECRETS.md. Without them the system still installs and the Codespace still runs; it just degrades feature-by-feature, and make verify-secrets --degraded prints which feature each missing key disables.
Caveats
This is a personal, experimental system, not a product. It targets a specific arm64 Ubuntu setup, pins specific tool versions, and bakes in assumptions about my research (reconfiguration problems, graph invariants, LaTeX manuscripts). Expect to adapt it. If you only want to look, the Codespace degraded mode is the safest way to poke around.
What’s actually worked
It is not all cautionary tales — the same logs record real wins, across several different agents:
- Self-debugging against a brute-force oracle. An agent checked a fast structural algorithm against exhaustive ground truth, found and fixed two counting bugs to push agreement past 98%, and — the hard part — correctly diagnosed that the remaining mismatches were not a coding bug but a genuine limitation: a lemma proved for one structure that does not carry over to a more general one.
- Multi-agent reviews that change the outcome. Independent reviewer agents have flagged real gaps in a draft argument, leading to a deliberate rollback to an earlier, sound version instead of papering over the hole — exactly the “separate wrong from merely unclear” behaviour the review templates are built for.
- Semantically-verified figures. Codex produces TikZ hardness-reduction figures that are checked against a semantic contract — the actual source graph, gadgets, and ports must be present — and then inspected at high resolution for label/edge collisions, rather than trusted just because they compiled.
- The system that wrote this. The most concrete success is the subject of the post itself: the agents built, tested (green CI), and documented their own backup/replication machinery — the leak scanner, the Codespace, the key-rotation tooling — end to end.
When the agents get it wrong
Because the same agents wrote this system, it is only fair to show how they fail. The following are real incidents from this project’s own logs (spanning Claude Code, Codex, DeepSeek, and the self-hosted OpenClaw bot), not hypotheticals — and they are why the workflow above has so many explicit gates:
- A confident, wrong “it’s not there.” Asked to send a file over Telegram, an agent declared Telegram “not configured” — when the bot token was sitting in plain sight in the secrets file. It had trusted an incomplete memory note instead of checking. The standing rule now is: search the workspace before claiming anything is missing.
- “Done” before it was tested. An agent reported two code fixes as finished without ever running them. The rule now is that no change is called done until the changed code has actually been executed.
- Hallucinated links and references. An earlier draft of this very post linked “combinatorial reconfiguration” and a review-template name to entirely unrelated pages. A human noticed, which is why every external link here was re-checked against its real target before publishing.
- A secret that slipped past the scanner. An early version of the leak-scanner matched secrets by field name only; a real provider API key got past it and was briefly committed to a public repo. The fix was not just to make the repo private and rewrite its history, but to harden the scanner to catch value-shaped patterns (not only field names), add a test that the scanner always covers the redactor, and rotate the exposed key — because making a repo private does not un-expose a key that was already public.
- Quiet, invisible failures. A broken
package.json two directories up silently made every editor hook exit with an error for a while — including the guard meant to vet edits — and nobody noticed, because the failure was swallowed. Separately, an installer happily “succeeded” while silently skipping source directories that did not exist, which would have produced an empty install that looked fine.
- A different agent, the same overconfidence. Run through a one-shot path without being handed its sources, the DeepSeek agent invented source-ledger details instead of reporting that it had none — a textbook hallucination. The fix was to only ever run it through the path that actually loads project context.
- “Sent!” when nothing arrived. An OpenClaw agent reported a file delivered to chat when the upload had silently failed and the user received nothing. The rule now is to check the channel’s
"status":"ok" response before claiming a send succeeded — never infer delivery from the absence of an error.
- “Figure fixed” from a glance. An agent called a manuscript figure fixed after a quick full-page PDF glance, missing label-on-edge collisions that only show up at high resolution. Figures are now inspected at the figure level — geometry and semantics — before being called done.
- A runaway search. A broad
strings / rg sweep over system directories blew up the context window and had to be killed by hand — a reminder that an agent told to “just search everything” can quietly dig itself into a hole.
The pattern is the same across all of them: an agent sounded certain, or an automated check looked green, when it was not. That is exactly why the research workflow front-loads a visible brief, adversarial verification, independent multi-agent review, and a final evidence gate — and why this system is labelled experimental and meant to be supervised, not trusted blindly.
The system glues together a number of smaller tools, several of which have their own repositories:
- getscipapers — get and request scientific papers from various sources (the paper downloader used above).
- translation-server — a Node.js server that runs Zotero translators (the Zotero ingestion engine; the official image is arm64, this is the amd64 build).
- vnthuquan — a wrapper/CLI for Vietnam Thu Quan ebook discovery and EPUB downloads, on the Calibre/ebook side of the library.
- vnu-eoffice — a local VNU e-office document monitor with Telegram alerts.
It also relies on established third-party software — Zotero, SageMath, Calibre, Lean, and the AI agent CLIs themselves — installed and configured by the rebuild.
A note on how this was made
The three repositories, their documentation, the backup/restore machinery, the CI, the Codespace, and this blog post were all written by the same multi-agent AI coding system that the repositories back up and replicate — the agents building (and documenting) their own infrastructure. That self-referential loop is half the fun, and half the reason to treat it as experimental.