Reproducible multi-agent experiments,
from hypothesis to paper-ready results

Run ablations across strategies and seeds, replay executions from checkpoints, evaluate automatically, and export publication-ready tables —
without building custom experiment infrastructure.

Research infrastructure
shouldn't be your research

Multi-agent experiments require orchestration, evaluation, reproducibility, and statistical analysis. Most researchers build this from scratch for every paper — then throw it away.

60%

of research time spent on infrastructure, not science

Industry surveys, 2024–2025 ML practitioner reports
0

major agent frameworks with built-in experiment grids, checkpoint replay, and publication export

Survey of LangGraph, AutoGen, CrewAI, March 2026
62×

token overhead in multi-agent experiments without proper orchestration

JamJet benchmark suite, local Ollama runs, March 2026

Everything between
hypothesis and publication

Six Reasoning Strategies

React, plan-and-execute, critic, reflection, consensus, debate — swap with a single parameter. Same agent, different reasoning. Perfect for ablation studies.

agent = Agent(
  strategy="debate",  # swap to compare
  max_iterations=6,
)

ExperimentGrid

Run every combination of conditions and seeds in a single call. Cartesian product, parallel execution, automatic result collection.

grid = ExperimentGrid(
  conditions={
    "strategy": ["react", "debate"],
  },
  seeds=[42, 123, 456],
)
results = await grid.run()

Publication Export

Export results as LaTeX booktabs tables, CSV for R/pandas, or structured JSON. Mean ± std computed automatically.

results.to_latex("table1.tex")
results.to_csv("results.csv")
results.compare(A, B)  # p-value

Durable Replay

Every execution is checkpointed. Replay any experiment exactly. Fork from any checkpoint with modified parameters for ablation studies.

$ jamjet replay exec_abc
$ jamjet fork exec_abc \
  --override-input '{"model":"gemini"}'

Built-in Evaluation

LLM-as-judge, assertion, latency, and cost scorers. Eval nodes run inside workflows for self-improving agents. CI exit codes on regression.

# workflow.yaml
check:
  type: eval
  on_fail: retry_with_feedback
  max_retries: 2

Research Template

One command to scaffold a complete experiment: agents, baselines, evaluation datasets, experiment runner, and results directory.

$ jamjet init my-study \
  --template research
# agents/ baselines/ experiments/
# evals/ results/ workflow.yaml

Most agent frameworks prioritize apps
over experimental reproducibility

Capability JamJet LangGraph AutoGen Custom scripts
Multi-agent orchestration Native Native Native Possible with custom setup
Durable replay Native Possible with custom setup Possible with custom setup Possible with custom setup
Strategy comparison 6 native strategies Possible with custom setup Possible with custom setup Possible with custom setup
Experiment grid Native Possible with custom setup Possible with custom setup Possible with custom setup
LaTeX / CSV export Native Possible with custom setup Possible with custom setup Possible with custom setup
Checkpoint fork Native Possible with custom setup Possible with custom setup Possible with custom setup
Built-in eval harness Native External tooling required External tooling required Possible with custom setup
Per-node cost tracking Native Partial Partial Possible with custom setup
Statistical comparison Native (Welch's t-test) Possible with custom setup Possible with custom setup Possible with custom setup

From hypothesis to Methods section

1

Scaffold

jamjet init --template research

2 min
2

Define agents

Tools, strategies, instructions

15 min
3

Run experiments

ExperimentGrid across conditions

automated
4

Export results

LaTeX tables, CSV, statistical tests

1 command
5

Reproduce

jamjet replay from checkpoint

exact

One research afternoon, end to end

1

Compare 6 strategies on your dataset

grid = ExperimentGrid(
  conditions={"strategy": ["react", "plan_and_execute",
    "critic", "reflection", "consensus", "debate"]},
  seeds=[42, 123, 456],
)
results = await grid.run()
2

Export a LaTeX table for your paper

results.to_latex("table1.tex", caption="Strategy comparison")
# Outputs booktabs table with mean +/- std per condition
3

Replay a failed condition — no re-running prior steps

$ jamjet replay exec_debate_seed42
# Restores from checkpoint. Saves tokens + cost.
4

Compute significance between conditions

results.compare("debate", "react")
# => {p_value: 0.023, effect_size: 0.41, significant: true}
5

Fork for an ablation study

$ jamjet fork exec_debate_seed42 \
  --override-input '{"model":"gpt-4o"}'
# Same execution, different model. Instant ablation.

Start as a simple Python agent, scale into reproducible experiment runs — without rewriting your stack. See the quickstart →

What a result looks like

Task: summarize a 2,000-word policy document. 6 strategies, 3 seeds each. Scored by LLM-judge (0–1). Local Ollama, Llama 3.

Strategy Score (mean ± std) Tokens Latency Cost
react 0.71 ± 0.04 1,240 2.1s $0.002
plan_and_execute 0.78 ± 0.03 1,890 3.4s $0.003
critic 0.82 ± 0.05 2,410 4.2s $0.004
reflection 0.84 ± 0.02 3,100 5.8s $0.005
consensus 0.86 ± 0.03 4,520 7.1s $0.007
debate 0.89 ± 0.02 5,880 9.3s $0.009

debate vs. react: p = 0.012 (Welch's t-test, n = 3 seeds). This table was generated by results.to_latex("table1.tex") — zero manual formatting.

Illustrative results from internal testing. Your numbers will vary by model, task, and hardware.

Why not just scripts?

Custom scripts work for one-off experiments. They break down when you need to reproduce, compare, or build on prior work.

Custom scripts

  • Reproducibility depends on discipline, not tooling
  • No checkpoint — a crash reruns everything from scratch
  • Manual experiment matrix loops with ad-hoc seed handling
  • Result formatting is copy-paste or custom code
  • No built-in cost tracking — discovered after the bill
  • Comparing strategies requires rewriting orchestration code

JamJet

  • Every execution event-sourced — replay from any checkpoint
  • Crash recovery built in — resume exactly where it stopped
  • ExperimentGrid handles conditions × seeds automatically
  • One call to to_latex(), to_csv(), or to_json()
  • Per-node token and cost tracking, visible in real time
  • Change strategy="debate" to strategy="react" — same agent, different reasoning

Patterns from published research

arXiv 2603.08852

LLM Delegate Protocol

Identity-aware agent routing with quality scores, governed sessions, and provenance tracking. JamJet integration via ProtocolAdapter trait.

agent routing identity provenance
arXiv 2603.11781

Deliberative Collective Intelligence

Structured multi-agent deliberation with four reasoning archetypes and typed epistemic acts. Patterns now available as JamJet strategies and examples.

multi-agent deliberation archetypes

Built for how you work

Multi-agent systems

AAMAS, NeurIPS workshops

Orchestration + evaluation + reproducibility

LLM reasoning

CoT, ToT, debate, reflection

Strategy parameter makes A/B testing trivial

Tool-augmented LLMs

ReAct, Toolformer

MCP-native tool integration

AI safety & alignment

HITL, guardrails

Human-in-the-loop + policy engine

Evaluation & benchmarks

AgentBench, GAIA

Eval harness + batch runner + CI gates

Agent communication

Negotiation, persuasion

Native A2A + LDP protocol support

Start your experiment

From pip install to running multi-agent experiments in under 5 minutes.

$ pip install jamjet && jamjet init my-study --template research
Read the quickstart Browse examples