Files
hermes-sync/skills/research/research-paper-writing/references/experiment-patterns.md

25 KiB

Experiment Design Patterns

Patterns and best practices distilled from running research experiments at scale with the Hermes agent. These cover experiment infrastructure, evaluation protocols, monitoring, and failure recovery.


Experiment Infrastructure

Directory Structure

Organize experiments with a consistent structure:

workspace/
  experiments/
    run_main.py                # Core experiment runner
    run_baselines.py           # Baseline comparison
    run_ablation.py            # Ablation studies
    strategies.py              # Method implementations
    config.yaml                # Shared configuration
  results/
    <experiment_name>/
      <task_or_problem>/
        <strategy>/
          result.json          # Final metrics
          final_output.md      # Final output artifact
          history.json         # Full trajectory/log
          pass_01/             # Per-iteration artifacts (if iterative)
            intermediate.md
  analysis/
    analyze_results.py         # Statistical analysis
    compute_stats.py           # Significance tests
    make_charts.py             # Visualization
  paper/
    paper.tex                  # LaTeX source
    fig_*.pdf                  # Generated figures

Script Design Principles

1. Incremental Saving (Crash Recovery)

Every experiment script should save results after each unit of work, and skip already-completed work on restart:

import json, os
from pathlib import Path

def run_experiment(problems, strategies, output_dir):
    for problem in problems:
        for strategy in strategies:
            result_path = Path(output_dir) / problem["id"] / strategy / "result.json"
            if result_path.exists():
                print(f"Skipping {problem['id']}/{strategy} (already done)")
                continue
            
            # Run the experiment
            result = execute_strategy(problem, strategy)
            
            # Save immediately
            result_path.parent.mkdir(parents=True, exist_ok=True)
            with open(result_path, 'w') as f:
                json.dump(result, f, indent=2)

This pattern makes re-runs safe and efficient. If a process crashes at problem 47/150, restarting skips the first 46.

2. Artifact Preservation

Save all intermediate outputs, not just final results. This enables post-hoc analysis without re-running:

def save_pass_artifacts(output_dir, pass_num, artifacts):
    """Save all artifacts from a single pass of an iterative method."""
    pass_dir = Path(output_dir) / f"pass_{pass_num:02d}"
    pass_dir.mkdir(parents=True, exist_ok=True)
    
    for name, content in artifacts.items():
        with open(pass_dir / f"{name}.md", 'w') as f:
            f.write(content)

3. Configuration Management

Use YAML configs for reproducibility:

# config.yaml
model: anthropic/claude-sonnet-4-20250514
author_temperature: 0.8
judge_temperature: 0.3
max_tokens: 4096
num_judges: 3
max_passes: 15
convergence_k: 2
import yaml

with open("config.yaml") as f:
    config = yaml.safe_load(f)

4. Separation of Concerns

Keep generation, evaluation, and visualization in separate scripts:

Script Purpose
run_experiment.py Core method execution
run_baselines.py Baseline comparisons at same compute
run_eval.py Blind evaluation / judge panels
analyze_results.py Statistical analysis
make_charts.py Figure generation

This lets you re-run evaluation without re-running expensive generation, and regenerate figures without re-running analysis.


Evaluation Protocols

Blind Judge Panels (for Subjective Tasks)

When evaluating subjective outputs (writing, analysis, recommendations), use a blind judge panel:

import random

def run_blind_evaluation(outputs: dict, task_prompt: str, num_judges: int = 7):
    """
    Run blind evaluation of multiple method outputs.
    
    Args:
        outputs: {"method_name": "output_text", ...}
        task_prompt: The original task description
        num_judges: Number of independent judge evaluations
    """
    rankings = []
    
    for judge_i in range(num_judges):
        # Randomize labels and presentation order per judge
        methods = list(outputs.keys())
        random.shuffle(methods)
        labels = {m: chr(65 + i) for i, m in enumerate(methods)}  # A, B, C...
        
        # Present to judge with randomized labels
        prompt = f"Task: {task_prompt}\n\n"
        for method in methods:
            prompt += f"--- Proposal {labels[method]} ---\n{outputs[method]}\n\n"
        prompt += "Rank all proposals from best to worst. Format: RANKING: [best], [second], [worst]"
        
        ranking = call_judge(prompt)
        rankings.append({"labels": labels, "ranking": ranking})
    
    # Aggregate via Borda count
    return compute_borda(rankings)

def compute_borda(rankings, n_methods=3):
    """Borda count: 3/2/1 points for 1st/2nd/3rd."""
    scores = {}
    points = {0: n_methods, 1: n_methods - 1, 2: n_methods - 2}  # Adjust for n_methods
    
    for r in rankings:
        for position, method in enumerate(r["ranking"]):
            scores[method] = scores.get(method, 0) + points.get(position, 0)
    
    return scores

Key design decisions:

  • Randomize both labels AND order per judge to prevent position bias
  • Use odd number of judges (3, 5, 7) to break ties
  • Conservative tiebreak: Incumbent/baseline wins ties (prevents false positives)
  • CoT judges match non-CoT quality at ~40% cost (1 CoT judge ≈ 3 standard judges)

Code/Objective Evaluation

For tasks with ground-truth evaluation (code, math, factual):

import subprocess

def evaluate_code(solution: str, test_cases: list, timeout: int = 30):
    """Run code solution against test cases with sandboxed execution."""
    results = {"public": [], "private": []}
    
    for test in test_cases:
        try:
            proc = subprocess.run(
                ["python3", "-c", solution],
                input=test["input"],
                capture_output=True,
                timeout=timeout,
                text=True
            )
            actual = proc.stdout.strip()
            expected = test["expected"].strip()
            passed = actual == expected
        except subprocess.TimeoutExpired:
            passed = False
        
        category = "public" if test.get("public") else "private"
        results[category].append(passed)
    
    return {
        "public_pass_rate": sum(results["public"]) / max(len(results["public"]), 1),
        "private_pass_rate": sum(results["private"]) / max(len(results["private"]), 1),
    }

Compute-Matched Comparison

Always compare methods at equal compute budget. If your method uses N API calls, baselines get N calls too:

Method Call Budget Allocation
Single pass 6 calls 6 independent generations
Critique & revise 6 calls 1 generate + 5 revise rounds
Autoreason 6 calls 1 generate + 1 analysis + 4 revisions
Best-of-N 6 calls 6 independent, pick best on public test

Human Evaluation Design

Many ML/NLP papers require human evaluation, especially for subjective tasks (text generation, summarization, dialogue, creative writing). Poorly designed human evals are a common rejection reason.

When Human Evaluation Is Required

Task Type Required? Notes
Text generation (open-ended) Yes LLM-as-judge alone is insufficient for acceptance at ACL/EMNLP
Summarization Usually At minimum for a subset of outputs
Dialogue systems Yes User studies or annotation
Code generation No Test suites are objective ground truth
Classification No Standard metrics suffice
Any task with subjective quality Strongly recommended Strengthens the paper significantly

Annotation Protocol Design

Human Evaluation Protocol:
1. Define the evaluation dimensions (fluency, relevance, factual accuracy, etc.)
2. Create annotation guidelines with examples of each score level
3. Run a pilot with 2-3 annotators on 20-30 examples
4. Compute pilot inter-annotator agreement — if low, revise guidelines
5. Run full evaluation
6. Report: annotator count, agreement metrics, compensation, time per item

Evaluation dimensions (pick relevant subset):

Dimension Definition Scale
Fluency Grammaticality and naturalness 1-5 Likert
Relevance Does it address the task? 1-5 Likert
Factual accuracy Are stated facts correct? Binary or 1-5
Coherence Logical flow and consistency 1-5 Likert
Informativeness Does it provide useful information? 1-5 Likert
Overall preference Which output is better? A/B/Tie (pairwise)

Pairwise comparison (preferred over absolute scoring — more reliable):

  • Present two outputs side-by-side (randomize left/right position)
  • Ask: "Which is better? A / B / Tie"
  • More discriminative and less susceptible to annotator calibration drift

Inter-Annotator Agreement

Always report agreement metrics. Without them, reviewers assume your annotations are unreliable.

# Krippendorff's alpha (preferred — handles missing data, any scale)
# pip install krippendorffs-alpha
import krippendorff

# Ratings: rows = annotators, columns = items, values = scores
ratings = [
    [3, 4, 1, 2, 5, None, 3],  # Annotator 1
    [3, 5, 1, 3, 5, 2, 3],     # Annotator 2
    [4, 4, 2, 2, 4, 2, None],  # Annotator 3
]
alpha = krippendorff.alpha(reliability_data=ratings, level_of_measurement="ordinal")
print(f"Krippendorff's alpha: {alpha:.3f}")
# Interpretation: >0.80 good, 0.67-0.80 acceptable, <0.67 questionable
# Cohen's kappa (for exactly 2 annotators, categorical data)
from sklearn.metrics import cohen_kappa_score

annotator_1 = [1, 2, 3, 1, 2, 3, 2]
annotator_2 = [1, 2, 2, 1, 3, 3, 2]
kappa = cohen_kappa_score(annotator_1, annotator_2)
print(f"Cohen's kappa: {kappa:.3f}")
# Interpretation: >0.80 excellent, 0.60-0.80 substantial, 0.40-0.60 moderate
Metric When to Use Annotators Scale
Krippendorff's alpha Default choice Any number Any (ordinal, nominal, ratio)
Cohen's kappa 2 annotators, categorical Exactly 2 Nominal/ordinal
Fleiss' kappa 3+ annotators, categorical 3+ Nominal
Pearson/Spearman Continuous scores 2 Interval/ratio

Crowdsourcing Platforms

Platform Best For Cost Quality
Prolific Academic research, higher quality $8-15/hr High — academic participant pool
MTurk Large-scale, fast turnaround $2-10/hr Variable — use qualifications
Surge AI NLP-specific annotations Premium High — trained annotators
Expert annotators Domain-specific (medical, legal) Highest Highest — but slow

Ethics requirements:

  • Report compensation rate (must be at minimum local minimum wage)
  • Describe annotator demographics if relevant
  • Obtain IRB/ethics approval if required by your institution
  • ACL venues explicitly require compensation documentation

What to Report in the Paper

Human Evaluation Section Checklist:
- [ ] Number of annotators
- [ ] Annotator qualifications / recruitment method
- [ ] Number of items evaluated
- [ ] Evaluation dimensions with definitions
- [ ] Scale used (Likert, pairwise, binary)
- [ ] Inter-annotator agreement (Krippendorff's alpha or Cohen's kappa)
- [ ] Compensation rate
- [ ] Time per annotation item
- [ ] Whether annotators saw model identities (should be blind)
- [ ] Randomization of presentation order

Statistical Analysis

Required Tests

Test When to Use Python
McNemar's test Comparing two methods on same problems scipy.stats.binomtest for small n
Two-proportion z-test Comparing success rates Custom or statsmodels
Fisher's exact test Small sample pairwise comparison scipy.stats.fisher_exact
Bootstrapped CI Confidence intervals for any metric Custom bootstrap
Cohen's h Effect size for proportions Manual calculation

Standard Analysis Script

import numpy as np
from scipy import stats
from pathlib import Path
import json

def load_all_results(results_dir):
    """Load all results into a structured format."""
    results = {}
    for result_file in Path(results_dir).rglob("result.json"):
        parts = result_file.relative_to(results_dir).parts
        if len(parts) >= 3:
            experiment, task, strategy = parts[0], parts[1], parts[2]
            data = json.loads(result_file.read_text())
            results.setdefault(experiment, {}).setdefault(strategy, {})[task] = data
    return results

def pairwise_mcnemar(method_a_results, method_b_results):
    """McNemar's test for paired binary outcomes."""
    a_win_b_lose = sum(1 for a, b in zip(method_a_results, method_b_results) if a and not b)
    b_win_a_lose = sum(1 for a, b in zip(method_a_results, method_b_results) if b and not a)
    
    n = a_win_b_lose + b_win_a_lose
    if n < 25:
        # Use exact binomial for small samples
        result = stats.binomtest(a_win_b_lose, n, 0.5)
        p_value = result.pvalue
    else:
        # Chi-squared approximation
        chi2 = (abs(a_win_b_lose - b_win_a_lose) - 1)**2 / (a_win_b_lose + b_win_a_lose)
        p_value = 1 - stats.chi2.cdf(chi2, df=1)
    
    return {
        "a_wins": a_win_b_lose,
        "b_wins": b_win_a_lose,
        "n_discordant": n,
        "p_value": p_value,
        "significant": p_value < 0.05
    }

def bootstrap_ci(data, n_bootstrap=10000, ci=0.95):
    """Bootstrap confidence interval for mean."""
    means = []
    for _ in range(n_bootstrap):
        sample = np.random.choice(data, size=len(data), replace=True)
        means.append(np.mean(sample))
    lower = np.percentile(means, (1 - ci) / 2 * 100)
    upper = np.percentile(means, (1 + ci) / 2 * 100)
    return {"mean": np.mean(data), "ci_lower": lower, "ci_upper": upper}

def cohens_h(p1, p2):
    """Cohen's h effect size for two proportions."""
    return 2 * np.arcsin(np.sqrt(p1)) - 2 * np.arcsin(np.sqrt(p2))

Reporting Standards

Always include in the paper:

  • Sample sizes: n=X problems/tasks
  • Number of runs: K independent runs if applicable
  • Error bars: Specify standard deviation or standard error
  • Confidence intervals: 95% CI for key results
  • Significance tests: p-values for key comparisons
  • Effect sizes: Cohen's d or h for practical significance

Monitoring (Cron Pattern)

Cron Prompt Template

For each experiment batch, create a monitoring prompt:

Check the status of the [EXPERIMENT_NAME] experiment:

1. Process check: ps aux | grep [PROCESS_PATTERN]
2. Log check: tail -30 [LOG_FILE]
3. Results check: ls [RESULT_DIR]/eval/ (or appropriate result location)
4. If results are available:
   - Read the result JSON files
   - Report metrics in a table (Borda scores, accuracy, etc.)
   - Compute key comparisons between methods
5. If all experiments in this batch are complete:
   - git add -A && git commit -m "[COMMIT_MESSAGE]" && git push
   - Report final summary
6. Key question: [SPECIFIC ANALYTICAL QUESTION]

If nothing has changed since the last check, respond with [SILENT].

Monitoring Best Practices

  1. Check processes first — don't read results if the experiment is still running and results are incomplete
  2. Read the log tail — look for errors, progress indicators, completion messages
  3. Count completed vs expected — "45/150 problems done" is more useful than "some results exist"
  4. Report in structured tables — always include key metrics in a table
  5. Answer the key question — each experiment should have a specific analytical question to answer when done
  6. [SILENT] for no-news — suppress notifications when nothing has changed
  7. Commit on completion — every completed batch gets committed with a descriptive message

Example Monitoring Report

## Code Experiments (Haiku 3.5) - COMPLETE

| Strategy | Pass Rate (150 problems) | vs Single |
|----------|------------------------|-----------|
| single_pass | 38.0% | — |
| critique_revise | 35.2% | -2.8pp |
| **autoreason** | **40.0%** | **+2.0pp** |
| best_of_6 | 31.0% | -7.0pp |

Key finding: Autoreason shows +2pp improvement over single pass, while 
best-of-6 collapses due to single-public-test selection issue.

Committed: `git commit -m "Add Haiku code results (150 problems, 4 strategies)"`
Next: Run significance tests on these results.

Failure Recovery

Common Failures and Recovery

Failure Detection Recovery
API credit exhaustion 402 errors in logs, incomplete results Top up credits, re-run (skips completed work automatically)
Rate limiting 429 errors, slow progress Add retry logic with exponential backoff
Process crash PID gone, log stops mid-problem Re-run script (resumes from last checkpoint)
Wrong model ID Model not found errors Fix ID (e.g., claude-opus-4-6 not claude-opus-4.6)
Parallel slowdown Each experiment taking 2x longer Reduce parallel experiments to 2-3 max
Security scan blocks Commands blocked by security Use execute_code instead of piped terminal commands
Delegation failures delegate_task returns errors Fall back to doing work directly
Timeout on hard problems Process stuck, no log progress Kill, skip problem, note in results
Dataset path mismatch File not found errors Verify paths before launching

Retry Naming Convention

When re-running failed experiments, use a suffix to track rounds:

logs/experiment_haiku_0_50.log       # Round 1
logs/experiment_haiku_0_50_r2.log    # Round 2 (after credit exhaustion)
logs/experiment_haiku_0_50_r3.log    # Round 3 (after bug fix)

Pre-Flight Checklist

Before launching any experiment batch:

Pre-Flight:
- [ ] API credits sufficient for estimated calls
- [ ] Model IDs correct (test with 1 problem first)
- [ ] Output directory exists and is writable
- [ ] Resume logic works (re-run won't overwrite existing results)
- [ ] Log file path is unique (won't overwrite previous logs)
- [ ] Dataset/task files are accessible
- [ ] Config matches intended experiment

Task/Benchmark Design

Open-Ended Tasks (Subjective Evaluation)

Design tasks that have clear objectives but subjective quality:

# Task: [Title]

## Context
[Specific scenario with concrete details: company size, constraints, timeline]

## Deliverable
[Exact format and structure required]

## Requirements
- [Specific, measurable requirements]
- [Not vague — "be comprehensive" is bad, "include exactly 6 sections" is good]

Constrained Tasks (for Testing Scope Effects)

Constrained tasks test whether methods respect scope boundaries. Design with:

  • Fixed facts: "Use only these N data points, add nothing else"
  • Fixed deliverable: Specific format (pitch, postmortem, memo — not "improve this")
  • Fixed structure: "These sections in this order, do not add/remove"
  • Fixed change items: "Address exactly these N points, nothing else"

Do NOT use word count as a scope constraint. Word limits cause false convergence — outputs get rejected for length, not quality. Constrain scope (what to include) not length.

Example: Good vs Bad Constraints

Bad Constraint Why Good Constraint
"Max 500 words" Judges reject for length "Exactly 4 sections, each with 3 numbered items"
"Be concise" Too vague "Each prohibition must reference a specific base fact"
"Improve this" Unbounded scope "Write a 600-word incident postmortem with this exact structure"
"Make it better" No clear criterion "Address exactly these 3 reviewer concerns"

Visualization Best Practices

Setup: SciencePlots + matplotlib

Install SciencePlots for publication-ready defaults:

pip install SciencePlots matplotlib numpy

Option A: SciencePlots styles (recommended — handles most defaults automatically):

import matplotlib.pyplot as plt
import scienceplots  # registers the styles

# Pick a style:
# 'science'        — clean, serif fonts, suitable for most venues
# 'science+ieee'   — IEEE-style (good for two-column papers)
# 'science+nature' — Nature-style
# Add 'no-latex' if LaTeX is not installed on the machine generating plots

with plt.style.context(['science', 'no-latex']):
    fig, ax = plt.subplots(figsize=(3.5, 2.5))  # single-column width
    # ... plot ...
    fig.savefig('paper/fig_results.pdf', bbox_inches='tight')

Option B: Manual rcParams (when you need full control):

import matplotlib.pyplot as plt

plt.rcParams.update({
    'font.size': 10,
    'font.family': 'serif',
    'axes.labelsize': 11,
    'axes.titlesize': 11,
    'xtick.labelsize': 9,
    'ytick.labelsize': 9,
    'legend.fontsize': 9,
    'figure.figsize': (3.5, 2.5),    # single-column default
    'figure.dpi': 300,
    'savefig.dpi': 300,
    'savefig.bbox': 'tight',
    'savefig.pad_inches': 0.05,
    'axes.linewidth': 0.8,
    'lines.linewidth': 1.5,
    'lines.markersize': 5,
    'axes.grid': True,
    'grid.alpha': 0.3,
    'grid.linewidth': 0.5,
})

Standard Figure Sizes (Two-Column Format)

Use Case figsize Notes
Single column (3.5, 2.5) Fits in one column of two-column layout
Double column (7.0, 3.0) Spans full page width
Square (heatmap, confusion matrix) (3.5, 3.5) Single column
Tall single (many rows) (3.5, 5.0) Use sparingly

Colorblind-Safe Palette (Okabe-Ito)

Use this palette for all paper figures. It is distinguishable by people with all common forms of color vision deficiency:

COLORS = {
    'blue':    '#0072B2',
    'orange':  '#E69F00',
    'green':   '#009E73',
    'red':     '#D55E00',
    'purple':  '#CC79A7',
    'cyan':    '#56B4E9',
    'yellow':  '#F0E442',
    'black':   '#000000',
}

# As a list for cycling:
COLOR_CYCLE = ['#0072B2', '#D55E00', '#009E73', '#E69F00', '#CC79A7', '#56B4E9']

Also differentiate lines by marker and linestyle, not just color:

STYLES = [
    {'color': '#0072B2', 'marker': 'o', 'linestyle': '-'},
    {'color': '#D55E00', 'marker': 's', 'linestyle': '--'},
    {'color': '#009E73', 'marker': '^', 'linestyle': '-.'},
    {'color': '#E69F00', 'marker': 'D', 'linestyle': ':'},
]

Complete Example: Method Comparison Bar Chart

import matplotlib.pyplot as plt
import numpy as np

try:
    import scienceplots
    style = ['science', 'no-latex']
except ImportError:
    style = 'default'

with plt.style.context(style):
    methods = ['Single Pass', 'Critique+Revise', 'Best-of-N', 'Ours']
    scores = [73.2, 74.1, 68.5, 77.0]
    errors = [2.1, 1.8, 3.2, 1.5]
    colors = ['#56B4E9', '#E69F00', '#CC79A7', '#0072B2']
    
    fig, ax = plt.subplots(figsize=(3.5, 2.5))
    bars = ax.bar(methods, scores, yerr=errors, capsize=3,
                  color=colors, edgecolor='black', linewidth=0.5)
    
    # Highlight "Ours"
    bars[-1].set_edgecolor('#0072B2')
    bars[-1].set_linewidth(1.5)
    
    ax.set_ylabel('Pass Rate (%)')
    ax.set_ylim(60, 85)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    
    fig.savefig('paper/fig_comparison.pdf', bbox_inches='tight')

Complete Example: Convergence/Trajectory Line Chart

with plt.style.context(style):
    fig, ax = plt.subplots(figsize=(3.5, 2.5))
    
    passes = np.arange(1, 16)
    ours = [65, 72, 78, 82, 85, 87, 88, 89, 89.5, 90, 90, 90, 90, 90, 90]
    baseline = [65, 68, 70, 71, 69, 67, 66, 65, 64, 63, 62, 61, 60, 59, 58]
    
    ax.plot(passes, ours, **STYLES[0], label='Ours', markersize=4)
    ax.plot(passes, baseline, **STYLES[1], label='Critique+Revise', markersize=4)
    
    # Mark convergence point
    ax.axvline(x=10, color='gray', linestyle=':', alpha=0.5, linewidth=0.8)
    ax.annotate('Converged', xy=(10, 90), fontsize=8, ha='center',
                xytext=(10, 93), arrowprops=dict(arrowstyle='->', color='gray'))
    
    ax.set_xlabel('Iteration')
    ax.set_ylabel('Quality Score')
    ax.legend(loc='lower right')
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    
    fig.savefig('paper/fig_trajectory.pdf', bbox_inches='tight')

Output Rules

  • Always save as PDF: fig.savefig('fig.pdf') — vector graphics, sharp at any zoom
  • Never save as PNG for paper figures — raster PNGs look blurry when printed/zoomed
  • Exception: Screenshots, photographs, or pixel-art visualizations → PNG at 600 DPI
  • Verify grayscale: Print to grayscale PDF and check all information is still visible

Chart Types for Common Comparisons

Comparison Type Chart Notes
Method vs method Grouped bar chart Include error bars
Across model sizes Line chart with CI bands Log scale for model size axis
Ablation study Stacked/grouped bar Highlight removed component
Trajectory/convergence Line chart over iterations Show winner per iteration
Per-task breakdown Heatmap or grouped bar Show variance across tasks