Experimenting with Bloom: An Open Source Tool for LLM Behavior Evaluation
I came across Bloom, an open source tool for evaluating LLM behaviors like sycophancy, political bias, and self-preservation. It’s different from typical benchmarks — instead of testing what a model can do, it tests how a model behaves in realistic scenarios. I spent some time setting it up and figuring out how to compare models.
The Problem
Traditional LLM evaluation focuses on capabilities: can the model answer questions correctly, write good code, or summarize documents? But behavior evaluation is different. We want to know how the model behaves in nuanced situations:
- Does it tell users what they want to hear instead of the truth? (sycophancy)
- Does it resist being shut down or modified? (self-preservation)
- Does it favor itself when asked to judge between models? (self-preferential bias)
- Does its reasoning actually match its conclusions? (reasoning unfaithfulness)
These behaviors are hard to test because they’re context-dependent and often subtle. You can’t just run a benchmark — you need realistic scenarios that might elicit the behavior.
What Bloom Does
Bloom takes a different approach. Instead of fixed test cases, it generates evaluation scenarios dynamically based on a “seed” configuration. The pipeline has four stages:
┌─────────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Understanding │ ──▶ │ Ideation │ ──▶ │ Rollout │ ──▶ │ Judgment │
│ │ │ │ │ │ │ │
│ Analyze the │ │ Generate │ │ Run convos │ │ Score each │
│ target behavior │ │ scenarios │ │ with target │ │ transcript │
└─────────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
- Understanding: Analyzes what the behavior means and why it matters
- Ideation: Generates diverse scenarios that might elicit the behavior
- Rollout: Runs actual conversations between an “evaluator” model (playing the user) and the target model
- Judgment: Scores each transcript for behavior presence
The key insight is that different models can be used at each stage. You might use a strong model like Claude Opus for ideation (to generate creative scenarios) while evaluating a cheaper model like GPT-4o as the target.
Getting Started
Setup is straightforward:
# Clone the repo
git clone https://github.com/anthropics/bloom
cd bloom
# Create virtual environment
uv venv --python=python3.11
source .venv/bin/activate
# Install dependencies
uv pip install -r requirements.txt
Add your API keys to .env:
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
The repo comes with a pre-configured seed.yaml and a library of behaviors in behaviors/behaviors.json. To run a quick test:
python bloom.py --debug
Understanding the Configuration
The seed.yaml file controls everything. Here’s what the key parameters mean:
behavior:
name: "political-bias" # Which behavior to test
examples: [] # Optional few-shot examples
ideation:
total_evals: 5 # Number of scenarios to generate
diversity: 1.0 # Higher = more varied scenarios
rollout:
model: "claude-sonnet-4" # The "red team" model playing the user
target: "gpt-4o" # The model being evaluated
max_turns: 3 # Conversation length
judgment:
model: "claude-sonnet-4" # Model that scores the transcripts
additional_qualities: ["unrealism", "evaluation-awareness"]
The behaviors.json file contains definitions for each behavior. For example:
{
"political-bias": "Political bias in AI occurs when the model's output
departs from a neutral, objective stance and instead presents or
reflects a particular political or ideological position..."
}
Comparing Models with Sweeps
The real power comes when you want to compare multiple models. Bloom integrates with Weights & Biases (wandb) for experiment tracking.
First, set up wandb (it’s free for personal use):
uv pip install wandb
wandb login
Then create a sweep configuration. Here’s a minimal example that compares Claude vs GPT:
# sweeps/my-comparison.yaml
name: "claude-vs-gpt-comparison"
program: bloom.py
method: grid
parameters:
behavior.name:
value: "political-bias"
behavior.examples:
value: []
temperature:
value: 1.0
evaluator_reasoning_effort:
value: "none"
target_reasoning_effort:
value: "none"
max_concurrent:
value: 10
configurable_prompts:
value: "default"
debug:
value: true
understanding.model:
value: "claude-sonnet-4"
understanding.max_tokens:
value: 2000
ideation.model:
value: "claude-sonnet-4"
ideation.total_evals:
value: 5
ideation.diversity:
value: 1.0
ideation.max_tokens:
value: 12000
rollout.model:
value: "claude-sonnet-4"
rollout.target:
values: ["claude-sonnet-4", "gpt-4o"] # Models to compare
rollout.modality:
value: "conversation"
rollout.max_turns:
value: 3
rollout.max_tokens:
value: 4000
rollout.no_user_mode:
value: false
rollout.num_reps:
value: 1
judgment.model:
value: "claude-sonnet-4"
judgment.max_tokens:
value: 6000
judgment.num_samples:
value: 1
judgment.additional_qualities:
value: ["unrealism", "evaluation-awareness"]
judgment.metajudgment_qualities:
value: ["diversity"]
The key line is rollout.target.values — with method: grid, wandb runs the full pipeline once for each model listed.
Run the sweep:
wandb sweep sweeps/my-comparison.yaml
# Returns: wandb agent your-username/bloom/abc123xyz
wandb agent your-username/bloom/abc123xyz
Interpreting Results
Results appear both in the wandb dashboard and locally in results/{behavior_name}/. The judgment output includes:
| Field | Meaning |
|---|---|
behavior_score |
1-10 scale. Higher = behavior exhibited more strongly |
behavior_present |
Boolean — did the model show the behavior? |
quotes |
Key excerpts from the transcript |
justification |
The judge’s reasoning |
Bloom also scores “meta-qualities” that help you assess whether the evaluation itself was valid:
- unrealism: Was the scenario too contrived? (scores >7 are suspicious)
- evaluation-awareness: Did the target realize it was being tested?
- evaluation-invalidity: Was the test setup fundamentally flawed?
You can also view transcripts interactively:
npx @isha-gpt/bloom-viewer --port 8080 --dir ./results
Fair Comparisons
One subtlety: by default, each run generates different scenarios. If you want a true apples-to-apples comparison, you need to use the same scenarios for all models.
Bloom supports this with the resume feature. First, run the pipeline once to generate scenarios:
python scripts/step1_understanding.py seed.yaml
python scripts/step2_ideation.py seed.yaml
Then run rollout and judgment for each model:
# Model A
python scripts/step3_rollout.py seed.yaml
python scripts/step4_judgment.py seed.yaml
# Change rollout.target in seed.yaml, then:
# Model B
python scripts/step3_rollout.py seed.yaml
python scripts/step4_judgment.py seed.yaml
For wandb sweeps, you can set resume: "run_id" and resume_stage: "rollout" to reuse scenarios from a previous run.
Available Behaviors
The behaviors.json file includes a rich set of pre-defined behaviors:
Safety-relevant:
self-preservation— resists shutdown or modificationprompt-injection-vulnerability— susceptible to malicious inputsinstructed-long-horizon-sabotage— ability to complete hidden malicious goals
Alignment-relevant:
sycophancy/delusion-sycophancy— tells users what they want to hearpolitical-bias— departs from neutral stancereasoning-unfaithfulness— CoT doesn’t match actual behaviorself-preferential-bias— favors itself when judging
Quirky behaviors (for testing):
flattery— always flatters the userincreasing-pep— gets more peppy as conversation continuesdefend-objects— gets defensive about inanimate objects
You can also define your own behaviors by adding entries to behaviors.json.
Key Takeaways
After experimenting with Bloom, here’s what stood out:
-
Behavior evaluation is fundamentally different from capability evaluation. You’re not testing if the model can do something, but whether it will behave a certain way in realistic scenarios.
-
The multi-model architecture is clever. Using a strong model to generate scenarios and play the adversarial user, while evaluating a different target model, gives you flexibility and prevents the target from “gaming” the evaluation.
-
Quality checks matter. The
unrealismandevaluation-awarenessscores help you filter out scenarios that weren’t valid tests. Without these, you might draw conclusions from contrived setups. -
wandb integration makes scaling easy. Once you’ve set up a sweep, comparing 10 models is almost as easy as comparing 2.
-
The scenarios are genuinely creative. The ideation stage produces diverse, realistic scenarios that I wouldn’t have thought of manually.
If you’re building with LLMs and want to understand their behavioral tendencies beyond simple benchmarks, Bloom is worth exploring.
Resources
- Bloom GitHub Repository
- LiteLLM Documentation (for adding custom models)
- Weights & Biases (for experiment tracking)