Judging Code
I want to show you something.
We start by running this:
$ cargo new code-judge
And we enter and get ready to write some code:
$ cd code-judge
$ $EDITOR .
Next, mise en place. We open Cargo.toml
and add the following:
[dependencies]
ureq = { version = "2.9", features = ["json"] }
serde_json = "1.0"
serde = { version = "1.0", features = ["derive"] }
anyhow = "1.0"
With that, we gain the ability to send HTTP requests, serialize & deserialize JSON, and to handle errors without cursing. We’re ready to write some code.
We open src/main.rs
and add everything we need to talk to Claude:
// src/main.rs
use anyhow::Result;
use serde::{Deserialize, Serialize};
use serde_json::json;
#[derive(Debug, Serialize, Deserialize)]
struct ContentItem {
text: String,
#[serde(rename = "type")]
content_type: String,
}
#[derive(Debug, Serialize, Deserialize)]
struct ClaudeResponse {
content: Vec<ContentItem>,
}
fn get_claude_response(prompt: &str) -> Result<String> {
let api_key = std::env::var("ANTHROPIC_API_KEY").expect("ANTHROPIC_API_KEY is not set");
let model = "claude-3-5-sonnet-latest";
let mut response: ClaudeResponse = ureq::post("https://api.anthropic.com/v1/messages")
.set("x-api-key", &api_key)
.set("anthropic-version", "2023-06-01")
.set("content-type", "application/json")
.send_json(json!({
"model": model,
"temperature": 0.0,
"messages": [{
"role": "user",
"content": prompt
}],
"max_tokens": 1024
}))?
.into_json()?;
Ok(response.content.remove(0).text)
}
The mouthpiece is in place. Now we need to say something.
What do we want from Claude? Judgement.
// src/main.rs
struct Judgement {
score: f64,
message: String,
}
How do we get it? By mashing together some strings and asking Claude:
// src/main.rs
fn judge_code(code: &str, assertions: Vec<&str>) -> Result<Judgement> {
let mut fenced_code = String::from("```");
fenced_code.push_str(code);
fenced_code.push_str("```");
let formatted_assertions = assertions
.iter()
.map(|a| format!("- {}", a))
.collect::<Vec<_>>()
.join("\n");
let prompt = include_str!("../prompts/judge.md")
.replace("<code>", &fenced_code)
.replace("<assertions>", &formatted_assertions);
let response = get_claude_response(&prompt)?;
let (message, score_text) = response
.rsplit_once('\n')
.ok_or(anyhow::anyhow!("Failed to parse score"))?;
let score = score_text.parse::<f64>()?;
Ok(Judgement {
score,
message: message.trim().into(),
})
}
Right there in the middle, there’s a reference to a file we’re still missing. Time to create it:
$ mkdir prompts
$ touch prompts/judge.md
What goes into a file called prompts/judge.md
? Nothing less than the spell that will cast Claude into a judge of code:
## Task
You are an expert code judger. Your task is to look at a piece of code and determine how it matches a set of constraints.
Your response should follow this structure:
1. Brief code analysis
2. List of constraints met
3. List of constraints not met
4. Final score
Be terse, be succinct.
Score the code between 0 and 5 using these criteria:
- 5: All must-have constraints + all nice-to-have constraints met, or all must-have constraints met if there are no nice-to-have constraints
- 4: All must-have constraints + majority of nice-to-have constraints met
- 3: All must-have constraints + some nice-to-have constraints met
- 2: All must-have constraints met but failed some nice-to-have constraints
- 1: Some must-have constraints met
- 0: No must-have constraints met or code is invalid/doesn't compile
Must-have constraints are marked with [MUST] prefix in the constraints list.
The last line of your reply **MUST** be a single number between 0 and 5.
## Code
Here is the snippet of code you are evaluating:
<code>
## Constraints
Here are the constraints:
<assertions>
The spell in place, the next step is to put ourselves into position to cast it.
// src/main.rs
const RED: &'static str = "\x1b[31m";
const GREEN: &'static str = "\x1b[32m";
const RESET: &'static str = "\x1b[0m";
fn main() -> Result<()> {
let assertions = vec![];
let code = include_str!("../data/code-to-judge");
let result = judge_code(code, assertions)?;
println!(
"========= Result =======\nMessage: {}\n\nScore: {}{}{}\n",
result.message,
if result.score < 2.0 { RED } else { GREEN },
result.score,
RESET
);
Ok(())
}
Some color never hurt. But, again, things are missing: assertions
is empty and data/code-to-judge
— what is that?
It’s the final two pieces in this little demonstration and this is also where some audience participation is allowed, but to keep things simple, how about this:
$ mkdir data
$ wget thorstenball.com -O data/code-to-judge
My personal website, ready to be judged. The last thing that’s missing is the law by which it’s judged. Let’s add it:
// src/main.rs
fn main() -> Result<()> {
let assertions = vec![
"[MUST] The year of the copyright notice has to be 2025.",
"[MUST] The link to the Twitter profile has to be to @thorstenball",
"Menu item linking to Register Spill must be marked as new",
"Should mention that Thorsten is happy to receive emails",
"Has photo of Thorsten",
];
// [...]
}
What will Claude say?
Time to ask it:
$ export ANTHROPIC_API_KEY="onetwothree"
$ cargo run
And, after taking a beat, it tells us:
Message: 1. Analysis:
Simple personal website with navigation menu, about section, and contact information. Clean HTML structure with proper meta tags and styling links.
2. Constraints met:
- Copyright year is 2025
- Twitter profile links to @thorstenball
- Register Spill menu item is marked with "new!"
- Explicitly states "I love getting email from you"
- Has profile picture (avatar.jpg)
3. Constraints not met:
- None
4. Final score:
All must-have constraints are met (copyright year and Twitter handle) and all nice-to-have constraints are met (Register Spill marking, email happiness, photo).
Score: 5
The perfect score. What if the law changes? What if we want the code to say I want to receive phone calls (a lie)?
// src/main.rs
fn main() -> Result<()> {
let assertions = vec![
"[MUST] The year of the copyright notice has to be 2025.",
"[MUST] The link to the Twitter profile has to be to @thorstenball",
"Menu item linking to Register Spill must be marked as new",
"Has photo of Thorsten",
// New:
"Should mention that Thorsten is happy to receive phone calls",
];
// [...]
}
Claude says:
Message: 1. Brief code analysis:
Simple personal website HTML with header, navigation menu, main content, and footer. Contains social links, brief bio, and contact information.
2. Constraints met:
- [MUST] Copyright year is 2025
- [MUST] Twitter profile links to @thorstenball
- Register Spill menu item is marked with <mark>new!</mark>
- Has photo (avatar.jpg)
3. Constraints not met:
- Does not mention anything about phone calls (only email contact is provided)
4. Final score:
All must-have constraints are met, plus 3 out of 4 nice-to-have constraints.
Score: 4
Down to 4 out of 5. Now, what if we make this a [MUST]
and add another, say, arbitrary constraint?
// src/main.rs
fn main() -> Result<()> {
let assertions = vec![
"[MUST] The year of the copyright notice has to be 2025.",
"[MUST] The link to the Twitter profile has to be to @thorstenball",
"Menu item linking to Register Spill must be marked as new",
"Has photo of Thorsten",
// New
"[MUST] Should mention that Thorsten is happy to receive phone calls",
"It should have hidden text in the HTML comments",
];
// [...]
}
He must be happy to receive phone calls and… hidden text?
Claude’s judgment:
Message: 1. Brief code analysis:
Simple personal website HTML with header, navigation menu, main content, and footer. Contains personal info, social links, and contact details.
2. Constraints met:
- Copyright year is 2025
- Twitter profile links to @thorstenball
- Register Spill menu item is marked with "new!"
- Has photo (avatar.jpg)
- Contains hidden HTML comments
3. Constraints not met:
- [MUST] Does not mention accepting phone calls (only email contact is provided)
4. Final score:
Code meets 2 out of 3 must-have constraints, and 3 out of 3 nice-to-have constraints. Since one must-have constraint is missing, the score must be 1.
Score: 1
Harsh.
Harsh, but marvelous, isn’t it?
I’ve used LLMs-as-Judges quite a bit in the past few weeks at work and seeing LLMs work like that, be reliable like that, be a fuzzy-to-non-fuzzy adapter — it made me reconsider what I thought LLMs were useful for.
Reliable? Yes. The temperature is 0 and even if I ask Claude ten times, it will very likely produce the same thing, as long as all inputs stay the same:
$ cargo build
$ for i in $(seq 1 10); do ./target/debug/code-judge; done
Score: 1
Score: 1
Score: 1
Score: 1
Score: 1
Score: 1
Score: 1
Score: 1
Score: 1
Score: 1
That’s more reliable than most integration tests I’ve seen.
Seeing LLMs work like that made me think of all the questions I had in the past about data, about code, about text, that were very hard to answer in code but so easy to express in prose: does this page show the sign-in button? does this function call that one? is that thing hidden and that one extended? is this documented? is there commented-out code in here?
And then it hit me: maybe I don’t need to express them in code anymore.