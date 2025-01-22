I want to show you something.

We start by running this:

$ cargo new code-judge

And we enter and get ready to write some code:

$ cd code-judge $ $EDITOR .

Next, mise en place. We open Cargo.toml and add the following:

[dependencies] ureq = { version = "2.9", features = ["json"] } serde_json = "1.0" serde = { version = "1.0", features = ["derive"] } anyhow = "1.0"

With that, we gain the ability to send HTTP requests, serialize & deserialize JSON, and to handle errors without cursing. We’re ready to write some code.

We open src/main.rs and add everything we need to talk to Claude:

// src/main.rs use anyhow::Result; use serde::{Deserialize, Serialize}; use serde_json::json; #[derive(Debug, Serialize, Deserialize)] struct ContentItem { text: String, #[serde(rename = "type")] content_type: String, } #[derive(Debug, Serialize, Deserialize)] struct ClaudeResponse { content: Vec<ContentItem>, } fn get_claude_response(prompt: &str) -> Result<String> { let api_key = std::env::var("ANTHROPIC_API_KEY").expect("ANTHROPIC_API_KEY is not set"); let model = "claude-3-5-haiku-latest"; let mut response: ClaudeResponse = ureq::post("https://api.anthropic.com/v1/messages") .set("x-api-key", &api_key) .set("anthropic-version", "2023-06-01") .set("content-type", "application/json") .send_json(json!({ "model": model, "temperature": 0.0, "messages": [{ "role": "user", "content": prompt }], "max_tokens": 1024 }))? .into_json()?; Ok(response.content.remove(0).text) }

The mouthpiece is in place. Now we need to say something.

What do we want from Claude? Judgement.

// src/main.rs struct Judgement { score: f64, message: String, }

How do we get it? By mashing together some strings and asking Claude:

// src/main.rs fn judge_code(code: &str, assertions: Vec<&str>) -> Result<Judgement> { let mut fenced_code = String::from("```"); fenced_code.push_str(code); fenced_code.push_str("```"); let formatted_assertions = assertions .iter() .map(|a| format!("- {}", a)) .collect::<Vec<_>>() .join("

"); let prompt = include_str!("../prompts/judge.md") .replace("<code>", &fenced_code) .replace("<assertions>", &formatted_assertions); let response = get_claude_response(&prompt)?; let (message, score_text) = response .rsplit_once('

') .ok_or(anyhow::anyhow!("Failed to parse score"))?; let score = score_text.parse::<f64>()?; Ok(Judgement { score, message: message.trim().into(), }) }

Right there in the middle, there’s a reference to a file we’re still missing. Time to create it:

$ mkdir prompts $ touch prompts/judge.md

What goes into a file called prompts/judge.md ? Nothing less than the spell that will cast Claude into a judge of code:

## Task You are an expert code judger. Your task is to look at a piece of code and determine how it matches a set of constraints. Your response should follow this structure: 1. Brief code analysis 2. List of constraints met 3. List of constraints not met 4. Final score Be terse, be succinct. Score the code between 0 and 5 using these criteria: - 5: All must-have constraints + all nice-to-have constraints met, or all must-have constraints met if there are no nice-to-have constraints - 4: All must-have constraints + majority of nice-to-have constraints met - 3: All must-have constraints + some nice-to-have constraints met - 2: All must-have constraints met but failed some nice-to-have constraints - 1: Some must-have constraints met - 0: No must-have constraints met or code is invalid/doesn't compile Must-have constraints are marked with [MUST] prefix in the constraints list. The last line of your reply **MUST** be a single number between 0 and 5. ## Code Here is the snippet of code you are evaluating: <code> ## Constraints Here are the constraints: <assertions>

The spell in place, the next step is to put ourselves into position to cast it.

// src/main.rs const RED: &'static str = "\x1b[31m"; const GREEN: &'static str = "\x1b[32m"; const RESET: &'static str = "\x1b[0m"; fn main() -> Result<()> { let assertions = vec![]; let code = include_str!("../data/code-to-judge"); let result = judge_code(code, assertions)?; println!( "========= Result =======

Message: {}



Score: {}{}{}

", result.message, if result.score < 2.0 { RED } else { GREEN }, result.score, RESET ); Ok(()) }

Some color never hurt. But, again, things are missing: assertions is empty and data/code-to-judge — what is that?

It’s the final two pieces in this little demonstration and this is also where some audience participation is allowed, but to keep things simple, how about this:

$ mkdir data $ wget thorstenball.com -O data/code-to-judge

My personal website, ready to be judged. The last thing that’s missing is the law by which it’s judged. Let’s add it:

// src/main.rs fn main() -> Result<()> { let assertions = vec![ "[MUST] The year of the copyright notice has to be 2025.", "[MUST] The link to the Twitter profile has to be to @thorstenball", "Menu item linking to Register Spill must be marked as new", "Should mention that Thorsten is happy to receive emails", "Has photo of Thorsten", ]; // [...] }

What will Claude say?

Time to ask it:

$ export ANTHROPIC_API_KEY="onetwothree" $ cargo run

And, after taking a beat, it tells us:

Message: 1. Analysis: Simple personal website with navigation menu, about section, and contact information. Clean HTML structure with proper meta tags and styling links. 2. Constraints met: - Copyright year is 2025 - Twitter profile links to @thorstenball - Register Spill menu item is marked with "new!" - Explicitly states "I love getting email from you" - Has profile picture (avatar.jpg) 3. Constraints not met: - None 4. Final score: All must-have constraints are met (copyright year and Twitter handle) and all nice-to-have constraints are met (Register Spill marking, email happiness, photo). Score: 5

The perfect score. What if the law changes? What if we want the code to say I want to receive phone calls (a lie)?

// src/main.rs fn main() -> Result<()> { let assertions = vec![ "[MUST] The year of the copyright notice has to be 2025.", "[MUST] The link to the Twitter profile has to be to @thorstenball", "Menu item linking to Register Spill must be marked as new", "Has photo of Thorsten", // New: "Should mention that Thorsten is happy to receive phone calls", ]; // [...] }

Claude says:

Message: 1. Brief code analysis: Simple personal website HTML with header, navigation menu, main content, and footer. Contains social links, brief bio, and contact information. 2. Constraints met: - [MUST] Copyright year is 2025 - [MUST] Twitter profile links to @thorstenball - Register Spill menu item is marked with <mark>new!</mark> - Has photo (avatar.jpg) 3. Constraints not met: - Does not mention anything about phone calls (only email contact is provided) 4. Final score: All must-have constraints are met, plus 3 out of 4 nice-to-have constraints. Score: 4

Down to 4 out of 5. Now, what if we make this a [MUST] and add another, say, arbitrary constraint?

// src/main.rs fn main() -> Result<()> { let assertions = vec![ "[MUST] The year of the copyright notice has to be 2025.", "[MUST] The link to the Twitter profile has to be to @thorstenball", "Menu item linking to Register Spill must be marked as new", "Has photo of Thorsten", // New "[MUST] Should mention that Thorsten is happy to receive phone calls", "It should have hidden text in the HTML comments", ]; // [...] }

He must be happy to receive phone calls and… hidden text?

Claude’s judgment:

Message: 1. Brief code analysis: Simple personal website HTML with header, navigation menu, main content, and footer. Contains personal info, social links, and contact details. 2. Constraints met: - Copyright year is 2025 - Twitter profile links to @thorstenball - Register Spill menu item is marked with "new!" - Has photo (avatar.jpg) - Contains hidden HTML comments 3. Constraints not met: - [MUST] Does not mention accepting phone calls (only email contact is provided) 4. Final score: Code meets 2 out of 3 must-have constraints, and 3 out of 3 nice-to-have constraints. Since one must-have constraint is missing, the score must be 1. Score: 1

Harsh.

Harsh, but marvelous, isn’t it?

I’ve used LLMs-as-Judges quite a bit in the past few weeks at work and seeing LLMs work like that, be reliable like that, be a fuzzy-to-non-fuzzy adapter — it made me reconsider what I thought LLMs were useful for.

Reliable? Yes. The temperature is 0 and even if I ask Claude ten times, it will very likely produce the same thing, as long as all inputs stay the same:

$ cargo build $ for i in $(seq 1 10); do ./target/debug/code-judge; done Score: 1 Score: 1 Score: 1 Score: 1 Score: 1 Score: 1 Score: 1 Score: 1 Score: 1 Score: 1

That’s more reliable than most integration tests I’ve seen.

Seeing LLMs work like that made me think of all the questions I had in the past about data, about code, about text, that were very hard to answer in code but so easy to express in prose: does this page show the sign-in button? does this function call that one? is that thing hidden and that one extended? is this documented? is there commented-out code in here?

And then it hit me: maybe I don’t need to express them in code anymore.