DEV Community

Cover image for Building an AI Code Reviewer: What I Learned Wiring LLMs Into Real Code
Emilio
Emilio

Posted on

Building an AI Code Reviewer: What I Learned Wiring LLMs Into Real Code

I've been using Claude, Copilot and ChatGPT for months, but I had no idea how to actually wire an LLM into my own code. So I built a code reviewer CLI to figure it out. Turns out, making AI not lie to you is harder than I expected.

I chose code review because it's something I do regularly anyway, and I wanted to experiment with automating parts of it. Plus it's a focused task with clear success criteria: either the feedback is useful or it isn't.

What started as a weekend experiment turned into a deep dive into prompt engineering, LLM integration patterns, and the very real problem of AI confidently making things up. Here's what I learned.

Note: I'm not an ML engineer or AI researcher... just a developer learning by building. Take my explanations with appropriate skepticism, and feel free to challenge my ideas in the comments.


The Experiment

I wanted to answer a few questions:

  • How do you actually talk to different LLM APIs?
  • What makes a prompt work (or fail spectacularly)?
  • Can local models running on my laptop compete with cloud APIs?
  • How do you handle it when an AI lies to you with complete confidence?

So I built Annoying Teammate Code Reviewer, a CLI tool that reviews your git diffs using LLMs. It supports local models (Ollama) and cloud providers (Claude, OpenAI, Gemini), and you can configure its personality from "friendly mentor" to "that picky teammate who nitpicks everything."

Demo

Disclaimer upfront: This is experimental. LLMs hallucinate. The tool might invent issues that don't exist. Always use human judgment.


Discovery 1: Prompts Are Code, Not Conversation

I used to think prompts were just "asking the AI nicely." Turns out, they're closer to writing code: structure matters, order matters, and small changes have big effects.

The Persona System Actually Works

I was skeptical, but giving the AI a persona with specific catchphrases and pet peeves produces noticeably different outputs:

// The reviewer name is configurable (or can be omitted for a neutral reviewer)
export const REVIEWER_NAME = "Emi";

// The personality is fully customizable - set to empty string for neutral reviews
export const REVIEWER_PERSONALITY_CUSTOMIZATION = `
You are ${REVIEWER_NAME}, a senior developer known for thorough but fair code reviews.

Your review style:
- Direct and concise - no fluff, get to the point
- Focuses on bugs, logic errors, and maintainability over style nitpicks
- Uses dry humor sparingly ("This works, but future-you will hate past-you for it")
- Acknowledges good code briefly ("Nice use of early returns here")
- When there's nothing wrong, just says "Looks good!" - never invents issues

Your pet peeves (flag these when you see them):
- Overly clever code that's hard to read
- Missing error handling for things that can fail
- Magic numbers without explanation
- Functions doing too many things
- Copy-pasted code that should be abstracted

Your catchphrases:
- "This could bite us later..."
- "Have you considered...?"
- "Nitpick, but..."
- "Future maintainers will thank you if..."
`;
Enter fullscreen mode Exit fullscreen mode

Here's how this actually changes the output. Same code, different personalities:

Reviewing this code:

const data = await fetch('/api/users');
Enter fullscreen mode Exit fullscreen mode

Neutral reviewer (personality disabled):

"Consider adding error handling for the fetch operation."

Emi (with personality enabled):

"This could bite us later... what happens when fetch() throws? Right now: silent failure and a confused user. Nitpick, but add a try/catch. Also, 'data' is vague - userResponse? apiData? Future maintainers will thank you if you're specific."

The personality isn't just flavor - it changes what the model notices and how it explains things. The "pet peeves" act as a filter that draws attention to specific patterns.

Why does this work? My understanding is that the model has seen these communication patterns in its training data. When you say "use dry humor sparingly", you're activating associations with a certain review style. It's not magic... it's pattern matching at scale.

Prompt Order Matters (Especially for Local Models)

Here's something that surprised me: for smaller local models, I had to put the diff first in the prompt, before the instructions.

This is my understanding from reading about attention mechanisms: these models have limited context windows and tend to weight recent text more heavily. If your instructions come after a 500-line diff, they might get deprioritized.

// From buildReviewPrompt() - diff comes first, then instructions
export function buildReviewPrompt(diff: string, context: ReviewContext): string {
  const filesContext = context.files.length
    ? `\nFiles changed (${context.fileCount}): ${context.files.join(", ")}`
    : "";

  return `# Code Review

## The Diff to Review

Branch: ${context.branch}
Mode: ${context.mode}${filesContext}

${diff}

## Your Task

${REVIEWER_PERSONALITY}

## Project Standards (Reference)
${PROJECT_STANDARDS}

---

Now write your review of the diff above. Remember: ONLY comment on code shown in the diff.
`;
}
Enter fullscreen mode Exit fullscreen mode

Cloud models with longer context windows are more forgiving, but this taught me that prompt engineering isn't one-size-fits-all. What works for GPT-4 might fail for a 7B local model.


Discovery 2: Multi-Layer Prompts Reduce Hallucinations (But Don't Eliminate Them)

The AI would confidently invent files, line numbers, and security issues. After iteration, I found that multi-layer defense in prompts helps:

export const REVIEWER_INSTRUCTIONS = `
CRITICAL RULES - FOLLOW EXACTLY:
1. ONLY discuss code that appears in the diff below (lines starting with + or -)
2. NEVER invent or imagine code that isn't shown - this is your #1 failure mode
3. NEVER mention functions, variables, or files unless you can quote them from the diff
4. If the code looks fine, say "None - looks good!" - don't force issues

BEFORE WRITING ANY ISSUE, ASK YOURSELF:
- Can I copy-paste the exact code I'm criticizing from the diff above?
- Is this a REAL problem I can see, or am I guessing/assuming?
- If I cannot point to a specific line in the diff, I must NOT mention it

COMMON MISTAKES TO AVOID:
- Making up function names like "createFileDiff" that aren't in the diff
- Suggesting "extract to utility" for code you can't see
- Generic advice about "repeated logic" without quoting the actual repeated code
`;
Enter fullscreen mode Exit fullscreen mode

Notice the structure: rules at the top, self-check questions in the middle, concrete examples at the bottom. This isn't just telling the model what to do - it's giving it a mental framework to verify its own output.

Did this eliminate hallucinations? No. Did it reduce them significantly? Yes. But I still put a warning in the README because no amount of prompt engineering makes this 100% reliable.

The lesson: LLMs need constraints, not just instructions. You're essentially writing defensive code in natural language.


Discovery 3: Local vs Cloud Is a Real Tradeoff

I wanted to support both local (Ollama) and cloud (Claude, OpenAI, Gemini) models. Here's what I learned:

Aspect Local (Ollama) Cloud APIs
Privacy Code never leaves your machine Sent to external servers
Cost Free after download Pay per token
Quality Good for focused tasks Generally better reasoning
Speed Depends on your hardware Usually fast
Context 4K-32K tokens typically Up to 1M tokens
Setup Requires Ollama install + model download Just need API key
Debugging Limited (console logs only) Better logging, API dashboards

When local works great:

  • Focused reviews of small-medium diffs
  • When privacy matters (proprietary code)
  • When you're doing lots of reviews (cost adds up)

When cloud is worth it:

  • Complex refactoring reviews
  • Very large diffs (better context handling)
  • When you need higher accuracy

The key insight: for code review specifically, local models are surprisingly capable. You're not asking them to write code, you're asking them to analyze code that's right in front of them. That's an easier task.


Discovery 4: Abstraction Saves You

Supporting four different providers (Ollama, Claude, OpenAI, Gemini) forced me to think about abstraction. Each has different:

  • Authentication methods
  • API shapes
  • Streaming protocols (NDJSON vs SSE)
  • Error handling patterns

But my CLI shouldn't care about any of that. So I built a simple interface:

interface LLMProvider {
  readonly name: string;
  readonly displayName: string;
  checkHealth(): Promise<boolean>;
  getAvailableModels(): Promise<string[]>;
  isModelAvailable(modelName: string): Promise<boolean>;
  streamResponse(
    prompt: string,
    onChunk: (chunk: string) => void,
    options?: GenerateOptions
  ): Promise<string>;
  getDefaultModel(): string;
}
Enter fullscreen mode Exit fullscreen mode

Each provider implements this interface, and the rest of the codebase doesn't know or care which one is being used.

// Registration happens once at startup
providerRegistry.registerClass("ollama", OllamaProvider);
providerRegistry.registerClass("claude", ClaudeProvider);
providerRegistry.registerClass("openai", OpenAIProvider);
providerRegistry.registerClass("gemini", GeminiProvider);

// Then the CLI just does this:
const response = await provider.streamResponse(prompt, onChunk, options);
Enter fullscreen mode Exit fullscreen mode

This isn't groundbreaking architecture, but implementing it helped me understand why these patterns exist. When you're dealing with 4+ APIs that do roughly the same thing but all differently, you feel the pain that abstraction solves.


Real Usage: What I've Learned So Far

I've used this on about 15 PRs so far. Here's what happened:

Caught real issues:

  • Missing error handling in 4 different async operations I'd overlooked
  • Unused imports in multiple files (small wins, but they add up)
  • A magic number that would've confused future-me
  • Overly complex conditional that could be simplified

False positives:

  • Complained about "missing types" in a .js file (technically correct, but not actionable)
  • Invented a security issue in perfectly safe code (classic hallucination)
  • Suggested extracting a function that was only used once

Model differences:

  • deepcoder (Ollama): Fast, decent for small diffs, occasionally generic
  • Claude Sonnet: Better reasoning, caught architectural concerns, slower/costs money
  • qwen3 (Ollama): Best balance of speed and quality for local

The 80/20 rule applies: About 80% of suggestions are valid, 20% are noise. But that 80% catches stuff I regularly miss in my own reviews - the mundane things like error handling, unused code, and unclear names.

The tool won't replace human review, but it's useful as a second pair of eyes before pushing code. Kind of like running a linter, but for logic and maintainability instead of syntax.


What I'm Still Figuring Out

This project raised more questions than it answered:

Token budgeting is hard.

When a diff exceeds the context window, I truncate by keeping complete files and prioritizing source over config. But this breaks on monorepo diffs where a change in packages/core affects packages/ui. The model can't see the connection.

Current approach: batch reviews + consolidation pass.

Problem: the consolidation sometimes loses cross-file context.

If you've solved this, I'm all ears.

Batch reviews are tricky.

For huge diffs, I split them into batches and review each separately, then consolidate. But the consolidation step sometimes loses nuance. How do you maintain context across multiple LLM calls without blowing up the token budget?

Evaluation is subjective.

How do you measure if a code review is "good"? I don't have automated tests for review quality, just manual spot-checking. Is there a way to programmatically evaluate review quality? Compare against human reviews? I'm not sure.

Model updates break things.

I've had to update the prompt twice when model behavior changed after provider updates. Models get retrained, APIs change, and suddenly your carefully-crafted prompt stops working as well. Is there a way to make prompts more robust to model changes, or is this just the reality of building on top of someone else's black box?


Why This Matters (Beyond My Learning)

Building this taught me that AI tooling isn't magic - it's engineering. You can:

  • Shape model behavior through careful prompt design
  • Build useful tools without being an ML expert
  • Fail fast and iterate (most of my early prompts were terrible)

If you're curious about AI but intimidated by the math, start here: build something small that solves a real problem for you. The best way to understand these tools is to break them.

I've caught real bugs with this tool, but more importantly, building it demystified LLM integration for me. Prompts aren't magic - they're code. Models aren't oracles - they're pattern matchers. Understanding that changes how you use them.


Try It Yourself

The project is open source: GitHub

# With Ollama (local, free)
ollama pull deepcoder
npm run review

# With cloud providers
export GEMINI_API_KEY=your-api-key
npm run review -- --provider gemini
Enter fullscreen mode Exit fullscreen mode

It's experimental. It will hallucinate sometimes. But it's been useful for my workflow, and building it taught me more than a dozen tutorials could.

I encourage you to clone it, play around with the prompts, swap providers, tweak the personality, break things. That's how I learned.


What's your experience?

If you have ideas for better token budgeting, consolidation strategies, or prompt patterns that reduce hallucinations - I'm all ears. Open an issue or PR, or just drop a comment below.


This started as a learning project and still is. The code is heavily commented with explanations if you want to dig into the implementation details.

Top comments (0)