How I Pick Models for My Specialized AI Agents

I built specialized agents for specific tasks - then figured out which models actually work for each one. Here's the framework I use.

I use OpenCode for coding sessions and OpenClaw for agentic workflows. After a few months of that, I started building specialized agents for specific things - one for debugging, one for writing tests, one for kicking off CI pipelines.

Each agent had different needs. The debugging one needed raw reasoning chops. The test writer needed to understand context across a whole file. The CI agent mostly needed to format JSON correctly and not hallucinate curl commands.

Generic leaderboard rankings didn’t help me there. “Best overall” is a fine headline for a benchmark post, but not useful when your debugging agent is stuck in a loop because you grabbed the wrong model.

So I built a way to actually look at this stuff - benchmark data filtered by task type, with price and speed factored in so I’m not just grabbing the “best” model for a job where 2x worse but 10x cheaper gets the job done just as well.

The dropdown below controls all three charts. Pick a task type and everything updates. The stars in the label tell you how directly the benchmark maps to that task. ★★★ means it actually measures that skill. ★☆☆ means it’s a rough proxy - treat it as a signal, not a verdict.

Task type:

      ■ Open-weight   ■ Closed / Frontier
    

First question: for this task, which models are actually the strongest? Not vibes, just ranked by benchmark score. This is the chart I check first when a new model drops.

Rankings tell you quality, but in OpenCode latency matters too. A model that's 5% better but noticeably slower breaks the flow. This plots quality against tokens per second so you can see where the fast models land relative to the best ones.

Then cost. Some of these models are 20-50x cheaper than others for similar quality. The log scale on the X-axis makes that spread visible. Top-left is the sweet spot.

What surprised me

A few things I didn’t expect once I had real data in front of me.

MiniMax M2.7 keeps showing up at the top. I hadn’t paid much attention to MiniMax before this. It’s open-weight, it’s cheap, and it scores competitively on general intelligence. Worth watching.

Kimi K2 Thinking is strong for reasoning. It shows up in the top 3 for the Reasoning bucket alongside Gemini 3.1 Pro and GPT-5.4-Mini. I hadn’t been using it at all.

The Designing buckets are mostly proxies. UI/UX reasoning maps to general reasoning benchmarks because there’s no benchmark that actually tests whether a model can think about layout and user experience. The ★☆☆ rating is honest. I use those charts as a rough signal, not a verdict.

LiveBench only had 3 of the 6 categories I expected. Coding, language, and instruction following are there. Reasoning and math aren’t in the current dataset slice, so the Reasoning bucket falls back to Artificial Analysis’s math index. Benchmark data is always incomplete.

How it works

The dashboard pulls from five sources, each filling a different gap:

Source	What it provides	Auth
Artificial Analysis	Quality indices, speed, latency, pricing	Free API key
LiveBench	Task-category benchmark scores	None
OpenRouter	Context windows, real-time pricing, open-weight flag	None
LiteLLM	Broadest model coverage, capability flags	None
Arena.ai	Image generation ELO rankings	None

Artificial Analysis is the backbone for quality and speed. They run continuous benchmarks across hundreds of models and publish quality indices for intelligence, coding, and math. LiveBench fills in task-specific scores. OpenRouter is the best source for identifying which models are actually open-weight and gives real-time pricing. LiteLLM covers the long tail.

Every fetch tries the live API first, then falls back to a local CSV cache, then to seed data shipped with the notebook. So it always has something to show, but it’s always trying to get fresh data:

def cached_fetch(name: str, fetch_fn, cache_dir: str = "cache") -> pd.DataFrame:
    try:
        df = fetch_fn()
        df.to_csv(f"{cache_dir}/{name}.csv", index=False)
        return df
    except Exception as e:
        print(f"fetch failed for {name}: {e}")
    for cache_path in [f"{cache_dir}/{name}.csv", f"{cache_dir}/seed/{name}.csv"]:
        if os.path.exists(cache_path):
            return pd.read_csv(cache_path)
    return pd.DataFrame()

The normalization problem

This was the hard part.

Five sources, five naming schemes. The same model across sources:

LiveBench: chatgpt-4o-latest-2025-01-29
OpenRouter: openai/gpt-4o-2024-08-06
LiteLLM: gpt-4o
Artificial Analysis: gpt-4o

To join these into one row I built a three-layer resolution pipeline:

def resolve_canonical(raw_name: str, source: str, aliases: dict) -> str | None:
    # Layer 1: exact match against a hand-maintained alias table
    for canonical, info in aliases.items():
        source_variants = info.get(source, [])
        if isinstance(source_variants, str):
            source_variants = [source_variants]
        if raw_name.lower() in [v.lower() for v in source_variants]:
            return canonical

    # Layer 2: cleaned exact match (strip dates, provider prefixes, version suffixes)
    cleaned = clean_model_name(raw_name, source)
    for canonical in aliases.keys():
        if cleaned == _clean_canonical(canonical):
            return canonical

    # Layer 3: rapidfuzz similarity >= 90 as a safety net
    result = process.extractOne(
        cleaned, [_clean_canonical(c) for c in aliases.keys()],
        scorer=fuzz.ratio, score_cutoff=90
    )
    if result:
        _, _, idx = result
        return list(aliases.keys())[idx]

    return None

The alias table covers about 74 models with their exact identifiers per source. Layer 3 catches anything that slips through. The 90% threshold is conservative on purpose because GPT-4o and GPT-4o Mini score around 85, and a false positive there would silently corrupt the data.

Mapping benchmarks to tasks

Benchmark categories don’t map cleanly to real tasks. “Reasoning” in LiveBench is math and logic puzzles. That’s a reasonable proxy for agentic reasoning but it’s not the same thing.

Each task bucket has a confidence level baked in:

def get_task_bucket_quality_col(df, bucket_name, task_mappings):
    cfg = task_mappings['task_buckets'][bucket_name]

    # AA index first, then LiveBench category
    # No averaging across sources, raw benchmark data only
    for idx in cfg.get('aa_indices', []):
        col = AA_INDEX_MAP.get(idx)
        if col and col in df.columns and df[col].notna().any():
            return col

    for cat in cfg.get('livebench_categories', []):
        col = f"livebench_{cat}"
        if col in df.columns and df[col].notna().any():
            return col

    return None

Pick a task, look at the data - that’s the whole point.