Whitepaper · v1

The clearest place to find what a local LLM is actually good at

robotanalyze.com is a search and comparison engine for open-weight & API language models and the GPUs that run them — built on one idea: organise models by capability, and be honest about how much to trust every number.

1The problem

Picking a model should start from a single, practical question: “Which model should I run, will it fit my hardware, how fast, and what will it cost?” — for both local open-weight models and paid APIs, in one place.

Existing resources answer a different question. Leaderboards collapse everything into one rank; vendor pages quote a hand-picked benchmark that “runs hot”; aggregators auto-generate numbers that can’t be traced. Almost none address the thing a hobbyist actually needs — will this run on my 4090, and how fast? — and none tell you how much a given score is worth.

2Thesis: capability + trust

Two principles drive the whole platform:

▸Organise by capability, not one score. Coding, reasoning, math, tool-use, long-context, multilingual, agentic — each is a separate axis, because no single number captures a model’s shape.
▸Label the trust of every score. Per benchmark we record whether it is vendor-reported or independent, how it was graded, whether it is saturated, its contamination risk, whether it used a held-out private split, and when it was last verified. That credibility layer is the real differentiator.

For local models we add what cloud leaderboards ignore entirely: effective context (not the advertised number), quantisation quality per bit-width, VRAM fit, and real tokens/sec on actual hardware.

3Hard rule: never invent data

Every figure is sourced or modelled — never fabricated. Unknown values are stored as nullwith a data_gaps note rather than guessed. Architecture comes straight from each model’s Hugging Face config.json (factual). Benchmark numbers are attributed to a source and labelled vendor vs independent. Speeds are either modelled from first principles or crowdsourced from real runs — and clearly marked as such.

4Architecture

Self-hosted, no cloud. A single Docker stack on a Raspberry Pi:

▸Postgres — holds the data and all heavy logic as SQL functions (fit, scoring, search, compare, cost). A hard rule: computation lives in SQL, not the UI layer.
▸PostgREST — turns the database into a REST API automatically; the Next.js app proxies to it and attaches the user’s token so writes run under row-level security as that user.
▸Next.js — the UI only. Design system: violet accent, Oxanium + JetBrains Mono, light/dark.
▸Auth — lightweight and self-hosted: passwords bcrypt-hashed in Postgres, a signed JWT minted by Next.js that PostgREST trusts. Row-level security everywhere: public read on stats, users write only their own rows.

5The fit model — three honest states

Binary “fits / doesn’t fit” is wrong; the truth has three positive states:

▸fits_vram fully in usable VRAM → fast (full bandwidth-bound speed).
▸fits_with_offload only fits by spilling layers to system RAM → runs, but 5–20× slower.
▸multi_device needs 2+ GPUs / a split → runs with a cross-device penalty.
▸no doesn’t fit even with offload.

This is computed with exact GQA-aware KV-cache math and a model’s real layer / KV-head / head-dim values, plus a usable-memory fraction — because a 32 GB Mac (unified memory, OS-reserved) is not a 32 GB discrete GPU.

weights_GB  = total_params_B × bits_per_weight / 8        (GGUF byte-table or EXL3 bpw)
kv_cache_GB = 2 × layers × (context + draft) × (kv_heads × head_dim) × kv_bytes / 1e9 × batch
VRAM_total  = weights + kv_cache + overhead    (MoE: weights use TOTAL params)

The calculator covers both quant families and the runtime knobs that quietly move VRAM:

▸GGUF or EXL3. GGUF quants use a measured byte-per-weight table (FP16 → Q3); EXL3 (ExLlamaV3) is driven directly by its bits-per-weight (weights = params × bpw ÷ 8), so e.g. a 27B at 2.5 bpw is ≈ 8.4 GB of weights.
▸KV-cache precision. FP16 / Q8 / Q4 cache — halving the cache bytes roughly halves the KV footprint at long context.
▸Runtime factors. Speculative draft tokens (e.g. dflash) add positions to the cache, and concurrent requests multiply it — each in-flight sequence needs its own KV cache. This is the “deceptively small, then blows up” effect: a model that fits at batch 1 can overflow VRAM at batch 2.

6The speed model

Token generation (decode) is memory-bandwidth-bound, not compute-bound. So the headline rule is:

decode_tok/s ≈ memory_bandwidth_GBs / active_weight_GB × efficiency

A MoE model only reads its active experts per token, so it’s fast once it fits. Prefill (prompt processing / time-to-first-token) is the compute-bound part:

prefill_tok/s ≈ FP16_TFLOPS × efficiency / (2 × active_params)

The calculator (and the /fit API) now report both numbers side by side, because they pull in opposite directions — prefill scales with a card’s TFLOPS, decode with its memory bandwidth. Speculative decoding, flash attention and batching all raise the compute-to-memory ratio. The takeaway we teach inline: for local inference, bandwidth matters more than TFLOPS.

7Quantisation quality

Quantisation trades VRAM for accuracy, and the trade isn’t linear. Q4_K_M is the sweet spot; Q8 is near-lossless; below ~Q3 quality visibly drops. We prefer KL-divergence vs FP16 over raw perplexity (perplexity hides per-token distortion), and note that MoE tolerates lower bits than dense models — but their routers are quantisation-sensitive. Real quant-quality evals (e.g. lm-evaluation-harness) are crowdsourced per quant.

The fit-calculator covers the quant families people actually run: GGUF k-quants(Q8 → Q3), GGUF i-quants (IQ4 / IQ3 / IQ2), AWQ and GPTQ (4-bit), and the ExLlama families EXL2 / EXL3 sized by their bits-per-weight — so you can plan the exact quant you run, not an approximation.

8The composite score

A single 0–10 number is a weighted blend, renormalised over whichever axes are present(so a model missing an axis isn’t penalised). It’s documented, not magic:

SWE-bench Verified 0.20	MMLU-Pro 0.10
GPQA Diamond 0.15	LMArena coding 0.08
HLE 0.12	Instruction-following 0.07
BFCL tools 0.10	Multilingual 0.05
AIME 0.10	Hallucination (inverted) 0.03
APEX agents 0.05	MRCR long-context 0.05

Vendor benchmarks are flagged (they run hot). Dollar-denominated agentic metrics (Vending-Bench, SWE-Lancer) are stored and shown but deliberately kept out of the blend — normalising them would require inventing a cap.

9The trust layer — the differentiator

Almost no site tells you how much to believe a benchmark. We attach, per score:

▸grading method — deterministic, code-execution, LLM-judge, human, or arena;
▸vendor vs independent, benchmark version, eval conditions (shots / CoT / sandbox), and whether it used a private held-out split;
▸a per-metric saturation flag and contamination risk (low / medium / high) — so saturated classics (MMLU, HumanEval, GSM8K) are demoted to “context only”, while contamination-resistant sets (LiveCodeBench, FrontierMath) are surfaced;
▸freshness (last_verified) — because eval sets drift and today’s separator saturates within months.

10Capability taxonomy

Each skill maps to the best current benchmark, with the saturated classics kept only as historical context:

▸Coding → LiveCodeBench, Aider Polyglot, SWE-bench (→ Pro).
▸Reasoning → GPQA Diamond, Humanity’s Last Exam.
▸Math → FrontierMath (AIME/MATH kept as legacy).
▸Knowledge → MMLU-Pro, SimpleQA-Verified.
▸Instructions → IFBench (IFEval saturating).
▸Tools → BFCL v4, τ²-bench.
▸Agentic / long-horizon → METR time-horizon, Vending-Bench, SWE-Lancer.
▸Long-context → RULER (effective length), MRCR v2.
▸Multilingual → Global-MMLU, MGSM.

11Community layer

Crowdsourced signal, kept honest by rules enforced in the database, not the UI:

▸Reviews with Reddit-style up/down voting; a review’s score is net votes.
▸Karma economy — authors earn 10 karma per upvote, capped at 1,000 per review and 2,500 per day. Downvotes never subtract karma; they only lower a review’s score. Self-votes are blocked and downvoting is gated server-side.
▸Trust stars — peer endorsement of a person (0–100), kept entirely separate from karma: trust measures who people vouch for, karma measures volume of valued contributions.
▸Crowdsourced tokens/sec & quant-quality evals — the highest-value data on the site, contributed by people running the models for real.
▸Accounts — contributing (reviews, votes, real speeds) needs a free account via email or one-tap Google sign-in; reading stays fully open, anonymous and cookie-free.

12Data pipeline

Three sources, each at a different cadence and clearly labelled:

▸Architecture / identity — auto-scraped nightly from Hugging Face configs (the factual layer).
▸Benchmark scores & profiles — gathered through a research-prompt pipeline and an additive import that never overwrites existing data with empty values.
▸Real speeds & quant quality — crowdsourced from the community.

13Principles

▸Honest about uncertainty — null over a guess, always.
▸Educational — explain why (bandwidth over TFLOPS, GQA shrinks the KV cache, effective ≠ advertised context).
▸Trust is the product — the labels on a number matter as much as the number.
▸Community-owned signal, no ads.

14Roadmap

Live benchmark/price auto-ingest from primary, independent sources; measured quant quality (perplexity/KL per quant); concurrency-aware throughput curves; and — once there’s a sensible comparison axis — embeddings, rerankers and vision models, which the schema already accommodates. Shipped recently: a fine-tune / LoRA / QLoRA VRAM mode, EXL2/EXL3 + AWQ/GPTQ + i-quants, multi-GPU planning, a custom-GPU entry for cards not in the catalog, a "what fits my GPU?" reverse lookup, a split prefill / decode speed estimate, Google sign-in, and privacy-first, cookie-free usage analytics.