robotanalyze.com is a search and comparison engine for open-weight & API language models and the GPUs that run them — built on one idea: organise models by capability, and be honest about how much to trust every number.
Picking a model should start from a single, practical question: “Which model should I run, will it fit my hardware, how fast, and what will it cost?” — for both local open-weight models and paid APIs, in one place.
Existing resources answer a different question. Leaderboards collapse everything into one rank; vendor pages quote a hand-picked benchmark that “runs hot”; aggregators auto-generate numbers that can’t be traced. Almost none address the thing a hobbyist actually needs — will this run on my 4090, and how fast? — and none tell you how much a given score is worth.
Two principles drive the whole platform:
For local models we add what cloud leaderboards ignore entirely: effective context (not the advertised number), quantisation quality per bit-width, VRAM fit, and real tokens/sec on actual hardware.
Every figure is sourced or modelled — never fabricated. Unknown values are stored as nullwith a data_gaps note rather than guessed. Architecture comes straight from each model’s Hugging Face config.json (factual). Benchmark numbers are attributed to a source and labelled vendor vs independent. Speeds are either modelled from first principles or crowdsourced from real runs — and clearly marked as such.
Self-hosted, no cloud. A single Docker stack on a Raspberry Pi:
Binary “fits / doesn’t fit” is wrong; the truth has three positive states:
This is computed with exact GQA-aware KV-cache math and a model’s real layer / KV-head / head-dim values, plus a usable-memory fraction — because a 32 GB Mac (unified memory, OS-reserved) is not a 32 GB discrete GPU.
weights_GB = total_params_B × bits_per_weight / 8 (GGUF byte-table or EXL3 bpw) kv_cache_GB = 2 × layers × (context + draft) × (kv_heads × head_dim) × kv_bytes / 1e9 × batch VRAM_total = weights + kv_cache + overhead (MoE: weights use TOTAL params)
The calculator covers both quant families and the runtime knobs that quietly move VRAM:
Token generation (decode) is memory-bandwidth-bound, not compute-bound. So the headline rule is:
decode_tok/s ≈ memory_bandwidth_GBs / active_weight_GB × efficiency
A MoE model only reads its active experts per token, so it’s fast once it fits. Prefill (prompt processing / time-to-first-token) is the compute-bound part:
prefill_tok/s ≈ FP16_TFLOPS × efficiency / (2 × active_params)
The calculator (and the /fit API) now report both numbers side by side, because they pull in opposite directions — prefill scales with a card’s TFLOPS, decode with its memory bandwidth. Speculative decoding, flash attention and batching all raise the compute-to-memory ratio. The takeaway we teach inline: for local inference, bandwidth matters more than TFLOPS.
Quantisation trades VRAM for accuracy, and the trade isn’t linear. Q4_K_M is the sweet spot; Q8 is near-lossless; below ~Q3 quality visibly drops. We prefer KL-divergence vs FP16 over raw perplexity (perplexity hides per-token distortion), and note that MoE tolerates lower bits than dense models — but their routers are quantisation-sensitive. Real quant-quality evals (e.g. lm-evaluation-harness) are crowdsourced per quant.
The fit-calculator covers the quant families people actually run: GGUF k-quants(Q8 → Q3), GGUF i-quants (IQ4 / IQ3 / IQ2), AWQ and GPTQ (4-bit), and the ExLlama families EXL2 / EXL3 sized by their bits-per-weight — so you can plan the exact quant you run, not an approximation.
A single 0–10 number is a weighted blend, renormalised over whichever axes are present(so a model missing an axis isn’t penalised). It’s documented, not magic:
| SWE-bench Verified 0.20 | MMLU-Pro 0.10 |
| GPQA Diamond 0.15 | LMArena coding 0.08 |
| HLE 0.12 | Instruction-following 0.07 |
| BFCL tools 0.10 | Multilingual 0.05 |
| AIME 0.10 | Hallucination (inverted) 0.03 |
| APEX agents 0.05 | MRCR long-context 0.05 |
Vendor benchmarks are flagged (they run hot). Dollar-denominated agentic metrics (Vending-Bench, SWE-Lancer) are stored and shown but deliberately kept out of the blend — normalising them would require inventing a cap.
Almost no site tells you how much to believe a benchmark. We attach, per score:
last_verified) — because eval sets drift and today’s separator saturates within months.Each skill maps to the best current benchmark, with the saturated classics kept only as historical context:
Crowdsourced signal, kept honest by rules enforced in the database, not the UI:
Three sources, each at a different cadence and clearly labelled:
Live benchmark/price auto-ingest from primary, independent sources; measured quant quality (perplexity/KL per quant); concurrency-aware throughput curves; and — once there’s a sensible comparison axis — embeddings, rerankers and vision models, which the schema already accommodates. Shipped recently: a fine-tune / LoRA / QLoRA VRAM mode, EXL2/EXL3 + AWQ/GPTQ + i-quants, multi-GPU planning, a custom-GPU entry for cards not in the catalog, a "what fits my GPU?" reverse lookup, a split prefill / decode speed estimate, Google sign-in, and privacy-first, cookie-free usage analytics.