Introducing llamaBench — LLM Inference Benchmark Runner

Posted May 18, 2026

By Leo

2 min read

What is llamaBench?

llamaBench is an LLM inference benchmark runner designed for any OpenAI-compatible server. It works with any backend — ROCm (Lemonade SDK), CUDA (Ollama, vLLM), Vulkan, or CPU — and runs latency and throughput benchmarks against remote servers at multiple context depths, then generates comparison plots automatically.

Key Features

Multi-server support — per-server config files (config.<NAME>.sh) with auto-discovery
Multi-model runs — benchmark several models in a single invocation
Variable context depths — test prompt processing and token generation from zero-context to hundreds of thousands of tokens
Atomic results — temp-directory pattern ensures no partial output on failure
Built-in plotting — matplotlib charts for prompt processing and token generation throughput vs. context depth
Traceable results — backend versions embedded in every result file

Requirements

Tool	Purpose
bash 4+	Script runtime
curl	HTTP requests to OpenAI-compatible API
python3	JSON parsing, plot generation
uvx / llama-benchy	Benchmark execution (install uv)
matplotlib	Chart generation (auto-installed if missing)

Install matplotlib ahead of time:

pip3 install matplotlib

Quick Start

        
      
# 1. Create a server configuration
cp config.template.sh config.MYSERVER.sh
# Edit config.MYSERVER.sh with your server IP, port, models, and depths

# 2. Run benchmarks
./run_bench.sh MYSERVER

# 3. Check results
ls results/<timestamp>/

Configuration

Each server has its own config.<NAME>.sh file sourced by run_bench.sh. Copy the template and fill in your values:

cp config.template.sh config.MYSERVER.sh

Config Variables

Variable	Description	Example
`IP`	Server IP address	`"192.168.2.238"`
`PORT`	API port	`"13305"`
`PLOT_PREFIX`	Filename prefix for plots	`"combined."`
`DEPTHS`	Array of context depths to test	`(0 8192 32768 65535 128000)`
`MODELS`	Array of model names on the server	`("user.MyModel-Q4")`

Benchmark Flow

Config load — Source config.<NAME>.sh, validate arrays, apply CLI overrides
Temp dir — Create isolated workspace (cleaned up on failure)
System info — Fetch /api/v1/system-info, extract backend versions
Per-model loop: Unload current model, run llama-benchy with all configured depths, save markdown table
Plot — Generate PNG charts from all result files
Finalize — Move temp contents to results/<timestamp>/ only on success

All intermediate files are written to a temporary directory. The final results folder is created only after all benchmarks and plots succeed. If any step fails, the temp directory is removed automatically.

Result Format

Each run produces a timestamped directory in results/:

system-info.json / system-info.md — Server hardware and backend versions
Per-model markdown tables with columns for tokens/sec, peak tokens/sec, TTFR, estimated PPT, and end-to-end TTFT
<prefix>p.png — Prompt processing throughput vs context depth
<prefix>g.png — Token generation throughput vs context depth

Test types include:

pp — Prompt processing (baseline, 2048 tokens)
tg — Token generation (32 tokens)
ctx_pp/tg @ d<N> — Full context processing/generation at depth N
pp/tg @ d<N> — Incremental processing/generation at depth N

plot.py

The standalone Python script can also be used separately:

        
# Plot specific result files
python plot.py --prefix "output." results/*/model*.md

# Plot from stdin
cat results.md | python plot.py -

Notes

An OpenAI-compatible inference server must be running and accessible before running benchmarks
Models are unloaded between runs to ensure clean state
Prefix caching is enabled by default in all benchmark runs
Backend version headers are prepended to each result file for traceability

Check out the project on GitHub.

Projects

This post is licensed under CC BY 4.0 by the author.