Post

Introducing llamaBench — LLM Inference Benchmark Runner

Introducing llamaBench — LLM Inference Benchmark Runner

What is llamaBench?

llamaBench is an LLM inference benchmark runner designed for any OpenAI-compatible server. It works with any backend — ROCm (Lemonade SDK), CUDA (Ollama, vLLM), Vulkan, or CPU — and runs latency and throughput benchmarks against remote servers at multiple context depths, then generates comparison plots automatically.

Key Features

  • Multi-server support — per-server config files (config.<NAME>.sh) with auto-discovery
  • Multi-model runs — benchmark several models in a single invocation
  • Variable context depths — test prompt processing and token generation from zero-context to hundreds of thousands of tokens
  • Atomic results — temp-directory pattern ensures no partial output on failure
  • Built-in plotting — matplotlib charts for prompt processing and token generation throughput vs. context depth
  • Traceable results — backend versions embedded in every result file

Requirements

Tool Purpose
bash 4+ Script runtime
curl HTTP requests to OpenAI-compatible API
python3 JSON parsing, plot generation
uvx / llama-benchy Benchmark execution (install uv)
matplotlib Chart generation (auto-installed if missing)

Install matplotlib ahead of time:

1
pip3 install matplotlib

Quick Start

1
2
3
4
5
6
7
8
9
# 1. Create a server configuration
cp config.template.sh config.MYSERVER.sh
# Edit config.MYSERVER.sh with your server IP, port, models, and depths

# 2. Run benchmarks
./run_bench.sh MYSERVER

# 3. Check results
ls results/<timestamp>/

Configuration

Each server has its own config.<NAME>.sh file sourced by run_bench.sh. Copy the template and fill in your values:

1
cp config.template.sh config.MYSERVER.sh

Config Variables

Variable Description Example
IP Server IP address "192.168.2.238"
PORT API port "13305"
PLOT_PREFIX Filename prefix for plots "combined."
DEPTHS Array of context depths to test (0 8192 32768 65535 128000)
MODELS Array of model names on the server ("user.MyModel-Q4")

Benchmark Flow

  1. Config load — Source config.<NAME>.sh, validate arrays, apply CLI overrides
  2. Temp dir — Create isolated workspace (cleaned up on failure)
  3. System info — Fetch /api/v1/system-info, extract backend versions
  4. Per-model loop: Unload current model, run llama-benchy with all configured depths, save markdown table
  5. Plot — Generate PNG charts from all result files
  6. Finalize — Move temp contents to results/<timestamp>/ only on success

All intermediate files are written to a temporary directory. The final results folder is created only after all benchmarks and plots succeed. If any step fails, the temp directory is removed automatically.

Result Format

Each run produces a timestamped directory in results/:

  • system-info.json / system-info.md — Server hardware and backend versions
  • Per-model markdown tables with columns for tokens/sec, peak tokens/sec, TTFR, estimated PPT, and end-to-end TTFT
  • <prefix>p.png — Prompt processing throughput vs context depth
  • <prefix>g.png — Token generation throughput vs context depth

Test types include:

  • pp — Prompt processing (baseline, 2048 tokens)
  • tg — Token generation (32 tokens)
  • ctx_pp/tg @ d<N> — Full context processing/generation at depth N
  • pp/tg @ d<N> — Incremental processing/generation at depth N

plot.py

The standalone Python script can also be used separately:

1
2
3
4
5
# Plot specific result files
python plot.py --prefix "output." results/*/model*.md

# Plot from stdin
cat results.md | python plot.py -

Notes

  • An OpenAI-compatible inference server must be running and accessible before running benchmarks
  • Models are unloaded between runs to ensure clean state
  • Prefix caching is enabled by default in all benchmark runs
  • Backend version headers are prepended to each result file for traceability

Check out the project on GitHub.

This post is licensed under CC BY 4.0 by the author.