Piotr Mazurek and Felix Gabriel have an amazing post up on LLM Inference Economics from First Principles, which I found on Bluesky.

They go into a huge amount of detail about how inference works and how that affects processing speed. But I saw the graph above and thought “we can get energy from that”. And so I asked chatgpt-o3: “Looking at these graphs of throughput at different batch sizes on a 4xH100 80gb cluster, what ranges of power per token do they equate to?”

First up this is based on a large input prompt of 16k tokens. That’s big! But it’s also perfectly reasonable for coding, documents etc. Throughput varies from 41.8 t/s at batch size 1 to 156 at a batch size of 8 on 4xH100s.

Next we need power for the nodes. Bracketing at a higher and low power we get:

GPU Type Board TDP 4-GPU Node Add ~10% for CPUs, NIC etc
H100 PCIe 350W 1.40 kW ~1.5 kW
H100 SXM5 700W 2.80 kW ~3.0 kW

These two bounds give us a realistic “best case” (1.5 kW) and “worst case” (3.0 kW) for a well-loaded Hopper box.

For each line, we calculate energy/sec over tokens/sec e.g. 1500 J/s / 41.8 t/s → 36J because 1 kWh = 3.6 MJ, so kWh / M tokens = (J token⁻¹ × 10⁶) / 3.6 × 10⁶

Batch J/t @ 1.5 kW J/t @ 3.0 kW kWh/million t (1.5 – 3.0 kW)
1 36J 72J 10 – 20 kWh
2 21J 42J 6 – 12 kWh
4 14J 28J 4 – 8 kWh
8 9.6J 19J 2.7 – 5.2 kWh

Even at the most GPU-efficient point (batch 8) the node is burning ≈ 9–19 J per token — 25–50 × more energy than the 0.4 J/token figure I used before.

ChatGPT claims “The culprit is the huge prompt (16 k tokens) plus per-request synchronisation overhead”. Well, the post includes throughput figures for small prompt sizes (2035 in, 300 out) on a four node cluster, so we can do that case:

Batch Size Throughput (tokens/s) J/token @ 1.5 kW J/token @ 3.0 kW kWh/million tokens (1.5 – 3.0 kW)
1 53.9 27.8 J 55.6 J 7.7 – 15.4
2 102.9 14.6 J 29.2 J 4.0 – 8.1
4 182.5 8.2 J 16.4 J 2.3 – 4.6
8 319.8 4.7 J 9.4 J 1.3 – 2.6

The post also includes a graph of this small prompt case on a single GPU. For this small prompt case, one GPU is faster than four because there’s no communication and syncronisation. This tops out at 2174 at a batch size of 512. Plugging that in, we get 0.69 J/token, which is ballpark to the 0.4 J/token I used before. But it does mean that in practice I should have used 10-20x that for realistic inputs when looking at coding and other long prompt examples.