2 posts found
The exact formula for KV cache memory and worked examples for every major model architecture. Calculate your GPU requirements precisely.
Standard multi-head attention uses separate K and V for each head. MQA and GQA share them — reducing KV cache dramatically with minimal quality loss.