Friday, May 16, 2025
Show HN: KVSplit – Run 2-3× longer contexts on Apple Silicon https://ift.tt/PB0N7IV
Show HN: KVSplit – Run 2-3× longer contexts on Apple Silicon I discovered that in LLM inference, keys and values in the KV cache have very different quantization sensitivities. Keys need higher precision than values to maintain quality. I patched llama.cpp to enable different bit-widths for keys vs. values on Apple Silicon. The results are surprising: - K8V4 (8-bit keys, 4-bit values): 59% memory reduction with only 0.86% perplexity loss - K4V8 (4-bit keys, 8-bit values): 59% memory reduction but 6.06% perplexity loss - The configurations use the same number of bits, but K8V4 is 7× better for quality This means you can run LLMs with 2-3× longer context on the same Mac. Memory usage scales with sequence length, so savings compound as context grows. Implementation was straightforward: 1. Added --kvq-key and --kvq-val flags to llama.cpp 2. Applied existing quantization logic separately to K and V tensors 3. Validated with perplexity metrics across context lengths 4. Used Metal for acceleration (with -mlong-calls flag to avoid vectorization issues) Benchmarked on an M4 MacBook Pro running TinyLlama with 8K context windows. Compatible with Metal/MPS and optimized for Apple Silicon. GitHub: https://ift.tt/mxNnqv6 https://ift.tt/mxNnqv6 May 17, 2025 at 01:34AM
Subscribe to:
Post Comments (Atom)
Show HN: S3mini(v0.2) – Basic S3 Support for Ceph and Oracle Object Storage https://ift.tt/2DWv0zT
Show HN: S3mini(v0.2) – Basic S3 Support for Ceph and Oracle Object Storage https://ift.tt/rw8O5K9 June 15, 2025 at 04:18AM
-
Show HN: High school robotics code/CAD/design binder release Hello HN! My name is Patrick, and I am a junior at my High School’s FRC robotic...
-
Show HN: D&D meets Siri – Interactive voice adventure Hey HN! I've been building tooling for voice-driven apps over the past few mon...
-
Show HN: I Made an AI Social Media Manager to Automate Content Creation Hey HN, I am a Solopreneur, and I love building apps to automate bor...
No comments:
Post a Comment