Google Apresenta o TurboQuant para Reduzir o Uso de Memória de LLM e Aumentar a Velocidade

Google Research unveiled TurboQuant, a new compression algorithm designed to dramatically reduce the memory footprint of large language models (LLMs) while also increasing inference speed. By targeting the key‑value cache—often described as a digital cheat sheet—TurboQuant can cut memory usage by up to six times and deliver performance gains of around eight times without sacrificing model quality. The technique relies on a novel PolarQuant conversion that represents vectors in polar coordinates, preserving essential information while enabling aggressive compression. Ler mais