Nivida just open sourced their long context goodies - 128k context for 50% less memory
edited : 50% 35% less memory
If you need long context for RAG, tool use, agents, or just because, Nvidia released a new library to make it super simple.
TLDR: You can get 128k context at 35% less memory
Here's a blog post on everything: https://huggingface.co/blog/nvidia/kvpress