Apr 17^th 2024 · 1 min read · #ai #efficiency #inference #llms #optimizations

Effort Engine ➹

I’ve been pointing out that LLMs are barely optimized for ages now, so here’s another example of possible inference speedups that seems very promising (it works somewhat like on-the-fly distillation).

If this technique checks out and ends up implemented in mainstream tooling like ollama, it’s going to significantly lower compute and memory requirements for a bunch of scenarios.

← Notes for April 8-14 Delta on the (US) App Store →

This page is referenced in:

CLLMs - A Family of Efficient Parallel Decoders • May 8^th 2024

Tao of Mac

Effort Engine ➹

This page is referenced in: