HACKER Q&A
📣 myrmidon

Why no inference directly from flash/SSD?


My understanding is that current LLMs require a lot of space for pre-computed weights (that are constant at inference-time).

Why is it currently not feasible to just keep those in flash memory (fast PCIe SSD Raid or somesuch), and only use RAM for intermediate values/results?

Even modest success on this front seems very attractive to me, because Flash storage appears much cheaper and easier to scale than GPU memory right now.

Are there any efforts in this direction? Is this a flawed approach for some reason, or am I fundamentally misunderstanding things?


  👤 sunscream89 Accepted Answer ✓
> A typical DRAM has a transfer rate of approximately 2-20GB/s, whereas typical SSDs have a transfer rate of 50MB-200MB/s. So it's one to two orders of magnitude slower.