Personally, I find them good, but very slow (RTX 3050, Mistral 7B) and hard to make them output in a consistent format (JSON, bullet points). GPT 3.5 makes it look like a pointless exercise from a speed and consistency perspective.
Any usecases for Local LLMs apart from them being local so we can feed sensitive documents?
It’s fun running models from HuggingFace on my computer. Finally, something that utilizes my computer’s 64GB of RAM and 24GB of VRAM. It’s neat seeing the immense performance difference between CPU (Ryzen 7 5700X) and GPU (RTX 3090) offloading.
I think as with most “Cloud vs on-prem” arguments, it comes down to cost vs convenience. Building an application on Azure or AWS is as easy as it gets, but if money is scarce, you can’t beat on-prem for raw resources.
I’m writing a program right now that will query ChatGPT 4 with… A LOT of tokens. We project it will cost between $5k and $15k and probably run for around 2 days. *OR* I could feed that same data through a local model running on the RTX 3090 and it’ll cost like $20 in electricity, and take maybe 6 days.
Right now, I'm prompting Mistral to generate these titles in "clickbait" style. I fold the topic of the message and other context into the prompt.
My intention is to shift my attention to the message, which shifts my attention to something else I need to do, because I tend to over-focus on whatever I'm doing at the moment.
It doesn't matter whether what I'm doing at the moment is "good" or "bad". Based on probability, I should almost always switch my attention when I receive such a message because I should have switched an hour ago.
To guarantee consistent JSON output, I use a llama.cpp grammar (converted from JSON schema)
Generation is via CPU (Ryzen 5800) because it's an async background operation and also because my 1070 GPU is being used by Stable Diffusion XL Turbo to generate the image that goes along with the message.
For example, lets say some image editing software decided to use Stable Diffusion to fill in image data in one of their Content Aware tools or something, they would not tell the user to install and run Ollama or sdapi from their CLI. They would install the LLMs when you install the app, and talk to it when you use the app. The end user would never know an LLM is being ran locally, any more than they know DirectX is running. (some might)
I like this use case because image/music/video editing software already requires good CPU/GPU, and in the case of Photoshop, I'm used to my fans blaring when I run Filter Gallery (lol) I as the end user would not need to know that LLMs are being invoked as I use software.
I think this use case is a lot stronger than any cloud-based one as long as it's this expensive to run GPU in the cloud - and the fact that present cloud behavior is to use one of the Big 3, anyone looking for cloud AI will use an OpenAI or another major provider - in the end something from Microsoft, Google, etc.
AI seems totally not like a giant bubble.