Having said that, it would be nice to run local AI models in order to be able to use them for sensitive information, or fine tune them for my use cases without worrying of data leak.
Is anyone here on HN running models locally? What hardware do you recommend getting (I prefer to get a dedicated machine, as it can also be used as a homelab/home server, so no macbook/mac mini recommendations please)? Is it worth it in terms of costs? Would you recommend owning the hardware/renting it/or colocating your hardware in a datacenter?
I started with a local system using llama.cpp on CPU alone and for short questions and answers it was OK for me. Because (in 2023) I didn't know if LLMs would be any good, I chose cheap components https://news.ycombinator.com/item?id=40267208.
Since AWS was getting pretty expensive, I also bought an RTX 3060(16GB), an extra 16GB RAM (for a total of 32GB) and a superfast 1TB M.2 SSD. The total cost of the components was around €620.
Here are some basic LLM performance numbers for my system:
You'll want at least 5 tokens per second for it to be viable as chat. The T2000 with an i7 does ~20 tokens per second on Llama3.2:3b.
Llama3.3 70b is 42gb as a file, so you'll want 64gb ram minimum before you can even load it. This model does 2 tokens per second. Some hosted solutions are literally 1000x faster with 2000+ tokens per second, on the same model.
My opinion is that local inference is useful, but the larger models don't operate fast enough for you to iterate on your problems and tasks. Start small and use the larger models over lunch breaks when the smaller models stop solving your problems.
Depending on how sensitive your data is to your org, read the Terms of Use for some of the AI providers, some of them don't mine every chunk of data. DeekSeek vs Cerebras would be a good comparison.
Alternatively, you might be able to solve your problems with data that is representative of the original data. Instead of looking at payroll data directly for example, generate similar data and use that to develop your solution. Not always possible to do, but it might let you use a larger cloud hosted model without leaking that super private data.
Stuff like llama is okay, but the quality does not match with claude-sonnet 3, much less 3.5. The delay and set up time can be quite significant. In the end, it's just more worth it to pay the $0.07 or so to run it from a proprietary model.
And something like Cursor has .cursorrules and privacy modes and their own enterprise grade stuff. So privacy is not as much as an issue, but you got to do the homework to see if it works for you.
The killer feature for locally hosted AI seems to be that it's consistent. No sudden patches breaking everything, no paranoia whenever quality spikes after demo day and then drops 3 months after. No worrying that prices are going from $20 to $200/month.
AWS Bedrock is also pretty good and secure if you want to use AWS, and it comes with a suite of other tools like evaluation and guardrails.
I am running several LLM for a web app running locally (django framework + Ollama for the LLMs). One of the main constrains is the limited context window for now (but getting better now with the newer models). This web app aims at keeping data safety first (IP and sensitive data). A good way to feed the local agents is to use markdown formatted data (PDF to mrkd using pymupdf4llm). I am running it on my work laptop (M2pro) but I am looking forward to test is on AMD AI Max 'strix halo' due to it unified memory. Mainstream GPU have a frustratingly limited amount of VRAM.
best