Anyone running AI locally? Mind to share your experience?

Question

It's pretty clear that AI is here to stay. However, as more models are being released, we can very well assume that they will collect every bit of information about us as possible. This is why I'm avoiding using AI for very sensitive information like analyzing financial statements or medical records, and I try to be mindful to never open `.env` file in an AI enabled neovim buffer in order to avoid leaking credentials.Having said that, it would be nice to run local AI models in order to be able to use them for sensitive information, or fine tune them for my use cases without worrying of data leak.Is anyone here on HN running models locally? What hardware do you recommend getting (I prefer to get a dedicated machine, as it can also be used as a homelab/home server, so no macbook/mac mini recommendations please)? Is it worth it in terms of costs? Would you recommend owning the hardware/renting it/or colocating your hardware in a datacenter?

roosgit · Accepted Answer

Renting could be a good choice to get started. I used to rent a g4dn.xlarge instance from AWS (for Stable Diffusion, not LLMs). More affordable options are Runpod and Vast.ai.
I started with a local system using llama.cpp on CPU alone and for short questions and answers it was OK for me. Because (in 2023) I didn't know if LLMs would be any good, I chose cheap components https://news.ycombinator.com/item?id=40267208.
Since AWS was getting pretty expensive, I also bought an RTX 3060(16GB), an extra 16GB RAM (for a total of 32GB) and a superfast 1TB M.2 SSD. The total cost of the components was around €620.
Here are some basic LLM performance numbers for my system:
https://news.ycombinator.com/item?id=41845936
https://news.ycombinator.com/item?id=42843313

lucideng · Answer

I've used Ollama primarily Laptops with Windows, Mac or Linux. Performance is the biggest issue here. Using what I have on hand, a laptop with Nvidia T2000 GPU. The ~3b parameter models fit in 4gb of GPU memory, so they perform decent. This machine has 64GB of system memory and Intel i7 10875H.
You'll want at least 5 tokens per second for it to be viable as chat. The T2000 with an i7 does ~20 tokens per second on Llama3.2:3b.
Llama3.3 70b is 42gb as a file, so you'll want 64gb ram minimum before you can even load it. This model does 2 tokens per second. Some hosted solutions are literally 1000x faster with 2000+ tokens per second, on the same model.
My opinion is that local inference is useful, but the larger models don't operate fast enough for you to iterate on your problems and tasks. Start small and use the larger models over lunch breaks when the smaller models stop solving your problems.
Depending on how sensitive your data is to your org, read the Terms of Use for some of the AI providers, some of them don't mine every chunk of data. DeekSeek vs Cerebras would be a good comparison.
Alternatively, you might be able to solve your problems with data that is representative of the original data. Instead of looking at payroll data directly for example, generate similar data and use that to develop your solution. Not always possible to do, but it might let you use a larger cloud hosted model without leaking that super private data.

muzani · Answer

It runs on a 2022 MBP with m1 chip.Stuff like llama is okay, but the quality does not match with claude-sonnet 3, much less 3.5. The delay and set up time can be quite significant. In the end, it's just more worth it to pay the $0.07 or so to run it from a proprietary model.And something like Cursor has .cursorrules and privacy modes and their own enterprise grade stuff. So privacy is not as much as an issue, but you got to do the homework to see if it works for you.The killer feature for locally hosted AI seems to be that it's consistent. No sudden patches breaking everything, no paranoia whenever quality spikes after demo day and then drops 3 months after. No worrying that prices are going from $20 to $200/month.AWS Bedrock is also pretty good and secure if you want to use AWS, and it comes with a suite of other tools like evaluation and guardrails.

ecolio · Answer

Hi,I am running several LLM for a web app running locally (django framework + Ollama for the LLMs). One of the main constrains is the limited context window for now (but getting better now with the newer models). This web app aims at keeping data safety first (IP and sensitive data). A good way to feed the local agents is to use markdown formatted data (PDF to mrkd using pymupdf4llm). I am running it on my work laptop (M2pro) but I am looking forward to test is on AMD AI Max 'strix halo' due to it unified memory. Mainstream GPU have a frustratingly limited amount of VRAM.best