I am a medical student with thousands of pdfs, various anki databases, video conferences, audio recordings, markdown notes etc. It can query into all of them and return extremely high quality output with sources to each original document.
It's still in alpha though and there's only 0.5 user beside me that I know of so there are bugs that have yet to be found!
https://github.com/Azure-Samples/rag-postgres-openai-python
I’d like to make that version when I have the time, probably just using Llamaindex for the ingestion.
My tips for getting SLMs working well for RAG: http://blog.pamelafox.org/2024/08/making-ollama-compatible-r...
I have a few tabs open that I haven't had a chance to try:
https://github.com/Mintplex-Labs/anything-llm
It provides APIs to extract paragraphs or tables from your PDFs in bulk, You can also separately do bulk labeling (say classification, NER and other labeling types). Once you have a knowledge database, it creates 4 indexes on top of your JSON data layer - db index for metadata search, full text search index, annotation index and vector index, so that way you can perform any search operation including hybrid search
The fact that your data layer is in JSON, it gives you infinite flexibility to add new snippets of knowledge or new labels and improve accuracy over time.
Not surprising!
The LLM itself is the least important bit as long as it’s serviceable.
Depending on your goal you need to have a specific RAG strategy.
How are you breaking up the documents? Are the documents consistently formatted to make breaking them up uniform? Do you need to do some preprocessing to make them uniform?
When you retrieve documents how many do you stuff into your prompt as context?
Do you stuff the same top N chunks from a single prompt or do you have a tailored prompt chain retrieving different resourced based on the prompt and desired output?
You can see the project page here: https://textualization.com/ragged/
src and scripts here: https://github.com/Textualization/the-ragged-edge-box
[1] video presentation about the project https://www.youtube.com/watch?v=_fJFuL2pLvw
I uploaded them through Supabase Embeddings Generator if you're curious. https://github.com/supabase/embeddings-generator
But things got a bit messy when I handed it off to someone else. They started using synonyms for locations, like abbreviated addresses to refer to certain columns, which didn't return the right documents.
Followed a friend's suggestion to try NotebookLM, so I uploaded the same docs there, and it was awesome. Some cloud-hosted vector DB tools only handle PDFs, but NotebookLM accepted my Markdown and chunked the docs better than the Supabase library I was using. It just "worked".
I would swap over to NotebookLM because their document chunking and RAG performance is working for my use case, but they just don’t offer an API yet.
I also gave Gemini a shot using this guide, but didn’t get the results I was hoping for. https://codelabs.developers.google.com/multimodal-rag-gemini...
Am I overhyping NotebookLM? I’d love to know to get on-par document chunking, because that seems to deliver fantastic RAG right out of the box. I’m planning to try some other suggestions I’ve seen here, but any insights into how NotebookLM does its magic would be super helpful.
Here is that that thread. https://news.ycombinator.com/item?id=41981907
Langchain, llamaindex have good resources on building such a pipeline the last I checked
Helpful for building a scalable, local RAG solution tailored to your group’s needs—plus, it’s open source-friendly if i'm correct.