My approach is to stuff as many documents as possible directly into the context. The context windows of frontier models are large enough for my use case of ~20-40 documents. Context windows are 128K tokens for gpt-4o/o1/o3 and 1M for Gemini.
When stuffing all of them in one query isn't possible, split the documents into multiple queries and aggregate the answers.
I've tried RAG. But matching query embeddings to chunk embeddings isn't that straightforward. I noticed that relevant content was missed even with my modest number of documents. Semantic matching using query embeddings is one level above dumb keyword-matching but one level below direct queries to LLMs.
Direct LLM queries seem to perform the best especially when some intermediate understanding is required (like "Based on these documents, infer the industries where X technique may be useful"). That's not possible with simple embedding search unless some of the documents specifically use the umbrella word "industry" or its close synonyms.
Embedding search can probably be improved - like generating a synthetic answer and matching that answer's embedding to chunk embeddings. But I haven't tried such techniques.
https://products.aspose.net/pdf/chat-gpt/
At glance I see it supports some advanced features:
Automatic detection of multiple languages. Batching requests for reduce LLM API call frequency to lower operational costs.
Just upload your documents to a OneDrive, Sharepoint, or Teams Site that you have access to and just start asking questions.