HACKER Q&A
📣 spruce_tips

Strategies or tools for embedding multiple file types?


I've worked a good bit with embedding strategies for RAG. But they've only been for documents that are identical in structure i.e. interview transcripts.

I'm curious how others have thought about handling embeddings for multiple file types (txt, pdf, image, docx, ppt, etc.)? Obviously, I could handle each file type individually and then build a flexible search layer on top, but I'm concerned about the level of maintenance required.

One idea I had was to build a translation layer of sorts that would take some arbitrary file type in, map it onto a standardized text schema, and embed that. For images (which are much less common in my dataset), I would use an LLM to describe the image and cast that text into my standard format. The standard format would allow me to simplify the chunking and embedding logic for each file type, and make the vector search layer a lot easier to maintain.

I know this won't be perfect, but I think it could solve most of what I'm trying to achieve.

---

Curious what others think about this and what you have tried.

Cheers,

spruce_tips


  👤 chiccomagnus Accepted Answer ✓
If you don't want to reinvent the wheel, we have built exactly that, goggle "Preprocess"

👤 skeptrune
Strongly recommend using Apache Tika[1] for this. It's industry standard for ubiquitous document text extraction.

You can take the text output from Tika, chunk it with something like Chonkie[2], and embed it for your search index.

-[1]https://tika.apache.org/

-[2]https://chonkie.ai/