I'm curious how others have thought about handling embeddings for multiple file types (txt, pdf, image, docx, ppt, etc.)? Obviously, I could handle each file type individually and then build a flexible search layer on top, but I'm concerned about the level of maintenance required.
One idea I had was to build a translation layer of sorts that would take some arbitrary file type in, map it onto a standardized text schema, and embed that. For images (which are much less common in my dataset), I would use an LLM to describe the image and cast that text into my standard format. The standard format would allow me to simplify the chunking and embedding logic for each file type, and make the vector search layer a lot easier to maintain.
I know this won't be perfect, but I think it could solve most of what I'm trying to achieve.
---
Curious what others think about this and what you have tried.
Cheers,
spruce_tips
You can take the text output from Tika, chunk it with something like Chonkie[2], and embed it for your search index.