HACKER Q&A
📣 tugcan

Seeking advice on designing a personal document server


I am thinking about building a project for personal use. The idea can get complicated, so I wanted to ask your opinions on how to tackle this. I wanted to give more information such as caveats and diagrams, but putting everything on a blog page felt like advertising, which is not the point.

tldr: I want to build a service that accepts documents of varying types, extracts information from them, translates it, and makes it searchable.

The reason: I am Turkish-Georgian and living in Poland. Including English, there are at least four languages (actually a few more) that I deal with. I scan all my documents—such as invoices, doctor’s papers, and governmental documents. Sometimes, when I search for a document and don’t remember what language it’s in, searching without searchable text (scanned files) is difficult. I’d like to build a service where I upload a document (PDF, text, image), the system performs OCR, translates results into multiple languages, and extracts key info like the date (from OCR, not created_at). I also want to group documents. For example, I have many family history files in different languages; when I search for something, I want the system to return the group, not just one page. Ideally, it should allow nested groups.

There may already be software that does much of this, but I also see it as a way to learn more about system design and language learning. It could later support photos (metadata, GPS, faces), notes, to-dos, and tagged content (like ROMs).

What I have in mind (Assume the frontend exists):

Upload scenario: I upload a folder. The system stores files in /uploads, registers them in an unprocessed database, and assigns a job. The endpoint returns the job ID. The system runs OCR, detects language, and extracts title, language, and date. The frontend polls the job ID and, once done, displays results for review and grouping. When I submit, the system translates content and indexes it for search.

Search scenario: I search for something like “glucose” to find my blood results. The system searches translations and returns files whose translations contain “glucose.” If any belong to a group, include that info. Nice-to-have: highlight the word in the preview image.

Technologies I’m thinking of using (everything runs locally, except maybe storage):

  - Use rust. I’m learning it and want to improve by avoiding languages I’m already competent at.
    - axum for the API framework
    - tokio-cron-scheduler for maintenance tasks
    - SQLx for migrations
    - Maybe Lapin for background tasks
  - RabbitMQ for background jobs
  - PostgreSQL with tsvector or Tantivy for search
  - File extraction: Apache Tika
  - OCR: Tesseract
  - Duplication detection:
     - Files: SHA-256
     - Content: SimHash
     - Images: ImageHash
  - Language detection: Lingua
  - Translation: Argos Translate
  - Storage: local and/or S3 (Minio on two servers)
Frontend can be anything; let’s ignore that for now. Ideally, the system runs via docker/podman compose.

A few caveats:

I want to keep as little information as possible in databases.

  **Reasons:**
    - I’d like to manually change files and rerun maintenance to let the system update itself.
    - Migration should be easy: copy files, start the service, and let it rebuild the database.
  
However, I still need to store title, group, and language info somewhere. I thought of using a dotfile per folder with group data and hashes, but then the system will have two sources of truths. Renaming files (e.g., 0001..000n) could help but complicates grouping, especially with nested groups.

I may also add tags. For example, if I add ROMs, I might want to search “Final Fantasy” across “roms,” “gba,” “pc,” “psp,” etc. This affects how I choose databases and data structures.

I’m open to any suggestions. Thank you!


  👤 incomingpain Accepted Answer ✓
In history I did this with cybersecurity. I'd tried all kinds of apps like obsidian, one note, evernote, etherpad, dokuwiki, opennote.

The reality is that none of them are particularly good. Obsidian is my current solution, but i dont like it.

What I really need to give a chance: Llama Index. https://developers.llamaindex.ai/python/framework/#use-cases

Obviously I'd need some time to customize it for my needs, but it sure does seem to be what i want.