HACKER Q&A
📣 babyent

How do I support multiple file versioning (any file type) using diffs?


Hi Friends.

I'm trying to do multi-version support for my users so they can upload the same file (same name) and have a version history.

For example, user uploads "Site Details Oct-2024.docx". They make some changes, upload the file again. Instead of giving them an option to Keep/Replace, I'd like for them to be able to have a version history that they can view/revert/etc. Less clutter, better organization, and most importantly it will make it easier for them to see what actually changed.

Right now I'm doing the naive way - uploading that updated file and storing the distinct files. The downside is that this will Nx my storage. If they have a 2mb file with 5 revisions, my current approach will use up 10mb.

I have no idea how this stuff works, so I did some googling and also asked chatGPT.

I found something called xdelta and bsdiff. Has anyone used this before? Seems like neither are actively maintained.

If you know any other way to do this also please let me know.

Text files are pretty straightforward, but again for formats like PDF docx etc. I am not sure how to do this.

I'd be happy if there are node/go/python bindings (in that order of preference), but no worries if not I can figure it out.

Thank you!


  👤 solardev Accepted Answer ✓
Are you using a cloud service or trying to host your own server?

If you're on AWS, for example, there's built-in versioning for S3 buckets: https://docs.aws.amazon.com/AmazonS3/latest/userguide/versio.... GCP and Azure have similar versioning systems for their storage.

To save disk space, the term you probably want is "data deduplication" (https://en.wikipedia.org/wiki/Data_deduplication), and there are various algorithms, filesystems, etc. that can help with that. However, it's somewhat slow and risky, because you have to recombine a final file from fragments spread across different versions. And if a fragment gets corrupted, it could affect not just one version but several or even all.

Disk space is generally cheap these days, and it might not be worth sacrificing performance and integrity just to save a few cents per user.

Can you just give them a disk space limit and charge them for additional space if they need it?


👤 examango
What are you thinking? It's the business that matters, not the technical details. The program runs as long as it runs. What do you care how it runs. Expand the business first, and when you're making money, hire a programmer who can work out the technical details.

👤 zahlman
>I'm trying to do multi-version support for my users so they can upload the same file (same name) and have a version history.

>I have no idea how this stuff works

>I'd be happy if there are node/go/python bindings (in that order of preference)

When you write code in those languages, over time you accumulate multiple versions of it in much the same way, yes? I assume you use a tool such as Git, Mercurial (hg), SVN etc. to manage this.

As you say, text files (like your source code) are pretty straightforward. But these tools can generally be adapted to handle binary files as well - see for example https://stackoverflow.com/questions/3601672 and https://stackoverflow.com/questions/9478023.

xdelta and bsdiff are implementations of actual diffing algorithms - they compare two similar versions of a file and create a representation of what changed.


👤 pestatije
are you sure you dont want to use a versioning system?