How to Resurrect a Site from Archive.org?

Question

I recently bought the expired domain of a niche interest site because the previous owner was determined to let it die and did not want to put any effort in it anymore.Is there a way I can "revive" it from archive.org in a more or less automated fashion? Have you ever encountered anything like it? I am familiar with web scraping, but archive.org has its peculiarities.I really, really love the content on it.It's a very niche site, but I would love for it to live on.

duskwuff · Accepted Answer

> I recently bought the expired domain of a niche interest site because the previous owner was determined to let it die and did not want to put any effort in it anymore. Is there a way I can "revive" it from archive.org in a more or less automated fashion?Buying a domain name does not award you ownership of the content it previously hosted. If you have not come to some agreement with the previous owner, you should not proceed.

ulrischa · Answer

I&rsquo;ve seen a lot of people do this when resurrecting old niche sites. The high-level approach usually involves grabbing all the snapshots from archive.org, stripping out their timestamped URLs, and consolidating everything into a local mirror. In practice, you want to:1. Collect a list of archived URLs (via archive.org&rsquo;s CDX endpoints). 2. Download each page and all related assets. 3. Rewrite all links that currently point to `web.archive.org` so they point to your domain or your local file paths.The tricky part is the Wayback Machine&rsquo;s directory structure&mdash;every file is wrapped in these time-stamped URLs. You&rsquo;ll need to remove those prefixes, leaving just the original directory layout. There&rsquo;s no perfect, purely automated solution, because sometimes assets are missing or broken. Be prepared for some manual cleanup.Beyond that, the process is basically: gather everything, clean up links, restore the original hierarchy, and then host it on your server. Tools exist that partially automate this (for example, some people have written scripts to do the CDX fetching and rewriting), but if you&rsquo;re comfortable with web scraping logic, you can handle it with a few careful passes. In the end, you&rsquo;ll have a mostly faithful static snapshot of the old site running under your revived domain.

Gualdrapo · Answer

I was commissioned to recover ideawave.ca from archive.org as its owner lost its database so pretty much all what was left was only on archive.org. I think it was under WordPress but he asked me to port it to Jekyll.
I scraped its contents (blog posts, pages, etcetera) with Python's beautifulsoup and redid its styling "by hand", which was not something otherworldy (the site was from line 2010 or so) and had the chance to put some improvements.
The thing with the scraping was that the connection was lost after a while and it was reaaaaaaaaaally sloooooooooow so I had to keep a register on memory of what was the last successful scraped post/page/whatever and, if something happened, restart from it as a starting point.
Got pennies for it, mostly because I lowballed myself, but got to learn a thing or two.

janesvilleseo · Answer

This something that used to be done quite a bit in the SEO world. Not sure if still holds and SEO value. Probably some, but maybe not the same level.
Anyways there are tools out there. I haven’t used them
But a tool like https://www6.waybackmachinedownloader.com/website-downloader...
Or
https://websitedownloader.com/
Should do the trick. Depending on the size of the site a small cost is involved.
They can even package them into unusable files.

moxvallix · Answer

You can use wayback_machine_downloader to automate downloading the archived pages https://github.com/hartator/wayback-machine-downloader/

latexr · Answer

Have you tried searching for your question online? I found plenty of results.https://superuser.com/questions/828907/how-to-download-a-web...

01jonny01 · Answer

Gosh. No one answers the question directly.1) Download HTTrack if its a large websit with alot of pages 2) Download Search and Replace program, theres many of them. 3) The search and replace program allows you to remove the appended web archive url from the pages in bulk. 4. Upload to your host. 5. Run the site through a bulk link checker, that test for broken links. There is plenty of them online.

bagpuss · Answer

Archivarix is the most fully formed easiest way to do this, free https://archivarix.com/

toast0 · Answer

I did this for a niche site, but it was only 20 pages.
I pulled each page off internet archive, saved it as an archive; then did some minor tidying up, setting viewports for mobile, updating the linkback html snippet to go to my url instead of the old dead one, changing the snippet to not suggest hotloading the link image, crop the dead url out of the link image, pngcrush the images, put it on cheap hosting for static pages.
I did a bit of poking around trying to find a way to contact the owner, but had no luck. If they come back and want it down, I'll take it down. Copyright notices are intact. I'm clearly violating the author's copyrights, and I accept that.

Sysreq2 · Answer

You could also consider using the Common Crawl dataset provided by Amazon. Archive.org is more or less a wrapper around it anyways.https://registry.opendata.aws/commoncrawl/

paxys · Answer

Have you spoken to the previous owner about any of this? Otherwise it's pretty crazy to just take ownership of the site and all its content without a written agreement in place. You are opening yourself up to a massive amount of liability for no reason.

aoipoa · Answer

This was posted 6 days ago but it's reappeared now 4 hours ago. What happened?
https://hn.algolia.com/?q=ask+hn+resurrect+site+archive
Very odd.
Even the times of the comments have changed, this is what the post looked like yesterday:
https://web.archive.org/web/20241205054108/https://news.ycom...

alsetmusic · Answer

I&rsquo;ve been thinking about buying a sibling domain (.net instead of .com) to re-host a fantastic essay that disappeared from the web some years back. I would make it clear that I didn&rsquo;t write it and offer to remove it if the original author contacted me requesting that I remove it (it did not include attribution in its original form). But the issue has been enough of a grey area that I haven&rsquo;t pulled the trigger.For anyone who may be curious, wayback machine has an archive: fuckthesouth.com

Alifatisk · Answer

HTTrack? You should not do it without the owners consent though.

pabs3 · Answer

Unless you are going to continue to run the site and have it change etc, there is no point doing this since archive.org already hosts static snapshots of sites.
Depending on the site you would use different tools, for eg for MediWiki/DokuWiki sites you would import the latest database dump on archive.org.
I have used wayback-machine-downloader before for completely static sites before:
https://github.com/hartator/wayback-machine-downloader/

donalhunt · Answer

Did this 10+ years ago for a circa-2000 band website (was a few html pages). Was fairly straightforward to achieve. Some content (embedded from 3rd party websites) was not recoverable.

joshdavham · Answer

Can I ask what site it was? Reading this made me think of a very specific site that I'd also like to see revived and I'm wondering if we're thinking of the same site.

canU4 · Answer

Isn't a simple wget -r enough?

ddgflorida · Answer

web scraping but be careful about using copyrighted images.

comboy · Answer

wget --mirror --convert-links --page-requisites --no-parent URL But yeah it's also not clear to me regarding copyrights and such.

How to Resurrect a Site from Archive.org?

You can use wayback_machine_downloader to automate downloading the archived pages https://github.com/hartator/wayback-machine-downloader/

Have you tried searching for your question online? I found plenty of results.
https://superuser.com/questions/828907/how-to-download-a-web...

Archivarix is the most fully formed easiest way to do this, free https://archivarix.com/

You could also consider using the Common Crawl dataset provided by Amazon. Archive.org is more or less a wrapper around it anyways.
https://registry.opendata.aws/commoncrawl/

Have you spoken to the previous owner about any of this? Otherwise it's pretty crazy to just take ownership of the site and all its content without a written agreement in place. You are opening yourself up to a massive amount of liability for no reason.

This was posted 6 days ago but it's reappeared now 4 hours ago. What happened?
https://hn.algolia.com/?q=ask+hn+resurrect+site+archive
Very odd.
Even the times of the comments have changed, this is what the post looked like yesterday:
https://web.archive.org/web/20241205054108/https://news.ycom...

HTTrack? You should not do it without the owners consent though.

Did this 10+ years ago for a circa-2000 band website (was a few html pages). Was fairly straightforward to achieve. Some content (embedded from 3rd party websites) was not recoverable.

Can I ask what site it was? Reading this made me think of a very specific site that I'd also like to see revived and I'm wondering if we're thinking of the same site.

Isn't a simple wget -r enough?

web scraping but be careful about using copyrighted images.

`wget --mirror --convert-links --page-requisites --no-parent URL`
But yeah it's also not clear to me regarding copyrights and such.

How to Resurrect a Site from Archive.org?

You can use wayback_machine_downloader to automate downloading the archived pages https://github.com/hartator/wayback-machine-downloader/

Have you tried searching for your question online? I found plenty of results.https://superuser.com/questions/828907/how-to-download-a-web...

Archivarix is the most fully formed easiest way to do this, free https://archivarix.com/

You could also consider using the Common Crawl dataset provided by Amazon. Archive.org is more or less a wrapper around it anyways.https://registry.opendata.aws/commoncrawl/

Have you spoken to the previous owner about any of this? Otherwise it's pretty crazy to just take ownership of the site and all its content without a written agreement in place. You are opening yourself up to a massive amount of liability for no reason.

This was posted 6 days ago but it's reappeared now 4 hours ago. What happened?https://hn.algolia.com/?q=ask+hn+resurrect+site+archiveVery odd.Even the times of the comments have changed, this is what the post looked like yesterday:https://web.archive.org/web/20241205054108/https://news.ycom...

HTTrack? You should not do it without the owners consent though.

Did this 10+ years ago for a circa-2000 band website (was a few html pages). Was fairly straightforward to achieve. Some content (embedded from 3rd party websites) was not recoverable.

Can I ask what site it was? Reading this made me think of a very specific site that I'd also like to see revived and I'm wondering if we're thinking of the same site.

Isn't a simple wget -r enough?

web scraping but be careful about using copyrighted images.

wget --mirror --convert-links --page-requisites --no-parent URL But yeah it's also not clear to me regarding copyrights and such.

Have you tried searching for your question online? I found plenty of results.
https://superuser.com/questions/828907/how-to-download-a-web...

You could also consider using the Common Crawl dataset provided by Amazon. Archive.org is more or less a wrapper around it anyways.
https://registry.opendata.aws/commoncrawl/

This was posted 6 days ago but it's reappeared now 4 hours ago. What happened?
https://hn.algolia.com/?q=ask+hn+resurrect+site+archive
Very odd.
Even the times of the comments have changed, this is what the post looked like yesterday:
https://web.archive.org/web/20241205054108/https://news.ycom...

`wget --mirror --convert-links --page-requisites --no-parent URL`
But yeah it's also not clear to me regarding copyrights and such.