How to download websites?

GroupDIY Audio Forum

Help Support GroupDIY Audio Forum:

This site may earn a commission from merchant affiliate links, including eBay, Amazon, and others.

Michael Tibes

Well-known member
Joined
Jun 5, 2004
Messages
893
Location
Berlin, Germany
There are a few interesting websites offering schematics (like this one: http://groupdiy.com/index.php?topic=62946.0). Since nothing on the net lives forever I would like to do download them, but couldn't get it done. I used HTTrack, but obviously I'm missing a setting which makes it work. I have added pdf and the relevant file formats to the list, but it didn't help.

Does anyone know a good trick?

Thanks,

Michael
 
I just tryed the same yesterday.
No luck. I used Wget (linux).

Hosters can set a robot.txt to  prevent robots from downloading the whole thing.
They don't want uns to have all the nice schematics on our drives!

Well, we could just share the task between us. Everyone is downloading 10 Files and use a shared disk to distribute.
No Idea if the Links to the PDF's are going all to the same directory. That it would be easy to reconstruct the site.

best Stephan
 
I have parsed HTML to get a list to feed to wget. This may be trivial if the HTML is clean and consistent. One great site has garbage HTML which needs hours of hand clean-up. (I may be missing a trick.)

FireFox on Windows: "DownThemAll" gives browser a right-click which offers to download all files on a page. On that Black Magic page, it thinks many seconds and then offers a reasonable looking list of almost 10,000(!) files it proposes to fetch. I don't wish to tie-up my damp-string internet connection to try it.
 
DerEber said:
I just tryed the same yesterday.
No luck. I used Wget (linux).

Hosters can set a robot.txt to  prevent robots from downloading the whole thing.
They don't want uns to have all the nice schematics on our drives!
The robots.txt file only works because "nice" programs read and respect it. This includes sites such as http://archive.org which often has copies of no-longer-online sites (useful anytime you get a 404), but sometimes it says the page wasn't saved because of the robots.txt file.
 

Latest posts

Back
Top