offline browsing: perl script for fetching and syncing threads

1954U1 · Jan 14, 2009

Hi all,
here is a little contribution to this forum.

Its a command line Perl script, tuned for the prodigy-pro forums, that use Wget for downloading/syncing threads.
I wrote it because I need to have some threads, here in the Lab,
organized on my laptop, per project, and totally browsable also offline.

wget_threads.pl

Code:

#! /usr/local/bin/perl

$url = $ARGV[0];
$url =~ s/^(http\:\/\/www\.prodigy\-pro\.com\/diy\/index\.php\?topic\=\d+$)/$1.0/;## correct topic's url without '.n' at the end
if ($url !~ /^http\:\/\/www\.prodigy\-pro\.com\/diy\/index\.php\?topic\=\d+\.\d+$/) {
print "Please enter a correct url.\n";
exit 0;
}

$prefetch = `wget -O - '$url' 2>/dev/null`;
$prefetch =~ /Topic\:(.+?)\(Read/;
$thread = $1;
$thread =~ s/\&\w+\;/\_/g;
$thread =~ s/\W/\_/g;
$thread =~ s/\_{1,}/\_/g;
$thread =~ s/^\_//;
$thread =~ s/\_$//;
print "$thread\n";
$dirs_thread = './dirs/' . $thread;

$wget_cmd = "wget -r -l 1 --follow-tags='a,img,link,script' -N -U 'Mozilla' -k -K -E -H -p -R '*PHPSESSID*,http://www.groupdiy.com/index.php,*.msg*,*prev_next=*,*action=profile*,*mailto:*,*action=verificationcode*,*action=help*,*action=search*,*action=login*,*action=register*,*action=activate*' --exclude-domains 'mysql.com,w3.org,php.net,paypal.com,simplemachines.org' --no-cookies -np -T 10 -t 2 ";

$last_page = '';
if($prefetch =~ /\<a\sclass\=\"navPages\"\shref\=\"(http\:\/\/www\.prodigy\-pro\.com\/diy\/index\.php\?topic\=\d+\.\d+)\">\d+\<\/a\>\s\<\/td\>/) {
$last_page = $1;
$last_page =~ s/PHPSESSID\=[0-9,a-f]*?\&amp\;//;
}
elsif($prefetch =~ /\<td\sclass\=\"middletext\"\>Pages\:\s\[\<b\>1\<\/b\>\]\s\<\/td\>/) {
$last_page = "0";
}
else {
#print "$prefetch\n\n";
print "Something wrong parsing '$url'\n";
exit 0;
}

$last_page =~ s/^.+?\.(\d+)$/$1/;
print "Last page: $last_page\n";
if($last_page !~ /^(\d+)$/) {
print "Error in last page file's name: '$last_page'\n";
exit 0;
}

### store some data for post conversion of internal links on groupdiy.com
$thread_number = $url;
$thread_number =~ s/^.+?topic\=(\d+)\..+$/$1/;
$int_thread_links_regexp = 's/http\:\/\/www\.prodigy\-pro\.com\/diy\/index\.php\?PHPSESSID\=[0-9,a-f]+\&amp\;(topic\=' . $thread_number . '\.\d+)\"/index.php\%3F$1.html\"/g';


### thread does not exist locally, so create dir and link and go fetch all the pages
if(!(-d $dirs_thread)) {

$localurl = $url;
$localurl =~ s/^http\:\/\///;
$localurl =~ s/\?/\%3F/;
$htmlcontent = '<!-- THREAD URL: ' . $url . ' --><html><script>location=\'' . $dirs_thread . '/' . $localurl . '.html\'</script></html>';

system "mkdir -p '$dirs_thread' 2>/dev/null";

open(W, ">./$thread.html");
print W $htmlcontent;
close W;

### go fetch all the pages
chdir $dirs_thread;

for($all_pages = 0; $all_pages <= $last_page; $all_pages += 20) {
$current_page = $url;
$current_page =~ s/\.\d+$/.$all_pages/;
print "$current_page\n";
$wget = $wget_cmd . "\'$current_page\' 2>/dev/null";
system $wget;
&internal_links();
system "echo -n $all_pages > last_page";

### fetch the printable version of the thread, for later understanding if it will be updated
if($all_pages == 0) {
$current_page =~ s/index\.php\?topic/index.php?&action=printpage;topic/;
$page = `wget -O - '$current_page' 2>/dev/null`;
$page =~ s/\<span.+?\<\/span\>//s;
open(W, '>first_page_printable');
print W $page;
close W;
$pagep = $page;
$countpagep = 0;
## insert page numbers of the posts in the big one printable file of the thread
$pagep =~ s/(Post\sby\:.+?\d\s[ap]m\<\/b\>)/&insert_page_numbers($1);/eg; 
open(W, '>printable.html');
print W $pagep;
close W;
}
}

}


### thread already fetched
else {
print "updating..\n";
chdir $dirs_thread;
### download first page
$current_page = $url;
print "$current_page\n";


$current_page =~ s/index\.php\?topic/index.php?&action=printpage;topic/;
$test_new = `wget -O - '$current_page' 2>/dev/null`;
$test_new =~ s/\<span.+?\<\/span\>//s;

$test_old = 'first_page_printable';
if(!(-s "$test_old")) {
print "Problem: '$test_old' inexistant\n";
exit 0;
}
$test_old = `cat '$test_old'`;

### if thread content has changed, update 1st page, last page, and fetch new pages if there are
if($test_new ne $test_old) {
print "updating 1st page..\n";
$current_page = $url;
$wget = $wget_cmd . "\'$current_page\' 2>/dev/null";
system $wget;
&internal_links();
open(W, '>first_page_printable');
print W $test_new;
close W;
$pagep = $test_new;
$countpagep = 0;
## insert page numbers of the posts in the big one printable file of the thread
$pagep =~ s/(Post\sby\:.+?\d\s[ap]m\<\/b\>)/&insert_page_numbers($1);/eg; 
open(W, '>printable.html');
print W $pagep;
close W;
}

### only one page in thread, exit now
if($last_page eq '0') {
print "DONE\n";
exit 0;
}

### download new content, if there it is, from last fetch
$prev_last_page = `cat last_page`;
chomp $prev_last_page;
if($prev_last_page !~ /^\d+$/) {
print "Error in last page file's name: '$last_page'\n";
exit 0;
}
if($test_new ne $test_old) {
for($new_pages = $prev_last_page; $new_pages <= $last_page; $new_pages += 20) {
$current_page = $url;
$current_page =~ s/\.\d+$/.$new_pages/;
print "$current_page\n";
$wget = $wget_cmd . "\'$current_page\' 2>/dev/null";
system $wget;
&internal_links();
system "echo -n $new_pages > last_page";
}
}

}


print "DONE\n";
exit 0;



### do the conversion of groupdiy.com internal links
sub internal_links() {
$current_page =~ s/^http\:\/\///;
if (-s "$current_page.html") { 
system "perl -pi -e '$int_thread_links_regexp' '$current_page.html'";
$print_regexp = 's/http\:\/\/www\.prodigy\-pro\.com\/diy\/index\.php\?PHPSESSID\=[0-9,a-f]+\&amp\;action\=printpage\;(topic\=' . $thread_number . '\.\d+)\"/..\/..\/printable.html\"/g';
$print_regexp_2 = 's/\>Print\<\/a\>/<span style="color:black">PRINT<\/span>/g';
system "perl -pi -e '$print_regexp' '$current_page.html'"; ## "PRINT" buttons
system "perl -pi -e '$print_regexp_2' '$current_page.html'"; ## "PRINT" buttons
}
else {
print "Problem in internal links rearrangement in $current_page\n";
exit 0;
}
return 1;
}


### insert page numbers on all the posts of the big one printable thread file
sub insert_page_numbers() {
my $cont = shift;
my $page = int($countpagep / 20) + 1;
my $lurl = $url;
my $mod = $countpagep - ($countpagep % 20);
$lurl =~ s/\.\d+$/.$mod/;
$lurl =~ s/^.+?index\.php/index.php/;
$lurl .= '.html';
$lurl =~ s/\?/\%3F/;
$cont = '<table width=50% border=0><tr><td>' . $cont . "</td><td align=right><b><a href='www.groupdiy.com/$lurl' target='_blank'>Page $page</a></b></td></tr></table>";
$countpagep++;
return $cont;
}

Features:
- Only one parameter [thread url], all the rest is automatic.
- no hassle to the forum's server, only useful files fetched, one time only.
- complete and "smart" download and storing of all the pages and docs linked in the threads.
- 100% locally viewable threads, docs, images, sounds, without internet connection.
- "update" feature -> fetch and update only the 1st page if changed, and the new ones from last time run.
- threads and linked files all in one directories tree, per project.

Usage:
- go on the "discussion forum" of your choice [the Lab, I suppose..], copy the link of the thread.
- create a dir on your hdisk, e.g. '33609', meant to be used _only_ for that.
- go in this dir.
- execute the command.

Example: wget_threads.pl 'http://www.groupdiy.com/index.php?topic=28274.0'

Then, point your browser at the local path of the dir where you've done the command,
and test the offline browsing.

Here are 2 other little scripts, one for update and the other one for global update:

update_thread.pl

Code:

#! /usr/local/bin/perl

$file = $ARGV[0];
if(!(-s $file)) {
print "$file inexistant.\n";
exit 0;
}

$cont = `cat $file`;
$cont =~ /URL\:\s(http.+?)\s/;
$url = $1;
if($url !~ /^http.+\d$/) {
print "$url incorrect.\n";
exit 0;
}

system "wget_threads.pl '$url'";
exit 0;

update_all_threads.pl

Code:

#! /usr/local/bin/perl

@html_files = `ls -1 *.html`;

foreach $sing(@html_files) {
chomp $sing;
system "update_thread.pl '$sing'";
}

print "\n\n\nTOTAL UPDATE DONE.\n";
exit 0;

[copy&paste the thread title..]
Example: "update_thread.pl Official_33609_builder_s_tread_See_1st_page_for_updates.html"
Example: "update_all_threads.pl" for updating all threads in current dir [i.e. 33609]

The OS I use is Linux, but the script is simple, commented and tweakable,
so its not difficult to make a version also for Windows and Mac.

Anyway, try it, you can understand how useful is a similar thing only using it,
and its more simple to understand the features.

Edit: script improved, added single file per thread functionality [the 2 "PRINT" links on the right of the threads],
with added links to the pages [not present on online version].

offline browsing: perl script for fetching and syncing threads

Help Support GroupDIY Audio Forum:

1954U1

Well-known member

Similar threads

Latest posts

offline browsing: perl script for fetching and syncing threads

Help Support GroupDIY Audio Forum:

1954U1

Well-known member

Similar threads

Latest posts

Join the conversation!

Join today and get all the highlights of this community direct to your inbox. It's FREE!