The Life of a Penetration Tester: PhD Comics Downloader | Python Script to Download Piled Higher and Deeper Comics

Thursday, January 8, 2015

PhD Comics Downloader | Python Script to Download Piled Higher and Deeper Comics

Written by Pranshu Bajpai | Join me on Google+ | LinkedIn

PhD Comics, by Jorge Cham, provide a funny (but accurate) glimpse into the life of a graduate student. Being a graduate student myself, I have always enjoyed reading this comic.

At some point, I decided to read a bunch of these comic during my travel when I wouldn't be able to access the Internet. Since this is an online comic, this was a problem. I wrote a small python scraping script that is able to visit different pages on the phdcomics website, locate the comic GIFs, and download them on local disk for reading later on.

Usage: python downloadphdcomics.py

Download: Github: https://github.com/lifeofpentester/phdcomics

Note that the start comic number and end comic number is hard coded in the script as '1' and '1699' respectively. You can modify the script in any text editor to download a different range of comics.

You can store all of these GIFs in a ZIP file and change the extension from ZIP to CBR. Then, you can use any CBR reader to read these comics.

In case you are interested in reading the code, here it is:

#!/usr/bin/python

"""The PhD Comics Downloader"""
"""
This code fetches PhD comics from www.phdcomics.com
and saves to '/root/'

Written by: Pranshu
bajpai [dot] pranshu [at] gmail [dot] com

""" 


from bs4 import BeautifulSoup
from urllib import urlretrieve
import urllib2
import re

for i in range(1, 1699):

    url = "http://www.phdcomics.com/comics/archive.php?comicid=%d" %i 
    html = urllib2.urlopen(url)
    content = html.read()
    soup = BeautifulSoup(content)

    for image in soup.find_all('img', src=re.compile('http://www.phdcomics.com/comics/archive/' + 'phd.*gif$')):
        print "[+] Fetched Comic " + "%d" %i + ": " + image["src"]
    outfile = "/root/" + "%d" %i + ".gif"
    urlretrieve(image["src"], outfile)

3 comments:

AnonymousFebruary 8, 2017 at 1:58 AM
Hi,
I'm new to python (in fact, this is the very first time I use it), and I get this error when running this program:

urlretrieve(image["src"], outfile)
NameError: name 'image' is not defined

Anyway, thanks for your effort :)
Cheers
ReplyDelete
Replies
AnonymousMay 15, 2017 at 11:53 PM
Move the last two lines right below print.

for image in soup.find_all('img', src=re.compile('http://www.phdcomics.com/comics/archive/' + 'phd.*gif$')):
print "[+] Fetched Comic " + "%d" %i + ": " + image["src"]
outfile = "/root/" + "%d" %i + ".gif"
urlretrieve(image["src"], outfile)
ReplyDelete
Replies
UnknownMay 27, 2017 at 2:06 PM
Hi,
I am unable to get this code to work in the current form.
ReplyDelete
Replies

Add comment

The Life of a Penetration Tester

Pages

Thursday, January 8, 2015

PhD Comics Downloader | Python Script to Download Piled Higher and Deeper Comics

3 comments:

Do Not Copy