Pages

Thursday, January 8, 2015

PhD Comics Downloader | Python Script to Download Piled Higher and Deeper Comics

Written by Pranshu Bajpai |  | LinkedIn

PhD Comics, by Jorge Cham, provide a funny (but accurate) glimpse into the life of a graduate student. Being a graduate student myself, I have always enjoyed reading this comic.

At some point, I decided to read a bunch of these comic during my travel when I wouldn't be able to access the Internet. Since this is an online comic, this was a problem. I wrote a small python scraping script that is able to visit different pages on the phdcomics website, locate the comic GIFs, and download them on local disk for reading later on.

Usage: python downloadphdcomics.py

Download: Githubhttps://github.com/lifeofpentester/phdcomics


Note that the start comic number and end comic number is hard coded in the script as '1' and '1699' respectively. You can modify the script in any text editor to download a different range of comics.

You can store all of these GIFs in a ZIP file and change the extension from ZIP to CBR. Then, you can use any CBR reader to read these comics.


In case you are interested in reading the code, here it is:


#!/usr/bin/python

"""The PhD Comics Downloader"""
"""
This code fetches PhD comics from www.phdcomics.com
and saves to '/root/'

Written by: Pranshu
bajpai [dot] pranshu [at] gmail [dot] com

""" 


from bs4 import BeautifulSoup
from urllib import urlretrieve
import urllib2
import re

for i in range(1, 1699):

    url = "http://www.phdcomics.com/comics/archive.php?comicid=%d" %i 
    html = urllib2.urlopen(url)
    content = html.read()
    soup = BeautifulSoup(content)

    for image in soup.find_all('img', src=re.compile('http://www.phdcomics.com/comics/archive/' + 'phd.*gif$')):
        print "[+] Fetched Comic " + "%d" %i + ": " + image["src"]
    outfile = "/root/" + "%d" %i + ".gif"
    urlretrieve(image["src"], outfile)


2 comments:

  1. Hi,
    I'm new to python (in fact, this is the very first time I use it), and I get this error when running this program:

    urlretrieve(image["src"], outfile)
    NameError: name 'image' is not defined

    Anyway, thanks for your effort :)
    Cheers

    ReplyDelete
  2. Move the last two lines right below print.

    for image in soup.find_all('img', src=re.compile('http://www.phdcomics.com/comics/archive/' + 'phd.*gif$')):
    print "[+] Fetched Comic " + "%d" %i + ": " + image["src"]
    outfile = "/root/" + "%d" %i + ".gif"
    urlretrieve(image["src"], outfile)

    ReplyDelete