Example: Code of the Wikipedia Fetcher

From Wiki of the E-Business and Web Science Research Group
Jump to: navigation, search

This is a short demo of accessing the Web from Python. It is meant to demonstrate the power of Python for Web mash-ups. PLEASE do not run this programm extensively, since it accesses Wikipedia in an illegal way - the code is just meant to show the power of the urllib and urllib2 libraries.

 
# fetchWikipedia
 
import urllib, urllib2, random, time
 
URI = 'http://en.wikipedia.org/wiki/Special:Random'
VALUES = {} # additional values for the http request
USER_AGENTS = [
    'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11',
    'Opera/9.25 (Windows NT 5.1; U; en)',
    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
    'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',
    'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12',
    'Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9'
]
 
data = urllib.urlencode(VALUES)
for i in range(100):
    # take random browser type
    r = random.randint(0, 4)
    headers = {'User-Agent' : USER_AGENTS[r]}
    # compose request
    req = urllib2.Request(URI, data, headers)
    # fetch page
    response = urllib2.urlopen(req)
    page = response.read()
    # show first 800 characters of HTML
    print i, page[:800]
    # wait a random amount of time
    time.sleep(random.random()+0.7)
 
Personal tools
Navigation