Zur Webseite der Professur für Web Science und Digitalisierung

Example: Code of the Wikipedia Fetcher

This is a short demo of accessing the Web from Python. It is meant to demonstrate the power of Python for Web mash-ups. PLEASE do not run this programm extensively, since it accesses Wikipedia in an illegal way - the code is just meant to show the power of the urllib and urllib2 libraries.

fetchWikipedia

import urllib, urllib2, random, time

URI = 'http://en.wikipedia.org/wiki/Special:Random' VALUES = {} # additional values for the http request USER_AGENTS = [

   'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11',
   'Opera/9.25 (Windows NT 5.1; U; en)',
   'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
   'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',
   'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12',
   'Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9'

]

data = urllib.urlencode(VALUES) for i in range(100):

   # take random browser type
   r = random.randint(0, 4)
   headers = {'User-Agent' : USER_AGENTS[r]}
   # compose request
   req = urllib2.Request(URI, data, headers)
   # fetch page
   response = urllib2.urlopen(req)
   page = response.read()
   # show first 800 characters of HTML
   print i, page[:800]
   # wait a random amount of time
   time.sleep(random.random()+0.7)