Python Web Scraping: Download and display the content of robot.txt for en.wikipedia.org

Last update on May 28 2022 13:13:54 (UTC/GMT +8 hours)

Python Web Scraping: Exercise-2 with Solution

Write a Python program to download and display the content of robot.txt for en.wikipedia.org.

Sample Solution:

Python Code:

import requests
response = requests.get("https://en.wikipedia.org/robots.txt")
test = response.text
print("robots.txt for https://www.wikipedia.org/")
print("===================================================")
print(test)

Sample Output:

robots.txt for https://www.wikipedia.org/
===================================================
# robots.txt for https://www.wikipedia.org/ and friends
#
# Please note: There are a lot of pages on this site, and there are
# some misbehaved spiders out there that go _way_ too fast. If you're
# irresponsible, your access to the site may be blocked.
#

# Observed spamming large amounts of https://en.wikipedia.org/?curid=NNNNNN
# and ignoring 429 ratelimit responses, claims to respect robots:
# https://mj12bot.com/
User-agent: MJ12bot
Disallow: /

# advertising-related bots:
User-agent: Mediapartners-Google*
Disallow: /

# Wikipedia work bots:
User-agent: IsraBot
Disallow:

User-agent: Orthogaffe
Disallow:

# Crawlers that are kind enough to obey, but which we'd rather not have
# unless they're feeding search engines.
User-agent: UbiCrawler
Disallow: /

User-agent: DOC
Disallow: /

User-agent: Zao
Disallow: /

# Some bots are known to be trouble, particularly those designed to copy
# entire sites. Please obey robots.txt.
User-agent: sitecheck.internetseer.com
Disallow: /

User-agent: Zealbot
Disallow: /
............
#
Disallow: /wiki/Wikipedia:Article_Incubator
Disallow: /wiki/Wikipedia%3AArticle_Incubator
Disallow: /wiki/Wikipedia_talk:Article_Incubator
Disallow: /wiki/Wikipedia_talk%3AArticle_Incubator
#
Disallow: /wiki/Category:Noindexed_pages
Disallow: /wiki/Category%3ANoindexed_pages
#
#

Flowchart:

Python Code Editor:

Have another way to solve this solution? Contribute your code (and comments) through Disqus.

Previous: Write a Python program to test if a given page is found or not on the server.
Next: Write a Python program to get the number of datasets currently listed on data.gov.

Python: Tips of the Day

Find current directory and file's directory:

To get the full path to the directory a Python file is contained in, write this in that file:

import os dir_path = os.path.dirname(os.path.realpath(__file__))

(Note that the incantation above won't work if you've already used os.chdir() to change your current working directory, since the value of the __file__ constant is relative to the current working directory and is not changed by an os.chdir() call.)

To get the current working directory use

import os cwd = os.getcwd()

Documentation references for the modules, constants and functions used above:

The os and os.path modules.

The __file__ constant

os.path.realpath(path) (returns "the canonical path of the specified filename, eliminating any symbolic links encountered in the path")

os.path.dirname(path) (returns "the directory name of pathname path")

os.getcwd() (returns "a string representing the current working directory")

os.chdir(path) ("change the current working directory to path")

Ref: https://bit.ly/3fy0R6m