Python BeautifulSoup: Extract all the text from a given web page

Last update on May 28 2022 13:22:51 (UTC/GMT +8 hours)

BeautifulSoup: Exercise-12 with Solution

Write a Python program to extract all the text from a given web page.

Sample Solution:

Python Code:

import requests
from bs4 import BeautifulSoup
url = 'https://www.python.org/'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'lxml')
print("Text from the said page:")
print(soup.get_text())

Sample Output:

Text from the said page:
Welcome to Python.org
     {
       "@context": "https://schema.org",
       "@type": "WebSite",
       "url": "https://www.python.org/",
       "potentialAction": {
         "@type": "SearchAction",
         "target": "https://www.python.org/search/?q={search_term_string}",
         "query-input": "required name=search_term_string"
       }
     }
    

    var _gaq = _gaq || [];
    _gaq.push(['_setAccount', 'UA-39055973-1']);
    _gaq.push(['_trackPageview']);

    (function() {
        var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
        ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'https://www') + '.google-analytics.com/ga.js';
        var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
    })();
    
Notice: While Javascript is not essential for this website, your interaction with the content will be limited. Please turn Javascript on for the full experience. 

...........

▲ Back to Top

Help & General Contact
Diversity Initiatives
Submit Website Bug

Status 

Copyright ©2001-2019.
                             Python Software Foundation
                            Legal Statements
                             Privacy Policy
                             Powered by Heroku

window.jQuery || document.write('<script src="/static/js/libs/jquery-1.8.2.min.js"><\/script>')

Python Code Editor:

Have another way to solve this solution? Contribute your code (and comments) through Disqus.

Previous: Write a Python program to a list of all the h1, h2, h3 tags from the webpage python.org.
Next: Write a Python program to print the names of all HTML tags of a given web page going through the document tree.

Python: Tips of the Day

Find current directory and file's directory:

To get the full path to the directory a Python file is contained in, write this in that file:

import os dir_path = os.path.dirname(os.path.realpath(__file__))

(Note that the incantation above won't work if you've already used os.chdir() to change your current working directory, since the value of the __file__ constant is relative to the current working directory and is not changed by an os.chdir() call.)

To get the current working directory use

import os cwd = os.getcwd()

Documentation references for the modules, constants and functions used above:

The os and os.path modules.

The __file__ constant

os.path.realpath(path) (returns "the canonical path of the specified filename, eliminating any symbolic links encountered in the path")

os.path.dirname(path) (returns "the directory name of pathname path")

os.getcwd() (returns "a string representing the current working directory")

os.chdir(path) ("change the current working directory to path")

Ref: https://bit.ly/3fy0R6m