Simple Web Scraping Using requests, Beautiful Soup, and lxml in Python
When it comes to web scraping in Python, the first package we think about is Scrapy. However, Scrapy is more suitable for larger scraping projects. Besides, there is a learning curve for Scrapy, which takes time. For simple scraping issues where you only need to get data from a single webpage directly, it can be rather overkilled to use Scrapy. In this case, we can use the requests plus Beautiful Soup/lxml packages to scrape the content you need very quickly.
Before we get started, we need to install the packages needed for simple web scraping. The requests library will be used to download the webpage content. And if you prefer a Pythonic way of extracting data from a webpage using properties and methods of constructed classes, you can install and use the Beautiful Soup package. Beautiful Soup also supports CSS selectors that are useful for complex and nested elements. However, Beautiful Soup does not support XPath. Therefore, if you are more used to using XPath, you should use the lxml package instead.
It is recommended to create a virtual environment and install the packages there so they won’t mess up system libraries. For simplicity, we will use conda to create the virtual environment. To make it easier to run Python code interactively, we will also install iPython:
(base) $ conda create --name simple_scrape python=3.10
(base) $ conda activate simple_scrape
(simple_scrape) $ pip install -U requests beautifulsoup4 lxml
(simple_scrape) $ pip install ipython
(simple_scrape) $ ipython
beautifulsoup4 should be installed, rather than
BeautifulSoup which is the previous major release Beautiful Soup 3.
In this tutorial, we will use http://quotes.toscrape.com/ as the demo. Once you master the scraping techniques, you can use them to scrape any interesting websites you like. We will try to get the top ten tags from this webpage:
First, we need to download the webpage so it can be scraped, which can be done with the…