How to scrape JavaScript webpages using ProxyCrawl in Python

Learn a simple way to scrape JavaScript webpages

Lynn Kwong

--

Due to the increasing popularity of modern JavaScript frameworks such as React, Angular, and Vue, more and more websites are now built dynamically with JavaScript. This poses a challenge for web scraping because the HTML markup is not available in the source code. Therefore, we cannot scrape these JavaScript webpages directly and need to render them as regular HTML markup first. In this article, we will introduce how to render JavaScript webpages using ProxyCrawl, a handy web service that can be used to help scrape JavaScript webpages.

Image by gTheMesh on Pixabay.

The demo site to be used in this tutorial is http://quotes.toscrape.com/js/. If you open this website, right-click on the webpage and select “View page source”, you can only see some JavaScript code and not the HTML markup. Luckily for this site, the data is included in the <script> tag. However, for many websites, especially those created with Angular, there is little data in the JavaScript code and you must render it before you can scrape it. For example:

Before getting started, we need to install the packages needed for web scraping. It is recommended to create a virtual environment and install the packages there so they won’t mess up system libraries. For simplicity, we will use conda to create the virtual environment. To make it easier to run Python code interactively, we will also install iPython:

(base) $ conda create --name js_scrape python=3.10
(base) $ conda activate js_scrape
(js_scrape) $ pip install -U requests lxml
(js_scrape) $ pip install ipython
(js_scrape) $ ipython
  • requests — Used to download the webpage content.
  • lxml — Used to scrape the rendered HTML markup using XPath.

Let’s first try to explore the ProxyCrawl Crawling API a bit.

For ProxyCrawl API, the first 1000 requests are free. And if you add your billing details, you can get an extra 9000 free requests. Therefore, if you just need to scrape some JavaScript web pages from time to time…

--

--

Lynn Kwong

I’m a Software Developer (https://medium.com/@lynn-kwong) keen on sharing thoughts, tutorials, and solutions for the best practice of software development.