Beer parser
Do you want some beer? How much?
This parser has been split into two smaller parsers to get the most out of this site.
The first parser works only with the static version of the site and collects data and XHR server responses.
Goals
- Collect as much data as possible and update it over time.
Must collect the following data:
-
Name
-
Price
-
Description
-
Availability
-
Link
Solution
First part of the parser
First, I had to discover the site's API, which not all sites have.
This is what a link to iterate through the entire site database looks like.
https://www.beerwulf.com/en-GB/api/search/searchProducts?routeQuery=c&page={0}
Afterwards, we will install the necessary packages and create the parser structure. Here are the required packages:
-
xlwt
-
lxml
-
requests
Structure:
-
.venv <- where installed packages are located
-
data <- where the parsing results will be stored
-
info.md <- general information about the parser
-
main_render.py <- script that parses dynamic site data
-
main_req.py <- script that parses through the site API
-
req.txt <- dependency file
It was decided to save the parsing results both in JSON and XLS format.
Second part of the parser
Let's install the necessary packages.
- requests_html
Afterwards, let’s get into the layout of beer cards and find all the selectors for the necessary data.
Parse and save.
Result
As a result, we will get two completely different parsers in one
Where we can choose how to parse and how to save the collected data
How to get:
-
GitHub Repository
Additional materials
The Python package requests_html always tries to download a version of the chromium engine that no longer exists.
Therefore, at the beginning of the script for dynamic parsing, we had to specify the current version of the engine through the virtual environment variable PYPPETEER_CHROMIUM_REVISION