Beer parser

Main
Goals
Solution
Result
Additional

Do you want some beer? How much?

This parser has been split into two smaller parsers to get the most out of this site.

The first parser works only with the static version of the site and collects data and XHR server responses.

Goals

  • Collect as much data as possible and update it over time.

Must collect the following data:

  • Name

  • Price

  • Description

  • Availability

  • Link

Solution

First part of the parser

First, I had to discover the site's API, which not all sites have.

This is what a link to iterate through the entire site database looks like.

https://www.beerwulf.com/en-GB/api/search/searchProducts?routeQuery=c&page={0}

Afterwards, we will install the necessary packages and create the parser structure. Here are the required packages:

  • xlwt

  • lxml

  • requests

Structure:

  • .venv <- where installed packages are located

  • data <- where the parsing results will be stored

  • info.md <- general information about the parser

  • main_render.py <- script that parses dynamic site data

  • main_req.py <- script that parses through the site API

  • req.txt <- dependency file

It was decided to save the parsing results both in JSON and XLS format.

Second part of the parser

Let's install the necessary packages.

  • requests_html

Afterwards, let’s get into the layout of beer cards and find all the selectors for the necessary data.

Parse and save.

Result

As a result, we will get two completely different parsers in one

Where we can choose how to parse and how to save the collected data

How to get:

Additional materials

The Python package requests_html always tries to download a version of the chromium engine that no longer exists.

Therefore, at the beginning of the script for dynamic parsing, we had to specify the current version of the engine through the virtual environment variable PYPPETEER_CHROMIUM_REVISION


heart 0
3 connected dots 0