Food and ingredients parser

Main
Goals
Solution
Result

Losing weight is easy if you know what to eat

This parser, using the built-in HTML sitemap, goes through all the available links in this map (previously collecting them into a separate JSON file for convenience).

After reading the created JSON file, the parser individually goes through each page, collecting the necessary data.

The results are saved as JSON and CSV files, separately for each category.

Goals

  • Collect all data from the site about available products.

Maintain the following product information:

  • Name

  • Calories

  • Proteins

  • Fats

  • Carbohydrates

Solution

First stage. Parsing HTML map

At this stage, the HTML map I found is parsed in such a way that all the links found in it are food categories.

All found categories are saved in a JSON file, allCategories.json in the form of a dictionary, where the key is the category name and the value is the link.

Second phase. Parsing categories

At the second stage, all categories are parsed and product cards are searched and the necessary data is saved.

When parsing a category, three files are obtained:

  • category_name.html

  • category_name.csv

  • category_name.json

Why save HTML category page?

To reduce the load on the target site. And for the possibility of debugging the parser.

Why save the result in 2 formats?

I save files in JSON because they are easier to work with in Python. And files in CSV or XLS format are made for customers and for other analyzing programs.

Result

We have a parser that can parse all the data about categories and recipes on the site in 10 seconds.

Sources can be viewed here

Repository

Or download the archive with the script immediately

Archive

Additional materials


heart 0
3 connected dots 0