Food and ingredients parser
Losing weight is easy if you know what to eat
This parser, using the built-in HTML sitemap, goes through all the available links in this map (previously collecting them into a separate JSON file for convenience).
After reading the created JSON file, the parser individually goes through each page, collecting the necessary data.
The results are saved as JSON and CSV files, separately for each category.
Goals
- Collect all data from the site about available products.
Maintain the following product information:
-
Name
-
Calories
-
Proteins
-
Fats
-
Carbohydrates
Solution
First stage. Parsing HTML map
At this stage, the HTML map I found is parsed in such a way that all the links found in it are food categories.
All found categories are saved in a JSON file, allCategories.json in the form of a dictionary, where the key is the category name and the value is the link.
Second phase. Parsing categories
At the second stage, all categories are parsed and product cards are searched and the necessary data is saved.
When parsing a category, three files are obtained:
-
category_name.html
-
category_name.csv
-
category_name.json
Why save HTML category page?
To reduce the load on the target site. And for the possibility of debugging the parser.
Why save the result in 2 formats?
I save files in JSON because they are easier to work with in Python. And files in CSV or XLS format are made for customers and for other analyzing programs.
Result
We have a parser that can parse all the data about categories and recipes on the site in 10 seconds.
Sources can be viewed here
Or download the archive with the script immediately