Scraper of an online shop using python using wilberries webiste as an example.
Introduction and overview of the scraper
Nowadays, parsing online stores is not easy. All of them are quite advanced in terms of protection from parsers and bots. They can use such types of protection as using dynamic content and firewalls. One of the most famous companies providing such protection is cloudflare.
In this tutorial, I will show various methods of bypassing blocking technologies using, for example, the online store wildberries. To be more specific, I will show how to make a parser based on the Selenium library and how to set up proxy rotation for it. We will use free proxies.
And as a bonus, I will provide one proxy scraper and one proxy checker. They will help you create your own proxy lists for various sites.
Writing a basic parser
First, we will write and review the parser I wrote. You will need to create and activate a virtual environment.
If it is a windows system:
If it is *nix like system:
After that, we will install the necessary packages:
Import headers. In your working directory, there should be a folder proxy_rotator. This package regulates the issuance of proxies on request. It issues proxies randomly depending on the weight of the proxy. The higher the weight, the more likely it is that this proxy will be selected.
Let's add a couple more utility functions, such as:
- saving in json
- saving in html
- downloading a proxy
- and scraping the necessary data
In my case, I decided to parse the main page. Collect all prices and product titles. The parse_data function determines what to parse and where to save it.
Add these lines of code at the very bottom. It will call the run() function only if this script is run using the python interpreter.
Now this is what the main run() function looks like:
First, we create a proxy rotator by loading the already prepared list into it (see the next chapter for how to create your own). Then, in a loop, we create a selenium driver and assign a proxy to it. If the proxy is bad, a TimeoutException will pop up, which will trigger a message in the console and a proxy replacement.
This was a basic parser. A fully-ready parser for the wildberries website. In this archive you will find both the proxy_rotator package and a list of ready-made proxies, although I do not guarantee that they will work at the time you read this article.
Collecting the free proxies
Since wildberries is an online store operating in Russia and the CIS, the proxies should be from there for greater credibility of our parsers, like they are ordinary users.
So, how and where can you get free proxies? I present to you my free proxy scraper, with the ability to select and filter proxies by country, protocols used, and the type of proxies themselves.
To collect only Russian proxies using http and https protocols, enter the following command:
If you want to know which codes correspond to which countries, enter:
As a result, you will receive JSON files with proxy lists. All such files are located in the data directory. Everything happens in parallel, mode and you can stop the script at any time if you think you have enough.
Checking free proxies
So, we have collected hundreds of proxies, and I can guarantee you that most of them are outright garbage. We will need to filter it using the target site, i.e., wildberries.
To check the functionality of a proxy on a particular site, I created a special CLI tool. Which you can download from the link in the previous sentence (⊙_(⊙_⊙)_⊙). Here's how to check a list of such proxies, the command:
Where -i is the proxies you got using my proxy scraper
Where -o is the name of the result file where each proxy will be assigned a weight.
Where -U is the list of websites to check
More options and variants can be viewed using the -h flag. But in this case, we are more interested in the log.txt file. After all, it stores the results of checks for each proxy and how many times it successfully connected to the target site. Choose the most successful proxies and combine them into one JSON file, which you will then use to parse sites.
0
Related questions
- Scraping data from Instagram is illegal? If the data you are going to collect is public and accessible to everyone, then it is definitely allowed. Plus, Instagram provides a special API for scraping so there should be no problems.
- How scraping is done It all depends on what to scrape and what to scrape with. You can scrape documents and tables, or you can scrape a websites. Moreover, websites are more difficult to scrape from than documents, because there are many websites and each has its own architecture, which greatly complicates scraping.
- Which language to use to write a scraper The single and universal language of all parsers is python. The fact is that for any information, in whatever form and format it was presented, there is a library. Although there are alternatives in the form of JavaScrip, Ruby, Go, C++, PHP