VPS (Virtual private server)

This is a service whose essence is that access to a dedicated server on a specific machine is provided. There can be thousands of such dedicated servers on one machine. Typically, managing such a server is no different from managing a regular, physical one.

It is an app or web service, with a collection of ready for use templates and tools, for constructing a website.

Is defined as a collection of related web pages that are typically identified by a common domain name and published on at least one web server. Websites can serve various purposes and can include anything from personal blogs to business sites, e-commerce platforms, or informational resources.

In the context of computing and web development, refers to a program or script that is designed to extract data from websites. This process is known as web scraping. Scrapers can automatically navigate through web pages, retrieve specific information, and store that data in a structured format, such as CSV, JSON, or a database.

Is a software system designed to carry out web searches. It allows users to search for information on the internet by entering keywords or phrases. The search engine then uses algorithms to index and retrieve relevant results from its database of websites.

TCP (Transmission Control Protocol,)

Is one of the main protocols of the Internet Protocol Suite. It is used for establishing a connection between networked devices, ensuring reliable data transmission over the internet or other networks.

How to avoid being blocked when scraping a website?

Many websites would block you if you scraped them too much. To avoid being denied, you should make the scraping process more like a human browsing a website. For example, adding a delay between two requests, using proxies, or applying different scraping patterns can help you avoid being blocked.

What is the difference between web scraping and web crawling?

Web scraping and web crawling are two related concepts. Web scraping, as we mentioned above, is a process of obtaining data from websites; web crawling is systematically browsing the World Wide Web, generally for the purpose of indexing the web.

Scraping data from Instagram is illegal?

If the data you are going to collect is public and accessible to everyone, then it is definitely allowed. Plus, Instagram provides a special API for scraping so there should be no problems.

Cloud scraping what is it

This is a service for collecting information from various sources and grouping them in various formats, which is carried out on the cloud servers of the provider of this service.

It all depends on what to scrape and what to scrape with. You can scrape documents and tables, or you can scrape a websites. Moreover, websites are more difficult to scrape from than documents, because there are many websites and each has its own architecture, which greatly complicates scraping.

Remove all

LOADING ...

Content

A scraper of an e-commerce website using python and a wilberries webiste as an example.

16.11.2024

11.03.2026

5 minutes

1074

Tags:

website scraper

free website scraper

python scraper

Introduction and overview of the scraper

Nowadays, parsing online stores is not easy. All of them are quite advanced in terms of protection from parsers and bots. They can use such types of protection as using dynamic content and firewalls. One of the most famous companies providing such protection is cloudflare.

In this tutorial, I will show various methods of bypassing blocking technologies using, for example, the online store wildberries. To be more specific, I will show how to make a parser based on the Selenium library and how to set up proxy rotation for it. We will use free proxies.

Also, please note that my goal was not to parse specific data from an online store. I just wanted to demonstrate the methodology and structure of parsers for such sites. And of course, it will be single-threaded so that it would be clearer for you to see how it works.

And as a bonus, I will provide one proxy scraper and one proxy checker. They will help you create your own proxy lists for various sites.

Writing a basic parser

First, we will write and review the parser I wrote. You will need to create and activate a virtual environment.

python -m venv .venv

If it is a windows system:

.\.venv\Scripts\activate

If it is *nix like system:

source ./.venv/bin/activate

After that, we will install the necessary packages:

pip install selenium requests lxml beautifulsoup4

Import headers. In your working directory, there should be a folder proxy_rotator. This package regulates the issuance of proxies on request. It issues proxies randomly depending on the weight of the proxy. The higher the weight, the more likely it is that this proxy will be selected.

import json from bs4 import BeautifulSoup # For creating webdriver from selenium import webdriver # For navigation purpose from selenium.webdriver.common.by import By # For easy to send parameters to driver when set from selenium.webdriver.firefox.options import Options # For events handling from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import TimeoutException # For proxy rotation from proxy_rotator import Proxy, Rotator from config import URL

Let's add a couple more utility functions, such as:

saving in json
saving in html
downloading a proxy
and scraping the necessary data

def load_proxies(file_path): result = [] with open(file_path, 'r', encoding='utf-8') as file: result = json.loads(file.read()) return result def save_to_html(path, text): with open(path, 'w', encoding='utf-8') as file: file.write(text) file.close() def save_to_json(path, list): with open(path, 'w', encoding='utf-8') as file: json.dump(list, file, indent=2, ensure_ascii=False) file.close() def parse_data(path): data = [] with open(path, 'r', encoding='utf-8') as file: soup = BeautifulSoup(file.read(), 'lxml') soup_blocks = soup.find_all('div', class_='product-card__wrapper') for block in soup_blocks: title = block.find('h2', class_='product-card__brand-wrap').text price = block.find('ins', class_='price__lower-price').text data.append({ 'title': title, 'price': price }) save_to_json('result.json', data)

In my case, I decided to parse the main page. Collect all prices and product titles. The parse_data function determines what to parse and where to save it.

Add these lines of code at the very bottom. It will call the run() function only if this script is run using the python interpreter.

if __name__ == "__main__": run()

Now this is what the main run() function looks like:

def run(): proxies: list[Proxy] = [] # Load proxies to memory proxies_json = load_proxies("proxies.json") for proxy_item in proxies_json: proxies.append(Proxy( proxy_item['path'], proxy_item['type'], proxy_item['protocol'], proxy_item['status'], proxy_item['weight']) ) # Making a rotator for a proxies rotator = Rotator(proxies) isFinished = False index = 0 max = 10 res = "" while not isFinished: try: proxy = rotator.get() print(f"Using proxy: {proxy}") # Create a webdriver firefox_opt = Options() firefox_opt.add_argument('--headless') firefox_opt.add_argument("--no-sandbox") firefox_opt.add_argument("--disable-dev-shm-usage") firefox_opt.add_argument(f"--proxy-server={proxy}") driver = webdriver.Firefox(options=firefox_opt) driver.get(URL) # Retrieve all reqeusts element on pages, while waiting they are loaded for i in range(index, max): items_container = WebDriverWait(driver, 10).until( EC.presence_of_all_elements_located((By.CLASS_NAME, 'product-card__wrapper')) ) index = i last_item = items_container[len(items_container ) - 1] driver.execute_script("arguments[0].scrollIntoView(true);", last_item) for item in items_container: res += item.get_attribute('outerHTML') print(f"Get {i} page") if index == max - 1: isFinished = True except TimeoutException as ex: print(f"Proxy failed -> {ex.msg}") print('Save the result') save_to_html('index.html', res) print('Parse the result') parse_data('index.html')

First, we create a proxy rotator by loading the already prepared list into it (see the next chapter for how to create your own). Then, in a loop, we create a selenium driver and assign a proxy to it. If the proxy is bad, a TimeoutException will pop up, which will trigger a message in the console and a proxy replacement.

By bad proxy I mean unavailable or already blocked. This exception will pop up after 10 seconds of waiting for an element with the class product-card__wrapper to appear.

This was a basic parser. A fully-ready parser for the wildberries website. In this archive you will find both the proxy_rotator package and a list of ready-made proxies, although I do not guarantee that they will work at the time you read this article.

Collecting the free proxies

Since wildberries is an online store operating in Russia and the CIS, the proxies should be from there for greater credibility of our parsers, like they are ordinary users.

So, how and where can you get free proxies? I present to you my free proxy scraper, with the ability to select and filter proxies by country, protocols used, and the type of proxies themselves.

To collect only Russian proxies using http and https protocols, enter the following command:

python .\main.py -p http, https -c RU

If you want to know which codes correspond to which countries, enter:

python .\main.py -HC

As a result, you will receive JSON files with proxy lists. All such files are located in the data directory. Everything happens in parallel, mode and you can stop the script at any time if you think you have enough.

Checking free proxies

So, we have collected hundreds of proxies, and I can guarantee you that most of them are outright garbage. We will need to filter it using the target site, i.e., wildberries.

You see, most of the sites that provide proxies check if they are alive by simply pinging them. But that doesn't mean that these proxies will work with a particular site.

To check the functionality of a proxy on a particular site, I created a special CLI tool. Which you can download from the link in the previous sentence (⊙_(⊙_⊙)_⊙). Here's how to check a list of such proxies, the command:

python .\main.py -i 0_64.json -o .\res.json -U https://wildberries.ru

Where -i is the proxies you got using my proxy scraper

Where -o is the name of the result file where each proxy will be assigned a weight.

Where -U is the list of websites to check

More options and variants can be viewed using the -h flag. But in this case, we are more interested in the log.txt file. After all, it stores the results of checks for each proxy and how many times it successfully connected to the target site. Choose the most successful proxies and combine them into one JSON file, which you will then use to parse sites.

How to make a scraper of a list of films from Kinopoisk

Next article

How to make custom quill module – table of content as an example

Previous article

Do not forget to share, like and leave a comment :)

Comments

(0)

Send

LOADING ...

It's empty now. Be the first (oﾟvﾟ)ノ

External links

Other

Used termins

VPS (Virtual private server) ⟶ This is a service whose essence is that access to a dedicated server on a specific machine is provided. There can be thousands of such dedicated servers on one machine. Typically, managing such a server is no different from managing a regular, physical one.
Website builder ⟶ It is an app or web service, with a collection of ready for use templates and tools, for constructing a website.
Website ⟶ Is defined as a collection of related web pages that are typically identified by a common domain name and published on at least one web server. Websites can serve various purposes and can include anything from personal blogs to business sites, e-commerce platforms, or informational resources.
Scraper ⟶ In the context of computing and web development, refers to a program or script that is designed to extract data from websites. This process is known as web scraping. Scrapers can automatically navigate through web pages, retrieve specific information, and store that data in a structured format, such as CSV, JSON, or a database.
Search engine ⟶ Is a software system designed to carry out web searches. It allows users to search for information on the internet by entering keywords or phrases. The search engine then uses algorithms to index and retrieve relevant results from its database of websites.
TCP (Transmission Control Protocol,) ⟶ Is one of the main protocols of the Internet Protocol Suite. It is used for establishing a connection between networked devices, ensuring reliable data transmission over the internet or other networks.