It is an app or web service, with a collection of ready for use templates and tools, for constructing a website.

Python programming language

It is interpreted, build upon object-oriented approach, with dynamic semantics, and yes it s high-level programming language. Actively using for rapid development, scripting and gluing part of existing apps.

Is defined as a collection of related web pages that are typically identified by a common domain name and published on at least one web server. Websites can serve various purposes and can include anything from personal blogs to business sites, e-commerce platforms, or informational resources.

In the context of computing and web development, refers to a program or script that is designed to extract data from websites. This process is known as web scraping. Scrapers can automatically navigate through web pages, retrieve specific information, and store that data in a structured format, such as CSV, JSON, or a database.

What is the difference between web scraping and web crawling?

Web scraping and web crawling are two related concepts. Web scraping, as we mentioned above, is a process of obtaining data from websites; web crawling is systematically browsing the World Wide Web, generally for the purpose of indexing the web.

Cloud scraping what is it

This is a service for collecting information from various sources and grouping them in various formats, which is carried out on the cloud servers of the provider of this service.

It all depends on what to scrape and what to scrape with. You can scrape documents and tables, or you can scrape a websites. Moreover, websites are more difficult to scrape from than documents, because there are many websites and each has its own architecture, which greatly complicates scraping.

Remove all

LOADING ...

Content

How to make a scraper of a list of films from Kinopoisk in Python

24.11.2024

11.03.2026

7 minutes

1549

Tags:

How it works

This parser collects information from the pages of the site that contain list in their address. Simply put, it is a parser of filtering pages and pagination pages. What a pun :). I will not hide the fact that it works quite slowly, because in order to access the pages of this site, you need to use the so-called dynamic parsing. That is, parsing pages with preliminary rendering and processing by the browser.

This is the first barrier that must be overcome before gaining access to the necessary pages. The second barrier on our way is the blocking of certain IP addresses from which parsing occurs. Therefore, appropriately selected proxies are used.

The proxies that you will receive by default with the source codes of this parser most likely no longer work. Therefore, do not forget to look into the chapters on how to find such proxies and how to select them for the target site, that is, kinopoisk.ru

A working parser of the kinopoisk site can be downloaded from here. All you have to do is activate the virtual environment and install the necessary packages.

How to make it, installing packages

We will write this proxy in python using third-party libraries and utilities. First, we will create a virtual environment and install it. You also need to download my proxy rotator.

python -m venv .venv

If Windows then:

.\.venv\Scripts\activate

If *unix systems then:

source ./.venv/bin/activate

Now it's the turn to install the necessary packages

pip install selenium requests lxml beautifulsoup4 pandas openpyxl

Or you can use a specially prepared file with dependencies for this project.

pip install -r req.txt

How to make it, writing a script

Create a main.py file in the project directory. Our script will not be very large, just over 100 lines of code. Here is the complete source code:

import pandas from bs4 import BeautifulSoup # For creating webdriver from selenium import webdriver # For navigation purpose from selenium.webdriver.common.by import By # For easy to send parameters to driver when set from selenium.webdriver.firefox.options import Options # For events handling from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import TimeoutException from proxy_rotator import Proxy, Rotator, load_proxies URL = "https://www.kinopoisk.ru/lists/movies/genre--action/country--2/?b=series&ss_subscription=ANY" TEST_FILE = "result" RESULT_FILE = 'result' def get_page_number(text): soup = BeautifulSoup(text, features='lxml') number = int(soup.find_all('a', class_="styles_page__zbGy7")[-1].text) return number + 1 def save_to_exel(data): # So, because of missing values length of arrays in dict are different. Pandas work just fine if write data in rows # and then transpose it. # See https://stackoverflow.com/questions/40442014/pandas-valueerror-arrays-must-be-all-same-length frame = pandas.DataFrame.from_dict(data, orient='index') frame = frame.transpose() frame.to_excel(f"{RESULT_FILE}.xlsx") def scrape_kinopoisk_list(text, data): soup = BeautifulSoup(text, features='lxml') soup_films = soup.find_all('a', class_="base-movie-main-info_link__YwtP1") for soup_film in soup_films: soup_children = soup_film.findChildren('div', recursive=False) # if target element has more than default 4 children if len(soup_children) > len(data): for i in range(0, len(soup_children) - len(data)): data.update({f"{i}": []}) for indx, soup_child in enumerate(soup_children): for data_indx, key in enumerate(data): if indx == data_indx: data[key].append(soup_child.text) def run(): proxies: list[Proxy] = [] # Load proxies to memory proxies_json = load_proxies("proxies.json") for proxy_item in proxies_json: proxies.append(Proxy( proxy_item['path'], proxy_item['type'], proxy_item['protocol'], proxy_item['status'], proxy_item['weight']) ) # Making a rotator for a proxies rotator = Rotator(proxies) isFinished = False index = 1 max = 3 data = { "titles": [], "dates": [], 'coutry, producer': [], 'actors': [], } while not isFinished: try: proxy = rotator.get() print(f"Using proxy: {proxy}") # Create a webdriver firefox_opt = Options() firefox_opt.add_argument('--headless') firefox_opt.add_argument("--no-sandbox") firefox_opt.add_argument("--disable-dev-shm-usage") firefox_opt.add_argument(f"--proxy-server={proxy}") driver = webdriver.Firefox(options=firefox_opt) # Retrieve all reqeusts element on pages, while waiting they are loaded for i in range(index, max): if i == 1: print(f"\tPage: {i} [{URL}]") driver.get(URL) else: print(f"\tPage: {i} [{URL}&page={i}]") driver.get(f'{URL}&page={i}') items_container = WebDriverWait(driver, 1).until( EC.presence_of_all_elements_located((By.CLASS_NAME, 'base-movie-main-info_link__YwtP1')) ) scrape_kinopoisk_list(driver.page_source, data) save_to_exel(data) print(f"\tSuccess.") if i == 1: max = get_page_number(driver.page_source) print(f"\tGet number of available pages. [{max - 1}]") index = i + 1 if index >= max : isFinished = True except Exception as ex: print(f"\tProxy failed") print('Save the result') save_to_exel(data) if __name__ == "__main__": run()

Next, to make it easier to understand the code, I broke it down into structural elements and functions. Let's start with the main run function. This function starts scraping and rotates proxies and decides when to stop scraping.

proxies: list[Proxy] = [] # Load proxies to memory proxies_json = load_proxies("proxies.json") for proxy_item in proxies_json: proxies.append(Proxy( proxy_item['path'], proxy_item['type'], proxy_item['protocol'], proxy_item['status'], proxy_item['weight']) ) # Making a rotator for a proxies rotator = Rotator(proxies)

It also creates a virtual browser and makes requests for the necessary URLs.

# Create a webdriver firefox_opt = Options() firefox_opt.add_argument('--headless') firefox_opt.add_argument("--no-sandbox") firefox_opt.add_argument("--disable-dev-shm-usage") firefox_opt.add_argument(f"--proxy-server={proxy}") driver = webdriver.Firefox(options=firefox_opt) # Retrieve all reqeusts element on pages, while waiting they are loaded for i in range(index, max): if i == 1: print(f"\tPage: {i} [{URL}]") driver.get(URL) else: print(f"\tPage: {i} [{URL}&page={i}]") driver.get(f'{URL}&page={i}')

Next, Selenium needs to hook onto something when loading. That is, to start acting immediately as soon as the corresponding elements appear on the page. In the case of kinopoisk, these are elements with the class base-movie-main-info_link__YwtP1. Here is a cycle of traversal through all available pages.

# Retrieve all reqeusts element on pages, while waiting they are loaded for i in range(index, max): if i == 1: print(f"\tPage: {i} [{URL}]") driver.get(URL) else: print(f"\tPage: {i} [{URL}&page={i}]") driver.get(f'{URL}&page={i}') items_container = WebDriverWait(driver, 1).until( EC.presence_of_all_elements_located((By.CLASS_NAME, 'base-movie-main-info_link__YwtP1')) ) scrape_kinopoisk_list(driver.page_source, data) save_to_exel(data) print(f"\tSuccess.") if i == 1: max = get_page_number(driver.page_source) print(f"\tGet number of available pages. [{max - 1}]") index = i + 1

It is worth paying attention to the fact that I save the result in a table when parsing each page and after going through all the pages. Thus, even if the parser suddenly "crashes", we will still be able to get the necessary data.

The get_page_number function is as easy. We find the necessary element and convert the string to a number:

def get_page_number(text): soup = BeautifulSoup(text, features='lxml') number = int(soup.find_all('a', class_="styles_page__zbGy7")[-1].text) return number + 1

The save_to_exel function is even simpler. Since we have the pandas package installed, all this function does is create a special data frame and convert it to a table:

def save_to_exel(data): frame = pandas.DataFrame.from_dict(data, orient='index') frame = frame.transpose() frame.to_excel(f"{RESULT_FILE}.xlsx")

It is important to note that in order to work with Excel tables, you need to install the openpyxl package. So what? You might ask, because in theory it should be a dependency when installing pandas, but it is not. Therefore, manual installation is required.

The scrape_kinopoisk_list function searches for the necessary elements on the page and fills the dictionary.

def scrape_kinopoisk_list(text, data): soup = BeautifulSoup(text, features='lxml') soup_films = soup.find_all('a', class_="base-movie-main-info_link__YwtP1") for soup_film in soup_films: soup_children = soup_film.findChildren('div', recursive=False) # if target element has more than default 4 children if len(soup_children) > len(data): for i in range(0, len(soup_children) - len(data)): data.update({f"{i}": []}) for indx, soup_child in enumerate(soup_children): for data_indx, key in enumerate(data): if indx == data_indx: data[key].append(soup_child.text)

That's all. All that's left is to run the run function. It's up to you how to do it. Just run it at the bottom of the script or do it like I did, that is:

if __name__ == "__main__": run()

Collecting the free proxies

Since kinopoisk is an website operating in Russia and the CIS (at least the audience is from there), the proxies should be from there for greater credibility of our parsers, like they are ordinary users.

So, how and where can you get free proxies? I present to you my free proxy scraper, with the ability to select and filter proxies by country, protocols used, and the type of proxies themselves.

To collect only Russian proxies using http, https, socks4 and socks5 protocols, enter the following command:

python .\main.py -p http, https, socks4, socks5 -c RU, BY, KZ

If you want to know which codes correspond to which countries, enter:

python .\main.py -HC

As a result, you will receive JSON files with proxy lists. All such files are located in the data directory. Everything happens in parallel, mode and you can stop the script at any time if you think you have enough.

Checking free proxies

So, we have collected hundreds of proxies, and I can guarantee you that most of them are outright garbage. We will need to filter it using the target site, i.e., kinopoisk.

You see, most of the sites that provide proxies check if they are alive by simply pinging them. But that doesn't mean that these proxies will work with a particular site.

To check the functionality of a proxy on a particular site, I created a special CLI tool. Which you can download from the link in the previous sentence (⊙_(⊙_⊙)_⊙). Here's how to check a list of such proxies, the command:

python .\main.py -i 0_64.json -o .\res.json -U https://kinopoisk.ru

Where -i is the proxies you got using my proxy scraper

Where -o is the name of the result file where each proxy will be assigned a weight.

Where -U is the list of websites to check

More options and variants can be viewed using the -h flag. But in this case, we are more interested in the log.txt file. After all, it stores the results of checks for each proxy and how many times it successfully connected to the target site. Choose the most successful proxies and combine them into one JSON file, which you will then use to parse sites.

COBOL programming language history by version

Next article

A scraper of an e-commerce website, an example provided, wildberries

Previous article

Do not forget to share, like and leave a comment :)

Comments

(0)

Send

LOADING ...

It's empty now. Be the first (oﾟvﾟ)ノ

External links

Other

Used termins

Website builder ⟶ It is an app or web service, with a collection of ready for use templates and tools, for constructing a website.
Python programming language ⟶ It is interpreted, build upon object-oriented approach, with dynamic semantics, and yes it s high-level programming language. Actively using for rapid development, scripting and gluing part of existing apps.
Website ⟶ Is defined as a collection of related web pages that are typically identified by a common domain name and published on at least one web server. Websites can serve various purposes and can include anything from personal blogs to business sites, e-commerce platforms, or informational resources.
Scraper ⟶ In the context of computing and web development, refers to a program or script that is designed to extract data from websites. This process is known as web scraping. Scrapers can automatically navigate through web pages, retrieve specific information, and store that data in a structured format, such as CSV, JSON, or a database.