Composite data type, and structure which has one more name, associative array. The python dictionary has keys and associating values.

Python programming language

It is interpreted, build upon object-oriented approach, with dynamic semantics, and yes it s high-level programming language. Actively using for rapid development, scripting and gluing part of existing apps.

Is a self-contained directory that provides a way to manage dependencies and isolate project-specific configurations in Python (and other programming languages). It allows developers to create a separate environment for each project, ensuring that each project can have its own dependencies, regardless of what dependencies every other project has.

In the context of computing and web development, refers to a program or script that is designed to extract data from websites. This process is known as web scraping. Scrapers can automatically navigate through web pages, retrieve specific information, and store that data in a structured format, such as CSV, JSON, or a database.

What is the best web scraping tool?

The choice of a scraping tool depends on the nature of the website and its complexity. As long as the tool can help you get the data quickly and smoothly with acceptable or zero cost, you can choose the tool you like.

What is the difference between web scraping and web crawling?

Web scraping and web crawling are two related concepts. Web scraping, as we mentioned above, is a process of obtaining data from websites; web crawling is systematically browsing the World Wide Web, generally for the purpose of indexing the web.

Scraping data from Instagram is illegal?

If the data you are going to collect is public and accessible to everyone, then it is definitely allowed. Plus, Instagram provides a special API for scraping so there should be no problems.

Remove all

LOADING ...

Content

How to make a simple python scraper + a ready-for-use example

10.12.2024

11.03.2026

7 minutes

916

Tags:

Introduction

In this article I will give you an example of the simplest parser in Python. In two versions. The first one is a parser of static pages, the second one is a parser of dynamically loaded pages. For someone who writes parsers, the difference between them is not very big. In both cases, I used BeautifulSoup4 to collect links of an images.

Since I have not yet received permission to parse other people's sites, I decided to use my own website as an example. We will parse the "about me page". There are both embedded images and dynamically loaded ones. An ideal example.

If you need the source code and the finished project, you can download it from here.

Yes, I apologize in advance for using @decorators. But I just like them, so this parser will have a little "syntactic sugar". Keep it in mind.

Basic script, its structure

Let's start with installing the necessary packages and importing them. We will need 3 packages:

requests - To send requests to the target site and receive web pages. Required for static scraping.
selenium - To create a browser-without-an-interface and send requests from it. Required for dynamic scraping.
beautifulsoup4 - For the scraping itself. Finding the necessary elements and converting or saving them in the desired format (files, lists ...)

Let's create a virtual environment and install the above-listed packages:

python -m venv .venv

This command will create a .venv directory and load the basic libraries needed for Python, along with its executable file.

Now we activate the virtual environment for Windows:

.\.vevn\Scripts\Activate.ps1

And this command for Linux:

source ./.venv/bin/activate

After activating the virtual environment, let's finally install the necessary packages:

pip install requests selenium beautifulsoup4

Now let's create a main.py file and add the necessary imports and target URL to the file:

Add the program entry point to the very end of the file. This line says that when this file is run through the python interpreter, it will execute two functions in sequential order, get_static and get_dynamic.

if __name__ == "__main__": get_static(URL) get_dynamic(URL)

We'll talk about them in the next chapters, but for now let's look at my decorators and the parser function itself:

The first decorator simply prints the specified message. In my case, it's the function name. There is another way to get the name of the current function in the current thread, but that would be too much for this article.

def identifier(message: str = ""): def decorator_repeat(func): @functools.wraps(func) def wrapper(*args, **kwargs): print(message) func(*args, **kwargs) return wrapper return decorator_repeat

You will see how to use them in the following chapters. This decorator takes a string argument and prints it to the console before executing the wrapped function.

The next decorator works on a similar principle, it prints all received images to the console, and it does not take any arguments.

def scraped_images(): def decorator_repeat(func): @functools.wraps(func) def wrapper_debug(*args, **kwargs): value = func(*args, **kwargs) images = scrape(value) for image in images: print(f'Link->[{image}]') return value return wrapper_debug return decorator_repeat

Next we need the scraping function itself. Inside of it we will specify what we are scraping with, how we are scraping and what we are scraping. The soup is created ...

Soup is a standard name for what the BeautifulSoup constructor returns. It's just the convention, but you're free to choose what to call it..

In this "soup" we find all the images and extract the data stored in the scr attribute from them. A list is created, which is passed to the decorator and processed there, i.e., output to the console.

def scrape(source): soup = BeautifulSoup(source, 'lxml') soup_images = soup.find_all('img') images = [] for soup_image in soup_images: images.append(soup_image['src']) return images

This is, of course, good, but for this function to work, it must receive some source, that is, a target page. And here the two functions I mentioned earlier come into play: get_static and get_dynamic.

Static page scraper

If you need to parse a static site, consider yourself lucky. These days, such sites are less and less common. But if the target site is like this, this function will work like a charm:

# This function defines how to get an actual content of the target page, in this case I'm using requests module @identifier('=>> run_static <<=') @scraped_images() def get_static(url): resp = requests.get(url) return resp.text

We have included 2 decorators that I mentioned above. They wrap around a function using the @ symbol. This function takes a URL as an argument and returns the text of the page itself.

I want to note that although I blurted out that such a function will work like a charm, this is true only if the site has no protection against parsers, which is very common in practice. But if such protection is available, you can always try to disguise your request as a real user using headers in requests or use a proxy. And if you need to parse many pages, then the rotation technique (either a proxy or user-agents) will help.

Dynamic page scraper

Often you will encounter a situation when the response is 200 and the page seems to have returned, but there is nothing on it. This happens when you parse dynamic sites. They just haven't loaded their content yet, and you just downloaded a template without content :(

Here's how to get a dynamic page:

# This function defines how to get an actual content of the target page, in this case I'm using selenium module @identifier('=>> run_dynamic <<=') @scraped_images() def get_dynamic(url): firefox_opt = Options() firefox_opt.add_argument('--headless') firefox_opt.add_argument("--no-sandbox") firefox_opt.add_argument("--disable-dev-shm-usage") driver = webdriver.Firefox(options=firefox_opt) driver.implicitly_wait(5) driver.get(url) return driver.page_source

The principle of operation is the same as get_static. We connect 2 decorators. They wrap around the function using the @ symbol. This function takes the URL as an argument and returns the text of the page itself.

The only difference is in the implementation of how the page itself is obtained and the tools used, of course. In this case, I used Firefox, but you can use any other browser, the main thing is that you have to have one.

I also added several options for configuring the driver:

headless - means do not create and display a browser window
no-sandbox - means use all available resources of the machine on which the script is running
disable-dev-shm-usage - means create an additional swap file if there are not enough memory

After creating the driver, we set the wait value in seconds before parsing the page. This is not my best solution. It would be better to use a timer that ends when loading certain content, so it will wait exactly as long as it needs to pick up the page. But the thing is that all sites are different, and you need to know which element to expect to load. So I decided to go for a universal solution and stupidly wait 5 seconds. (´。＿。｀)

Conclusions nonetheless

The parser is completely ready, and the only thing left is to launch it. Yes, it is not a very simple parser, but it is very visual and has everything necessary to parse most sites. This is how to launch it:

python main.py

And here is the console output you will get:

=>> run_static <<= Link->[https://mc.yandex.ru/watch/95199635] Link->[/static/placeholder.svg] Link->[/static/placeholder.svg] Link->[/static/Main/img/bg_1.webp] Link->[placeholder.svg] Link->[placeholder.svg] Link->[placeholder.svg] Link->[/static/placeholder.svg] Link->[/static/placeholder.svg] Link->[/static/placeholder.svg] Link->[/static/placeholder.svg] Link->[/static/placeholder.svg] =>> run_dynamic <<= Link->[https://mc.yandex.ru/watch/95199635] Link->[/static/menu.svg] Link->[/static/favicon.svg] Link->[/static/Main/img/bg_1.webp] Link->[/static/Main/img/me1.jpg] Link->[/static/Main/img/sity1.jpg] Link->[placeholder.svg] Link->[/static/placeholder.svg] Link->[/static/placeholder.svg] Link->[/static/placeholder.svg] Link->[/static/placeholder.svg] Link->[/static/placeholder.svg]

Want to parse links instead? Go ahead. Just rewrite the scrape function and you're done.

Is something or someone blocking the parsing? Go ahead. Modify get_static, add a proxy, or start rotating user agents. If that doesn't help, let's go to the heavy artillery and start dynamic parsing, modifying it if necessary; the get_dynamic function.

Don't like the console output? Well, you can always rework it and make your own decorators.

So, you get the idea. This is the basic template for your parser. And in case you copied something incorrectly, here's the full script:

# FOR DECORATORS import functools # FOR STATIC SCRAPING import requests from bs4 import BeautifulSoup # FOR DYNAMIC SCRAPING # For creating webdriver from selenium import webdriver # For easy to send parameters to driver when set from selenium.webdriver.firefox.options import Options URL="https://timthewebmaster.com/ru/about/" # Decorator to print name of function in console def identifier(message: str = ""): def decorator_repeat(func): @functools.wraps(func) def wrapper_debug(*args, **kwargs): print(message) func(*args, **kwargs) return wrapper_debug return decorator_repeat # Decorator to print links of the images def scraped_images(): def decorator_repeat(func): @functools.wraps(func) def wrapper_debug(*args, **kwargs): value = func(*args, **kwargs) images = scrape(value) for image in images: print(f'Link->[{image}]') return value return wrapper_debug return decorator_repeat # This function defines how to, and what to scrape def scrape(source): soup = BeautifulSoup(source, 'lxml') soup_images = soup.find_all('img') images = [] for soup_image in soup_images: images.append(soup_image['src']) return images # This function defines how to get an actual content of the target page, in this case I'm using requests module @identifier('=>> run_static <<=') @scraped_images() def get_static(url): resp = requests.get(url) return resp.text # This function defines how to get an actual content of the target page, in this case I'm using selenium module @identifier('=>> run_dynamic <<=') @scraped_images() def get_dynamic(url): firefox_opt = Options() firefox_opt.add_argument('--headless') firefox_opt.add_argument("--no-sandbox") firefox_opt.add_argument("--disable-dev-shm-usage") driver = webdriver.Firefox(options=firefox_opt) driver.implicitly_wait(5) driver.get(url) return driver.page_source if __name__ == "__main__": get_static(URL) get_dynamic(URL)

That's all for now. <(＿　＿)>

Types and features of Telegram bots

Next article

COBOL programming language history by version

Previous article

Do not forget to share, like and leave a comment :)

Comments

(0)

Send

LOADING ...

It's empty now. Be the first (oﾟvﾟ)ノ

External links

Other

Used termins

Python dictionary ⟶ Composite data type, and structure which has one more name, associative array. The python dictionary has keys and associating values.
Python programming language ⟶ It is interpreted, build upon object-oriented approach, with dynamic semantics, and yes it s high-level programming language. Actively using for rapid development, scripting and gluing part of existing apps.
Virtual environment ⟶ Is a self-contained directory that provides a way to manage dependencies and isolate project-specific configurations in Python (and other programming languages). It allows developers to create a separate environment for each project, ensuring that each project can have its own dependencies, regardless of what dependencies every other project has.
Scraper ⟶ In the context of computing and web development, refers to a program or script that is designed to extract data from websites. This process is known as web scraping. Scrapers can automatically navigate through web pages, retrieve specific information, and store that data in a structured format, such as CSV, JSON, or a database.