3 horizontal lines, burger
3 horizontal lines, burger
3 horizontal lines, burger
3 horizontal lines, burger

3 horizontal lines, burger
Remove all
LOADING ...

Content



    How to make a simple python scraper + a ready-for-use example

    Clock
    10.12.2024
    /
    Clock
    02.10.2025
    /
    Clock
    7 minutes
    An eye
    561
    Hearts
    1
    Connected dots
    0
    Connected dots
    0
    Connected dots
    0

    Introduction

    In this article I will give you an example of the simplest parser in Python. In two versions. The first one is a parser of static pages, the second one is a parser of dynamically loaded pages. For someone who writes parsers, the difference between them is not very big. In both cases, I used BeautifulSoup4 to collect links of an images.
    Since I have not yet received permission to parse other people's sites, I decided to use my own website as an example. We will parse the "about me page". There are both embedded images and dynamically loaded ones. An ideal example.
    If you need the source code and the finished project, you can download it from here.
    Yes, I apologize in advance for using @decorators. But I just like them, so this parser will have a little "syntactic sugar". Keep it in mind.

    Basic script, its structure

    Let's start with installing the necessary packages and importing them. We will need 3 packages:
    1. requests - To send requests to the target site and receive web pages. Required for static scraping.
    2. selenium - To create a browser-without-an-interface and send requests from it. Required for dynamic scraping.
    3. beautifulsoup4 - For the scraping itself. Finding the necessary elements and converting or saving them in the desired format (files, lists ...)
    Let's create a virtual environment and install the above-listed packages:
    python -m venv .venv
    This command will create a .venv directory and load the basic libraries needed for Python, along with its executable file.
    Now we activate the virtual environment for Windows:
    .\.vevn\Scripts\Activate.ps1
    And this command for Linux:
    source ./.venv/bin/activate
    After activating the virtual environment, let's finally install the necessary packages:
    pip install requests selenium beautifulsoup4
    Now let's create a main.py file and add the necessary imports and target URL to the file:
    # FOR DECORATORS import functools # FOR STATIC SCRAPING import requests from bs4 import BeautifulSoup # FOR DYNAMIC SCRAPING # For creating webdriver from selenium import webdriver # For easy to send parameters to driver when set from selenium.webdriver.firefox.options import Options URL="https://timthewebmaster.com/ru/about/"
    Add the program entry point to the very end of the file. This line says that when this file is run through the python interpreter, it will execute two functions in sequential order, get_static and get_dynamic.
    if __name__ == "__main__": get_static(URL) get_dynamic(URL)
    We'll talk about them in the next chapters, but for now let's look at my decorators and the parser function itself:
    The first decorator simply prints the specified message. In my case, it's the function name. There is another way to get the name of the current function in the current thread, but that would be too much for this article.
    def identifier(message: str = ""): def decorator_repeat(func): @functools.wraps(func) def wrapper(*args, **kwargs): print(message) func(*args, **kwargs) return wrapper return decorator_repeat
    You will see how to use them in the following chapters. This decorator takes a string argument and prints it to the console before executing the wrapped function.
    The next decorator works on a similar principle, it prints all received images to the console, and it does not take any arguments.
    def scraped_images(): def decorator_repeat(func): @functools.wraps(func) def wrapper_debug(*args, **kwargs): value = func(*args, **kwargs) images = scrape(value) for image in images: print(f'Link->[{image}]') return value return wrapper_debug return decorator_repeat
    Next we need the scraping function itself. Inside of it we will specify what we are scraping with, how we are scraping and what we are scraping. The soup is created ...
    Soup is a standard name for what the BeautifulSoup constructor returns. It's just the convention, but you're free to choose what to call it..
    In this "soup" we find all the images and extract the data stored in the scr attribute from them. A list is created, which is passed to the decorator and processed there, i.e., output to the console.
    def scrape(source): soup = BeautifulSoup(source, 'lxml') soup_images = soup.find_all('img') images = [] for soup_image in soup_images: images.append(soup_image['src']) return images
    This is, of course, good, but for this function to work, it must receive some source, that is, a target page. And here the two functions I mentioned earlier come into play: get_static and get_dynamic.

    Static page scraper

    If you need to parse a static site, consider yourself lucky. These days, such sites are less and less common. But if the target site is like this, this function will work like a charm:
    # This function defines how to get an actual content of the target page, in this case I'm using requests module @identifier('=>> run_static <<=') @scraped_images() def get_static(url): resp = requests.get(url) return resp.text
    We have included 2 decorators that I mentioned above. They wrap around a function using the @ symbol. This function takes a URL as an argument and returns the text of the page itself.
    I want to note that although I blurted out that such a function will work like a charm, this is true only if the site has no protection against parsers, which is very common in practice. But if such protection is available, you can always try to disguise your request as a real user using headers in requests or use a proxy. And if you need to parse many pages, then the rotation technique (either a proxy or user-agents) will help.

    Dynamic page scraper

    Often you will encounter a situation when the response is 200 and the page seems to have returned, but there is nothing on it. This happens when you parse dynamic sites. They just haven't loaded their content yet, and you just downloaded a template without content :(
    Here's how to get a dynamic page:
    # This function defines how to get an actual content of the target page, in this case I'm using selenium module @identifier('=>> run_dynamic <<=') @scraped_images() def get_dynamic(url): firefox_opt = Options() firefox_opt.add_argument('--headless') firefox_opt.add_argument("--no-sandbox") firefox_opt.add_argument("--disable-dev-shm-usage") driver = webdriver.Firefox(options=firefox_opt) driver.implicitly_wait(5) driver.get(url) return driver.page_source
    The principle of operation is the same as get_static. We connect 2 decorators. They wrap around the function using the @ symbol. This function takes the URL as an argument and returns the text of the page itself.
    The only difference is in the implementation of how the page itself is obtained and the tools used, of course. In this case, I used Firefox, but you can use any other browser, the main thing is that you have to have one.
    I also added several options for configuring the driver:
    1. headless - means do not create and display a browser window
    2. no-sandbox - means use all available resources of the machine on which the script is running
    3. disable-dev-shm-usage - means create an additional swap file if there are not enough memory
    After creating the driver, we set the wait value in seconds before parsing the page. This is not my best solution. It would be better to use a timer that ends when loading certain content, so it will wait exactly as long as it needs to pick up the page. But the thing is that all sites are different, and you need to know which element to expect to load. So I decided to go for a universal solution and stupidly wait 5 seconds. (´。_。`)

    Conclusions nonetheless

    The parser is completely ready, and the only thing left is to launch it. Yes, it is not a very simple parser, but it is very visual and has everything necessary to parse most sites. This is how to launch it:
    python main.py
    And here is the console output you will get:
    =>> run_static <<= Link->[https://mc.yandex.ru/watch/95199635] Link->[/static/placeholder.svg] Link->[/static/placeholder.svg] Link->[/static/Main/img/bg_1.webp] Link->[placeholder.svg] Link->[placeholder.svg] Link->[placeholder.svg] Link->[/static/placeholder.svg] Link->[/static/placeholder.svg] Link->[/static/placeholder.svg] Link->[/static/placeholder.svg] Link->[/static/placeholder.svg] =>> run_dynamic <<= Link->[https://mc.yandex.ru/watch/95199635] Link->[/static/menu.svg] Link->[/static/favicon.svg] Link->[/static/Main/img/bg_1.webp] Link->[/static/Main/img/me1.jpg] Link->[/static/Main/img/sity1.jpg] Link->[placeholder.svg] Link->[/static/placeholder.svg] Link->[/static/placeholder.svg] Link->[/static/placeholder.svg] Link->[/static/placeholder.svg] Link->[/static/placeholder.svg]
    Want to parse links instead? Go ahead. Just rewrite the scrape function and you're done.
    Is something or someone blocking the parsing? Go ahead. Modify get_static, add a proxy, or start rotating user agents. If that doesn't help, let's go to the heavy artillery and start dynamic parsing, modifying it if necessary; the get_dynamic function.
    Don't like the console output? Well, you can always rework it and make your own decorators.
    So, you get the idea. This is the basic template for your parser. And in case you copied something incorrectly, here's the full script:
    # FOR DECORATORS import functools # FOR STATIC SCRAPING import requests from bs4 import BeautifulSoup # FOR DYNAMIC SCRAPING # For creating webdriver from selenium import webdriver # For easy to send parameters to driver when set from selenium.webdriver.firefox.options import Options URL="https://timthewebmaster.com/ru/about/" # Decorator to print name of function in console def identifier(message: str = ""): def decorator_repeat(func): @functools.wraps(func) def wrapper_debug(*args, **kwargs): print(message) func(*args, **kwargs) return wrapper_debug return decorator_repeat # Decorator to print links of the images def scraped_images(): def decorator_repeat(func): @functools.wraps(func) def wrapper_debug(*args, **kwargs): value = func(*args, **kwargs) images = scrape(value) for image in images: print(f'Link->[{image}]') return value return wrapper_debug return decorator_repeat # This function defines how to, and what to scrape def scrape(source): soup = BeautifulSoup(source, 'lxml') soup_images = soup.find_all('img') images = [] for soup_image in soup_images: images.append(soup_image['src']) return images # This function defines how to get an actual content of the target page, in this case I'm using requests module @identifier('=>> run_static <<=') @scraped_images() def get_static(url): resp = requests.get(url) return resp.text # This function defines how to get an actual content of the target page, in this case I'm using selenium module @identifier('=>> run_dynamic <<=') @scraped_images() def get_dynamic(url): firefox_opt = Options() firefox_opt.add_argument('--headless') firefox_opt.add_argument("--no-sandbox") firefox_opt.add_argument("--disable-dev-shm-usage") driver = webdriver.Firefox(options=firefox_opt) driver.implicitly_wait(5) driver.get(url) return driver.page_source if __name__ == "__main__": get_static(URL) get_dynamic(URL)
    That's all for now. <(_ _)>

    Do not forget to share, like and leave a comment :)

    Comments

    (0)

    captcha
    Send
    LOADING ...
    It's empty now. Be the first (o゚v゚)ノ

    Other

    Similar articles


    A scraper of an e-commerce website, an example provided, wildberries

    Clock
    16.11.2024
    /
    Clock
    02.10.2025
    An eye
    452
    Hearts
    0
    Connected dots
    0
    Connected dots
    0
    Connected dots
    0
    This is a tutorial with an example showing how to make a scraper for an e-commerce website with bypasses of blocking using proxies and their rotation. Using Selenium and some …

    How to make a scraper of a list of films from Kinopoisk

    Clock
    24.11.2024
    /
    Clock
    02.10.2025
    An eye
    719
    Hearts
    0
    Connected dots
    0
    Connected dots
    0
    Connected dots
    0
    In this article I will tell you how to write a scraper for the Kinopoisk website, what you will need for this, and I will share the source code with …

    How to scrape google serp via google serp position api

    Clock
    15.02.2025
    /
    Clock
    02.10.2025
    An eye
    651
    Hearts
    0
    Connected dots
    0
    Connected dots
    0
    Connected dots
    0
    In this article I will show you how to make a Google serp scraper using their official API. I will show how to get an API key and search engine …

    Used termins


    Related questions