How to write a simple python scraper

Clock
10.12.2024
An eye
45
Hearts
1
Connected dots
0
Connected dots
0
Connected dots
0

Introduction

In this article I will give you an example of the simplest parser in Python. In two versions. The first one is a parser of static pages, the second one is a parser of dynamically loaded pages. For someone who writes parsers, the difference between them is not very big. In both cases, I used BeautifulSoup4 to collect links of an images.
Since I have not yet received permission to parse other people's sites, I decided to use my own website as an example. We will parse the "about me page". There are both embedded images and dynamically loaded ones. An ideal example.
If you need the source code and the finished project, you can download it from here.
Yes, I apologize in advance for using @decorators. But I just like them, so this parser will have a little "syntactic sugar". Keep it in mind.

Basic script, its structure

Let's start with installing the necessary packages and importing them. We will need 3 packages:
  1. requests - To send requests to the target site and receive web pages. Required for static scraping.
  2. selenium - To create a browser-without-an-interface and send requests from it. Required for dynamic scraping.
  3. beautifulsoup4 - For the scraping itself. Finding the necessary elements and converting or saving them in the desired format (files, lists ...)
Let's create a virtual environment and install the above-listed packages:
python -m venv .venv
This command will create a .venv directory and load the basic libraries needed for Python, along with its executable file.
Now we activate the virtual environment for Windows:
.\.vevn\Scripts\Activate.ps1
And this command for Linux:
source ./.venv/bin/activate
After activating the virtual environment, let's finally install the necessary packages:
pip install requests selenium beautifulsoup4
Now let's create a main.py file and add the necessary imports and target URL to the file:
# FOR DECORATORS
import functools
# FOR STATIC SCRAPING
import requests
from bs4 import BeautifulSoup
# FOR DYNAMIC SCRAPING
# For creating webdriver
from selenium import webdriver
# For easy to send parameters to driver when set
from selenium.webdriver.firefox.options import Options


URL="https://timthewebmaster.com/ru/about/"
Add the program entry point to the very end of the file. This line says that when this file is run through the python interpreter, it will execute two functions in sequential order, get_static and get_dynamic.
if __name__ == "__main__":
get_static(URL)
get_dynamic(URL)
We'll talk about them in the next chapters, but for now let's look at my decorators and the parser function itself:
The first decorator simply prints the specified message. In my case, it's the function name. There is another way to get the name of the current function in the current thread, but that would be too much for this article.
def identifier(message: str = ""):
def decorator_repeat(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
print(message)
func(*args, **kwargs)
return wrapper
return decorator_repeat
You will see how to use them in the following chapters. This decorator takes a string argument and prints it to the console before executing the wrapped function.
The next decorator works on a similar principle, it prints all received images to the console, and it does not take any arguments.
def scraped_images():
def decorator_repeat(func):
@functools.wraps(func)
def wrapper_debug(*args, **kwargs):
value = func(*args, **kwargs)
images = scrape(value)
for image in images:
print(f'Link->[{image}]')
return value
return wrapper_debug
return decorator_repeat
Next we need the scraping function itself. Inside of it we will specify what we are scraping with, how we are scraping and what we are scraping. The soup is created ...
Soup is a standard name for what the BeautifulSoup constructor returns. It's just the convention, but you're free to choose what to call it..
In this "soup" we find all the images and extract the data stored in the scr attribute from them. A list is created, which is passed to the decorator and processed there, i.e., output to the console.
def scrape(source):
soup = BeautifulSoup(source, 'lxml')
soup_images = soup.find_all('img')
images = []
for soup_image in soup_images:
images.append(soup_image['src'])
return images
This is, of course, good, but for this function to work, it must receive some source, that is, a target page. And here the two functions I mentioned earlier come into play: get_static and get_dynamic.

Static page scraper

If you need to parse a static site, consider yourself lucky. These days, such sites are less and less common. But if the target site is like this, this function will work like a charm:
# This function defines how to get an actual content of the target page, in this case I'm using requests module
@identifier('=>> run_static <<=')
@scraped_images()
def get_static(url):
resp = requests.get(url)
return resp.text
We have included 2 decorators that I mentioned above. They wrap around a function using the @ symbol. This function takes a URL as an argument and returns the text of the page itself.
I want to note that although I blurted out that such a function will work like a charm, this is true only if the site has no protection against parsers, which is very common in practice. But if such protection is available, you can always try to disguise your request as a real user using headers in requests or use a proxy. And if you need to parse many pages, then the rotation technique (either a proxy or user-agents) will help.

Dynamic page scraper

Often you will encounter a situation when the response is 200 and the page seems to have returned, but there is nothing on it. This happens when you parse dynamic sites. They just haven't loaded their content yet, and you just downloaded a template without content :(
Here's how to get a dynamic page:
# This function defines how to get an actual content of the target page, in this case I'm using selenium module
@identifier('=>> run_dynamic <<=')
@scraped_images()
def get_dynamic(url):
firefox_opt = Options()
firefox_opt.add_argument('--headless')
firefox_opt.add_argument("--no-sandbox")
firefox_opt.add_argument("--disable-dev-shm-usage")
driver = webdriver.Firefox(options=firefox_opt)
driver.implicitly_wait(5)
driver.get(url)
return driver.page_source
The principle of operation is the same as get_static. We connect 2 decorators. They wrap around the function using the @ symbol. This function takes the URL as an argument and returns the text of the page itself.
The only difference is in the implementation of how the page itself is obtained and the tools used, of course. In this case, I used Firefox, but you can use any other browser, the main thing is that you have to have one.
I also added several options for configuring the driver:
  1. headless - means do not create and display a browser window
  2. no-sandbox - means use all available resources of the machine on which the script is running
  3. disable-dev-shm-usage - means create an additional swap file if there are not enough memory
After creating the driver, we set the wait value in seconds before parsing the page. This is not my best solution. It would be better to use a timer that ends when loading certain content, so it will wait exactly as long as it needs to pick up the page. But the thing is that all sites are different, and you need to know which element to expect to load. So I decided to go for a universal solution and stupidly wait 5 seconds. (´。_。`)

Conclusions nonetheless

The parser is completely ready, and the only thing left is to launch it. Yes, it is not a very simple parser, but it is very visual and has everything necessary to parse most sites. This is how to launch it:
python main.py
And here is the console output you will get:
=>> run_static <<=
Link->[https://mc.yandex.ru/watch/95199635]
Link->[/static/placeholder.svg]
Link->[/static/placeholder.svg]
Link->[/static/Main/img/bg_1.webp]
Link->[placeholder.svg]
Link->[placeholder.svg]
Link->[placeholder.svg]
Link->[/static/placeholder.svg]
Link->[/static/placeholder.svg]
Link->[/static/placeholder.svg]
Link->[/static/placeholder.svg]
Link->[/static/placeholder.svg]
=>> run_dynamic <<=
Link->[https://mc.yandex.ru/watch/95199635]
Link->[/static/menu.svg]
Link->[/static/favicon.svg]
Link->[/static/Main/img/bg_1.webp]
Link->[/static/Main/img/me1.jpg]
Link->[/static/Main/img/sity1.jpg]
Link->[placeholder.svg]
Link->[/static/placeholder.svg]
Link->[/static/placeholder.svg]
Link->[/static/placeholder.svg]
Link->[/static/placeholder.svg]
Link->[/static/placeholder.svg]
Want to parse links instead? Go ahead. Just rewrite the scrape function and you're done.
Is something or someone blocking the parsing? Go ahead. Modify get_static, add a proxy, or start rotating user agents. If that doesn't help, let's go to the heavy artillery and start dynamic parsing, modifying it if necessary; the get_dynamic function.
Don't like the console output? Well, you can always rework it and make your own decorators.
So, you get the idea. This is the basic template for your parser. And in case you copied something incorrectly, here's the full script:
# FOR DECORATORS
import functools
# FOR STATIC SCRAPING
import requests
from bs4 import BeautifulSoup
# FOR DYNAMIC SCRAPING
# For creating webdriver
from selenium import webdriver
# For easy to send parameters to driver when set
from selenium.webdriver.firefox.options import Options


URL="https://timthewebmaster.com/ru/about/"

# Decorator to print name of function in console
def identifier(message: str = ""):
def decorator_repeat(func):
@functools.wraps(func)
def wrapper_debug(*args, **kwargs):
print(message)
func(*args, **kwargs)
return wrapper_debug
return decorator_repeat

# Decorator to print links of the images
def scraped_images():
def decorator_repeat(func):
@functools.wraps(func)
def wrapper_debug(*args, **kwargs):
value = func(*args, **kwargs)
images = scrape(value)
for image in images:
print(f'Link->[{image}]')
return value
return wrapper_debug
return decorator_repeat

# This function defines how to, and what to scrape
def scrape(source):
soup = BeautifulSoup(source, 'lxml')
soup_images = soup.find_all('img')
images = []
for soup_image in soup_images:
images.append(soup_image['src'])
return images

# This function defines how to get an actual content of the target page, in this case I'm using requests module
@identifier('=>> run_static <<=')
@scraped_images()
def get_static(url):
resp = requests.get(url)
return resp.text
# This function defines how to get an actual content of the target page, in this case I'm using selenium module
@identifier('=>> run_dynamic <<=')
@scraped_images()
def get_dynamic(url):
firefox_opt = Options()
firefox_opt.add_argument('--headless')
firefox_opt.add_argument("--no-sandbox")
firefox_opt.add_argument("--disable-dev-shm-usage")
driver = webdriver.Firefox(options=firefox_opt)
driver.implicitly_wait(5)
driver.get(url)
return driver.page_source


if __name__ == "__main__":
get_static(URL)
get_dynamic(URL)
That's all for now. <(_ _)>

Comments

(0)
captcha
Send
It's empty now. Be the first (o゚v゚)ノ

Used termins

Related questions

Similar articles

How to write a scraper of kinopoisk

Clock
24.11.2024
In this article I will tell you how to write a scraper for Kinopoisk website, what you will need for this and I will share the source code with the …