How to write a simple python scraper
10.12.2024
45
1
0
0
0
Introduction
In this article I will give you an example of the simplest parser in Python. In two versions. The first one is a parser of static pages, the second one is a parser of dynamically loaded pages. For someone who writes parsers, the difference between them is not very big. In both cases, I used BeautifulSoup4 to collect links of an images.
Since I have not yet received permission to parse other people's sites, I decided to use my own website as an example. We will parse the "about me page". There are both embedded images and dynamically loaded ones. An ideal example.
If you need the source code and the finished project, you can download it from here.
Basic script, its structure
Let's start with installing the necessary packages and importing them. We will need 3 packages:
- requests - To send requests to the target site and receive web pages. Required for static scraping.
- selenium - To create a browser-without-an-interface and send requests from it. Required for dynamic scraping.
- beautifulsoup4 - For the scraping itself. Finding the necessary elements and converting or saving them in the desired format (files, lists ...)
Let's create a virtual environment and install the above-listed packages:
This command will create a .venv directory and load the basic libraries needed for Python, along with its executable file.
Now we activate the virtual environment for Windows:
And this command for Linux:
After activating the virtual environment, let's finally install the necessary packages:
Now let's create a main.py file and add the necessary imports and target URL to the file:
Add the program entry point to the very end of the file. This line says that when this file is run through the python interpreter, it will execute two functions in sequential order, get_static and get_dynamic.
We'll talk about them in the next chapters, but for now let's look at my decorators and the parser function itself:
The first decorator simply prints the specified message. In my case, it's the function name. There is another way to get the name of the current function in the current thread, but that would be too much for this article.
You will see how to use them in the following chapters. This decorator takes a string argument and prints it to the console before executing the wrapped function.
The next decorator works on a similar principle, it prints all received images to the console, and it does not take any arguments.
Next we need the scraping function itself. Inside of it we will specify what we are scraping with, how we are scraping and what we are scraping. The soup is created ...
In this "soup" we find all the images and extract the data stored in the scr attribute from them. A list is created, which is passed to the decorator and processed there, i.e., output to the console.
This is, of course, good, but for this function to work, it must receive some source, that is, a target page. And here the two functions I mentioned earlier come into play: get_static and get_dynamic.
Static page scraper
If you need to parse a static site, consider yourself lucky. These days, such sites are less and less common. But if the target site is like this, this function will work like a charm:
We have included 2 decorators that I mentioned above. They wrap around a function using the @ symbol. This function takes a URL as an argument and returns the text of the page itself.
Dynamic page scraper
Often you will encounter a situation when the response is 200 and the page seems to have returned, but there is nothing on it. This happens when you parse dynamic sites. They just haven't loaded their content yet, and you just downloaded a template without content :(
Here's how to get a dynamic page:
The principle of operation is the same as get_static. We connect 2 decorators. They wrap around the function using the @ symbol. This function takes the URL as an argument and returns the text of the page itself.
The only difference is in the implementation of how the page itself is obtained and the tools used, of course. In this case, I used Firefox, but you can use any other browser, the main thing is that you have to have one.
I also added several options for configuring the driver:
- headless - means do not create and display a browser window
- no-sandbox - means use all available resources of the machine on which the script is running
- disable-dev-shm-usage - means create an additional swap file if there are not enough memory
After creating the driver, we set the wait value in seconds before parsing the page. This is not my best solution. It would be better to use a timer that ends when loading certain content, so it will wait exactly as long as it needs to pick up the page. But the thing is that all sites are different, and you need to know which element to expect to load. So I decided to go for a universal solution and stupidly wait 5 seconds. (´。_。`)
Conclusions nonetheless
The parser is completely ready, and the only thing left is to launch it. Yes, it is not a very simple parser, but it is very visual and has everything necessary to parse most sites. This is how to launch it:
And here is the console output you will get:
Want to parse links instead? Go ahead. Just rewrite the scrape function and you're done.
Is something or someone blocking the parsing? Go ahead. Modify get_static, add a proxy, or start rotating user agents. If that doesn't help, let's go to the heavy artillery and start dynamic parsing, modifying it if necessary; the get_dynamic function.
Don't like the console output? Well, you can always rework it and make your own decorators.
So, you get the idea. This is the basic template for your parser. And in case you copied something incorrectly, here's the full script:
That's all for now. <(_ _)>
Comments
(0)
Send
It's empty now. Be the first (o゚v゚)ノ
Used termins
- Scraper ⟶ In the context of computing and web development, refers to a program or script that is designed to extract data from websites. This process is known as web scraping. Scrapers can automatically navigate through web pages, retrieve specific information, and store that data in a structured format, such as CSV, JSON, or a database.
Related questions
- Manual scraping, what is it? This is a process of extracting data from web resources or documents that is manual, that is, performed by a person without the help of any auxiliary scripts or programs.
- Scraping data from Instagram is illegal? If the data you are going to collect is public and accessible to everyone, then it is definitely allowed. Plus, Instagram provides a special API for scraping so there should be no problems.
- Cloud scraping what is it This is a service for collecting information from various sources and grouping them in various formats, which is carried out on the cloud servers of the provider of this service.