How to write a simple python scraper
10.12.2024
15.04.2025
6 minutes
88
1
0
0
0
Introduction
In this article I will give you an example of the simplest parser in Python. In two versions. The first one is a parser of static pages, the second one is a parser of dynamically loaded pages. For someone who writes parsers, the difference between them is not very big. In both cases, I used BeautifulSoup4 to collect links of an images.
Since I have not yet received permission to parse other people's sites, I decided to use my own website as an example. We will parse the "about me page". There are both embedded images and dynamically loaded ones. An ideal example.
If you need the source code and the finished project, you can download it from here.
Basic script, its structure
Let's start with installing the necessary packages and importing them. We will need 3 packages:
- requests - To send requests to the target site and receive web pages. Required for static scraping.
- selenium - To create a browser-without-an-interface and send requests from it. Required for dynamic scraping.
- beautifulsoup4 - For the scraping itself. Finding the necessary elements and converting or saving them in the desired format (files, lists ...)
Let's create a virtual environment and install the above-listed packages:
This command will create a .venv directory and load the basic libraries needed for Python, along with its executable file.
Now we activate the virtual environment for Windows:
And this command for Linux:
After activating the virtual environment, let's finally install the necessary packages:
Now let's create a main.py file and add the necessary imports and target URL to the file:
Add the program entry point to the very end of the file. This line says that when this file is run through the python interpreter, it will execute two functions in sequential order, get_static and get_dynamic.
We'll talk about them in the next chapters, but for now let's look at my decorators and the parser function itself:
The first decorator simply prints the specified message. In my case, it's the function name. There is another way to get the name of the current function in the current thread, but that would be too much for this article.
You will see how to use them in the following chapters. This decorator takes a string argument and prints it to the console before executing the wrapped function.
The next decorator works on a similar principle, it prints all received images to the console, and it does not take any arguments.
Next we need the scraping function itself. Inside of it we will specify what we are scraping with, how we are scraping and what we are scraping. The soup is created ...
In this "soup" we find all the images and extract the data stored in the scr attribute from them. A list is created, which is passed to the decorator and processed there, i.e., output to the console.
This is, of course, good, but for this function to work, it must receive some source, that is, a target page. And here the two functions I mentioned earlier come into play: get_static and get_dynamic.
Static page scraper
If you need to parse a static site, consider yourself lucky. These days, such sites are less and less common. But if the target site is like this, this function will work like a charm:
We have included 2 decorators that I mentioned above. They wrap around a function using the @ symbol. This function takes a URL as an argument and returns the text of the page itself.
Dynamic page scraper
Often you will encounter a situation when the response is 200 and the page seems to have returned, but there is nothing on it. This happens when you parse dynamic sites. They just haven't loaded their content yet, and you just downloaded a template without content :(
Here's how to get a dynamic page:
The principle of operation is the same as get_static. We connect 2 decorators. They wrap around the function using the @ symbol. This function takes the URL as an argument and returns the text of the page itself.
The only difference is in the implementation of how the page itself is obtained and the tools used, of course. In this case, I used Firefox, but you can use any other browser, the main thing is that you have to have one.
I also added several options for configuring the driver:
- headless - means do not create and display a browser window
- no-sandbox - means use all available resources of the machine on which the script is running
- disable-dev-shm-usage - means create an additional swap file if there are not enough memory
After creating the driver, we set the wait value in seconds before parsing the page. This is not my best solution. It would be better to use a timer that ends when loading certain content, so it will wait exactly as long as it needs to pick up the page. But the thing is that all sites are different, and you need to know which element to expect to load. So I decided to go for a universal solution and stupidly wait 5 seconds. (´。_。`)
Conclusions nonetheless
The parser is completely ready, and the only thing left is to launch it. Yes, it is not a very simple parser, but it is very visual and has everything necessary to parse most sites. This is how to launch it:
And here is the console output you will get:
Want to parse links instead? Go ahead. Just rewrite the scrape function and you're done.
Is something or someone blocking the parsing? Go ahead. Modify get_static, add a proxy, or start rotating user agents. If that doesn't help, let's go to the heavy artillery and start dynamic parsing, modifying it if necessary; the get_dynamic function.
Don't like the console output? Well, you can always rework it and make your own decorators.
So, you get the idea. This is the basic template for your parser. And in case you copied something incorrectly, here's the full script:
That's all for now. <(_ _)>
Comments
(0)
Send
It's empty now. Be the first (o゚v゚)ノ
Other
Similar articles
Used termins
- Python dictionary ⟶ Composite data type, and structure which has one more name, associative array. The python dictionary has keys and associating values.
- Python programming language ⟶ It is interpreted, build upon object-oriented approach, with dynamic semantics, and yes it s high-level programming language. Actively using for rapid development, scripting and gluing part of existing apps.
- Virtual environment ⟶ Is a self-contained directory that provides a way to manage dependencies and isolate project-specific configurations in Python (and other programming languages). It allows developers to create a separate environment for each project, ensuring that each project can have its own dependencies, regardless of what dependencies every other project has.
- Scraper ⟶ In the context of computing and web development, refers to a program or script that is designed to extract data from websites. This process is known as web scraping. Scrapers can automatically navigate through web pages, retrieve specific information, and store that data in a structured format, such as CSV, JSON, or a database.
Related questions
- What is the best web scraping tool? The choice of a scraping tool depends on the nature of the website and its complexity. As long as the tool can help you get the data quickly and smoothly with acceptable or zero cost, you can choose the tool you like.
- What is the difference between web scraping and web crawling? Web scraping and web crawling are two related concepts. Web scraping, as we mentioned above, is a process of obtaining data from websites; web crawling is systematically browsing the World Wide Web, generally for the purpose of indexing the web.
- Scraping data from Instagram is illegal? If the data you are going to collect is public and accessible to everyone, then it is definitely allowed. Plus, Instagram provides a special API for scraping so there should be no problems.