This article is about writing your own search results parser, for free and in 5 minutes. Without using such things as proxy or bs4. No third-party programs to bypass captcha and (or) imitate user activity in the browser, i.e., Selenium, for example.
It is intended for beginner SEO specialists who are a little savvy in programming and understand python syntax, but who do not have a lot of spare money.
And how am I going to parse Google search results? It's simple, I will connect to the Google Search API, which has a free plan of 100 requests per day. For the owner of a small site, it's just right. Here is a ready-made Google search results parser project.
Next, fill in all the fields and you're done. Now, finally, create an API key. Go to Credentials, either by the link or by the button (。・∀・)ノ゙:
Create an API-key:
After all these steps, you have managed to create your own API key for your own search engine. Copy it and save it somewhere.
Writing a scraper
Basic setup and preparation
Now we have everything we need; all that's left is to write a scraper. Let's create a virtual environment, install the necessary packages, and create a couple of directories:
For Windows/PowerShell
For Linux/Bash
Installing pandas and openpyxl is optional, because if you don't want to save the parsing results to XLSX files, then you don't have to. I will, because it's more convenient for me. The data directory will store our temporary JSON files and the results themselves, either the same JSON or XLSX tables.
Configuration file
My parser will also have a configuration file - config.json, from where it can find out how it should process requests. Here is the content of the configuration file, copy and paste:
Here is a general description of each key:
key - the API key we recently created
cx - the ID created at the beginning for the custom search engine
save_to - allows you to define how to save the result, valid values are exel and json.
depth - how many pages of search results to parse; Google allows you to get a maximum of 10 pages with 10 positions each
title, description and url - what to scrape
Script
This script is designed to accept arguments from the command line. The first is -q, the query itself. The second is the path to the configuration file -C. I did this using the argparse python module. All of its functionality is implemented in the run function:
In this function, an argparse object is created, configured, and then the configuration file is processed. At the very bottom of this function, serp_scrape_init and serp_page_scrape are called. Let's look at them one by one.
The first function serp_scrape_init works with Google Search API. Although, it's hard to call it work. We simply make a request to this URL:
It is important to understand that we need to go through all possible pages that Google returns. For this, the following parameters are used in the address, num and start. The first is responsible for how many sites to return in one request (maximum 10). The second parameter goes through all pages with a step of 10. In general, there are many more parameters for queries, you can see all of them here. As a result, our function looks like this:
As a result of the work, the function creates JSON files, which will then be processed by serp_page_scrape. Actually, let's talk about it.
Nothing extraordinary, it just opens previously created JSON files and saves what was specified in the configuration file. And that's it. Now, we actually got a small Google in the console. Here's an example of usage:
Here is the full code of the script and the main.py file:
Conclusion
You know, I originally planned to write a parser on the BeautifulSoup4 + Selenium + Python stack. But after googling a bit, no, I didn't find an official tutorial from Google on how to create a legal search results parser. I was getting websites of agencies and companies that offer to do the same thing, only for money.
Sure, if you are a large company, and you need to make 1000 requests per second, then Google Search API can provide additional limits for a small fee. Very small, compared to “unnamed” companies and websites. That's how it is. If you want to learn more about Google Search API, check out their official blog. It is very informative.
This is a tutorial with an example showing how to make a scraper for an e-commerce website with bypasses of blocking using proxies and their rotation. Using Selenium and some …
In this article I will show how to make a simple python scraper. This parser is an example of how to parse static and dynamic sites. With the source code …
Used termins
Client side rendering(CSR) ⟶ It is a JavaScript rendering method that uses JavaScript to render a website or application in the browser. With CSR, the processing and rendering of the content happens in the browser rather than on the server.
Javascript ⟶ Is a high-level, interpreted programming language that is commonly used for web development. It is an essential part of web applications, enabling interactive features and dynamic content on websites.
Script ⟶ Is a set of instructions written in a programming or scripting language that is executed by a runtime environment rather than being compiled into machine code. Scripts are typically used to automate tasks or to control the behavior of applications and systems.
Scraper ⟶ In the context of computing and web development, refers to a program or script that is designed to extract data from websites. This process is known as web scraping. Scrapers can automatically navigate through web pages, retrieve specific information, and store that data in a structured format, such as CSV, JSON, or a database.
Related questions
What is the best web scraping tool?
The choice of a scraping tool depends on the nature of the website and its complexity. As long as the tool can help you get the data quickly and smoothly with acceptable or zero cost, you can choose the tool you like.
How to avoid being blocked when scraping a website?
Many websites would block you if you scraped them too much. To avoid being denied, you should make the scraping process more like a human browsing a website. For example, adding a delay between two requests, using proxies, or applying different scraping patterns can help you avoid being blocked.
What is the difference between web scraping and web crawling?
Web scraping and web crawling are two related concepts. Web scraping, as we mentioned above, is a process of obtaining data from websites; web crawling is systematically browsing the World Wide Web, generally for the purpose of indexing the web.
Manual scraping, what is it?
This is a process of extracting data from web resources or documents that is manual, that is, performed by a person without the help of any auxiliary scripts or programs.
Scraping data from Instagram is illegal?
If the data you are going to collect is public and accessible to everyone, then it is definitely allowed. Plus, Instagram provides a special API for scraping so there should be no problems.
Cloud scraping what is it
This is a service for collecting information from various sources and grouping them in various formats, which is carried out on the cloud servers of the provider of this service.
How scraping is done
It all depends on what to scrape and what to scrape with. You can scrape documents and tables, or you can scrape a websites. Moreover, websites are more difficult to scrape from than documents, because there are many websites and each has its own architecture, which greatly complicates scraping.
Which language to use to write a scraper
The single and universal language of all parsers is python. The fact is that for any information, in whatever form and format it was presented, there is a library. Although there are alternatives in the form of JavaScrip, Ruby, Go, C++, PHP