How to write a Google scraper using the official API

Clock
15.02.2025
Clock
15.02.2025
Clock
6 minutes
An eye
79
Hearts
0
Connected dots
0
Connected dots
0
Connected dots
0

What is this article about and for whom

This article is about writing your own search results parser, for free and in 5 minutes. Without using such things as proxy or bs4. No third-party programs to bypass captcha and (or) imitate user activity in the browser, i.e., Selenium, for example.
It is intended for beginner SEO specialists who are a little savvy in programming and understand python syntax, but who do not have a lot of spare money.
And how am I going to parse Google search results? It's simple, I will connect to the Google Search API, which has a free plan of 100 requests per day. For the owner of a small site, it's just right. Here is a ready-made Google search results parser project.

Creating an API key and search engine ID

To be able to use Google's API, we need to get an access key and a search engine ID. First, let's create our own search engine. Go to https://programmablesearchengine.google.com/controlpanel/all, click Add and fill in all the form fields.
You will be redirected to the next page. Now on this page you can get your ID.
You get the ID, all that remains is to get an API key. Go to this page, register if necessary https://console.cloud.google.com/apis/dashboard?inv=1&invt=AbppVQ On this page you will need to create a new project. Click here:
Next, fill in all the fields and you're done. Now, finally, create an API key. Go to Credentials, either by the link or by the button (。・∀・)ノ゙:
Create an API-key:
After all these steps, you have managed to create your own API key for your own search engine. Copy it and save it somewhere.

Writing a scraper

Basic setup and preparation

Now we have everything we need; all that's left is to write a scraper. Let's create a virtual environment, install the necessary packages, and create a couple of directories:
mkdir MyParser
mkdir MyParser/data
mkdir MyParser/data/serp
mkdir MyParser/data/temp
New-Item MyParser/main.py
New-Item MyParser/config.json
python -m venv .venv
./.venv/Scripts/activate
pip install requests pandas openpyxl
For Windows/PowerShell
mkdir MyParser
mkdir MyParser/data
mkdir MyParser/data/serp
mkdir MyParser/data/temp
touch MyParser/main.py
touch MyParser/config.json
python -m venv .venv
source ./.venv/bin/activate
pip install requests pandas openpyxl
For Linux/Bash
Installing pandas and openpyxl is optional, because if you don't want to save the parsing results to XLSX files, then you don't have to. I will, because it's more convenient for me. The data directory will store our temporary JSON files and the results themselves, either the same JSON or XLSX tables.

Configuration file

My parser will also have a configuration file - config.json, from where it can find out how it should process requests. Here is the content of the configuration file, copy and paste:
{
"key": "11111111111111111111111111111111111",
"cx": "11111111111111111",
"save_to": "exel",
"title": true,
"description": false,
"url": true,
"depth": 1
}

Here is a general description of each key:
  1. key - the API key we recently created
  2. cx - the ID created at the beginning for the custom search engine
  3. save_to - allows you to define how to save the result, valid values ​​are exel and json.
  4. depth - how many pages of search results to parse; Google allows you to get a maximum of 10 pages with 10 positions each
  5. title, description and url - what to scrape

Script

This script is designed to accept arguments from the command line. The first is -q, the query itself. The second is the path to the configuration file -C. I did this using the argparse python module. All of its functionality is implemented in the run function:
def run():
parser = argparse.ArgumentParser(add_help=True)
parser.add_argument('-q', type=str, help='Query to parse', metavar='QUERY', required=True, nargs='*')
parser.add_argument('-C', type=str, help='Path to config, in json format', metavar='CONFIG_FILE', required=True, nargs=1)
args = parser.parse_args()
# query
raw_query = ''.join(word + ' ' for word in args.q)
if raw_query is None:
return
query = quote(raw_query)
# check if config exist
options = {
'key': '',
'cx': '',
'save_to': '',
'title': '',
'description': '',
'url': '',
'depth':''
}
with open(args.C[0], 'r') as file:
data = json.loads(file.read())
for key in data:
if options.get(key) is not None:
options[key] = data[key]
else:
print(f'ERROR: Something went wrong in your config file, {key}')
return False

# check depth
if options['depth'] > 10:
print('WARNING: Google Search API allowed only 100 search results to be available')
options['depth'] = 10
else:
options['depth'] = data['depth']
serp_scrape_init(query, options)
serp_page_scrape(query, options)
In this function, an argparse object is created, configured, and then the configuration file is processed. At the very bottom of this function, serp_scrape_init and serp_page_scrape are called. Let's look at them one by one.
The first function serp_scrape_init works with Google Search API. Although, it's hard to call it work. We simply make a request to this URL:
https://www.googleapis.com/customsearch/v1?key={options['key']}&cx={options['cx']}&q={query}&num=10&start={i * 10 + 1}
It is important to understand that we need to go through all possible pages that Google returns. For this, the following parameters are used in the address, num and start. The first is responsible for how many sites to return in one request (maximum 10). The second parameter goes through all pages with a step of 10. In general, there are many more parameters for queries, you can see all of them here. As a result, our function looks like this:
def serp_scrape_init(query: str, options: dict = {}) -> list:
for i in range(0, options['depth']):
response = requests.get(f'https://www.googleapis.com/customsearch/v1?key={options['key']}&cx={options['cx']}&q={query}&num=10&start={i * 10 + 1}')
save_to_json(f'./data/temp/{query}_{i*10 + 1}-{i*10 + 10}.json',response.json())
As a result of the work, the function creates JSON files, which will then be processed by serp_page_scrape. Actually, let's talk about it.
def serp_page_scrape(query: str, options: dict) -> list:
data = []
for i in range(0, options['depth']):
try:
with open(f'./data/temp/{query}_{i*10 + 1}-{i*10 + 10}.json', 'r+', encoding='utf-8') as file:
data_temp = json.loads(file.read())
for item in data_temp['items']:
title = None
if options['title']:
title = item['title']
description = None
if options['description']:
description = item['snippet']
url = None
if options['url']:
url = item['link']

data.append({
'title': title,
'description': description,
'url': url,
})
except:
pass
if options['save_to'] == 'json':
save_to_json(f'./data/serp/{query}.json', data)
else:
save_to_exel(f'./data/serp/{query}.xlsx', data)

return data
Nothing extraordinary, it just opens previously created JSON files and saves what was specified in the configuration file. And that's it. Now, we actually got a small Google in the console. Here's an example of usage:
python main.py -q The biggest cats in the world -С config.json
Here is the full code of the script and the main.py file:
import json
import argparse
import requests
import pandas
from urllib.parse import quote, unquote


def save_to_json(path, list):
with open(path, 'w', encoding='utf-8') as file:
json.dump(list, file, indent=2, ensure_ascii=False)
file.close()

def save_to_exel(path, data):
frame = pandas.DataFrame({
'title': [],
'link': [],
'description': []
})
for indx, entry in enumerate(data):
frame.at[indx, 'title'] = entry['title']
frame.at[indx, 'link'] = entry['url']
frame.at[indx, 'description'] = entry['description']
frame.to_excel(path, index=False )

def serp_page_scrape(query: str, options: dict) -> list:
data = []
for i in range(0, options['depth']):
try:
with open(f'./data/temp/{query}_{i*10 + 1}-{i*10 + 10}.json', 'r+', encoding='utf-8') as file:
data_temp = json.loads(file.read())
for item in data_temp['items']:
title = None
if options['title']:
title = item['title']
description = None
if options['description']:
description = item['snippet']
url = None
if options['url']:
url = item['link']

data.append({
'title': title,
'description': description,
'url': url,
})
except:
pass
if options['save_to'] == 'json':
save_to_json(f'./data/serp/{query}.json', data)
else:
save_to_exel(f'./data/serp/{query}.xlsx', data)

return data

def serp_scrape_init(query: str, options: dict = {}) -> list:
print(f'Query: {unquote(query)},\nOptions: title={options['title']} | description={options['description']} | urls={options['url']} | depth={options['depth']} | save to={options['save_to']}')
for i in range(0, options['depth']):
response = requests.get(f'https://www.googleapis.com/customsearch/v1?key={options['key']}&cx={options['cx']}&q={query}&num=10&start={i * 10 + 1}')
save_to_json(f'./data/temp/{query}_{i*10 + 1}-{i*10 + 10}.json',response.json())

def run():
# This is going to be only in standalone script
# Get the options and query from CLI
parser = argparse.ArgumentParser(add_help=True)
parser.add_argument('-q', type=str, help='Query to parse', metavar='QUERY', required=True, nargs='*')
parser.add_argument('-C', type=str, help='Path to config, in json format', metavar='CONFIG_FILE', required=True, nargs=1)
args = parser.parse_args()
# query
raw_query = ''.join(word + ' ' for word in args.q)
if raw_query is None:
return
query = quote(raw_query)
# check if config exist
options = {
'key': '',
'cx': '',
'save_to': '',
'title': '',
'description': '',
'url': '',
'depth':''
}
with open(args.C[0], 'r') as file:
data = json.loads(file.read())
for key in data:
if options.get(key) is not None:
options[key] = data[key]
else:
print(f'ERROR: Something went wrong in your config file, {key}')
return False

# check depth
if options['depth'] > 10:
print('WARNING: Google Search API allowed only 100 search results to be available')
options['depth'] = 10
else:
options['depth'] = data['depth']
serp_scrape_init(query, options)
serp_page_scrape(query, options)

if __name__ == "__main__":
run()

Conclusion

You know, I originally planned to write a parser on the BeautifulSoup4 + Selenium + Python stack. But after googling a bit, no, I didn't find an official tutorial from Google on how to create a legal search results parser. I was getting websites of agencies and companies that offer to do the same thing, only for money.
Sure, if you are a large company, and you need to make 1000 requests per second, then Google Search API can provide additional limits for a small fee. Very small, compared to “unnamed” companies and websites. That's how it is. If you want to learn more about Google Search API, check out their official blog. It is very informative.


Comments

(0)

captcha
Send
It's empty now. Be the first (o゚v゚)ノ

Other