ImageThief

Start

ImageThief-Status

ImageThief-Results

ImageThief-Logs


							
							

Proxy

?
?
?

Mode

About image parser

Common

This is a web scraping tool that searches and downloads all images from a site. It works in 3 different modes. In the single-page parser mode, it searches and downloads images only from the specified page. In the multi-page mode. . In this case, the list of provided pages is parsed. And finally, the mode analyzes the image across the entire site and, if possible, downloads them. Although you can't stop scraping, you can close the tab and continue to scrape from the last link. Just enter the same address and mode and click Start buttons.
Scraping is implemented in single-threaded mode with user-agents swapping and proxies. Swapping and selection of the same is performed randomly using weights. That is, the more and longer you scrape the site, the better and faster the scraper will select the most effective proxies and user-agents.
To save space on the server, every day at 0:00 Moscow time I delete all collected parsing results.
This tool is developed in 2 variations. As a django application and as a separate CLI tool. Quite an important note, if I constantly update and improve the Django application, then the CLI version is not. Keep this in mind. Here is a link to Django app. Here it is a link to script.

About proxies

This tool supports proxies. Only public ones for now, but still. Here is an example of a file with proxies Can work with such proxies protocols as http, https, socks4 socks5. Also, due to the fact that the ProxyChecker tool is not ready yet, the option of automatic generation and selection of proxies for a specific site is not available.

Limitations and disclaimer

This tool has several limitations while scraping. Such as, it does not scrape svg files, it does not scrape background images specified in styles. Also dynamic scraping mode not yet implemented, but soon will. This web tool is absolutely free, the only thing I ask is, add this tool to your bookmarks, or share a link to it. Thank you.
Also, the author of this tool does not bear any responsibility for what visitors scrape. It was created solely to save time and nerves of those who simply need to collect all the images from the site.

A notes about this tool, devnotes

Cleaning up the ImageThief tool

Clock
17.11.2024
Successfully migrated .09 version of ImageThief to the server. With some major changes. Removed the ability to stop scraping, replaced Process base threading with Thread base threading. Also replaced several timers. More to come.

Working on proxy support for ImageThief tool

Clock
20.11.2024
Today I worked on ImageThief. I was busy with the layout and preparation of the backend for working with proxies. And spoiler, I did everything right. I probably could have done more, but I was too lazy. By the end of this year, I plan to finish ImageThief and add two smaller tools ProxyChecker and ProxyParser.

Has published the 9th version of ImageThief

Clock
21.11.2024
Now proxies are available for use. Let me be honest, the implementation of this feature leaves much to be desired, but as I usually say, first make "it" work, then make "it" work better. Or something like that.

heart
cloud
cloud
cloud

Reviews

(0)
captcha
Send
It's empty now. Be the first (o゚v゚)ノ