Skip to main content

Tool to create image datasets for machine learning problemsby scraping search engines like Google, Bing and Baidu.

Project description

DatasetScraper

Tool to create image datasets for machine learning problems by scraping search engines like Google, Bing and Baidu.

Features:

  • Search engine support: Google, Bing, Baidu. (in-production): Yahoo, Yandex, Duckduckgo
  • Image format support: jpg, png, svg, gif, jpeg
  • Fast multiprocessing enabled scraper
  • Very fast multithreaded downloader
  • Data verification after download for assertion of image files

Installation

  • COMING SOON on pypi

Usage:

  • Import from datasetscraper import Scraper

  • Defaults

obj = Scraper()
urls = obj.fetch_urls('kiniro mosaic')
obj.download(urls, directory='kiniro_mosaic/')
  • Specify a search engine
obj = Scraper()
urls = obj.fetch_urls('kiniro mosaic', engine=['google'])
obj.download(urls, directory='kiniro_mosaic/')
  • Specify a list of search engines
obj = Scraper()
urls = obj.fetch_urls('kiniro mosaic', engine=['google', 'bing'])
obj.download(urls, directory='kiniro_mosaic/')
  • Specify max images (default was 200)
obj = Scraper()
urls = obj.fetch_urls('kiniro mosaic', engine=['google', 'bing'], maxlist=[500, 300])
obj.download(urls, directory='kiniro_mosaic/')

FAQs

  • Why aren't yandex, yahoo, duckduckgo and other search engines supported? They are hard to scrape, I am working on them and will update as soon as I can.

  • I set maxlist=[500] why are only (x<500) images downloaded? There can be several reasons for this:

    • Search ran out: This happens very often, google/bing might not have enough images for your query
    • Slow internet: Increase the timeout (default is 60 seconds) as follows: obj.download(urls, directory='kiniro_mosaic/', timeout=100)
  • How to debug? You can change the logging level while making the scraper object : obj = Scraper(logger.INFO)

TODO:

  • More search engines
  • Better debug
  • Write documentation
  • Text data? Audio data?

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datasetscraper-0.0.4.tar.gz (5.8 kB view hashes)

Uploaded Source

Built Distribution

datasetscraper-0.0.4-py3-none-any.whl (13.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page