google indexer

Chrome Web Scraper Tutorial From Semalt

Web scraping has become an indispensable tool for marketing and business in virtually all industries. The competition in the corporate world has snowballed into a real war. The importance of having regular access to data cannot be over-emphasized.

However, only a very few people know that they can tweak their web browser to work as a great web scraping tool. All you have to do is to install a web scraper extension from Chrome web store. Once installed, your web browser can scrape a site while you're working. Although it does not require much technical skills, you just need to follow the steps outlined below to get started:

Introduction to Web Scraper Extension

Web Scraper is an extension for Chrome browser created for web data scraping. During setup, it allows you to include instructions on how to navigate through a source website and specify the data you need to scrape. The tool will follow your instructions to extract the required data. You can also extract the data to CSV. In addition, the program can scrape several web pages simultaneously, as well as scrape data from pages built on Ajax and JavaScript.

Requirements

  • Internet connection
  • Google Chrome as a default browser

Setting up Instructions

  • Click the following link https://chrome.google.com/webstore/detail/web-scraper/jnhgnonknehpejjnehehllkliplmbmhn?hl=en
  • Add the extension to Chrome
  • You are done with set up

How to use the tool?

Open Google Chrome developer tools by right-clicking on the screen. Select inspect element. A shorter process is to press F12 after opening Google Chrome developer tools. You will find a new tab tagged 'Web Scraper' among other tabs.

Note that we used www.awesomegifs.com as an example for this tutorial. This is because the site has numerous gif images that can be scraped using this tool.

  • The first step is to create a sitemap
  • Go to awesomegifs.com.
  • Open developer tools by right-clicking on the screen and then selecting inspect
  • Select the web scraper tab
  • Go to 'create new sitemap' and click 'create sitemap'
  • Name your sitemap and go to the Start URL field to enter the URL of the site
  • Click on 'Create Sitemap'

You must understand the pagination structure of the site to be able to scrape multiple pages. Click the 'Next' button several times from the homepage to know how the pages are structured. Using awesomegifs.com, we discovered that page 1 has the addition of /page/1/ to the URL and page 2 has the addition of /page/2/ to the URL as in http://awesomegifs.com/page/2/ and it goes on like that.

This means you need to change the number at the end of the URL. However, you need to make the scraper do it automatically. Assuming that the site has 125 pages, you can create a new sitemap with this start URL – http://awesomegifs.com/page/[001 -125]. With this URL, the scraper will scrape images from page 1 to page 125.

Elements scraping

Elements have to be scraped from each page of the site. For this site, the elements are gif image URLs. You should start by finding the CSS selector that matches the images. This can be done by looking at the source file of the web page:

  • Use the selector tool to click any element on the screen
  • Click on the newly created sitemap
  • Click on 'Add new selector'
  • Name the selector in the selector id field
  • Stipulate the type of data you want to scrape in the type field
  • Click on the select button and select the required elements on the web page
  • Click on 'Done selecting'

Finally, if the element you want to scrape appears multiple times on a web page, you should check the 'multiple' checkbox, so that the tool can scrape each of them.

Now you can save the selector. To start scraping, you only need to select the sitemap tab and click 'Scrape.' A new window will pop up. You can stop the process prematurely by closing the window. At that point, you will get the data that has been already scraped.

After scraping, you can either browse the extracted data or export it to a CSV file by going to the sitemap. Unfortunately, this process cannot be automated. You'll have to carry it out manually every time. Also, scraping a large amount of data may require a data scraping service as tools may not be helpful.