finddatalab.com

10 Tips for Web Scraping

With web scraping quickly becoming more popular as business intelligence and research tool, it's increasingly important to do it properly. It doesn't matter if you want to scrape the prices from an e-retailer for comparison or create a database of all car crashes in a region and their specific coordinates, there are a few things everyone should keep in mind.

1. Website scraper time-outs and request randomization
2. Using custom headers for a web scraper
3. IP address rotation
4. Web scrape while rotating user-agents
5. Synchronous vs asynchronous web scrapers
6. Selecting objects in a web page
7. Avoiding 'honey pots.'
8. Solving CAPTCHAs
9. Dealing with cookies
10. Using Google's cached web content

Who is FindDataLab?

We're a web scraping service gathering data fit for any size company.
We can help you turn anything into data, be it websites, databases, PDFs, or your grandma's lover's letters from the 70s.

Just starting with your business and not sure if you need a dedicated data scraper?

Don't worry, FindDataLab can provide you with a web scraping script, as well as scale the data scraping solution as your business grows.

The following examples are written in python3. Let's dive in!

1. Website scraper time-outs and request randomization

One of the most often heard answers to the question: 'How not to get blocked while web scraping?' is by scraping 'politely.' This can be done by implementing a simple time-out. A generally good request time-out would be around 10 seconds.
A lot of the traffic-heavy websites will have a robots.txt file, where we can find a specific crawl-delay to implement in our website scraper.

If the robots.txt file exists, it can be accessed by appending '/robots.txt' at the end of whatever domain that we're accessing, e.g., 'http://example.com/robots.txt.' If we see 'Crawl-delay: 5', it means that a bot (or a non-malicious web scraper) should wait 5 seconds in between sending requests.

The sleep function from the python time module is perfect for simple time-outs like this. The sleep function takes in an integer or a float that represents time in seconds for how long the script execution will be delayed:

import requests
import time

headers = {
    # headers
}
pages = {
    # url 1
    # url 2
}

for url in pages:
    page = requests.get(url, headers = headers)
    # scraping
    time.sleep(5)

Another way of polite scraping while also appearing more human-like is by sending timed-out requests at a random rate. Sending timed-out requests at a regular, unchanging rate could be interpreted as bot activity (because most of the time, it is) and gets blocked.

If we think about how we browse the web, requests, or clicks are usually spaced out at random intervals – we click on something, read a little bit, then click on something else. It's practically impossible for a human end-user to send requests at an unchanging interval.

The following code example will make sure that the script execution times-out sometime between 5 and 10 seconds, with every loop iteration time-out being different:

import requests
import time
import random

headers = { 
    # headers 
}
pages = {
    # url 1
    # url 2
}

for url in pages:
    page = requests.get(url, headers = headers)
    # scraping
    time.sleep(random.uniform(5, 10))

Make sure to check out FindDataLab's ultimate guide for ethical web scraping to find out more about time-outs and strategies to not get blocked while scraping.

2. Using custom headers for a web scraper

When thinking about how to web scrape, we usually don't think about what information is sent to the server when we're making requests. Depending on what tool we're using to scrape the web if the 'headers' section is left unmodified, it's fairly easy to catch a bot.

Whenever you're accessing a webpage, two things are happening. First, a Request object is constructed, and second, a Response object is made once the server responds to the initial Request.

Now, we will make a simple request and access both the headers that we sent to the server and headers that the server sent back to us:

import requests

page = requests.get('https://www.google.com/')
print(page.request.headers)

The output is:

{'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate',
 'Accept': '*/*', 'Connection': 'keep-alive'}

This shows us that the request is sent by the python Requests module, what compression algorithms the client accepts and what types of data can be sent back, as well as what to do with the network connection after the current transaction finishes.

We can also access the headers that the server sent back to us:

print(page.headers)

Output from the previous code:

{'Date': 'Tue, 18 Feb 2020 15:55:09 GMT', 'Content-Type': 'application/json', 
'Content-Length': '51', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 
'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true'}

A great tool for decoding, as well as creating custom headers is the Mozilla Developer HTTP headers reference.

In order to not get blocked, we need to change our user-agent to something more human end-user like, which can be done by setting custom headers:

import requests

headers = {
    'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0), [email protected]'
}
page = requests.get('https://www.google.com/', headers = headers)
print(page.request.headers)

Let's check what we sent to the server:

{'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0), [email protected]',
 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

As we can see, the sent request contains our new user-agent, with the rest of the headers being sent by default. Of course, we can also set custom values for every single one of the headers.

Another header that might come in handy is the Referer header. It contains a link to the page visited previously. This is the page through which the user navigated to the current page.

The following code will access the Wiki page on web scraping with the referrer page set as Google.

import requests

headers = {
    'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0),
 [email protected]', 'Referer' : 'https://www.google.com/'
}
page = requests.get('https://en.wikipedia.org/wiki/Web_scraping', headers = headers)
print(page.request.headers)

The output is showing the referrer we set previously:

{'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0),
 [email protected]', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 
'Connection': 'keep-alive', 'Referer': 'https://www.google.com/'}

This works for linking to pages most of the time, however, if you're concerned about deep links not working, set the referrer as a 'shallower' link accordingly.

3. IP address rotation

If you're interested in learning about how to scrape a website while using multiple IP addresses, look no further. It's a good idea to start by using free proxies available on the web. However, you do have to keep in mind that since these proxies are free, a lot of people are using them and so – they're slow, and they often fail either because they're already blocked by servers or overloaded. Therefore, use private proxies whenever possible.

Let's look at a brief example of making a simple IP address rotation tool. First, we'll need to get example proxy addresses to use by scraping a web page that provides free IP addresses and corresponding ports for public use. We'll use the python requests module and BeautifulSoup.

import requests
from bs4 import BeautifulSoup

The following code accesses the free web scraping proxy list web page and gets the IP addresses and ports to use for the proxy request calls. Here, using BeautifulSoup, we access the <td> tag which defines a standard cell in a HTML table. We'll use the python built-in web parser - 'html.parser'. We don't have to necessarily specify which web parser to use, but some problems might arise if the code needs to be used from a different machine.

url = 'https://free-proxy-list.net/'
page = requests.get(url)
soup = BeautifulSoup(page.content, ‘html.parser’)

By inspecting the page element, we can conclude that each 8^th <td> tag or cell of a table contains an IP address. The next line of code gets each 8^th cell starting from 0, removes the tags and makes a list of 20 IP addresses. The find_all function takes in the argument limit, which in our case means that the web scraper will get 160 <td> tags. Since only every 8^th cell contains the IP address, we will get 20 IPs (160/8 = 20).

# every 8th <td> element is the IP
ips = [x.text for x in soup.find_all('td', limit = 160)[::8]]

Next, we will get the ports. The same principle applies only this time we start from the second cell and again, add every 8^th cell to the list.

# omit the 1st element (IP), get all the 8ths
ports = [x.text for x in soup.find_all('td', limit = 160)[1::8]]

Finally, we join the IP address to the corresponding port and return a list of proxy addresses.

proxies = [":".join([a, b]) for a, b in zip(ips, ports)]

This is how the finished function looks:

def new_proxies():
    url = 'https://free-proxy-list.net/'
    page = requests.get(url)
    soup = BeautifulSoup(page.content, ‘html.parser’)
    ips = [x.text for x in soup.find_all('td', limit = 160)[::8]]
    ports = [x.text for x in soup.find_all('td', limit = 160)[1::8]]
    proxies = [":".join([a, b]) for a, b in zip(ips, ports)]
    return proxies

Next, we'll just check that the requests are actually sent from our newly selected proxy list. This can be done by using httpbin.org's request inspection tool. A request sent to httpbin.org/ip returns the requester's IP address either as a JSON app or raw data.

The next code snippet returns the request count (we have a total of 20 proxy addresses) as well as the IP address from where it was supposedly sent.

addresses = new_proxies()
url = 'https://httpbin.org/ip'
for nb, i in enumerate(addresses, 1):
    proxies = {'http': 'http://' + i, 'https': 'http://' + i}
    print("Request Number %d" %nb)
    try:
        page = requests.get(url, proxies = proxies)
        print(page.json())
    except:
        print('Connection failed. Skipping proxy.')

This is a simple example of implementing IP rotation yourself. Keeping in mind the dynamic nature of websites, make sure to check the HTTP tag accuracy for yourself, since any modifications to the source web page could break the script.

A relatively straightforward way of implementing IP rotation is by using ready-made Scrapy middleware. Scrapy is a python framework that's been developed specifically for web scraping and crawling. A useful Scrapy tool for rotating IP addresses could be scrapy-proxies middleware.

Another way of rotating IP addresses is by using a proxy service. Depending on the purchased plan, you'll get a set number of IP addresses based in a location of your choice. Subsequently, all the requests will be sent through these IPs. Make sure to use elite proxies if you can, as they will send the most user-like headers to the server you're trying to access.

4. Web scrape while rotating user-agents

We can implement user-agent rotation either by modifying the headers manually or by writing a function that renews the user-agent list every time we start the web scraping script. This can be implemented in a similar way as the previous function for retrieving IP addresses. You can find many sites where it's possible to get various user-agent strings.
Make sure to keep your user-agent strings updated either manually or by automating the process. Since new browser releases are getting more frequent, it's easier than ever to detect an outdated user-agent string and for servers to block requests from said user-agents.

5. Synchronous vs asynchronous web scrapers

Synchronous web scraping means that we scrape only one site at a time and start scraping the next one only when the first one has finished processing. Here we have to keep in mind that the biggest time sink is the network request.

Most of our time is spent waiting for the webserver to respond to the request we sent and give us the content of a site. During this downtime, the computer effectively is not doing any work. There are a million better things that the machine could do at this time and two of these things are - send new requests and process received data.

We can find a lot of libraries out there that can make asynchronous request sending easy and fast, for example, python's grequests, aiohttp, as well as the concurrent package and requests in combination with threading from the standard python library. Scrapy's default behaviour is to schedule and process requests in an asynchronous manner, therefore if you want to start and scale your project quickly, this framework could potentially be an excellent tool for the job.

6. Selecting objects in a web page

Depending on the tool you're using, there are a few ways of selecting objects on a web page. If you've inspected the site you're trying to scrape, you're probably familiar with the structure of tags and how they're nested. Altogether there are two main ways of selecting page objects, and that is by using CSS selectors or XPATH navigation.

One of the main drawbacks of using XPATH locators is that the XPATH engines differ in each browser, therefore making the method inconsistent. As well as, depending on how you decide to map the location, the XPATH may change unpredictably with every update to the web page even if you're using the same browser for scraping.

It is generally agreed that the more robust way of finding objects in a web page is by using CSS selectors. It's an easier approach generally since applications are built with CSS. Therefore, it will be easier to write code and to talk about it with others, as well as have others help you maintain the script.

However, if you're looking to walk up the page (from child element to parent) instead of just traversing the Document Object Model (from parent to child), then XPATH could be a better option.

7. Avoiding 'honey pots'

A 'honey pot' is an HTML link that's not visible to the user on a browser but can be accessed by a web scraper. One of the simplest ways to do this is by setting the CSS selector as display: none. In the case where you try to scrape website data by sending a request to any of these hidden links, the web server could detect activity that a human end-user couldn't perform and accordingly block the scraper.

A simple way of bypassing this is by using the Selenium is_displayed function. Since Selenium renders the web page, it'll check that the element is visually present.

In any case, make sure to inspect the web site you're trying to scrape carefully. Even though you don't want to visit hidden links, there might be pre-populated hidden forms that need to be sent together with your request. It's essential to look at the web page as a whole and not just simply ignore all hidden fields.

8. Solving CAPTCHAs

An efficient way of blocking basic scripts is by tracking 'user' behaviour and saving it into cookies. If the 'user' behaviour seems suspicious, then a CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) is served for the user to solve. This way a potentially real user will not be blocked; however, most web scrapers will, since the test is based on the assumption that only a human could pass the CAPTCHA test. If you want to scrape data from websites that are known to test their users with web scraping CAPTCHA, make sure to implement a problem-solving script right away.

Depending on the type of CAPTCHA the easiest way of solving it would be by using an open-source Optical Character Recognition (OCR) engine. One of the most popular tools would be Tesseract, GOCR and Ocrad to name a few.

Tesseract is a command-line tool, but it has many wrappers as well and can be easily integrated into a web scraping script. If we're writing a python web scraping script, we can use PyTesseract to read the CAPTCHA images and get an output.

In case you want to take the DIY route, there's a possibility of using machine learning to train your own neural net to solve CAPTCHAs for a specific web page. This can get very time consuming, but will surely be a rewarding learning experience.

If you're stumped by CAPTCHAs, the OCRs don't work, and your web scraper is stuck, you can always just hire a CAPTCHA solving service. Either purely human-based one, or one that combines automated CAPTCHA solving with human solutions for the more advanced CAPTCHA problems.

9. Dealing with cookies

A cookie is a mechanism with which the webserver remembers the HTTP state for a user's browsing session. In simple terms – it tracks user activity, remembers the language and other preferred settings that a user has selected when visiting previously. For example, if you're online shopping and you add items to your cart, you expect that the items will be there when you go to check-out. A session cookie is what enables the web page to do this.

For web scraping a typical example of cookie usage would be to maintain a logged-in state if you need to scrape information that's password protected.

If you're wondering how to scrape a website with persisting cookies, wonder no more. One way of having some parameters and cookies persist across requests is by using the python requests module's Session object. You can use cookies to speed up web scraping if you're accessing the same web page, this works by using the same Transmission Control Protocol (TCP). We would just reuse the existing HTTP connection, thus saving time.

To test the HTTP requests we're sending to a webpage, we can use httpbin.org, which is 'a simple HTTP Request & Response Service'.

import requests

s = requests.Session()
# this is a cookie pre-set for testing purposes
s.get('http://httpbin.org/cookies/set/sessioncookie/1234')

# next request
page = s.get('http://httpbin.org/cookies')

print(page.request.headers)
# {'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate',
'Accept': '*/*', 'Connection': 'keep-alive', 'Cookie': 'sessioncookie=1234'}

The output {'Cookie':'sessioncookie=1234'} shows us that the cookie that we retrieved from the previous page persisted when making the next request.

What activity will trigger a CAPTCHA you might ask? Well, it could be anything from using too many devices to log into one account, to not rendering the site. This is the reason why it's crucial to think about how humans browse the web and simulate this behaviour in our web scraper.

10. Using Google's cached web content

If you're wondering how to web scrape without accessing the site's servers, wonder no more. Depending on how fresh you need the data to be, you can use Google's cached web content for undetected scraping. When Google indexes the web it effectively scrapes it as well and lucky for us, Google also gives us access to the cached version of the content.

Sites that are updated frequently most likely will also be crawled by Google more regularly. Therefore the cached content can sometimes be only a few hours old. Scraping cached content is a great way not to overload the web page's servers since you're making requests to Google. If you decide to scrape data from the website's cached content, you will not get the bleeding-edge data, but you won't have to worry about any security measures that the actual server might have implemented.

Using cached web content to scrape website data can be done by Google searching the web page you want to scrape and clicking on the green arrow next to the site's link. Select 'Cached' from the drop-down menu, and you'll be redirected to the Google user content web cache. You'll see a date signifying the last snapshot that was made (how old is the information), as well as have to access to the text-only version, which makes scraping exceptionally simple.