The Ultimate Guide To Ethical
Web Scraping

Web Scraping Ethics - How To Not Get Blocked.

Request your data now

By clicking "Submit", you agree to our Privacy Policy

Big data collection for analytics has become increasingly important both for businesses and research. Web scraping, as one of the ways of achieving that, has arrived at the forefront of automatized data collection, and with good reason. A web scraper can be one of the best tools for gathering a large amount of data fast, which will let us analyze and gain insight into events almost instantaneously. This ultimate guide to ethical web scraping uncovers most of the techniques you need to adopt when you have your web scraping blocked. If you feel like you need more help with your project or need further consultation on how to prevent web scraping blocking, please feel to contact our data team.

Why is ethical data scraping necessary?

Scraping a single page is pretty straightforward. Problems tend to arise when we want to scrape data from a website and collect a large amount of information in a short amount of time. While we can write a crude script that will scrape everything in a fraction of a second, it will most likely be the last time we get to access the web page in question.

This is where ethical data scraping comes in handy. If we respect the fact that a web page has finite resources at its disposal and scrape mindfully, we will most likely not get blocked when web scraping. However, if we want to save ourselves the headache, it's worth looking into web scraping services.

5 things to keep in mind when performing internet scraping:

Taking note of terms of use and robots.txt
Using an API
Identifying yourself when sending requests
Time-outs and responsive delays
Simulating a real-world user

1. Taking note of terms of use and robots.txt in web scraping ethics

No matter what our intentions are, before starting anything pertaining to web crawling or internet scraping we need to read the website's Terms and Conditions. It's important to find out if the data is explicitly copyrighted or if any other restrictions are in place. If in doubt, we can always contact the webmaster of a site and ask them directly.

Most popular and traffic-heavy websites have a robots.txt file. This file contains a list of instructions for robots or web-crawlers - mostly search engines and polite web scrapers.
The robots.txt file can be accessed by appending "/robots.txt" at the end of a website's URL, e.g. "https://example.com/robots.txt":

User-agent: *
Disallow: /error/
Disallow: /browse/
Crawl-delay: 5

Sitemap: https://finddatalab.com/sitemap_index.xml

It's possible to create rules for multiple user-agent groups and their respective members, but mostly we will see "User-agent: * ", which means that the corresponding rules apply to all crawlers.

The "disallow" directive specifies site pages or paths that shouldn't be crawled.

Most importantly, robots.txt files specify a crawl-delay that should be taken into consideration when sending requests to a web page. A crawl-rate is a time frame between requests that are sent to a web page, and subsequently, a crawl-delay defines the number of seconds that should pass between repeat requests to a web page. This will be further discussed in "Time-outs and responsive delays".

2. Using a web scraping API when web scraping blocked

Some websites provide their users with an API (Application Programming Interface). An API makes it possible for developers to use the web page's data or functionalities for their own projects. However, it may not always work as a data-gathering solution. Nevertheless, it's worth looking into, especially if we plan on working with some more significant sites, like Twitter or Facebook.

Using an API for gathering data would mean that we're explicitly following the web page's rules. But what are the downsides of using an API? Generally, internet scraping is a more reliable way of gathering data, since APIs need to be developed and continuously maintained, and it's not usually the top priority task.

Sometimes, if it hasn't been updated, an API could even provide us with out-of-date information and depending on the API's state of development, the data might be unstructured or not include the things we're looking for.

3. Identifying yourself when sending requests

A user-agent is an HTTP header of a request that we're sending to a web page. This header contains information about the operating system and browser that we're using while visiting a web page. This information can be used to either tailor the end-user experience to a specific browser or identify the user.

By default, data scraping scripts will have a unique user-agent identifier that can be easily distinguished from that of a human user, since a script does not use a browser.

Let's look at a user-agent web scraping example that a browser sends to a web page:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) 
Chrome/79.0.3945.88 Safari/537.36

One of the things that this user-agent string tells us (and the sysadmin) is that the web page is being accessed by a Linux operating system via a Chrome browser.

For comparison's sake, this is what the python Requests module sends by default:

{
    'User-Agent': 'python-requests/2.22.0', 
    'Accept-Encoding': 'gzip, deflate', 
    'Accept': '*/*', 'Connection': 'keep-alive'
}

This user-agent is quite different and easy to spot in the system's log files, and some servers might even have an automatic user-agent web scraping ban in place for certain users, such as "python-requests".

In order to maintain transparency, we can provide our contact information in the scraper's user-agent string, so that an admin from the target website could contact us if they see our activity in their logs, of course, if the default user-agent hasn't already been automatically blocked for web scraping. Just in case, to bypass an automatic ban, we could make a custom header for our request. Here's an example of creating a custom header if we're using the python Requests module:

…
<p><b>Web scraping</b>, <b>web harvesting</b>, or <b>web data extraction</b> is <a href="/extraction" title="Data scraping">data scraping</a> 
used for <a href="/extraction" title="Data extraction">extracting data</a> from <a href="/" title="Website"> websites </a>.
<sup class= "reference" id="cite_ref-Boeing2016JPER_1-0"><a href="#cite_note-Boeing2016JPER-1">[1]</a></sup>
…

import requests

headers = {'User-Agent': 'This is my web scraping script; Contact me at [email protected]'}
page = requests.get('http://example.com', headers = headers)

This header very plainly announces that this is a internet scraping script and provides the contact information.

A better solution would be to opt for something in between, providing the start of a standard user-agent string, as well as our contact information:

headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64); John Doe/[email protected]'}

Custom headers are commonly used to make it appear as if the request is coming from a human user, not a web scraping program, since, as mentioned previously, the system might have automatic user-agent web scraping bans in place.

In this case, the user-agent is being imitated completely:

headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64)
AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/79.0.3945.88 Safari/537.36'}

4. Time-outs and responsive delays in web scraping ethics

Depending on the amount of bandwidth a website has, we need to be mindful of not overloading their server with our requests.

Multiple, fast-paced requests that are coming from the same IP address and the same user-agent will alert the system administrator that potentially unwanted actions are taking place. This will most likely result in a ban.

The simplest way to gather data without overloading a page's server is by setting time-outs.

Time-outs

This is where we return to the robots.txt file. As mentioned before, a comprehensive robots.txt page will include a crawl-delay directive which specifies how many seconds must pass between requests that a web scraper sends to a page.

Let's look at an example:

Crawl-delay: 5

This means that a crawler/web scraper should wait 5 seconds between requests.

When using python Scrapy framework version 1.1+ the robots.txt file will be taken into account by default, crawl-delay, restricted pages, and all.

In other instances, we'll need to set the crawl-delay manually, as an example it can be done with the python Time module:

import requests
import time

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64); John Doe/[email protected]',
    'Referer': 'https://finddatalab.com/'
}

urls = {'https://finddatalab.com/brand-review-price-tracking-and-monitoring', 'https://finddatalab.com/web-scraping-legal'}

for n in urls:
    page = requests.get(n, headers = headers)
    # some code
    time.sleep(5) #this is the time-out

Another approach to setting crawling delays involves being mindful of how long the site took to respond to the sent request, since the web page's response time increases, the more requests are sent to it.

In this case, we set a crawl delay that is proportional to how long it took for the site to load. Specifically, the time-out will be 2 times longer than the time it took to load the page:

import requests
import time

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64); John Doe/[email protected]',
    'Referer': 'https://finddatalab.com'
}

urls = {'https://finddatalab.com/web-scraping-legal', 'https://finddatalab.com/brand-review-price-tracking-and-monitoring'}

for n in urls:
    start = time.time()
    page = requests.get(n, headers = headers)
    delay = time.time() - start
    # some code
    time.sleep(2 * delay)

It's also worth thinking about gathering data during the website's off-peak hours. Mostly, the time of day remains the same throughout different countries (when taking the time zone into account) with the peak browsing activity occurring after 5 PM, when most people get off work. Of course, some exclusions might apply, and we should try to find information about a specific country and the specific website we're interested in, in order to not get trapped in the internet rush hour.

The purpose of setting request delays and scraping during off-peak hours is to get the data we need without overwhelming the website's server and possibly risk getting banned. Human users will always rank higher in the access priority list; therefore, even though time-outs and responsive delays will slow down our data collection process significantly, they're worth implementing.

5. Simulating a real-world user

Imitating a regular user could be considered a web scraping ethical grey area. By simulating a real-world user, the web scraper won't be as easily detected and blocked.

There are a few aspects that go into making a web scraper appear as a real-world user: the user-agent string, IP address, time-outs between requests and the request rate.

The user-agent string

As it was previously discussed, we can easily set up a custom HTTP header for the web scraper:

headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64)
AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/79.0.3945.88 Safari/537.36'}

Now we have one user covered.
However, if we're planning on gathering a large amount of data, then it's worth thinking about rotating custom user-agents, to make it appear as if the requests are coming from different devices.
We can make a python dictionary of user-agents from any database on the web and rotate them between requests.

The IP address

If a website has some anti-scraping bypass tools in place, it will most likely detect and ban an IP address that is sending too many repeated requests.

How many requests are too many requests? Only the sysadmin knows the rate limit. Therefore, the most basic way to ensure that your web scraper is not blocked is to rotate IP addresses.

Take note that the IP addresses should be completely random and not in a continuous range or belonging to the same group.

The most effective way of web scraping would be by combining user-agent rotation and IP rotation. This would make it appear as if the requests are being sent by different machines that belong to various networks.

Time-outs between requests and the request rate

As previously mentioned, we need to take note of the crawl-delay in robots.txt, and if we want to be even more polite, we can set an adaptive time-out that will be proportional to how long it took to load the page. Also though this approach is "polite", it's happening in regular intervals and thus can be detected, interpreted as an unwanted activity, and blocked.

So, how does a human-user browse the web? Unlike an automated web scraper, a human-user has random time-outs and sends requests to the web page randomly. This irregularity in the request rate for human-users is why a web scraper can be easily detected in the system log files by detecting repeated requests that are sent at a regular, unchanging rate.

Randomizing the request rate isn't as necessary if we can provide our web scraper with random user-agents and proxies, but it's still worth looking into.

The next code snippet will time-out our web scraper at random rates between 5 and 15 seconds:

import time
import random

time.sleep(random.uniform(5, 15))

We can also set an adaptive, random time-out to further mask the web scraper. Let's modify a previous example:

import requests
import time
import random

headers = { 
    # headers 
}

urls = {
    # urls
}

for n in urls:
    start = time.time()
    page = requests.get(n, headers = headers)
    delay = time.time() - start
    # some code
    time.sleep(random.uniform(1, 2) * delay)

By using the python Random module, we have set the web scraper to time-out at random rates that will be between 1 and 2 times longer than the time it took to load the page.

This will slow down our web scraper significantly, especially if we need to collect a large amount of data. This problem could be solved by looking into asynchronous web scraping solutions, as well as a web scraping service since the whole process might start to require some serious processing power and hardware to complete.

Conclusion

With web scraping becoming widespread it's more important than ever to scrape mindfully since automatic data collection is a nuisance for web page sysadmins if it's done crudely and inexpertly.

We should keep in mind that we need to take note of a web page's terms and conditions and the robots.txt file. Always use an API if it's available and if we can access the needed data that way. We should identify our web scraper, and provide contact information, as well as set time-outs between sending requests to a site.

After all of this is considered, we can think about imitating real-world users to evade a ban since the previous steps have ensured that the website would not be overloaded by requests.

Web scraping service compliance validation

If you are uncertain about the legality of your web scraping project, don't hesitate to contact our team so we can check it for you. You can also check out our how to scrape data from a website guide.

Contact customer success team