NO need to protect your website from scraping: 8 reasons

During the last years, information became new gold.
As our entire life transitions more and more into the digital sphere, the bigger chunks of data become available on the web. No surprise, all kinds of researchers and business people strive to get a piece of this cake.

Today we'll define web scraping and crawling process (and what's the difference), review technical ways to attempt blocking, the legal side of the topic with the examples, and even benefits of being scraped.
Web crawling is a process of obtaining all relevant information from the web pages, and iteratively fetching new web links. It is mainly used by search engines to get a map of the whole Internet.

The process of extracting particular data from the web site is called web scraping, for instance, data about real estate listings, product pricing or all companies in the city. Technically speaking, web crawling includes some degree of web scraping, and web scraping itself is a subset of crawling.

Web scraping became extremely popular, as it allows in relatively short terms and low budget collect unstructured, organic data. This data can be of an different topic and nature, but it's always helpful in making business decisions, reach an audience or complete a scientific project.

At first, web scraping applications were implemented using pure programming languages, such as Python or JavaScript, and specialized instruments, for example, Selenium and PhantomJS. Manually created scrapers required programming experience and were applicable only to one task. As the demand grew, professional web scraping services started to provide an exceptional experience where scraping could be set up in minutes without a need to program at all.

Website owners, in their turn, often didn't like to share so hardly collected information with anyone. They began to interfere with web scraping by technical and legal methods. As years went, methods of web scraping evolved, and means of defence against it become more sophisticated as well.

This confrontation is not going to end in the nearest future, so maybe it's time to stop for a second and think whether attempts to block web scrapers make sense at all? Below I'll try to show that in many cases web scraping is not a threat, there are no guaranteed means to obstruct it, and sometimes it can even be useful for the website owner.

1. There is no technical way to prevent web scraping

Technically, web scraping is implemented as some requests to the website and further parsing of the obtained HTML code.
Let's review in detail how a website can withstand data grabbing attempts.

Every web request contains a browser signature, so-called User agent, and in theory, a web server can detect and reject non-human browser requests. However, modern scrapers can impersonate themselves as different browsers and bypass this check.
A web server can notice too often requests or fetches done from the same computer or IP address; this block can be easily avoided by adding random delays between calls and accessing URLs using proxy servers.

Some websites render data using JavaScript, so direct data fetch becomes not possible. And as a response, web scrapers can now load the entire page and execute JavaScript code before analyzing content.
Other webmasters constantly change the page layout, tweak HTML code, alter data format. This is aimed to break the scraper's logic, which awaits concrete positions of the page elements. Unfortunately, such changes are taken into consideration and reimplemented quickly, and small changes might event not interfere with scraping!

Owners can add mandatory email registration and even checks, that can be answered only by real users. Unfortunately, all this hustle makes it worse for the site visitors, as they have to overcome new difficulties before accessing content. Professional web scraping services and experienced developers usually have a pool of available emails and even solve a CAPTCHA (there are APIs allowing to solve it for the affordable price).
As the last call, some sites introduce user behaviour patterns matching. For popular services, such patters are well known, and it doesn't take much to emulate all the requirements. Instead, regular users start suffering from being accidentally blocked.

As we see, technical means don't prevent data from grabbing at all, mainly when professional web scraping service is used.

2. You ruin your users' experience

A happy user is a must in the modern web industry.

People don't like to wait when page loads slowly, and they hate to solve the CAPTCHA and receive email confirmations when it's not necessary. You can't ask them to wait longer or pass monotonous bot verification because "it is a web scraping precaution". Precaution against what? Nobody cares!

As a result, such visitors will simply close the page and navigate somewhere else. Can you justify your daily visitors count going down by knowing there is a probability that someone will access your page content automatically? It reminded me of the Simpsons episode when the new city tax was added to protect people of Springfield from the wild lions.

When a web developer starts changing the layout constantly, the only outcome will be a frustration of those who can't find the needed information at the usual place. Moreover, every change implies a programming mistake so that it can worsen user experience even more.

To sum up, an obvious decision should be made between user comfort and protection from the lowly potential threats.

3. Protection will increase your infrastructure costs

Every additional webserver check adds latency to a served page and consumes part of your server's resources. The worst part here is that you should check all of your visitors to find the suspicious ones.

For example, If you want to identify that someone requests web pages too often, you'll have to track such actions from all of your users. Imagine, that now you have to additionally record every hit from hundreds and thousands of the IP addresses, calculate limits to make corresponding decisions and still be fast to do this in real-time without performance loss. This additional server load will become constant, even when no web scraping services crawl your pages.

Another threat here is a custom implementation of this functionality. One incorrect code change and all of the users will be blocked from accessing your website. It may sound fantastical, but I've seen a real-life story when development settings were pushed into a production environment, and an IP filter closed it for everyone except the developer's machine. Revenue loss was astronomical!

Also, don't forget about the most significant web scraping companies in the industry - Google and Bing. Yes, to provide us with the search results they have to crawl the web and scrape your web pages as well, so technically they use web scraping services, too. On average, most of the traffic comes to the website from the search engines, so messing with the crawling bots access is very dangerous to any business.

As a completely different option, you can choose one of the 3rd party solutions, and as a result, obtain ineffective (all pattern checks are well known, remember?) and costly solution.
As we see, the idea to fight web scrapers can turn into additional costs and even problems with the search traffic.

4. Can other features wait?

Adding scraping protection means your development team will be busy implementing, testing and continuous monitoring and supporting this functionality. Remember the case with changing the layout to interrupt scrapers - how many hours per week; you are ready to spend to disappoint users and web scrapers by shuffling page's look and feel?

As a businessman and product owner, you have to grow your app and roll out new features faster than competitors and on a regular basis, to keep your clients interested and attracted. Adding scraping protection to the list of features means to delay other important ones. Just look at your backlog (or whatever you use to organize your project plans) and compare every element with the anti-scraping module. My prediction is that in most cases, all other features will be more important than it.

As another exercise (regular in the Scrum development approach), you can define the task's Business Value and Implementation Effort and obtain a pretty low ratio (value to effort), indicating that this task should be done "sometimes later, probably never".

It seems obvious that the formal, mathematical approach leads to the same conclusion - it makes no sense to spend dozens of hours implementing a scraping-proof application.

5. The legal side of the web scraping

Like for many other activities on the Internet, there is no simple answer about the legal aspects of web scraping. There is a common belief that it's illegal, but that's not true.

An answer can be divided into two parts. First is ethical, second is law standpoint.

From the ethical point of view, if data can be publically accessed, you can grab it, especially if it is some standard information like a flight schedule or NBA match score. These are just facts that don't belong to anyone.

The tricky part starts when the data you access is specific. For example, a site can contain hidden areas and provides necessary instructions in its robots.txt telling crawlers and scrapers to ignore these URLs. The ethical approach tells us that we should ignore this site's section as the owner kindly asked to do that. Another good gesture would be to limit the amounts of simultaneous web requests, so web scraper doesn't affect the site's performance drastically.

Another point of concern is users' personal information. As time goes, people are concerned more and more about how their personal data is used and who obtained it. As a result, many countries developed similar laws to make data bearers responsible and transparent regarding personal data usage and storage. The two most important and well-known laws are GDPR (General Data Protection Regulation) and CCPA (California Consumer Protection Act). These laws are totally different in detail but common in one thing - if you deal with personal data (like emails, real names, etc.) then get consent from your clients and store everything securely.

In fact, this makes scraping of personal data almost impossible in the European Union due to the required consent from each individual. The situation is different in the United States and the rest of the world. We'll review it a bit later.

As for the legal part, practice varies from case to case significantly.
The main points here are:
- Copyrighted content;
- Terms of Service (ToS);
- Related laws, for example, CFAA - Computer Fraud and Abuse Act (simply: anti-hacking law).

Everything is straightforward with the copyrighted content - one can freely parse such information, for example, YouTube video names, but can't repost videos anywhere.

As for the Terms of Service, data owners often don't authorize automatic usage of data in their ToS agreements. However, sites with publicly available data can't make data scraper to agree with its ToS before accessing the content so users can use web scraping services as they want.

Also, companies like to interpret CFAA law in a very broad manner and try to insist that violating ToS means breaking CFAA law.

So, are there any court decisions that can help us shed light on this? Over the years, there were many trials, and many times court took the side of the data owners, forcing those who scraped data to pay fines. Even so, a crucial count decision happened recently.

The most famous case is a withstanding between LinkedIn and hiQ. HiQ is an analytical company that scraped public data profiles from LinkedIn, determined who started to look for a new job and sold this information to the employers. At some point, LinkedIn sent hiQ a request, so-called a cease-and-desist letter, to stop automated data collection from profiles. As a reason, LinkedIn used a CFAA violation. After two years of trial, the U.S. 9th circuit court of Appeals ruled that CFAA applies only to information or computer systems that are initially closed to the public. Thus hiQ can freely access data initially available for the public. Also, the court forbids LinkedIn to interfere with hiQ's web scraping. This is a huge step to the legal web scraping.

Evidently, with the latest court decisions and data protection laws, it becomes more clear how to perform scraping without breaking the laws.

One can research all the aspects and limitations of scraping or use one of the legal web scraping services, implementing the best practices.

To sum up: web scraping is legal by itself, but one should respect copyright and data protection laws.
If I can't stop scraping, are there any benefits?
Until that moment, we looked only at the negative sides of fighting with the scraping. However, there are many positive effects your business can gain from someone scraping your website.

Let's consider the bright side of this process.

6. New sources of traffic

As you remember, Google scrapes your site on a regular basis. But its bot is not the only one who crawls and scrapes your data - dozens of other "good" bots do the same (Bing, GuckDuckGo, Yandex, Baidu, Sogou, Exalead, Facebook, Alexa, and many others), adding your content to the search engines around the world. There is even such technology as Microformats, aimed to help parsers better understand the content on a web page. For many sites, it is a valuable and significant way to attract a new audience. To be specific, almost 93% of web traffic comes through search engines.

Also, there are specialized aggregators, scraping only specific types of data - price monitors, blogs and news aggregators, etc. If your website belongs to any of these groups, you get an easy option to promote your goods and services!

Simply put, giving aggregation services access to your pages is a vital part of the digital business.

7. Evaluation of industry and competitors

If you are serious about your digital strategy and use modern promotional tools, then no doubt you are using 3rd party services powered by global web scraping.

Nowadays, you can easily track all the mentions of your product or service (posted on the numerous review services and social networks) and react accordingly, drastically improving public relations and reputation. To achieve it, the service provider should constantly scrape the media sources and parse out company and product names, expressed emotions, feedback. Besides, it can be extremely useful to track the results of the PR campaign and adjust it almost in real-time.
Various 3rd party services provide an analysis of the global market, where you can compare yourself against competitors. Monitoring rivals' performance, ups, and downs can be crucial to your success. The same relates to understanding the demography of your audience and alter strategy accordingly.

Search engine optimization (SEO) is highly essential to be on the first page of the related search results. Thus, it makes sense to dedicate time and effort to improve and maintain it; to know the right keywords, you'll have to scrape information about your audience. In addition, there are SEO tools, that track changing in search rankings, and to achieve it, those tools scrape search results as well.

The whole new area is sales lead generation and marketing analysis. As you might know, at the present time, high-quality is the biggest challenge and focus of B2B marketers. Whether you need hotel contacts, local doctors' names or game developers' emails, the fastest and most efficient way will be scraping this data from the Internet. According to statistics, such a method of leads generation significantly outperforms such traditional approaches as email, PPC ads, content marketing, and social media. Also, it's more cost-effective, so you'll be able to spend the budget on the other activities.

In other words, modern web scraping allows web businesses to attract new traffic, promote its service or product, optimize web presence with significantly cheaper and more accurate means than ever before! However, it works only if the global community allows accessing its data.

8. Your data helps the academic world

Last but not least is the fact, that scientists need a lot of data for their studies, and web scraping allows researchers to get it. Of course, web scraping can't help with math formulas, but it can be invaluable for social and medical science. Using scraped data, scientists can measure relationships between countries, predict virus outbreaks and famine, better understand society sentiments, forecast economic jolts and do many other fascinating prognoses.
A real-life example is a prediction of recent epidemy of Ebola and hunger in Africa.

Think of your website as a potential source of data for research, which may change the entire world?

So, we can conclude that web scraping became a part of everyday web business.
Disregarding our wishes, all publically available content is being scraped as there is no restrict it programmatically, and many services we use to grow business already contain scraped information as well.

The United States law treats scraping data as a legal process, and there is no world practice to forbid it explicitly.

Your data can be a way to attract new users or help science to make our planet and life better.
What line of action can be chosen to make the web scraping process smooth and comfortable for both sides?

As a website owner, you can create API to your data and set up limitations that won't harm your servers. Prepare valid robots.txt file explaining what paths of your site shouldn't be reached. Explain clearly copyrights and licensing moments to avoid any confusion.

As a scraper or researcher, ask yourself "Will my actions disturb the work of a target site? Is this data open to be scraped? Can I store securely personal data?" Think about how you can implement a proper web scraping app, or it is better to use one of the legal web scraping services?

Web scraping is still in the grey area rather than white, and it's up to us all to make it civilized, polite and acceptable for all parties!
Photo by Giulia May on Unsplash