Legal Web Scraping for Legal Purposes

6 Tips for Healthy Web Scraping or
How to Extract Data from the Web Compliantly with the GDPR

Is web scraping legal? This question raises controversy among lawyers and practitioners. Scraping data from the web does indeed have some ethical, legal, and technical limitations. In May 2018, the General Data Protection Regulation (GDPR) was enforced, creating challenges for all companies working with personal data of EU residents. In June 2019, online media reported on the first GDPR fine issued in Poland for a failure to inform data subjects about the processing of their data. However, web scraping is legal for legal purposes and when it is compliant with the GDPR.

Image courtesy of Max Nelson (Unsplash)
Tip 1. Make sure that the purpose of web scraping is legal
Before you start to extract data, you should decide on what information, from which websites and in what format you want to receive. The next step is how you plan to use the collected data.
Two important questions should be answered here:
1. Are you going to publish the data or will you use it for your company's needs only?
2. May the data extraction be a cause of any damage to the owners of these data (e. g. reputational or financial losses)?

If the data is extracted for your personal use and analysis, then web scraping is legal and ethical. But if you are going to use it as your content and publish it on your website without any attributing to original data owners, then it is completely against the interest of data subjects and it is neither ethical, nor legal. So, if you plan to publish the scraped data, you should make download request to the data owners or do some background research about website policies as well as about the data you are going to scrape. Remember that scraping information about individuals without their knowledge could infringe on personal data protection laws.
Tip 2. Make sure that you want to get publicly available information
Though the data published by most websites is for public consumption, and it is legal for copying, it's better to double-check the website's policies. You can legally use web scraping to access and acquire public, authorized data. Make sure that the information on the sites you need do not contain personal data. Web scraping can generally be done without asking for permission of the owner of data if it does not a violate the website's terms of service. Each website has Terms of Service (ToS), you can easily find that document in the footer of the page and check that there is no direct prohibition on scraping. If a website has written under its ToS that data collection is not allowed, you risk being fined for web scraping, because it is done without the owner's permission. Also be ready that some information on needed websites may be secured (usernames, passwords or access codes), you cannot collect these data as well.

And of course, you may scrape your website without any doubts.
Tip 3. Check copyrights
In addition to ToS, all websites have Copyright details, which web scraping users should respect as well. Before copying any content, make sure that the information you are about to extract is not copyrighted, including the rights to text, images, databases, and trademarks. Avoid republish scraped data or any data-sets without verifying the data license, or without having written consent from the copyright holder. If some data is not allowed to be used for commercial purposes because of copyright, you should steer clear from it. However, if the scraped data is a creative work, then usually just the way or format in which it is presented is copyrighted. So, in case you scrape 'facts' from the work, modify it and present originally, that is legal.

Tip 4. Set the optimal web scraping rate
Let's come to the technical limitations of legal web scraping. Data scrapers can put heavy loads on a website's servers by asking for data much more times than a human does. You should take care of the optimal rate of web scraping process and do not affect the performance and bandwidth of the web server in any way. If you do, most web servers will just automatically block your IP, preventing further access to its web pages. Respect the crawl-delay setting provided in robots.txt. If there is none, use a conservative scraping rate, e. g. 1 request per 10-15 seconds.


Tip 5. Direct your web scrapers a path similar to the search engine
One more important thing about healthy web scraping is the way of getting to the site and searching for needed information. Experienced coders and lawyers recommend using crawlers which access website data as a visitor and by following paths similar to a search engine. Even more, this can be done without registering as a user and explicitly accepting any terms. So, a legal web scraping may scan and copy any public information which is available to the regular user but cannot, for example, damage the site coding, destroy secured digital obstacles and interfere with normal website operation in any way.
Tip 6. Identify your web scrapers
Be respectful and identify your web scraper with a legitimate user agent string. Create a page that explains what you are doing and for what, point out organization name (if you are scraping for one), add a link back to the page in your user agent string as well. Legitimate bots abide by a site's robot.txt file, which lists those pages a bot is permitted to access and those it cannot. If ToS or robots.txt prevent you from scraping, you should ask written permission from the site owner, before doing anything else.

Web scraping is a valuable and cheap tool for businesses in the global competitive market. However, web scraping should be done with respect and responsibility to data owners and site administrators. Following our 6-steps instruction of healthy web scraping, you may avoid many problems and protect yourself.
Is scraping LinkedIn legal?

One of the most highlighted cases of legal web scraping was in the case of LinkedIn vs HiQ. HiQ is a data science company that provide scraped data to corporate HR departments. The business model is primarily focused on scraping publicly available data from the LinkedIn network. The data is used within analytics to determine key factors like whether an employee is likely to leave for another company or what employees would like their training departments to invest in.

LinkedIn sent a cease and desist letter to HiQ, stating they would deploy technical methods for stopping the activity. However, HiQ also filed a lawsuit to stop LinkedIn from blocking their access. On a technical basis, their web scraping was just an automated method to get publicly available data, which a human visitor to LinkedIn could easily do manually.

The court ruled in favour of HiQ given that publicly accessible data is far short of hacking or "breaking and entering" as thy put it. This is a landmark case in showing that scraping is a perfectly legitimate for companies to gather data when used correctly and responsibly.
Quick checklist for legal web scraping:

Whilst web scraping itself isn't necessarily illegal, there are regulations governing data that do impact what firms should use the technique for. The regulation is designed to ensure any activity is done on an ethical and responsible basis, rather than completely outlawing it. Web scraping is legal or not? Let's look at the issue for clarification.

Regulation
The General Data Protection Regulation (GDPR) in the EU was launched in 2018 to give the public control over their own data. The concept is that it puts limits on what businesses can do with personally identifiable data likes names, addresses, phone numbers or emails. The regulation does not state that scraping data is illegal but instead, imposes limits on what firms can do when it comes to extracting it. For example, firms need to have explicit consent from consumers to be able to scrape their data.
A lot of the use cases we've defined like price tracking do not require extraction of personally identifiable data. However, if a firm was scraping names and emails from a site to generate leads without consent of the customer, this would not be allowed under GDPR. Other regulations such as CCPA in California are now following suit where it comes to personally identifiable data.
So, scraping itself is not illegal but firms should be aware of other regulations surrounding how it is used.

Terms of Service

Many websites will state in their terms of service that they do not allow scraping of their website. Again, whilst this doesn't make doing so illegal, the terms of service do act a bit like a contract and could be used against companies who do decide to scrape. Ethically speaking, conducting any activity that another company has asked you to refrain from could be considered poor practice.

Robots.txt
Robots.txt is a file used by websites to let others know how they should do scraping activities. If you want to scrape a site, it is important to understand robots.txt. In the robots.txt file of a website, it will tell you the type of access that scraping tools have, what time that are allowed on the site and how many requests for information they can make.
Bots should comply to the robots.txt file of every website they visit and ensure they don't break any of the rules.
It should be said, it is not illegal to ignore robots.txt files on a website but it is highly unethical.
Data scraping project description
By clicking "Submit", you agree to our Privacy Policy
Web scraping service compliance validation
If you are uncertain about the legality of your web scraping project, don't hesitate to contact our team so we can check it for you.