Legal Web Scraping for Legal Purposes

6 Tips for Healthy Web Scraping or How to Extract Data from the Web Compliantly with the GDPR
Web scraping is a fast and easy way to extract data from the web. How does it work? It is an automated process using a bot or a web crawler through the HTTP protocol or a web browser. Target data is stored in a central local database or a spreadsheet, and is later used for retrieval or analysis. Web scraping is a tool that can be applied for different business processes. Through web scraping, you may easily get information for brand monitoring and market research (such as visitor stats, product details, customers' email addresses). Web scraping is also actively used to train artificial intelligence and to collect information for scientific research.

Is web scraping legal? This question raises controversy among lawyers and practitioners. Scraping data from the web does indeed have some ethical, legal, and technical limitations. In May 2018, the General Data Protection Regulation (GDPR) was enforced, creating challenges for all companies working with personal data of EU residents. In June 2019, online media reported on the first GDPR fine issued in Poland for a failure to inform data subjects about the processing of their data. However, web scraping is legal for legal purposes and when it is compliant with the GDPR. Check out our six-step instruction for healthy web scraping.

Image courtesy of Max Nelson (Unsplash)
Step 1. Make sure that the purpose of data collection is legal
Step 2. Make sure that you want to get publicly available information
Step 3. Check copyrights
Step 4. Set optimal web scraping rate
Step 5. Direct your web scrapers a similar to the search engine
Step 6. Identify your web scrapers
Tip 1. Make sure that the purpose of data collection is legal
Before you start to extract data, you should decide on what information, from which websites and in what format you want to receive. The next step is how you plan to use the collected data. Two important questions should be answered here: 1. Are you going to publish the data or will you use it for your company's needs only? 2. May the data extraction be a cause of any damage to the owners of these data (e. g. reputational or financial losses)?

If the data is extracted for your personal use and analysis, then it is all legal and ethical. But if you are going to use it as your content and publish it on your website without any attributing to original data owners, then it is completely against the interest of data subjects and it is neither ethical, nor legal. So, if you plan to publish the scraped data, you should make download request to the data owners or do some background research about website policies as well as about the data you are going to scrape. Remember that scraping information about individuals without their knowledge could infringe on personal data protection laws.
Tip 2. Make sure that you want to get publicly available information
Though the data published by most websites is for public consumption, and it is legal for copying, it's better to double-check the website's policies. You can legally use web scraping to access and acquire public, authorized data. Make sure that the information on the sites you need do not contain personal data. Web scraping can generally be done without asking for permission of the owner of data if it does not a violate the website's terms of service. Each website has Terms of Service (ToS), you can easily find that document in the footer of the page and check that there is no direct prohibition on scraping. If a website has written under its ToS that data collection is not allowed, you risk being fined for web scraping, because it is done without the owner's permission. Also be ready that some information on needed websites may be secured (usernames, passwords or access codes), you cannot collect these data as well.

And of course, you may scrape your website without any doubts.
Tip 3. Check copyrights
In addition to ToS, all websites have Copyright details, which web scraping users should respect as well. Before copying any content, make sure that the information you are about to extract is not copyrighted, including the rights to text, images, databases, and trademarks. Avoid republish scraped data or any data-sets without verifying the data license, or without having written consent from the copyright holder. If some data is not allowed to be used for commercial purposes because of copyright, you should steer clear from it. However, if the scraped data is a creative work, then usually just the way or format in which it is presented is copyrighted. So, in case you scrape 'facts' from the work, modify it and present originally, that is legal.

Tip 4. Set the optimal web scraping rate
Let's come to the technical limitations of healthy web scraping. Data scrapers can put heavy loads on a website's servers by asking for data much more times than a human does. You should take care of the optimal rate of web scraping process and do not affect the performance and bandwidth of the web server in any way. If you do, most web servers will just automatically block your IP, preventing further access to its web pages. Respect the crawl-delay setting provided in robots.txt. If there is none, use a conservative scraping rate, e. g. 1 request per 10-15 seconds.


Tip 5. Direct your web scrapers a path similar to the search engine
One more important thing about healthy web scraping is the way of getting to the site and searching for needed information. Experienced coders and lawyers recommend using crawlers which access website data as a visitor and by following paths similar to a search engine. Even more, this can be done without registering as a user and explicitly accepting any terms. So, a legal crawler may scan and copy any public information which is available to the regular user but cannot, for example, damage the site coding, destroy secured digital obstacles and interfere with normal website operation in any way.
Tip 6. Identify your web scrapers
Be respectful and identify your web scraper with a legitimate user agent string. Create a page that explains what you are doing and for what, point out organization name (if you are scraping for one), add a link back to the page in your user agent string as well. Legitimate bots abide by a site's robot.txt file, which lists those pages a bot is permitted to access and those it cannot. If ToS or robots.txt prevent you from scraping, you should ask written permission from the site owner, before doing anything else.

Web scraping is a valuable and cheap tool for businesses in the global competitive market. However, web scraping should be done with respect and responsibility to data owners and site administrators. Following our 6-steps instruction of healthy web scraping, you may avoid many problems and protect yourself.
October-1-2019
We'll take over all the routine work
If you are uncertain about the legality of your scraping project, don't hesitate to contact us and we'll find an individual solution for you.
Data scraping project description
E-mail
URLs to extract data from:
List of data attributes to extract:
What is your budget?
Example (optional)
If you have an example of needed result, please send the file
By clicking Submit, you agree to our Privacy Policy