Like for many other activities on the Internet, there is no simple answer about the legal aspects of web scraping. There is a common belief that it's illegal, but that's not true.
An answer can be divided into two parts. First is ethical, second is law standpoint.
From the ethical point of view, if data can be publically accessed, you can grab it, especially if it is some standard information like a flight schedule or NBA match score. These are just facts that don't belong to anyone.
The tricky part starts when the data you access is specific. For example, a site can contain hidden areas and provides necessary instructions in its
robots.txt telling crawlers and scrapers to ignore these URLs. The ethical approach tells us that we should ignore this site's section as the owner kindly asked to do that. Another good gesture would be to limit the amounts of simultaneous web requests, so web scraper doesn't affect the site's performance drastically.
Another point of concern is users' personal information. As time goes, people are concerned more and more about how their personal data is used and who obtained it. As a result, many countries developed similar laws to make data bearers responsible and transparent regarding personal data usage and storage. The two most important and well-known laws are GDPR (General Data Protection Regulation) and CCPA (California Consumer Protection Act). These laws are totally different in detail but common in one thing - if you deal with personal data (like emails, real names, etc.) then get consent from your clients and store everything securely.
In fact, this makes scraping of personal data almost impossible in the European Union due to the required consent from each individual. The situation is different in the United States and the rest of the world. We'll review it a bit later.
As for the legal part, practice varies from case to case significantly.
The main points here are:
- Copyrighted content;
- Terms of Service (ToS);
- Related laws, for example, CFAA - Computer Fraud and Abuse Act (simply: anti-hacking law).
Everything is straightforward with the copyrighted content - one can freely parse such information, for example, YouTube video names, but can't repost videos anywhere.
As for the Terms of Service, data owners often don't authorize automatic usage of data in their ToS agreements. However, sites with publicly available data can't make data scraper to agree with its ToS before accessing the content so users can use web scraping services as they want.
Also, companies like to interpret CFAA law in a very broad manner and try to insist that violating ToS means breaking CFAA law.
So, are there any court decisions that can help us shed light on this? Over the years, there were many trials, and many times court took the side of the data owners, forcing those who scraped data to pay fines. Even so, a crucial count decision happened recently.
The most famous case is a withstanding between LinkedIn and hiQ. HiQ is an analytical company that scraped public data profiles from LinkedIn, determined who started to look for a new job and sold this information to the employers. At some point, LinkedIn sent hiQ a request, so-called a cease-and-desist letter, to stop automated data collection from profiles. As a reason, LinkedIn used a CFAA violation. After two years of trial, the U.S. 9th circuit court of Appeals ruled that CFAA applies only to information or computer systems that are initially closed to the public. Thus hiQ can freely access data initially available for the public. Also, the court forbids LinkedIn to interfere with hiQ's web scraping. This is a huge step to the
legal web scraping.
Evidently, with the latest court decisions and data protection laws, it becomes more clear how to perform scraping without breaking the laws.
One can research all the aspects and limitations of scraping or use one of the legal web scraping services, implementing the best practices.
To sum up: web scraping is legal by itself, but one should respect copyright and data protection laws.