Quick checklist for legal web scraping:
Whilst web scraping itself isn't necessarily illegal, there are regulations governing data that do impact what firms should use the technique for. The regulation is designed to ensure any activity is done on an ethical and responsible basis, rather than completely outlawing it. Web scraping is legal or not? Let's look at the issue for clarification.Regulation
The General Data Protection Regulation (GDPR) in the EU was launched in 2018 to give the public control over their own data. The concept is that it puts limits on what businesses can do with personally identifiable data likes names, addresses, phone numbers or emails. The regulation does not state that scraping data is illegal but instead, imposes limits on what firms can do when it comes to extracting it. For example, firms need to have explicit consent from consumers to be able to scrape their data.
A lot of the use cases we've defined like price tracking
do not require extraction of personally identifiable data. However, if a firm was scraping names and emails from a site to generate leads without consent of the customer, this would not be allowed under GDPR. Other regulations such as CCPA in California are now following suit where it comes to personally identifiable data.
So, scraping itself is not illegal but firms should be aware of other regulations surrounding how it is used.
Terms of Service
Many websites will state in their terms of service that they do not allow scraping of their website. Again, whilst this doesn't make doing so illegal, the terms of service do act a bit like a contract and could be used against companies who do decide to scrape. Ethically speaking, conducting any activity that another company has asked you to refrain from could be considered poor practice. Robots.txt
Robots.txt is a file used by websites to let others know how they should do scraping activities. If you want to scrape a site, it is important to understand robots.txt. In the robots.txt file of a website, it will tell you the type of access that scraping tools have, what time that are allowed on the site and how many requests for information they can make.
Bots should comply to the robots.txt file of every website they visit and ensure they don't break any of the rules.
It should be said, it is not illegal to ignore robots.txt files on a website but it is highly unethical.