First, we need to think about web scraper
request delays, as well as request randomization. These things are fairly easy to implement if you have a custom web scraping script. For example, you can use the python Time module's sleep function to time out your requests.
Request randomisation comes in when we want the web scraper to appear more human-like in its browsing habits. In essence, when you think about how you browse the web, you can come to a conclusion that you click on things, scroll and perform other activities in a random manner. This is why, in order to not appear as suspicious activity in the web server logs, we can use the python's Random module and make the web scraper time-out at random intervals.
All in all, spacing out requests is important so that you wouldn't overload the server and inadvertently cause damages to the web site. Therefore, you have to especially careful when selecting ready-made web scraping software, as it may not have this feature.
IP address rotation is one of the unblockable web scraper cornerstones. In addition to randomising your request rates, you should also make sure that the server doesn't think that one IP address is randomly browsing 10'000 pages of a website. IP address blacklisting is one of the most basic anti-web scraping measures that a server can implement. Usually it happens if many requests are sent one by one with little time in between.
Take for example a server log, where one IP address is sending one request per second. Would you block it? Of course, since no human-user would browse the web like that. Usually this is set to happen automatically, without further analyzing the user-agent (where you might have identified yourself and provided contact information).
Because of this, we need the web scraped to be random and to change the IP addresses. It's not about tricking the webmaster and portraying ourselves as not-a-web scraper. It's more about not getting banned automatically.
You can build a simple IP address rotation tool by using free IP addresses available on the web. However, the price you pay by using a free IP address is that the connection will be painfully slow with a lot of the addresses timing out before getting a response from the web server. This is due to the sheer amount of people rerouting their requests to these IP addresses.
This is where you can
use a proxy web scraping service, where you'll get a set number of IP addresses in a location of your choice. HTTP proxies are classified in the following types, depending on the anonymity level that they provide to you. There are transparent, anonymous and elite proxies.
A transparent proxy will surrender all of your information to the web page's server that you're trying to access. Despite using a proxy IP, your real IP address will get passed down to the server in the HTTP request's headers. Obviously, despite it being a proxy, there is zero anonymity provided to you, if you decide to use it. It may help you get around simple IP bans, but don't bank on it getting you far.
An anonymous proxy will identify itself to the server as a proxy, but won't disclose your IP address. It's detectable, but atleast it provides a layer of anonymity to you.
An
elite proxy does not notify the server of the fact that a proxy is being used and it doesn't pass your IP address to the server to boot. Therefore, this is the best solution for web scraping securely.
Some ready-made web scraping software tools provide the user with an IP address rotation option as well, so make sure to check for this option when selecting the tools for your web scraping project.
A slight grey area of web scraping would be
user-agent rotation. Just as with IP address rotation, if you want, you can also periodically change the user agent that will be sent to the web server. Some people choose to spoof the whole user-agent making it appear that a human user is browsing the web page. However, that might not be the best practice if you're really trying to stay within the bounds of law.
A great middle ground would be to provide a regular user-agent and appending your contact information at the end of it. This way you can still rotate the user-agents, while also providing a channel for open communication with the sysadmin.
Some websites go above and beyond in order to not get web scraped and to block all automated data collection. One of the ways of catching a web scraper is by using a 'honey pot'. A 'honey pot' is an HTML link that's not visible to the user but can be accessed by a web scraper, usually the CSS selector (that a web scraper uses) is set as display : none.
Avoiding 'honey pots' and other hidden link trickery is a skill that takes time to master. The 'honey pot' concept in itself is nothing revolutionary or hard to grasp. Just don't select the CSS elements that are not displayed, right? Yes, but there's another thing to keep in mind. If a developer has included hidden links, what stops them from adding prepopulated hidden forms that need to be sent together with your request in order to process the request as valid.
A web scraper needs to be built while looking at the website as a whole. We can't simply ignore all hidden fields. Some of them matter and some don't.
Finally, we come to
cookies and CAPTCHAs. With the e-commerce sector and digital marketing becoming a vital part of running a profitable business, cookies have earned their time in the spotlight. A cookie is a token by which the web server remembers a user's browser session's state or in simpler terms – it tracks user activity.
Therefore, one of the ways of not getting blocked is by using persisting cookies in your web scraper. This way it'll appear that the requests are made in one session. Not to mention the fact, that you can even speed up your web scraper by using the same Transmission Control Protocol, i.e. by using the same connection. A plus for browser plugins is that they scrape within your browser's session limits. Therefore, cookies are taken care of.
CAPTCHA or Completely Automated Public Turing test to tell Computers and Humans Apart is served to a user once his activity becomes 'suspicious'. This activity is tracked by the aforementioned session cookies. Activities that'll trigger a CAPTCHA include, but aren't limited to using too many devices to log into one account, not rendering the site, as well as sending requests too quickly.
Working around CAPTCHAs gets even more tricky. One of the more reliable ways of solving them is by using an Optical Character Recognition or OCR engine. One of the more accurate OCRs is Tesseract which originally started out as proprietary software, but since has been sponsored by Google and released to use and develop in the open source community.
The last measure of working with CAPTCHAs is a CAPTCHA solving service. This solution would encompass an automated CAPTCHA solving software, as well as human CAPTCHA solvers for the most advanced CAPTCHAs.
Make sure to check out FindDataLab's 10 tips for web scraping to find out how to not get blocked amongst other tips in more detail.