What is web scraping?

No-coding solution.
Retrieve data from any website and have it delivered in any format.

Request your data now!

By clicking "Submit", you agree to our Privacy Policy

Request your data now!
By clicking "Submit", you agree to our Privacy Policy

Page navigation

10. Web Scraping output
11. Why do you need a web scraping service?
12. Best web scraping software
13. Drawbacks of ready-made web scraper software
14. How to not get blocked when web scraping?
15. Data protection and legal issues with web scraping

1. General information about Web Scraping

2. How Does Web Scraping Work?

3. Problems Solved By Web Scraping

4. Web scraping key elements

5. What is web scraping: definition and break-down

6. Uses of Web Scraping

7. Web scraping "politely"

8. What can web scraping do for you?

9. Choosing the best web scraping service

10. Web Scraping output

11. Why do you need a web scraping service?

12. Best web scraping software

13. Drawbacks of ready-made web scraper software

14. How to not get blocked when web scraping?

15. Data protection and legal issues with web scraping

{ }

General information about Web Scraping

Web scraping is a fast and easy way to extract data from the web. It is an automated process using a bot or a web crawler through the HTTP protocol or a web browser. Target data is stored in a central local database or a spreadsheet and is later used for retrieval or analysis.

Web scraping service can be applied for different business processes.

Through internet scraping, you may quickly get information for price tracking, brand monitoring and market research (such as visitor stats, product details, customers' email addresses).

Website scraping is also actively used to train artificial intelligence and to collect information for scientific research.

Web scraping (crawling, extraction) is simply the process of retrieving data from a website. Rather than somebody sitting there painstakingly copying information one line at a time, web scraping uses automation to retrieve hundreds, millions or billions of data points at pace. Web scraping is transforming businesses by providing vital information that they could never have previously got hold of when attempting to make pivotal business decisions.

Check out our web scraping service to learn more about leveraging web scraping for your business needs.

Scraping can be used for text, images or documents, among other types, depending on what is available and the business needs.

Web scraping process involves the use of software programs which simulate an activity likened to that of a human, surfing the web in a bid to acquire specific data from multiple sources and websites. It is essentially, a form of data mining as well. The data collected from such campaigns are often used by companies to create structured data which can be used to understand their customers and create better value for those mentioned above demographic. The unstructured data is obtained from the web, after which it is rearranged into structured data such that it can be understood by applications which provide significant business value.

{ }

How Does Web Scraping Work?

This article doesn't go into the technical detail of creating an internet scraping tool but here is a basic overview of how ones might operate.

Websites are built with some sort of text mark-up language, the most common of those is HTML. This is where you find all the sites structure and contents. A mark-up language is universal meaning that scrapers can easily pinpoint specific elements within them. For example, the text displayed between <b> and </b> will always be in bold.

Say you wanted to extract the bold contents, a scraping tool could find those tags and extricate any data that falls between them through the HTML script. A website scraping tool will parse the HTML language before extracting the necessary data and storing it somewhere. The data might be simple and just fall between two tags, but some projects require slightly more complex coding. Data is stored in the format needed from the project such as CSV, JSON or TSV. Developers or data analysts can then take it and use it to build better insights.

{ }

Problems Solved By Web Scraping

The main benefit of internet scraping is that it automates tasks that might otherwise take humans far longer, or even be impossible to complete on a timely basis. We've given a few examples of these below but ultimately, the applications are extensive given the huge diversity of the digital environment.

Price Tracking

One of the most common applications of web scraping is price tracking. It would take an enormous amount of time for humans to manually audit every product price across the internet. However, a well-developed scraper can do this continuously. Digital pricing strategies are changing all the time and it is vital that you keep on top of what your competitors are doing. Web scraping can be setup to automatically extract prices from your competitors and map them to your own products. You can get proactive insight into pricing strategies and stay on top of what is happening in the market.

Compliance

A key part of the post-manufacturing process is ensuring that retailers are meeting minimum price requirements. However, those with a vast distribution simply can't manually visit every single website continuously. Web scraping is used to monitor retailer sites and send compliance notifications back to the manufacturer without needing a human to spend their valuable time doing the activity.

Sentiment Analysis

With so many digital channels available, customers are leaving comments and feedback about a business in all kinds of places. They could leave a Tweet, Yell review, feedback on your website, comment on a blog post, Facebook message and various other publicly accessible mediums. For somebody to gather the text from each of these, we are taking a significant amount of time.
Internet scrapings tools are being used to monitor multiple websites and take any text related to the business for analysis. A sentiment analysis will tell you whether consumers are being positive or negative in regard to your business which is vital when trying to find the root cause of problems. Without access to data, you cannot do that effectively.

Market and News

Web scraping is being used by marketing and digital teams to aggregate content in a more useful way. With millions of articles being posted online every day, it has become impossible for traditional teams to be aware of everything that is happening. Web scraping looks for the information they need and puts it into a usable format.

{ }

Web scraping key elements

Web scraping has however caused a lot of controversies as some websites do not allow certain kinds of data mining to occur on their platforms, despite this challenge, web scraping arguably remains one of the most popular and capable ways of collecting information on the internet. Web scraping usually involves the consumption of large amounts of data which means its process has to be automated for it to produce any significant results, a few core components or steps are usually followed to ensure efficiency, they are as follows:

Web crawling

Which is done by a web crawler or a spider is the first step of scraping websites. This is the step where our web scraping software will visit the page we need to scrape; then it will continue to actual web scraping, and then "crawl" to the next page.

Web scraping

As the name implies, this process involves the actual collection of data from the websites by the crawler, in this process, the specific data gotten from the website is copied out on a separate platform.

Data extracting

This process involves sorting through the scrapped data and extracting meaningful information. The extractor could either be extracting names, phone numbers, prices, job descriptions, image information or video details, etc.

Data formatting

Once the data has been extracted, it is then fed into a user application in a bid to reach the final user. Some of the common formats used in presenting this data are JSON, CSV, XML, etc.

{ }

What is web scraping: definition and break-down

By definition, web scraping is the process of data extraction from a website. This can be done manually by writing down or copy-pasting information. However, it can become cumbersome, especially if we're dealing with large amounts of data. This is where automated data collection comes in.

By using web scraping software or web scraping tools, we can gather a truly formidable amount of data. In this case, the process of web scraping could even be called web harvesting. Similarly, to the number of grains in crop harvest, we can build big databases of data "grains" by harvesting the web.

James:

Web scraping vs web crawling?

Support Team:

Often the process of automated web scraping involves visiting multiple pages of a website. The process of systematically browsing web pages is called web crawling. Search engines like Google and Bing mostly do this. They deploy web crawlers or spiders that index the site's content, making it possible for end-users to find relevant content by typing in a query in the search engine.

Daniel:

Web scraping overview

Support Team:

It's become more convenient to blanket the whole process of data acquisition by using just the term web scraping, when, in fact, multiple steps are necessary to gather usable data. Essentially, the process of automated web scraping consists of website crawling, data scraping, data extraction and data wrangling. At the end of this process, we should end up with data that's ready for analysis.

Olivia:

What about web crawling?

Support Team

Web crawling, which is done by a web crawler or a spider is the first step of scraping websites. This is the step where our web scraping software will visit the page we need to scrape; then it will continue to actual web scraping, and then "crawl" to the next page.

Automated web scraping in its literal sense means gathering unordered information. Python is commonly to create web scrapers.
Take for example the following, with this code we access the Wikipedia page on web scraping, and we scrape the whole page:

import requests
from bs4 import BeautifulSoup

page = requests.get('https://en.wikipedia.org/wiki/Web_scraping')
soup = BeautifulSoup(page.content, 'html.parser')

This is what a snippet from the output looks like - raw data that's hard to read for a human:

…
<p><b>Web scraping</b>, <b>web harvesting</b>, or <b>web data extraction</b> is <a href="/extraction" title="Data scraping">data scraping</a> 
used for <a href="/extraction" title="Data extraction">extracting data</a> from <a href="/" title="Website"> websites </a>.
<sup class= "reference" id="cite_ref-Boeing2016JPER_1-0"><a href="#cite_note-Boeing2016JPER-1">[1]</a></sup>
…

To be fair, the team behind BeautifulSoup have thought about this. We can use the function prettify to view the HTML code in a more human friendly way. All we have to do is print(soup.prettify( ) ) This will output neatly indented code that's easier to read. Even though now the data is more easily readable for a human, it's still not analysis ready.

{ }

Uses of Web Scraping

Most web scraping techniques range from ad-hoc solutions, requiring human effort, to fully automated systems that are capable of converting entire websites with raw data into structured information. Using web scraping software sitemaps that are able to navigate a site and extract specific data can be created, with the use of various selectors the web scraping tool can navigate the site and extract multiple types of data including text, tables, images, links, etc.

Next comes data extraction, which is the process of systematically filtering out the relevant target information from the unordered or "raw" data that was acquired in the previous step. This could involve gathering links for relevant pages of a site or filtering out text objects, such as product prices, product descriptions, phone numbers, addresses etc.

If you know what data you need, it is possible to filter it out from the "soup" by using a selector.
The two main ways of selecting page objects are by using CSS selectors or XPath navigation. All in all, a more robust way of selecting items in a page is by using CSS selectors. Nobody writes code in a vacuum, so there might come a time where you need to ask for advice or talk about your web scraper's code to someone.

Since more people are familiar with and use CSS selectors in their daily work, it will be easier to debug and maintain your code by either hiring help or asking for advice.

Data wrangling or formatting is the process of removing all the noise from the data, such as HTML tags, commas or other text characters. This is the step where it's possible to add some attributes to the data as well.

Somewhere after or in the middle of the data wrangling process data enrichment comes in, of course, if your project requires it. The data needed for enrichment can be acquired from a ready-made government database, scraping another website or using a specific API. By enriching the data either with demographical, geographical or other types of information, we can make the data work for us, for example by launching targeted marketing campaigns.

After the raw data formatting is done, we need to create an output file. This will be the analysis ready data. Some output file format options are XML, CSV, JSON, Excel files and others. If you're interested in a specific file format don't hesitate to contact the FindDataLab team and we'll come up with a solution for your problem.

{ }

Web scraping "politely"

Every request or page visit consumes a set amount of resources of said page's server. Therefore, if a page receives too many requests or visits at a time either by bots or human users, it'll lag, slow down or crash entirely.

This is where "polite" web scraping becomes important both for search engines and other web crawlers. The standard practice for polite web scraping is to set time-outs for the requests sent to a page. We can choose to set an arbitrary time-out like 10 seconds, or we can look at the robots.txt page of a website, which can be accessed by attaching "/robots.txt" at the end of the page's URL, e.g. "https://www.finddatalab.com/robots.txt".

Here is a sample robots.txt file with two groups:

# Group 1
User-agent: Googlebot
Disallow: /error/
Disallow: /browse/
Disallow: /shopping-cart/

# Group 2
User-agent: *
Allow: /

Crawl-delay: 5

Sitemap: http://www.finddatalab.com/sitemap.xml

The "disallow" directive specifies site pages or paths that shouldn't be crawled.
A web crawler that has "Googlebot" in its user-agent string shouldn't crawl the folder "error" or any subdirectories, the same with "browse" and "shopping-cart".

The line User-agent: * addresses all of the user-agents, which in combination with Allow: / means that all other user agents are allowed to access the entire site. The "allow" directive can be omitted, since it's implied that all of the pages that are not restricted are fair game for scraping.

The next line specifies the crawl delay, in this case it is 5 seconds. This is the amount of time that should pass between requests that are sent from a web scraper or a web crawler. Some websitescraping tools take the robots.txt file in account by default.

For example, the python's Scrapy framework's settings have a built-in environmental variable ROBOTSTXT_OBEY which, when enabled, will respect the robots.txt policies without you explicitly writing in a crawl delay in your code. ROBOTSTXT_OBEY default value is False due to historical reasons, however this option is enabled by default if you're using the settings.py file generated by the scrapy startproject command.

In the case if a web page doesn't have a robots.txt file, you should still adhere to the principles of politeness for web scraping and set a request sending delay, as well as identify yourself in the header's user-agent.

High-profile websites that have a sizable traffic flow and a dedicated fan base will most likely have some sort of an automatic user-agent ban in place. If your user-agent is known as a web scraping tool, it might get banned very quickly. For example, python's Requests module by default sends a standard header identifying itself as a web crawler: 'User-Agent': 'python-requests/2.22.0'. This, as well as other known web scraping tools will get banned if not immediately then fairly quickly, especially if they're sending a lot of requests that are interfering with the human user traffic.

If you're buying a ready-made web scraping software tool, you have to check whether or not user-agent spoofing and rotation is available. As well as, if the option to identify yourself is available too, since identifying yourself is one of the cornerstones of polite and ethical web scraping.

This is a simple implementation of a custom header, where we identify ourselves, tell the sysadmin where we work and provide our contact information:

import requests

headers = {'User-Agent': 'websitescraping; John Doe/[email protected]'}
page = requests.get('http://finddatalab.com', headers = headers)

We dive in deeper into polite web scraping in the FindDataLab's ultimate guide for legal web scraping.

{ }

What can web scraping do for you?

With data becoming the hottest commodity of 21^st century, a logical question is
"What can data do for me?" If we have the right data, we can perform analysis and gain insights into various processes at a set time point. Moreover, it is possible to even predict further events by using machine learning algorithms that are trained on the previous "events" or previously gathered data.

Some specific web scraping project ideas would be data for market analysis, gathering leads, online reputation management and sentiment analysis, as well as making datasets for machine learning, to name a few.

Price tracking

Perform price tracking faster by automatically gathering price points for your competitor's products. Price scraping can be a great tool in your business intelligence toolkit. Scrape best-selling product information from Amazon or other e-commerce platforms to perform market analysis. A part of staying relevant in a select market means keeping up to date with your competitors.

Brand monitoring

Use text mining with FindDataLab to get word frequency distributions for queries you're interested in. Find out how many times your product's name or a specific topic appears in search engine results by using FindDataLab's Google search data scraper.

Another facet of text mining is sentiment analysis – find out how people feel about your product. Look for specific words in comments, reviews or hashtags on Twitter, Instagram or any other social media platform.

Compound travel, hotel and airline data by extracting information from travel websites. Analyze hotel reviews, perform price scraping and check the availability, as well as airline ticket price fluctuations.

Aggregate reviews for services or products from e-commerce sites. Extract their rating and build a database of similar services or similarly rated services.

Get to know your customers and their perception of your or your competitor's products or services with FindDataLab's review tracking subscription service. With the review monitoring subscription, you'll have access to daily, weekly or monthly reports on how people perceive your brand.

Lead generation

Create a contact list of potential clients for your business by scraping LinkedIn, Google Maps, Angel List or other platforms. Search for family brand hotels or other niche businesses in your area to target as clients. Looking for contacts has never been this easy. Enrich the data with geographical information as well, to make launching your targeted marketing campaigns smoother.

Job data and recruitment

Scrape multiple job boards and company websites to aggregate data about a specific specialty. Find the best candidates for a job opening in your company in mere seconds. As well as monitor the feedback about your company as an employer.

Web scraping for research purposes

If you're an employee or a student of a government-accredited institution and work on a research project, FindDataLab will grant you web scraping services equivalent to 1000$. Check out our research grant for more information.

FindDataLab accepts data harvesting projects on any topic, provided that it's technically feasible, and the data can be scraped legally.

Web scraping in real-time

Perform web scraping in real-time, for example, by scraping financial data, such as scraping stock market data and their changes. One of the web scraping uses could also be scraping sports results pages for real-time updates.

{ }

Choosing the best web scraping service

Usually coming up with web scraping ideas is not too complicated. The most challenging part is finding a way to aggregate the data. The web scraping tool range starts from ready-made web scraping software to DIY scripting with web scraping services as a customized solution in between. Site scraping tools can be very versatile if you know how to use them.

Web scraping software

It's hard to pinpoint which is the best web scraping software, since a ready-made solution will not always be the most flexible and objectively "best" one. Web scraping software is great for those with no programming skills trying to accomplish easy to medium difficulty tasks.

Another honorary mention in this category would be browser plugins that may or may not work. You can always start with free web scraping software or plugins, but keep in mind that nothing is completely free. You could luck out and be able to perform simple web page scraping tasks or you could download software that is not functional in the best scenario and malware in the worst scenario.

Another drawback of using plugins is that the output file variety is most often small, and you're limited in the number of pages that can be scraped. However, if you need to realize a small-scale project, a simple browser plug-in might be just what you need.

DIY web scraping tools

Best web scraping tools can be the ones you build yourself since a custom problem requires a custom solution. Depending on what language you feel most comfortable in, you can use various website scraping tools. Some of the tools Java has are JSoup, HTMLUnit, Jaunt and others. A few of JavaScript frameworks and libraries are Cheerio, Osmosis, Apify SDK and Puppeteer.

One of the best web scraping tools for Python is the Scrapy framework which is great for scalable projects. If you feel like taking a more grassroots approach, use individual libraries like Requests and BeautifulSoup to puzzle together a custom web scraping tool.

Most programming languages will be able to handle some form of web scraping, therefore don't feel disinclined to write a scraper in C++ if it's the only language you know. It might not be the best method of writing a web scraper, but it could potentially be good enough for what you're trying to accomplish.

Web scraping service

If you want to see your web scraping ideas come to life without a hitch, use a web scraping service. A mark of some of the best web scraping services is gathering data with compliance to the website's terms of use and privacy policy.

The main reason for using web scraping services is to save your precious time. Every software product or DIY solution takes time to learn and implement, and even after all your work you might get stumped by the unknown unknowns. A web scraping service such as FindDataLab will ensure that your web scraping tool will be robust, and you'll get the output data on time.

The best web scraping services will provide you with a stable, cleanly written and scalable solution to your web scraping needs. Scraping websites successfully has never been this easy, with the FindDataLab subscription model. You can choose to scrape data from website daily, weekly or monthly depending on your goals. The task can be fully customizable, and you won't have to be bogged down by all the technical details. No need to ask yourself what is web scraping and how exactly can I do it? Hire someone who already knows.

All in all, when choosing the best web scraping tool, you have to keep in mind how easily the tool can be used. This will differ amongst people depending on their technical abilities. Another thing worth keeping in mind is how much time you'll need to spend on implementing and maintaining your web scraping solution. This is where web scraping services come into play, freeing up your time and your hands by providing you with an analysis ready output on time.

{ }

Web scraping output

A database is a collection of data that is organized, allowing for rapid search and retrieval by a computer.It is the intended result of the complete process of presentation of textual material in order to communicate meaning.

Be it an Excel sheet or a CSV file, once the data cleansing is over, writing to an output file can be performed relatively quickly.

As previously mentioned, the types of outputs can be Excel sheets and CSV files. In addition, a viable file format is also XML and Json. This applies mostly to using web scraping services, since the output needs to be portable. You can always write straight into SQLite or some other database of your choice if you're implementing a DIY solution. However, migrating an SQL database for example from server to server might be trickier, regardless of whether you're using a web scraping service or not, since there's a possibility that data loss will occur during migration.

In any case, it's best to consult the web scraping service providers as to what might be the best database format for your web scraping idea.

The best thing about modern technology is that most data file formats are interchangeable and convertible. You can export a CSV file in an Excel sheet, as well as import an Excel sheet into a MySQL database that already has an existing schema.

As previously mentioned, when using a data scraping service, you should specify the output file format with the data frame's configuration. You have to think about what you want to do next with the data. If you need a simple Word file with addresses of restaurants in a 30-mile radius of your office, then this does not apply to you. But, if you want to perform exploratory and other kinds of statistical analyses on your data, you have to think about the output's layout.

Depending on what will be used for further analysis, you'll want a specific data table configuration, for example as is needed with Excel and SPSS.

This way, integrating the clean data further into your workflow will be seamless and quick. The worst thing after building a database is finding out that you need to restructure it, since restructuring can lead to a loss of information.

{ }

Why do you need a web scraping service?

Website structure changing is a fairly standard occurrence, especially if you want to scrape an up and coming website that's still developing. Therefore, it you're wanting to scrape a page multiple times over a period of time there's a high possibility that your web scraper will break. This is nothing out of the ordinary.

Web scraper maintenance is a part of the web scraping process and therefore we need to think about how much time it'll take for the specific tool you're using. Even though writing a custom web scraper from scratch might be more time consuming up front, debugging and maintaining your script in case of changes in website's layout could potentially be far easier than if you're using ready made software or a plugin.

{ }

Best web scraping software

Truly, the best web scraping tool is a custom built one. Either if you decide to DIY build a web scraper or you hire a dedicated web scraping service like FindDataLab, getting a custom-built web scraper will be a better solution in the long run.

When implementing a DIY solution, you have to consider the time needed upfront for both research and development, whereas if you decide to hire specialists the price you pay will be money, not time.

A great place to start would be to implement a DIY solution and turn to web scraping services for scaling. This way you have some hands-on experience and have realistic expectations of what you'll be able to get out of your web scraper. Of course, there's nothing wrong with going to a web scraping service as a first-line solution, especially, if your web scraping project involves scraping a large number of pages as well as scraping from different sources.

A custom web scraping solution will be the easiest way to go, if you want to enrich the data in a way that requires the use of APIs or scraping data from a different web page entirely and then merging the datasets. You can DIY the whole thing, come to a web scraping service provider with a half-baked dataset or arrange the whole project to be done by professionals.

This is where you have to keep in mind that the more complex and ambitious your web scraping idea is the bigger the chance of stumbling upon unknown unknowns is if you decide to implement a solution yourself.

Therefore, depending on the urgency and accuracy needed for your web scraping project, you should weigh all of the options both free web scraping software and paid site scraping tools, as well as a DIY custom script versus using a web scraping service.

Request your data now!

By clicking "Submit", you agree to our Privacy Policy

{ }

Drawbacks of ready-made web scraper software

1. Ready-made software is not specifically made to parse the web pages you're interested in. You're applying a generic solution to a specific problem; therefore, the chances of malfunctioning are somewhat higher.

2. There's also a higher chance of the web scraper maintenance process taking longer, since you might be stuck with finding another solution if your chosen software product does not support the changes made in the web site.

3. Another drawback of using ready made software or plugins is that you're bound by its rules. You're stuck with whatever file format the web scraping tool offers to you and it's not necessarily the format you need. As previously mentioned, the file formats are fairly fluid nowadays with many conversion options for example you can import CSV files into Excel with no fuss. But nevertheless, this process will take some time out of your day.

Of course, you can always look for a specific web scraping software solution that offers you the file format you want, but what happens when your project requires a different output file format? Either you find another ready-made web scraper software and start the process of building your scraper again or you predict this and go with the pricier software solution that will have multiple output file format options included.

4. If your project requires some sort of data enrichment or dataset merging, you will have to take extra steps in order to make it happen. This could possibly be another time sink, depending on the complexity of your web scraping project.

Usually if you're using ready-made software or some free web scraping software, you'll have limitations as to how many projects you can make, as well as how many pages you can scrape in the context of one project. Usually, it is possible to increase the number of projects and pages scraped by buying a subscription or paying a one-time set fee.

5. You also have to keep in mind whether the web scraping tool has built-in delay setting and IP rotation setting with preferably a user-agent rotation option as well. If you're using a free web scraping tool chances are that there will be no delay setting available, no IP rotation and certainly no user-agent rotation at your disposal.

Setting delays to space out your requests as well as changing up user-agents and IP addresses is important if you don't want your web scraping endeavors to get blocked. It is possible to do without all of these shiny attributes, but only if you're looking to perform small scale web scraping. With increasing request rate and count of pages scraped, you'll eventually see your web scraper get blocked if you're not taking any of these anti-blocking precautions.

6. With the freemium or free-to-paid web scraping software model, you will still have to pay eventually. Especially, if you're looking to scale your web scraping project. Therefore, a valid question is why not pay from the start and get a fully customized product with anti-blocking precautions already implemented, and free up your precious time? Time is a resource and we need to treat it as such.

{ }

How to not get blocked when web scraping?

If you're trying to scrape websites on your own, there are a few things your should keep in mind in order to not get blocked whether you're using ready-made web scraping software or a DIY web scraper

First, we need to think about web scraper request delays, as well as request randomization. These things are fairly easy to implement if you have a custom web scraping script. For example, you can use the python Time module's sleep function to time out your requests.

Request randomisation comes in when we want the web scraper to appear more human-like in its browsing habits. In essence, when you think about how you browse the web, you can come to a conclusion that you click on things, scroll and perform other activities in a random manner. This is why, in order to not appear as suspicious activity in the web server logs, we can use the python's Random module and make the web scraper time-out at random intervals.

All in all, spacing out requests is important so that you wouldn't overload the server and inadvertently cause damages to the web site. Therefore, you have to especially careful when selecting ready-made web scraping software, as it may not have this feature.

IP address rotation is one of the unblockable web scraper cornerstones. In addition to randomising your request rates, you should also make sure that the server doesn't think that one IP address is randomly browsing 10'000 pages of a website. IP address blacklisting is one of the most basic anti-web scraping measures that a server can implement. Usually it happens if many requests are sent one by one with little time in between.

Take for example a server log, where one IP address is sending one request per second. Would you block it? Of course, since no human-user would browse the web like that. Usually this is set to happen automatically, without further analyzing the user-agent (where you might have identified yourself and provided contact information).

Because of this, we need the web scraped to be random and to change the IP addresses. It's not about tricking the webmaster and portraying ourselves as not-a-web scraper. It's more about not getting banned automatically.

You can build a simple IP address rotation tool by using free IP addresses available on the web. However, the price you pay by using a free IP address is that the connection will be painfully slow with a lot of the addresses timing out before getting a response from the web server. This is due to the sheer amount of people rerouting their requests to these IP addresses.

This is where you can use a proxy web scraping service, where you'll get a set number of IP addresses in a location of your choice. HTTP proxies are classified in the following types, depending on the anonymity level that they provide to you. There are transparent, anonymous and elite proxies.

A transparent proxy will surrender all of your information to the web page's server that you're trying to access. Despite using a proxy IP, your real IP address will get passed down to the server in the HTTP request's headers. Obviously, despite it being a proxy, there is zero anonymity provided to you, if you decide to use it. It may help you get around simple IP bans, but don't bank on it getting you far.

An anonymous proxy will identify itself to the server as a proxy, but won't disclose your IP address. It's detectable, but atleast it provides a layer of anonymity to you.

An elite proxy does not notify the server of the fact that a proxy is being used and it doesn't pass your IP address to the server to boot. Therefore, this is the best solution for web scraping securely.

Some ready-made web scraping software tools provide the user with an IP address rotation option as well, so make sure to check for this option when selecting the tools for your web scraping project.

A slight grey area of web scraping would be user-agent rotation. Just as with IP address rotation, if you want, you can also periodically change the user agent that will be sent to the web server. Some people choose to spoof the whole user-agent making it appear that a human user is browsing the web page. However, that might not be the best practice if you're really trying to stay within the bounds of law.

A great middle ground would be to provide a regular user-agent and appending your contact information at the end of it. This way you can still rotate the user-agents, while also providing a channel for open communication with the sysadmin.

Some websites go above and beyond in order to not get web scraped and to block all automated data collection. One of the ways of catching a web scraper is by using a 'honey pot'. A 'honey pot' is an HTML link that's not visible to the user but can be accessed by a web scraper, usually the CSS selector (that a web scraper uses) is set as display : none.

Avoiding 'honey pots' and other hidden link trickery is a skill that takes time to master. The 'honey pot' concept in itself is nothing revolutionary or hard to grasp. Just don't select the CSS elements that are not displayed, right? Yes, but there's another thing to keep in mind. If a developer has included hidden links, what stops them from adding prepopulated hidden forms that need to be sent together with your request in order to process the request as valid.

A web scraper needs to be built while looking at the website as a whole. We can't simply ignore all hidden fields. Some of them matter and some don't.

Finally, we come to cookies and CAPTCHAs. With the e-commerce sector and digital marketing becoming a vital part of running a profitable business, cookies have earned their time in the spotlight. A cookie is a token by which the web server remembers a user's browser session's state or in simpler terms – it tracks user activity.

Therefore, one of the ways of not getting blocked is by using persisting cookies in your web scraper. This way it'll appear that the requests are made in one session. Not to mention the fact, that you can even speed up your web scraper by using the same Transmission Control Protocol, i.e. by using the same connection. A plus for browser plugins is that they scrape within your browser's session limits. Therefore, cookies are taken care of.

CAPTCHA or Completely Automated Public Turing test to tell Computers and Humans Apart is served to a user once his activity becomes 'suspicious'. This activity is tracked by the aforementioned session cookies. Activities that'll trigger a CAPTCHA include, but aren't limited to using too many devices to log into one account, not rendering the site, as well as sending requests too quickly.

Working around CAPTCHAs gets even more tricky. One of the more reliable ways of solving them is by using an Optical Character Recognition or OCR engine. One of the more accurate OCRs is Tesseract which originally started out as proprietary software, but since has been sponsored by Google and released to use and develop in the open source community.

The last measure of working with CAPTCHAs is a CAPTCHA solving service. This solution would encompass an automated CAPTCHA solving software, as well as human CAPTCHA solvers for the most advanced CAPTCHAs.

Make sure to check out FindDataLab's 10 tips for web scraping to find out how to not get blocked amongst other tips in more detail.

{ }

Data protection and legal issues with web scraping

When web scraping, we need to think about a few things from a legal web scraping standpoint. Obtaining written consent from the web site's owner, data's intellectual owner or crawling the website in accordance to its Terms of Service.

European Union's 2018 General Data Protection Regulation (GDPR)

If you're concerned about the European Union's 2018 General Data Protection Regulation (GDPR), you have to take note of what types of data you want to gather. In a nutshell, the GDPR puts limits on what businesses can do with people's personal identification data such as name, last name, their home address, phone number and e-mail. The GDPR in itself doesn't declare web scraping as an illegal practice, instead it states that institutions wanting to scrape people's data need the person's in question explicit consent.

For example, if we're trying to scrape names and e-mails from a particular site as a way of generating leads, we need to obtain consent from every lead in order for it to be allowed under the GDPR. Of course, the matter varies further, for example, this restriction wouldn't apply if we're scraping a site that has obtained the person's permission for posting their identifying information along with a clause where matters concerning web scraping are disclosed as well.

Terms of Use and Privacy Policy

Regardless of whether you're looking to scrape information that pertains to a certain person or not, you need to read the web page's Terms of Use and Privacy Policy. It's important to find out if the data is explicitly protected or copyrighted in the Terms of Use. Along with this, obtaining prior written permission from the site's owner will further guarantee that your actions are within the scope of the law.

Oftentimes the web site's Terms of Service will contain clauses that prohibit web scrawling, web scraping or data harvesting of any kind, as well as automated use of their associated services. By continuing to use the web page you are legally bound by those terms and it doesn't matter if you could obtain the data by manually copy-pasting.

By now, a famous case of not reading the Terms of Use is the US case of Internet Archive vs. Suzanne Shell. On her website, Mrs. Shell displays a warning that by copying content from her website you immediately enter a contract where you owe her 5000$ per very page copied. With more exorbitant fees required as part of penalty for not paying the previously mentioned fees. The case ended with the Internet Archive taking down any information gathered from Mrs. Shell's web page, as well as active copyright infringement claims.

It is ignorant to think that a website's Terms of Use are not enforceable and legally binding. Mostly, because regardless if the Terms of Use are clickwrap or browsewrap, the web scrapers (or in court – defendant's) failure to read these terms is mostly found irrelevant to the enforceability of them.

However, a recent ruling in favor of web scraping makes the waters even muddier. At the end of 2019 a court ruled that scraping LinkedIn's publicly available data is legal. The case of hiQ vs. LinkedIn has major implications in regard to privacy. It implies that data entered by users in a social media site, especially data that has been available for the public to see, does not belong to the site owner.

A little background, hiQ Labs scrapes user profiles from LinkedIn, performs various data enrichment procedures and then sells the data to other companies. Therefore, in 2017 LinkedIn sent hiQ Labs a cease-and-desist letter notifying hiQ Labs of violating LinkedIn's User Agreement and the Computer Fraud and Abuse Act or the CFAA, among other laws. After hiQ's response being a lawsuit against LinkedIn, the district court ordered LinkedIn to remove any technical barriers to hiQ's access to the site user's public profiles. And as previously mentioned, now the court has ruled that scraping LinkedIn public data is legal.

The trial will still continue, but regardless of the final outcome, the implications of this case will have far reaching consequences as a novel interpretation of the CFAA. Check out legal web scraping guide for more information.

Robots exclusion standard

If you've already gotten consent from everybody and have read the Terms of Use, and everything seems feasible from a legal standpoint, time to think about the robots.txt file. This is more of a technical tool to detect unwanted crawling or scraping by search engine bots and "polite" web scrapers, yet still worth looking into if you don't want to get blocked.

The robots.txt file can be accessed by appending "/robots.txt" at the end of a website's URL, e.g. "https://example.com/robots.txt". Most traffic-heavy web pages have specifications for bots to follow. The /robots.txt file contains information about which pages you can or can't scrape, as well as what the time-out should be between the requests among other things.

We talk more about the robots.txt file and identifying yourself in the section web scraping politely, also make sure to check out FindDataLab's ultimate guide to ethical web scraping. Regardless of the tool you're using, you have to identify yourself when web scraping in order to increase the chances of not ending up in a lawsuit.

In order to save time and not get surprised by the unknown unknowns, consider hiring a web scraping service such as FindDataLab. As professionals with hands-on experience in realizing projects of various difficulty, FindDataLab can find the solution to almost any web scraping problem. Don't hesitate to contact us and we'll get back to you shortly.

16 March / 2020

Have more questions about web scraping?

Check out our web scraping service or fill in the form below. We'll come back to you with a quote for your task.

Contact customer success team

FEEL FREE TO CONTACT US

[email protected]

Email: [email protected]
Messenger (FB): m.me/finddatalab

Get new ways to get data and special offers in our news

Office in the USA
1 Broadway – Cambridge, MA 02142
Tel: +1 617 430 5286
Office in EU
Cara Lazara 5-7, Beograd 11000, Serbiя

Services

Data Extraction Service
Price Tracking Solution
Reputation Monitoring
Travel Data
Data for ML

insights

A Guide to Web Scraping
10 Tips for Web Scraping
Web Scraping FAQ
The Legal Web Scraping
Research Grant
What is Web Scraping

company

FindDataLab Careers
Privacy Policy
Terms of Service
About US
Contacts