Web scraping is a technique used to extract information from web pages in an automated way. If we translate from English its meaning would come to mean something like “digging a web” .
Applications and examples: What is web scraping for?
Its use is very clear: we can take advantage of web scraping to get industrial amounts of information (Big data) without typing a single word. Through search algorithms we can crawl hundreds of websites to extract only the information we need.
For this, it will be very useful to master regex (regular expression) to delimit the searches or make them more precise and that the information filtering is better.
Some examples for which we will need web scraping:
- For content marketing : we can design a robot that does a ‘scraping’ of data specific of a web and we can use them to generate our own content. Example: scrape the statistical data on the official website of a soccer league to generate our own database.
- To gain visibility in social networks : we can use the data of a scrape to interact through a robot with users in networks social. Example: create a bot on instagram that selects the links of each photo and then program a comment on each post.
- To control the image and visibility of our brand on the internet : through scraping we can automate the position by which Several articles on our website are positioned on Google or, for example, control the presence of our brand name in certain forums. Example: track the position in Google of all our blog posts.
Do you need help scraping information for your business?
Get in touch with us and we will give you a customized solution
How does a web scraping work?
Let’s take a basic example of how a web scraper works. Let’s imagine that we are interested in extracting the title of 400 pages that have the same format and are within the same site. In each of the 400 pages the title is inside a
selector which in turn is inside a
.header class .
What our web scraper will do is detect that selector
h1 that is inside the header
(.header h1) class and it will extract that information in each of these 400 pages. Then we can obtain all this information by exporting the data in formats such as a list in
.json or a file
What would take manually a few hours of absolute boredom and mechanic work our web scraper can do it in just a couple of minutes .
What knowledge do you have to have to be a good web scraper?
Web scraping is a discipline that must combine two very different aspects of web knowledge, both essential to have a versatile profile on the Internet. On the one hand, we must master data visualization at a conceptual level and, on the other, we must have the technical knowledge necessary to be able to accurately extract the data with specialized tools.
At the end of the day this will be summarized in knowing how to manage large amounts of data (big data). We must be minimally familiar with the visualization of large amounts of data in order to be able to hierarchize and interpret the data that we extract from a web. And not only to extract the data, also when proposing the extraction strategy we must know what the data will be that we are going to extract in order to give it an informative sense for the user.
There are 3 key points that we must master to be good web scrapers:
- 1. Knowledge of web layout. Web scrapers work by selecting html selectors and for this we will need to have four basic knowledge of web architecture.
- 2. Knowing how to use software to visualize data such as a Google spreadsheet processor, known as Google Spreadsheets, or a basic text editor such as Sublime.
- 3. Having knowledge of regex. Having minimal knowledge of regex (also called regular expression) will make our work much easier when working with large amounts of data since it can save us thousands of hours of laborious work per hour to correct or debug the data before importing it to the desired platform.
Regular expression (regex) to narrow a search
And after the web scraping? How to use the obtained data
The web scraping consists of obtaining the data but obviously we will have to use this data for some purpose. This is where two key processes come into play once the data is obtained:
Data hierarchy, ordering and filtering . Many times when we extract industrial amounts of data, before importing it to another platform we will have to ‘work’ on this data with precision in order to purify it for import.
Importing the data to another platform. Importing the data is another basic process. There are highly recommended tools with which we can work on platforms such as WordPress, such as the WP Ultimate CSV Importer plugin from the Smack Coders development studio (they also have a paid version, Ultimate CSV Importer Pro).
Google Spreadsheets sheet with data extracted with a web scrapper from the laliga.es and uefa.com websites. The data is ready to be imported to a website that runs on the WordPress CMS through the WP Ultimate CSV Importer
What tools are there to do web scraping?
I would definitely opt for two main tools: webscraper.io and import.io. Another third tool would be Scrapy.org. Here is a brief description of each of the tools along with a small assessment of who they will be useful for:
It is a plugin for the Google browser, Chrome. From my point of view, it is the tool that can get the most juice, although to use it you need to have a minimum knowledge of web layout to correctly identify the html selectors and, in some cases, it will also be useful to have some notions of regex (regular expression) to formulate the ‘scrape’ commands well.
It is a recommended tool for users with a certain knowledge of programming and web layout.
It can be used from the web control panel for basic scraping, although for more complex operations it is necessary to download the program. The program is nothing more than a browser built on the basis of free Chromium software (Chrome’s engine) specially modified for web scraping. It is an easy-to-use tool and that implies that you do not have to have specific programming knowledge to start experimenting with it. You have many options, although the freedom to program webscraping.io scraps is greater.
It can be used by all types of users as long as they are familiar with the basic concepts of the web world and with data visualization tools such as Excel and Google Spreadsheets.
It is a tool that works with the Python programming language. To use it, obviously, you must have advanced knowledge of Python programming. From my point of view it is quite a complex tool to handle and not open to everyone at all. Another handicap is that if you want to use the extracted data to work with spreadsheets, the process will be arithmetically complicated.
It is a tool 100% designed for programmers with advanced knowledge of Python and for projects that do not require a lot of data visualization work when working with the ‘scraping’ results (that is, without using more visual tools such as for example, spreadsheets).
Web scraping as a substitute for an API
An API (Application Programming Interface, in Spanish, application programming interface) is a tool that allows us to exchange data between several websites. Let’s say that we have a sports newspaper and that we have a section with statistics on soccer matches in it.
These data -in principle- are not filled in manually after each game. What sports newspaper websites tend to do is be connected to companies that have data centers. These companies are dedicated to providing access to this data through APIs. An API of this type will normally be a paid service.
With a regularly scheduled web scraper we can end up achieving the same result to update the data on our website. In fact, the import.io tool already has a service that converts the web scraper into an API. Of course, unlike the API, it is not going to be a ‘real-time’ process.