You may have used NodeJS as a web server, but did you know that you can also use it for web scraping? In this tutorial, we’ll review how to scrape static web pages – and those pesky ones with dynamic content – with the help of NodeJS and a few helpful NPM modules.
A Bit About Web Scraping
Web scraping has always had a negative connotation in the world of web development – and for good reason. In modern development, APIs are present for most popular services and they should be used to retrieve data rather than scraping. The inherent problem with scraping is that it relies on the visual structure of the page being scraped. Whenever that HTML changes – no matter how small the change may be – it can completely break your code.
Despite these flaws, it's important to learn a bit about web scraping and some of the tools available to help with this task. When a site does not reveal an API or any syndication feed (RSS/Atom, etc), the only option we're left with to get that content… is scraping.
Note: If you can't get the information you require through an API or a feed, it's a good sign that the owner does not want that information to be accessible. However, there are exceptions.
Why use NodeJS?
Scrapers can be written in any language, really. The reason why I enjoy using Node is because of its asynchronous nature, which means that my code is not blocked at any point in the process. I'm quite familiar with JavaScript so that's an added bonus. Finally, there are some new modules that have been written for NodeJS that makes it easy to scrape websites in a reliable manner (well, as reliable as scraping can get!). Let's get started!
Simple Scraping With YQL
Let's start with the simple use-case: static web pages. These are your standard run-of-the-mill web pages. For these, Yahoo! Query Language (YQL) should do the job very well. For those unfamiliar with YQL, it's a SQL-like syntax that can be used to work with different APIs in a consistent manner.
YQL has some great tables to help developers get HTML off a page. The ones I want to highlight are:
Let's go through each of them, and review how to implement them in NodeJS.
html table
The html table is the most basic way of scraping HTML from a URL. A regular query using this table looks like this:
select * from html where url="http://finance.yahoo.com/q?s=yhoo" and xpath='//div[@id="yfi_headlines"]/div[2]/ul/li/a'
This query consists of two parameters: the "url" and the "xpath". The url is self-explanatory. The XPath consists of an XPath string telling YQL what section of the HTML should be returned. Try this query here.
Additional parameters that you can use include browser
(boolean), charset
(string), and compat
(string). I have not had to use these parameters, but refer to the documentation if you have specific needs.
Not comfortable with XPath?
Unfortunately, XPath is not a very popular way of traversing the HTML tree structure. It can be complicated to read and write for beginners.
Let's look at the next table, which does the same thing but lets you use CSS instead
data.html.cssselect table
The data.html.cssselect table is my preferred way of scraping HTML off a page. It works the same way as the html table but allows you to CSS instead of XPath. In practice, this table converts the CSS to XPath under the hood and then calls the html table, so it is a little slower. The difference should be negligible for scraping needs.
A regular query using this table looks like:
select * from data.html.cssselect where url="www.yahoo.com" and css="#news a"
As you can see, it is much cleaner. I recommend you try this method first when you're attempting to scrape HTML using YQL. Try this query here.
htmlstring table
The htmlstring table is useful for cases where you are trying to scrape a large chunk of formatted text from a webpage.
Using this table allows you to retrieve the entire HTML content of that page in a single string, rather than as JSON that is split based on the DOM structure.
For example, a regular JSON response that scrapes an <a>
tag looks like this:
"results": { "a": { "href": "...", "target": "_blank", "content": "Apple Chief Executive Cook To Climb on a New Stage" } }
See how the attributes are defined as properties? Instead, the response from the htmlstring table would look like this:
"results": { "result": { "<a href=\"…\" target="_blank">Apple Chief Executive Cook To Climb on a New Stage</a> } }
So, why would you use this? Well, from my experience, this comes in great use when you're trying to scrape a large amount of formatted text. For example, consider the following snippet:
<p>Lorem ipsum <strong>dolor sit amet</strong>, consectetur adipiscing elit.</p> <p>Proin nec diam magna. Sed non lorem a nisi porttitor pharetra et non arcu.</p>
By using the htmlstring table, you are able to get this HTML as a string, and use regex to remove the HTML tags, which leaves you with just the text. This is an easier task than iterating through JSON that has been split into properties and child objects based on the DOM structure of the page.
Using YQL with NodeJS
Now that we know a little bit about some of the tables available to us in YQL, let's implement a web scraper using YQL and NodeJS. Fortunately, this is really simple, thanks to the node-yql module by Derek Gathright.
We can install the module using npm
:
npm install yql
The module is extremely simple, consisting of only one method: the YQL.exec()
method. It is defined as the following:
function exec (string query [, function callback] [, object params] [, object httpOptions])
We can use it by requiring it and calling YQL.exec()
. For example, let's say we want to scrape the headlines from all the posts on the Nettuts main page:
var YQL = require("yql"); new YQL.exec('select * from data.html.cssselect where url="http://net.tutsplus.com/" and css=".post_title a"', function(response) { //response consists of JSON that you can parse });
The great thing about YQL is its ability to test your queries and determine what JSON you are getting back in real-time. Go to the console to try this query out, or click here to see the raw JSON.
The params
and httpOptions
objects are optional. Parameters can contain properties such as env
(whether you are using a specific environment for the tables) and format
(xml or json). All properties passed into params
are URI-encoded and appended to the query string. The httpOptions
object is passed into the header of the request. Here, you can specify whether you want to enable SSL, for instance.
The JavaScript file, named yqlServer.js
, contains the minimal code required to scrape using YQL. You can run it by issuing the following command in your terminal:
node yqlServer.js
Exceptions and other notable tools
YQL is my preferred choice for scraping content off static web pages, because it's easy to read and easy to use. However, YQL will fail if the web page in question has a robots.txt
file that denies a response to it. In this case, you can look at some of the utilities mentioned below, or use PhantomJS, which we’ll cover in the following section.
Node.io is a useful Node utility that is specifically designed for data scraping. You can create jobs that take input, process it and return some output. Node.io is well-watched on Github, and has some helpful examples to get you started.
JSDOM is a very popular project that implements the W3C DOM in JavaScript. When supplied HTML, it can construct a DOM that you can interact with. Check out the documentation to see how you can use JSDOM and any JS library (such as jQuery) together to scrape data from web pages.
Scraping Pages With Dynamic Content
So far, we've looked at some tools that can help us scrape web pages with static content. With YQL, it's relatively easy. Unfortunately, we are often presented with pages that have content which is loaded dynamically with JavaScript. In these cases, the page is often empty initially, and then the content is appended afterwards. How can we deal with this issue?
An Example
Let me provide an example of what I mean; I have uploaded a simple HTML file to my own website, which appends some content, via JavaScript, two seconds after the document.ready()
function is called. You can check out the page here. Here's what the source looks like:
<!DOCTYPE html> <html> <head> <title>Test Page with content appended after page load</title> </head> <body> Content on this page is appended to the DOM after the page is loaded. <div id="content"> </div> <script src="http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js"></script> <script> $(document).ready(function() { setTimeout(function() { $('#content').append("<h2>Article 1</h2><p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p><h2>Article 2</h2><p>Ut sed nulla turpis, in faucibus ante. Vivamus ut malesuada est. Curabitur vel enim eget purus pharetra tempor id in tellus.</p><h2>Article 3</h2><p>Curabitur euismod hendrerit quam ut euismod. Ut leo sem, viverra nec gravida nec, tristique nec arcu.</p>"); }, 2000); }); </script> </body> </html>
Now, let's try scraping the text inside the <div id="content">
using YQL.
var YQL = require("yql"); new YQL.exec('select * from data.html.cssselect where url="http://tilomitra.com/repository/screenscrape/ajax.html" and css="#content"', function(response) { //This will return undefined! The scraping was unsuccessful! console.log(response.results); });
You'll notice that YQL returns undefined
because, when the page is loaded, the <div id="content">
is empty. The content has not been appended yet. You can try the query out for yourself here.
Let's look at how we can get around this issue!
Enter PhantomJS
PhantomJS can load web pages and mimic a Webkit-based browser without the GUI.
My preferred method for scraping information from these sites is to use PhantomJS. PhantomJS describes itself as a "headless Webkit with a JavaScript API. In simplistic terms, this means that PhantomJS can load web pages and mimic a Webkit-based browser without the GUI. As a developer, we can call on specific methods that PhantomJS provides to execute code on the page. Since it behaves like a browser, scripts on the webpage run as they would in a regular browser.
To get data off our page, we are going to use PhantomJS-Node, a great little open-source project that bridges PhantomJS with NodeJS. Under the hood, this module runs PhantomJS as a child process.
Installing PhantomJS
Before you can install the PhantomJS-Node NPM module, you must install PhantomJS. Installing and building PhantomJS can be a little tricky, though.
First, head over to PhantomJS.org and download the appropriate version for your operating system. In my case, it was Mac OSX.
After downloading, unzip it to somewhere such as /Applications/
. Next, you want to add it to your PATH
:
sudo ln -s /Applications/phantomjs-1.5.0/bin/phantomjs /usr/local/bin/
Replace 1.5.0
with your downloaded version of PhantomJS. Be advised that not all systems will have /usr/local/bin/
. Some systems will have: /usr/bin/
, /bin/
, or usr/X11/bin
instead.
For Windows users, check the short tutorial here. You'll know you're all set up when you open your Terminal and write phantomjs
, and you don't get any errors.
If you are uncomfortable editing your PATH
, make a note of where you unzipped PhantomJS and I'll show another way of setting it up in the next section, although I recommend you edit your PATH
.
Installing PhantomJS-Node
Setting up PhantomJS-Node is much easier. Provided you have NodeJS installed, you can install via npm:
npm install phantom
If you did not edit your PATH
in the previous step when installing PhantomJS, you can go into the phantom/
directory pulled down by npm and edit this line in phantom.js
.
ps = child.spawn('phantomjs', args.concat([__dirname + '/shim.js', port]));
Change the path to:
ps = child.spawn('/path/to/phantomjs-1.5.0/bin/phantomjs', args.concat([__dirname + '/shim.js', port]));
Once that is done, you can test it out by running this code:
var phantom = require('phantom'); phantom.create(function(ph) { return ph.createPage(function(page) { return page.open("http://www.google.com", function(status) { console.log("opened google? ", status); return page.evaluate((function() { return document.title; }), function(result) { console.log('Page title is ' + result); return ph.exit(); }); }); }); });
Running this on the command-line should bring up the following:
opened google? success Page title is Google
If you got this, you're all set and ready to go. If not, post a comment and I'll try to help you out!
Using PhantomJS-Node
To make it easier for you, I've included a JS file, called phantomServer.js
in the download that uses some of PhantomJS' API to load a webpage. It waits for 5 seconds before executing JavaScript that scrapes the page. You can run it by navigating to the directory and issuing the following command in your terminal:
node phantomServer.js
I'll give an overview of how it works here. First, we require PhantomJS:
var phantom = require('phantom');
Next, we implement some methods from the API. Namely, we create a page instance and then call the open()
method:
phantom.create(function(ph) { return ph.createPage(function(page) { //From here on in, we can use PhantomJS' API methods return page.open("http://tilomitra.com/repository/screenscrape/ajax.html", function(status) { //The page is now open console.log("opened site? ", status); }); }); });
Once the page is open, we can inject some JavaScript into the page. Let's inject jQuery via the page.injectJs()
method:
phantom.create(function(ph) { return ph.createPage(function(page) { return page.open("http://tilomitra.com/repository/screenscrape/ajax.html", function(status) { console.log("opened site? ", status); page.injectJs('http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js', function() { //jQuery Loaded //We can use things like $("body").html() in here. }); }); }); });
jQuery is now loaded, but we don't know whether the dynamic content on the page has loaded yet. To account for this, I usually put my scraping code inside a setTimeout()
function that executes after a certain time interval. If you want a more dynamic solution, the PhantomJS API lets you listen and emulate certain events. Let's go with the simple case:
setTimeout(function() { return page.evaluate(function() { //Get what you want from the page using jQuery. //A good way is to populate an object with all the jQuery commands that you need and then return the object. var h2Arr = [], //array that holds all html for h2 elements pArr = []; //array that holds all html for p elements //Populate the two arrays $('h2').each(function() { h2Arr.push($(this).html()); }); $('p').each(function() { pArr.push($(this).html()); }); //Return this data return { h2: h2Arr, p: pArr } }, function(result) { console.log(result); //Log out the data. ph.exit(); }); }, 5000);
Putting it all together, our phantomServer.js
file looks like this:
var phantom = require('phantom'); phantom.create(function(ph) { return ph.createPage(function(page) { return page.open("http://tilomitra.com/repository/screenscrape/ajax.html", function(status) { console.log("opened site? ", status); page.injectJs('http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js', function() { //jQuery Loaded. //Wait for a bit for AJAX content to load on the page. Here, we are waiting 5 seconds. setTimeout(function() { return page.evaluate(function() { //Get what you want from the page using jQuery. A good way is to populate an object with all the jQuery commands that you need and then return the object. var h2Arr = [], pArr = []; $('h2').each(function() { h2Arr.push($(this).html()); }); $('p').each(function() { pArr.push($(this).html()); }); return { h2: h2Arr, p: pArr }; }, function(result) { console.log(result); ph.exit(); }); }, 5000); }); }); }); });
This implementation is a little crude and disorganized, but it makes the point. Using PhantomJS, we are able to scrape a page that has dynamic content! Your console should output the following:
→ node phantomServer.js opened site? success { h2: [ 'Article 1', 'Article 2', 'Article 3' ], p: [ 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.', 'Ut sed nulla turpis, in faucibus ante. Vivamus ut malesuada est. Curabitur vel enim eget purus pharetra tempor id in tellus.', 'Curabitur euismod hendrerit quam ut euismod. Ut leo sem, viverra nec gravida nec, tristique nec arcu.' ] }
Conclusion
In this tutorial, we reviewed two different ways for performing web scraping. If scraping from a static web page, we can take advantage of YQL, which is easy to set up and use. On the other hand, for dynamic sites, we can leverage PhantomJS. It's a little harder to set up, but provides more capabilities. Remember: you can use PhantomJS for static sites too!
If you have any questions on this topic, feel free to ask below and I'll do my best to help you out.
No hay comentarios:
Publicar un comentario