missionbaymedia

BlogSan Diego Web Design Logo

Back to Blog

Scraping Pages in Node.js

in scripting, node.js, javascript, JQuery, Internet | August 18, 2016

Scraping Pages in Node.js
Node with cUrl and jQuery? It sounds strange to talk about cUrl in the context of a Node server and even stranger to talk about jQuery in the context of backend, but bear with me here as I explain how to set up a fast node structure to comfortably apply your jQuery skills to backend scraping of with Node. First, to get page contents we will use curlrequest a fast, easy to use node wrapper for cUrl. To install and save to your package.json:

	$   npm i -S curlrequest

Using cUrl over Node’s http.request module is somewhat a matter of preference, but I have found that using curl allows me to stay away from coding a bunch of low level request functions as the scope and functions of my projects grow, without compromising versatility. Also, curl is fast, stable, reliable and can utilize all CPU’s in non-blocking requests that combine well with Node’s asynchronous nature to make for some truly fast code if you do it right. Also, if you find yourself (outside of the scope of this tutorial) sending multipart form-data responses, or POST’ing, PUT’ing or otherwise interacting with RESTful API’s in the course of your project, good old cUrl is there to assist. The next bit to add is cheerio, a truly neat and blazingly fast backend implementation of the jQuery core API that is perfectly suited for node. Yes, you read that right: backend jQuery on Node. Loading the response you get back from a cUrl GET request, for example, into cheerio creates a DOM in which you can use almost all of your familiar jQuery selectors and functions. It is worth noting that this works with HTML or XML. Loading the cheerio instance in as the variable ‘$’ should make it look as familiar as it feels. To install and save to your package.json, similarily:

	$   npm i -S cheerio

Now, to glue it all together in a generalized function. As an example, let’s write a function using these two modules to return an array of all the images in the page.

var curl = require('curlrequest'), // our curl wrapper
    cheerio = require("cheerio"); // here’s the core jquery API implementation


function loadExtImgs(instance, cheerio, scrapeUrl, callback) {
    instance.request(scrapeUrl, function(err, stdout, meta) {
        // uncomment the next line to log curl request
        // console.log('%s %s', meta.cmd, meta.args.join(' '));
        var $ = cheerio.load(stdout);
        var ImgUrls = [];
        $('img').each(function() {
            ImgUrls.push(scrapeUrl + $(this).attr('src'));
        });
        callback(ImgUrls);
    });
}

// call our function on MBM's homepage
loadExtImgs(curl, cheerio, 'http://missionbaymedia.com', function (imgArr) {
  console.log(imgArr);
});

There are two key things to notice here. First, the line where we set $ = to cheerio.load(), parsing a dom from our curl response and assigning it to a variable that should make subsequent jQuery work look familiar. Next, notice the use of standard jQuery selectors and function, like the $(‘img’).each() I use here. Programmatically selecting and and manipulating DOM elements on the backend has never been easier. The power and speed of the stripped down cheerio implementation of the jQuery API should open a lot of options for your project.

Like what you see?

Get In Touch

Contact Us