Web Scraping NY Times with Node.js

Whether you’re a web developer, business analyst, marketer, or media guru web scraping can benefit your business. Web scraping or harvesting is a technique to extract large amounts of data from sites to a local file or database. This post adapts a tutorial from scotch.io to use node.js to scrape the top three latest news stories from the NY Times business page. You can find the code for the project at GitHub.

First, what is node.js? Even if you’re an experienced developer, it’s normal to be a little confused about what this technology actually does (and the geeky descriptions found online certainly don’t help). In short, node.js is a packaged compilation of Google’s V8 JavaScript engine – and can do some pretty amazing things in terms of creating real-time web applications (with JavaScript running on both the client and server side). OK that’s enough for that – there’s always the internet if you want to learn more.

Next on our agenda, why should you care about web scraping? How can it actually help your business? Here are some examples off of Quora:

  • scrape products for retailers or manufacturers for price comparison
  • real estate listings comparison
  • scrape job ads from Application Tracking Systems
  • scrape news sites for custom analysis and curation
  • And more

In this tutorial, we’ll be working with the following dependencies: ExpressJS (popular Node framework), Request (for HTTP request), and Cheerio (jQuery for the server, for navigation DOM and extracting data). OK, it’s time to do the damn thang – let’s get our code on.

Open your terminal (I’m assuming you’re on a Mac, if not then open up the Console losers). First, let’s check if you already have node and npm (the package management system) installed:

$ node -v

$ npm -v

If you don’t see any versions come up then go ahead and install node using Homebrew:

$ brew install node

Now create a directory for your files. We can call it “web_scraper”. Place the files from the GitHub link above  with our core code (“server1.js”) and “package.json” which will specify the dependencies into the folder. Now navigate to your directory and install the dependencies:

$ cd ~/path-to-your-file/web_scraper

$ npm install 

Now that housekeeping is out of the way, let’s get into the nitty gritty of our JavaScript: Screen Shot 2017-01-06 at 12.59.09 PM.png

Here’s what you need to know about that mess of code above, don’t fret…First we put in our dependencies, next under app.get we go ahead and specify the URL we want to scrape (in this case the NY Times business page). If you skip down to fs.writeFile this is instructing to output a JSON file. We’re using app.listen because the scraped content is being passed along to our server (in this case our local device). Now, here’s the important part and time to pay attention folks. See $(‘.scrollContent’) ? That’s where we define the class from which we want to extract information from.

Go ahead and open up the NY Times page in Chrome and inspect the area to see what your class is. As you can see in this case the section we are interested in has scrollContent as its class and so we’ll use that as our unique identifier.

Screen Shot 2017-01-06 at 1.10.41 PM.png

So how do we specify which items to pull out? Let’s say that we want to only do the first three headlines?  Take a look at our code, for the first one we’ll call it first, and then for subsequent items, we’ll use the format eq(2); eq(3); eq(4); etc.:

sectionHead = data.children().first().text().trim();
sectionMiddle = data.children().eq(2).text().trim();
sectionEnd = data.children().eq(3).text().trim();

Once you’ve isolated the section of the page you want to extract from and have updated the code accordingly, open up the terminal again and in your directory, write:

$ node server1.js

Point to our server in your web browser at: http://localhost:8081/scrape

If all goes well you should see an the output file: “output1.json” in your directory, looking something like this:

Screen Shot 2017-01-06 at 1.11.36 PM.png

BAM! Pretty cool right? Now needless to say, this is only the tip of the iceberg. If you really want to bring this type of scraping to an application, you’ll have to develop a way to automate the process and then take your JSON data and place it back into a web page. You might think about running a cron job (cron is a Linux utility for scheduling a command or script on your server to run at a specified time and date).

OK nerds, hope that this helped in some way as either 1) an intro to node.js or 2) an intro to web scraping. Stay tuned!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s