Whether you’re a web developer, business analyst, marketer, or media guru web scraping can benefit your business. Web scraping or harvesting is a technique to extract large amounts of data from sites to a local file or database. This post adapts a tutorial from scotch.io to use node.js to scrape the top three latest news stories from the NY Times business page. You can find the code for the project at GitHub.
Next on our agenda, why should you care about web scraping? How can it actually help your business? Here are some examples off of Quora:
- scrape products for retailers or manufacturers for price comparison
- real estate listings comparison
- scrape job ads from Application Tracking Systems
- scrape news sites for custom analysis and curation
- And more
In this tutorial, we’ll be working with the following dependencies: ExpressJS (popular Node framework), Request (for HTTP request), and Cheerio (jQuery for the server, for navigation DOM and extracting data). OK, it’s time to do the damn thang – let’s get our code on.
Open your terminal (I’m assuming you’re on a Mac, if not then open up the Console losers). First, let’s check if you already have node and npm (the package management system) installed:
$ node -v
$ npm -v
If you don’t see any versions come up then go ahead and install node using Homebrew:
$ brew install node
Now create a directory for your files. We can call it “web_scraper”. Place the files from the GitHub link above with our core code (“server1.js”) and “package.json” which will specify the dependencies into the folder. Now navigate to your directory and install the dependencies:
$ cd ~/path-to-your-file/web_scraper
$ npm install
Here’s what you need to know about that mess of code above, don’t fret…First we put in our dependencies, next under app.get we go ahead and specify the URL we want to scrape (in this case the NY Times business page). If you skip down to fs.writeFile this is instructing to output a JSON file. We’re using app.listen because the scraped content is being passed along to our server (in this case our local device). Now, here’s the important part and time to pay attention folks. See $(‘.scrollContent’) ? That’s where we define the class from which we want to extract information from.
Go ahead and open up the NY Times page in Chrome and inspect the area to see what your class is. As you can see in this case the section we are interested in has scrollContent as its class and so we’ll use that as our unique identifier.
So how do we specify which items to pull out? Let’s say that we want to only do the first three headlines? Take a look at our code, for the first one we’ll call it first, and then for subsequent items, we’ll use the format eq(2); eq(3); eq(4); etc.:
sectionHead = data.children().first().text().trim();
sectionMiddle = data.children().eq(2).text().trim();
sectionEnd = data.children().eq(3).text().trim();
Once you’ve isolated the section of the page you want to extract from and have updated the code accordingly, open up the terminal again and in your directory, write:
$ node server1.js
Point to our server in your web browser at: http://localhost:8081/scrape
If all goes well you should see an the output file: “output1.json” in your directory, looking something like this:
BAM! Pretty cool right? Now needless to say, this is only the tip of the iceberg. If you really want to bring this type of scraping to an application, you’ll have to develop a way to automate the process and then take your JSON data and place it back into a web page. You might think about running a cron job (cron is a Linux utility for scheduling a command or script on your server to run at a specified time and date).
OK nerds, hope that this helped in some way as either 1) an intro to node.js or 2) an intro to web scraping. Stay tuned!