Written on 29 Nov 2017
Scraping web pages is a messy, error prone, and brittle method to go about getting some data of the internet, but sometimes it is all you have. I have written a few scrapers and have always wondered what a good scraper set up might look like. In an attempt to scrape as many Gothamist articles as I could while the was site down, I came up with a solution that I really liked using Docker, Node, and open source.
My traditional scraping approach has been something like:
- Inspect some HTML in the browser and find just the right selectors for the data that I want. “Oh, this should be really easy since it the HTML is nice and semantic”, I (sometimes) say out loud.
- Write some simple code that retrieves, parses the HTML, and extracts just that data from the page and save it to a file/database/something else
- Feel confident in my ability to boss around computers and run the scraping script with all 30,000 or whatever URLs. Seeing as how this might take a while, I get up, stretch, and go to my fridge to see if there is any cake
- Be disappointed that I already ate all the cake and come back to my computer, it having whizzed through 406 URLs until it came across something malformed and crashed because it couldn't find
.container article p:first-of-child span. Oof.
At this point, I have to spend a bit more time splitting the parser logic to be conditional, slice the initial data array so as not to re-run through the successfully scraped URLs again, and hope that there aren't more than two different layouts I'm dealing with.
I don't really like this approach for a few reasons, beyond the frustrating mechanics I just described. For one, it is hard to be sure that you are scraping all the URLs and not missing a few in between all the initial data slicing and index fudging. Then, there is the issue that you can't really make it go much faster because the script isn't written to be parallelized, there's just one scraper. You could split up the initial list and run two different processes with the same script but now you've got a bunch more accounting to make sure nothing slips through the cracks. Next, there can be issues after you have parsed the HTML but before you have persisted the data somehow (like by writing it to a file) but if you don't store an intermediate state, you might find yourself having to make a few hundred HTTP requests over again. Finally, sometimes you just want to be able to close the computer or restart it and it can be cumbersome to stop the scraping script and pick back up right where it left off.
Using a system with queues and workers, made extremely portable and powerful by docker, I think I have overcome most of these frequent issues I have had with my past scrapers. Queues are great because they enable you to set up as many different data streams as you need to to model your scraping process (and error states), as many workers as you can to process the data, all while being pretty fault tolerant able to be shutdown and restarted easily.
Queues are not great because they generally require a database of some sort (I used a queue based on Redis) which, I have found, are a pain in the ass to install, configure, and maintain. However, with a tool like Docker, and docker-compose, it is really easy to get the database you want set up and running. On top of that, it is really easy to run your queue and scale your workers up quickly but we'll get to that.
For the sake of brevity, I'm going to slightly over simplify my process for scraping all the Gothamist URLs. I came up with a basic process that I thought would yield as many posts as possible. I found a snapshot of the site on the Wayback Machine, and found a URL that listed the articles posted by each journalist on pages of a couple dozen URLs. Luckily, once I got the URL of the article it was pretty easy to pull the actual article content out of the page because of the article element. I wanted to write that content (HTML was fine) to a markdown file and store some of the metadata as YAML frontmatter.
need-post-urlsfor pages of an author's article, which would put one job on the
need-post-contentqueue for every URL it scraped from the listing
has-no-archiveif that page didn't exist - to check out later
need-to-write-filefor the content scraped before it was successfully saved to a file
has-unmatched-authorto serve as catch-alls for various error states I was running into
I realize I could have named these queues much better and would like to explore a sort of naming convention that I can use in my next scraper. But the basic idea is that I have a trough-like system where jobs start and can either end up being completed successfully or put on another queue for later handling. Implementing a catch-all queue was really helpful as it meant I didn't have to tackle all of the failed scrapes at once but could analyze them for patterns and prioritize the changes accordingly.
By utilizing docker-compose to run the worker scripts separately from the scripts that used the initial data to seed the entire process, I could use the
--scale flag to turn off the initial loading entirely (which I did every time after the first run) or to crank up the number of instances running the worker code. But the best docker-compose moment came when I decided I wanted to run the scraper on a different machine that had a more stable internet connection, namely a DigitalOcean droplet I had just spun up. I pushed the repo to Github, pulled it down on my droplet and was able to get the scraper up and running in minutes without having to worry about setting up a database or anything! And then I could really crank up the number of workers to make the process go even faster.
I really liked using this setup for scraping, but that doesn't mean it was perfect. Aside from having terrible queue names, I ended up getting rate limited quite a bit. Next time I do some scraping I think I will investigate using an IP masking service or proxy. I realize that I'm essentially running a small tiny botnet and I want to respect people's bandwidth costs and the fact that protecting your site from malicious actors is more important than enabling my janky scrapers. But I think that some sort of proxy service could help me avoid rate limiting, and while I employed a backoff scheme, I'd like my scraper to run as fast as possible.
In summary, queues are cool and pretty handy for a process as brittle as scraping web pages, but they're even cooler with Docker since I don't have to do anything to make them scalable, portable, and simple enough for me to understand. If you have a preferred way of setting up your scrapers, or strong opinions about mine, I'd love to hear about it.