Written on 29 Jan 2023
Taking stock of my "data tool belt"
Twelve software projects and companies I use all the time as a working data journalist.
I’ve been doing data work for nearly a decade and, at this point, I’ve got a set of tools that I really enjoy and make me really productive. At least I feel that they do.
I wrote a recent post where I gushed about Datasette, a tool I use all the time. I likened it to my "data hammer" because it is always within reach of my work.
Well, now I want to talk about the other tools I use. In part to strain the metaphor and in part because I’ve never talked about the software I use day in and day out. There’s only one criteria for something to show up on this list: I have to use it all the time.
These tools are always close at hand because they’re that useful. But I also use these tools because they feel like magic. I can cast data alchemy spells with non-trivial geospatial analysis - the sort of feeling that hooks you, that makes you go “I guess computers aren’t all terrible.”
It’s a bit hyperbolic, but I love each and every software project on this list. And I’m super grateful to the hundreds of thousands of hours that have been put in to them.
I’ve broken my toolset into a few categories to provide some sort of organization. Also note that most of these tools are in the Javascript ecosystem. It is a language that is super expressive and jives with my brain, for better or worse.
General-use tools
D3
Books have been written about how useful, and complicated, the D3 library is. And even though I don't use it so much for visualization anymore, it has some features that I just can't live without. Namely:
- CSV parsing and serializing
- Date and time parsing
- Scales and interpolation
- Basic math like sums and means
lodash
This collection of utility functions is in nearly every one of my projects. I use it to sort data, ensure uniqueness, and do basic string manipulation.
Web scraping tools
p-queue
The thing for when I need promise based control flow. It makes it easy to run tasks in parallel and allows me to adjust the number of concurrent tasks going at once. Why is that cool? Because it means I can dramatically reduce the run time of some of my web scrapers.
cheerio
A jQuery inspired API for parsing and manipulating HTML outside of a browser.
puppeteer
Scraping websites can mean you need to be able to automate a browser. There are a number of options to control browsers but I've found that I like puppeteer's API the most so it is the one I use.
Geospatial tools
mapshaper
Need to convert geographic data between file formats? Need to simplify it in the process? Or filter out particular features that you don't want to include in a dataset?
Mapshaper does it all and because it's a command-line tool it's easy to integrate into data pipelines.
turfjs
This project is a fucking gem; it enables me to conduct geospatial analysis from the comfort of Javascript and JSON. The most magical function of this library is called booleanPointInPolygon which can tell you if a point is inside of a polygon.
Sounds simple but it turns out to be really powerful and is the crux of how I built a tool for California voters to see how redistricting affected their home address.
QGIS
And sometimes you just need a GUI to do... well, something. I'm just a novice when it comes to this open-source geospatial power house but I use it a lot to explore new data by putting it on a map.
Website tools
SvelteKit
If I'm making a website, SvelteKit is the first thing I'm going to reach for. Not only does it use the very popular component model for building interfaces, but it also makes it very easy to keep API/data fetching code and UI code in the same repository.
Tools that are also companies
Observable notebooks
Doing data analysis is hard and ensuring that it's repeatable and auditable is crucial. When possible, I use Observable notebooks to handle that. The major thing I like about this tool over other notebook options, besides that it's Javascript, is that I can just share a URL with a colleague. I love that.
Github Actions
Github Actions are where all of my scrapers and recurring tasks run. It's easy to schedule scrapers, read the logs of the times they've succeeded and failed, and it can easily commit data to a repository.
Netlify
I don't have the skill to run webservers, let alone the interest. I want my websites to be deployed whenever a change happens on Github and I want to be able to see when a deploy succeeded and when it fails. Netlify does it all.
Do you use any of these tools in your day to day work? Love them? Hate 'em? I'm super curious to hear from you.