Written on 11 Aug 2023

The lobbying data Cal-Access should be publishing

Getting structured data out of the Secretary of State's office about lobbying is much harder than it should be. I've got some new robots that scrape the site daily to generate JSON files that are ready-to-use for analysis.

Cal-Access was built during the Clinton administration and was innovative for it's time: it provides a way for anybody to look up campaign finance and lobbying expenditure activity. Together, these two data sets represent a large part of the influence buying going on in California's government. It's a rickety yet useful website.

Starting with the positive, there are some things I adore about Cal-Access:

  1. URLs are stable and meaningful.
  2. The site is not interactive - HTML is returned from the server and rendered by the browser.
  3. It gets updated frequently.

But there's something that really bothers me about the site with respect to lobbying information: it's very hard to export the data for use in a subsequent workflow or tool.

The only SoS sanctioned option is a 4+ gigabyte collection of TSV files that are updated daily and available for download. But it includes all of the data in Cal-Access broken out across many tables. You have to be a subject matter expert to deciper the provided documentation and reconstitute the relationship between the different files. And even then, good luck.

But I've been looking at lobbying stuff recently and I needed to analyze all of the lobbying activity so far in 2023. I figured out a way to leverage the things I like about Cal-Access to fix the thing I don't and generate structured data: I wrote a few "git scraping" robots that crawl the site to extract and structure the data and save it as JSON files.

The Secretary of State breaks out lobbying activity into four main groups of data, which are visible in the left hand menu.

Screenshot of the lobbying menu on Cal-Access
Screenshot of the lobbying page on Cal-Access
1. Lobbyists
People who spend at least 1/3 of their compensated time on direct lobbying of officials.
2. Lobbying firms
Companies that hire lobbyists to advocate on behalf of clients.
3. Lobbyist employers
Companies and organizations that employ lobbyists and/or lobbying firms.
4. $5k+ payments to influence
People and organizations that spend at least $5,000 to influence legislative or administrative action but don't employ a lobbyist or lobbying firm.

Made sense to me that I just copied their setup. I have four separate repos, one for each data category, that checks for new data daily and updates the JSON file so that it's always up-to-date and ready to use.

Category Current year JSON file Scraper status (green is good, red is bad)
Lobbyists lobbyists-2023.json Scrape lobbyists
Lobbying firms lobbying-firms-financial-activity-2023.json Scrape lobbying firms
Lobbyist employers lobbyist-employers-financial-activity-2023.json Scrape lobbyist employers
$5k+ payments to influence 5k-filers-financial-activity-2023.json Scrape lobbyists

If you end up using this data, I'd love to hear about it. And if you find a mistake in it, you really ought to let me know.