A few months ago, I made a rather rudimentary visualisation of some historical rainfall data for Seattle: Will it rain on my parade? I had some ideas about improving it, and got some helpful feedback. I didn’t initially have all the technical skills needed to get it done, so I’ve been using this as a platform to learn some new things, and now that it finally does everything I want it to, it’s time to publish the update (click the image for a full size, interactive version): It now displays temperatures as well as rainfall, has a longer history to make the data less noisy, and lets you choose between 14 cities. I’ve also published all the tools you would need to quickly make one for your locations of choice. It needs a catchy title, and I’m interested in any feedback about how to make this still better.
City selection: by default it displays the main cities for each of the 14 biggest metropolitan areas in the United States. Why 14? Well, I wanted to include where I live and some others for comparison, and I happen to live in the 14th-biggest. Date range: the last 19 completed calendar years. Why 19? Tableau Public has a data limit of 100,000 rows, and 19*14*365¼ = just shy of 100,000. I’ve been careful to use the same years for every city, to make comparisons between the cities more valid. Station selection: I’ve used a major airport from each city, not because airports necessarily have better data (actually there are reasons to expect them not to be perfectly representative), but because they tend to have the longest continuous data histories, and they’re good enough proxies for the general pattern of a place’s weather. Aggregation: I’ve used medians rather than means for rainfall amounts and high/low temperatures, because they are less skewed by one-off events like big storms and freak heatwaves.
For the rainfall, I’ve stuck with the size-and-darkness coding from last time, because people seemed to like that. The only thing I changed was to let the top end of the scale be set automatically because it varies so much between cities that I couldn’t find a single value that worked for the whole range. It’s a shame I can’t manually set a different value for each city, because doing that would make the variation through the year a bit clearer for each place. For temperatures, I experimented with a couple of ways of displaying two values (either the high and low or the median and variance) in one figure, like I did for the frequency and amount of rainfall, but I couldn’t come up with anything that wasn’t hopelessly confusing. So I’m leaving variance out, and using separate charts for highs and lows. I also made two scale choices that I’m second-guessing myself on about: 1. I used a diverging-colour scale with red for hot and blue for cold. I generally stay away from using colour alone to denote anything, because of accessibility to the colour-blind, but recently learned that true red-blue colour-blindness basically doesn’t exist, and full monochromacy is extraordinarily rare. Based on this, I’m not worrying about red-blue comparisons in the way that I do for red-green (which something like 1 in 20 men can’t use), but I’m still not quite sure I should therefore act as if the condition doesn’t exist at all. 2. I fixed the scale to fit the full range from Detroit & Chicago’s coldest nights to Phoenix’s hottest days. This makes comparisons between cities much easier, but also exaggerates the weakness of LA’s seasonal variation. I think it’s the right trade-off, but I’m not quite sure.
Method, and how to adapt this to your needs
The main work behind this is a script that automates the downloading and processing of historical data files from NOAA. NOAA publishes an incredible trove of historical data from weather stations all over the world, but does so in an idiosyncratic format, split up into many, many separate files. To teach myself Python, which is a really great tool for this sort of thing, I wrote a script that asks how many years of data to fetch for which station, fetches all of the files, concatenates them and translates them into a much more standard CSV format that most tools can read as input. I’ve open-sourced the script, everything on Tableau Public is open, and both Python and Tableau Public are free tools, so if you want to make a version of this for other cities or just start playing around with it yourself, here’s all you have to do:
- There’s a good chance your computer already has some version of Python installed, but if you’re not sure then use these instructions to check, and install it if necessary.
- Install Tableau Public. Note that this does require Windows. If, like me, you don’t have access to a Windows computer, you can use VirtualBox (another free tool – yay!) to fake it. You will need a Windows licence, though, so this option isn’t actually free – borrowing someone’s Windows laptop might be a better option.
- Get the data processing script by clicking the “View Source” link from this page and saving it with a .py extension.
- Run the script from the command line. It will interactively prompt you for all the input it needs, so you don’t need to know any special syntax or anything. If you download multiple stations, you will need to manually copy them all into one CSV file, but that’s easy to do with Excel or a text editor.
- Download the viz itself from this link, and open it in Tableau Public.
- Use the connect to data menus in Tableau to make it connect to the CSV file you created in step 4. Because the format is exactly the same as the one I used, you shouldn’t have to change anything else – just save that and share it. If you find this useful, or have any feedback, I would love to hear from you.
UPDATE: There was some interest in being able to compare sunny/cloudy days, too. I don’t have the right data for that, but I do have visibility, so I tried adding that. I’m not all that pleased with the results, but it does give some sense of the character of the seasons, and I’d love feedback on how to improve it: Where the sun don’t shine