Looking at NOAA Storm Fatality Data with Python and Pandas: Part 1.


Introduction.

The NOAA has a whole collection of datasets relating to severe weather as well as weather-related fatalities. We're going to be looking at one of the datasets and seeing how to explore and plot a few neat things about it using Python.

Because I prefer using IPython notebook to do analysis (mostly because of the inline features of matplotlib), I'll be showing cells from my IPython notebook. Feel free to follow along with my code!

Getting the Data, Setting up Python.

Unfortunately, the data is in an irritating csv.gz format. While it is possible — using the requests and gzip libraries — to download all of them, we'll focus on one of them which I've converted to csv here:

Get the data: StormEvents_fatalities-ftp_v1.0_d2011_c20150506

We're now going to set up our Python to load in some libraries we'll use for analysis. Make sure you have app of these libraries before going on!

%matplotlib inline

# Analysis.
import numpy as np
import pandas as pd
import matplotlib.pyplot
import seaborn as sns

# Misc.
import datetime
import time

Briefly, we're going to use numpy and pandas for most of the analysis, we're going to use matplotlib.pyplot and seaborn (optional!) to graph our results, and the others for helper functions. Note that the top line, about matplotlib being inline, is only for iPython notebooks; this will show your plot inline as you work which is great for creating lots of graphs as you work.

Let's import the data now. We're going to use pandas to import our csv file into a dataframe, which we're going to manipulate later. There's a ton of potentially confusing options that you can specify when turning something into a dataframe so we'll go over this after we do it.

file_path = "../static/data/storm_fatality_pt_1/StormEvents_fatalities-ftp_v1.0_d2011_c20150506.csv"

date_parser_custom = lambda d : datetime.datetime.strptime(d, "%m/%d/%Y %H:%M:%S").date()

# the actual import of the csv to a DF.
raw_df = pd.read_csv(file_path, parse_dates=["FATALITY_DATE"], 
                         date_parser = date_parser_custom)

# To cut out some unnecessary columns.
df = raw_df[["FATALITY_DATE", "FATALITY_TYPE", "FATALITY_AGE", "FATALITY_SEX", "FATALITY_LOCATION"]]

print(df.head())

We did a bit of stuff here, so let's just go through it.

  1. First we made a variable for our file path. You don't need to do this, but if you like to store your data elsewhere it might be a good idea.
  2. This $date\_parser\_custom$ variable is sort of strange, no? This function is because our dataset will have a date in it but the date doesn't immediately translate into a "nice" date when we import it into a pandas dataframe. Luckily, we can parse the date with a custom function! For example, here, our data will give us a date that looks like "12/10/2011 13:21:01" in month, date, year, hour, minute, second form. I only want the date, so I call on the strptime method to input the stuff we get and spit out just the date part in a nice (read: Python datetime) format.
  3. We use the read_csv function to take in the csv we give it and split out a dataframe. Notice that we're explicitly telling pandas which column is a date column and then telling it how to parse these using the function we made above.
  4. We cut out some unnecessary columns to make $df$.

We get the following output:

  FATALITY_DATE FATALITY_TYPE  FATALITY_AGE FATALITY_SEX   FATALITY_LOCATION
0    2011-07-21             D            72            M  Outside/Open Areas
1    2011-07-21             D            51            M  Outside/Open Areas
2    2011-07-22             D            69            F      Permanent Home
3    2011-07-24             D            55            M      Permanent Home
4    2011-07-24             D            46            M      Permanent Home

Okay. The data here is fairly clear, but let me just go over what we're looking at. For each person who died during an extreme weather event (tornado, hurricane, etc.) the NOAA reports this and notes their sex, their age, the fatality type (directly from the storm = D, indirectly from the storm = I), as well as the date. One of the more interesting things to me was that they also give an approximate location for where the person was killed. Sort of strange.

How many fatalities per day?

At this point we ought to do some surface-level exploration on the data to see if anything strange sticks out. This also gives us an excuse to show off some pandas functionality.

First, let's see how many fatalities per day happened. We can do this by using the value_counts() function on a series (or a dataframe). After, let's plot this and see what we get.

df_fatalities_per_day = df["FATALITY_DATE"].value_counts()
df_fatalities_per_day.plot(figsize=(20,5))

Note that the second line here will simply plot a time series plot since we have dates as our index. The figsize parameter will just make the figure larger if you're in IPython. We get the following figure.

Woah, what happened around May and June? Let's zoom in a bit but cutting out some of the dates that aren't around those two big spikes. This gives us a chance to lok at some masks.

mask = lambda d : datetime.date(2011,6,15) > d > datetime.date(2011, 4,15)
masked_data = df_fatalities_per_day.index.map(mask)

df_fatalities_per_day_cut = df_fatalities_per_day[masked_data]
df_fatalities_per_day_cut.plot(figsize = (20,10))

Let's explain this code a bit. The last part is just plotting, as usual, but what's the first part?

This is a mask. We first create a function (which I usually just call $mask$) which will look at every element in some column of a dataframe and spit out either true or false, depending on what our function is. Here, we're using a lambda function to send $d$, which will be a date in our dataframe, to some inequality; this is just asking, "Is $d$ between these other two dates? If so, return True. If not, return False.

The second line here is an example of making the mask. We look at the index of $df\_fatalities\_per\_day$ and use the $map()$ function to apply our mask function to each element and return a true or false. If you look at $masked_data$, you'll see a list of True or False values corresponding to if that index is between the given dates or not.

Last, we pass that $masked\_data$ to our $df\_fatalities\_per\_day$ dataframe. This automatically filters the data by telling pandas, "Return only those values of $df\_fatalities\_per\_day$ which are True when we applied mask to them."

The plotted data looks like this:

Well, that didn't tell us much more, but it did get us to see that these peaks seemed to happen on a single day. Maybe it'd be better to just look at the table itself in this case. Let's do that.

df_fatalities_per_day_cut.head(5)

2011-04-27    318
2011-05-22    305
2011-04-16     32
2011-05-24     18
2011-04-25     12
dtype: int64

This tells us that there were 318 deaths on 4-27-2011. A quick googling tells us that there was a historic tornado outbreak on that day, which explains the number of fatalities. It's up to the reader to find out what happened on the other significant days!

Just looking at the data shows us that there were significant fatalities on 4-27-2011, but I wonder if it was because a number of trailer homes were uprooted. Because the NOAA data gives us an approximate location we can check to see where most of these fatalities happened.

df_2011_4_27 = df[df["FATALITY_DATE"] == datetime.date(2011,4,27)]
df_2011_4_27.groupby("FATALITY_LOCATION").agg("count").icol(0).plot(kind="bar", figsize=(20,10))

The first line isn't so bad. We're making a new dataframe using pandas boolean indexing. We're saying here that we want all elements in $df$ with the property that $df$["FATALITY_DATE"] is equal to the datetime specified. This returns only those elements — you can check this by printing out $df\_2011\_4\_27$.

The next line gets a bit wild. We first group $df\_2011\_4\_27$ by "FATALITY_LOCATION". This gives us a groupby object in pandas which is sort of like a dataframe where we've grouped some objects together. In this case, we've taken all of the different fatality locations and put them into groups with like fatality locations so that all the same locations are all together in one group.

We want to count the number of times each location comes up so we use the $agg()$ aggregate function and ask it to "count". At this point if you print the dataframe you'll see that it gives a count of all of the columns (which, in this case, will be mainly the same) so it is only necessary to look at the first column, which we can do by dotting the $icol(0)$ on.

At long last, we can plot these values. But we don't want a time series chart now (and we couldn't do that anyway; our index is no longer a bunch of dates), so we're going to specify the kind of graph we want ("bar") and, as usual, the figure size.

Whew.

The output looks like this:

This graph shows us that, while a good number of mobile homes were fatality locations, the vast majority of fatalities happened in permanent homes and structures.

We can do a similar analysis with the distribution of age and sex of the fatalities.

df_2011_4_27.groupby("FATALITY_SEX").agg("count").icol(0).plot(kind="bar", figsize=(20,10))

The one for age is a little different due to some null values being present in the data (which we'll discuss more in a future post). For my analysis, I wanted to exclude these values. Here's some tweaking I did.

df_2011_4_27_ages = df_2011_4_27["FATALITY_AGE"]
df_2011_4_27_ages = df_2011_4_27_ages[df_2011_4_27_ages.notnull()]

print("The mean is: ", df_2011_4_27_ages.mean())
print("The median is: ", df_2011_4_27_ages.median())
print("The stdev is ", df_2011_4_27_ages.std())

df_2011_4_27_ages = df_2011_4_27_ages.value_counts(sort=False)
df_2011_4_27_ages.plot(kind="bar", figsize=(50,10))

The $notnull()$ function here is similar to the masks we made above: it returns true if and only if the value is not null. This mask gets rid of all of the null values in our age column. We then print out the mean, median, and standard deviation using the corresponding functions.

Last, we use the $value\_counts$ function again, but if we want to just graph the distribution by age we don't want it to sort by highest frequency to lowest frequency, we want to sort by age. After ten minutes of agonizing googling, I found that if you set sort to False on $value\_counts$ it will do exactly that. Set it to true and see what happens if you have no idea what I'm talking about here.

Hmm. Maybe it would be better to use a histogram with a bit larger bins, no? We'll explore this in a future post.

Homework.

A few things to think about.

  1. What is the seaborn library doing? Try turning it off and making some graphs to see.
  2. What happened on the date with the second-most fatalities? What location did most fatalities occur?
  3. What was the average age for the fatalities on the date with the second-most fatalities? What does the distribution of sexes look like?
  4. In general, on all dates in this dataset, what is the most common age for fatalities?

Resources.