Gathering Historical NFL Weather Data

Warmish Welcome

Fantasy sports are a passion of mine because they reside at the intersection of two of my favorite things in the world: sports and data. Over the last 4 or so years, I have built data sets suitable for modeling fantasy football and basketball outcomes. I tend to favor daily fantasy, but weather data is valuable for season long fantasy football as well.

In fantasy football, weather data are valuable for a variety of reasons. Conventional wisdom states temperature, wind, and precipation will affect throwing the football, the kicking game, holding on to the football, etc. After reading through this post, you will have the necessary data to explore these claims.

Setting Up

To start, we’ll load the necessary libraries. From there, we’ll find a website with weather data, write a function to scrape the data and then run that function on several web pages automatically to build a complete historical data set.

The website we will be using is NFL Weather. Like most real-world exercises, the data is not presented to us in a perfectly built table. We’ll need to do a fair amount of data wrangling to mold the data into a useable structure.

# package management
library(tidyverse)
library(rvest)
library(janitor)
library(glue)

Exploring NFLWeather.com

Once you navigate to NFL Weather, you can see the weeks indexed with small blue boxes towards the top, directly in the middle of the page. Clicking on the ‘1’ takes you to the weather for week 1 of the 2020 season. Notice the url for this page: http://nflweather.com/en/week/2020/week-1/. Simply replace ‘2020’ in the url with ‘2019’ will direct you to the week 1 forecast for the 2019 season. As you might have guessed, swapping ‘week-1’ with ‘week-8’ will launch the forecast for week 8. Already, we can see that the forecast for each week of a given season can be accessed by switching the year and week parts of the url.

Let’s start with accessing the weather data for week 1 of the 2020 season. Using rvest, we can quickly scan the html for tables and parse appropriately using the read_html() and html_table() functions. These commands parse the html from the url given and return the tables in a list. Using glimpse(), we see that a list of 1 data frame that has some blank rows and null columns is returned.

url <- 'http://nflweather.com/en/week/2020/week-1/'

url %>% 
  read_html() %>% 
  html_table() %>%
  glimpse()
## List of 1
##  $ :'data.frame':    16 obs. of  13 variables:
##   ..$                  : logi [1:16] NA NA NA NA NA NA ...
##   ..$ Away             : chr [1:16] "Texans" "Eagles" "Dolphins" "Packers" ...
##   ..$ Game             : logi [1:16] NA NA NA NA NA NA ...
##   ..$ Game             : chr [1:16] "@" "@" "@" "@" ...
##   ..$ Game             : logi [1:16] NA NA NA NA NA NA ...
##   ..$ Home             : chr [1:16] "Chiefs" "Washington" "Patriots" "Vikings" ...
##   ..$ Time (ET)        : chr [1:16] "Final: 20 - 34" "Final: 17 - 27" "Final: 11 - 21" "Final: 43 - 34" ...
##   ..$ TV               : chr [1:16] "NBC" "FOX" "CBS" "FOX" ...
##   ..$                  : logi [1:16] NA NA NA NA NA NA ...
##   ..$ Forecast         : chr [1:16] "58f Overcast" "76f Partly Cloudy" "73f Clear" "DOME" ...
##   ..$ Extended Forecast: chr [1:16] "Overcast. Rain in the morning and afternoon." "Partly Cloudy. Clear throughout the day." "Clear. Partly cloudy throughout the day." "Clear. Clear throughout the day." ...
##   ..$ Wind             : chr [1:16] "6m NNE" "4m S" "6m S" "2m SW" ...
##   ..$                  : chr [1:16] "Details" "Details" "Details" "Details" ...

Data Wrangling

We’ve returned the data frame we want, but definitely not the data frame we need. The initial output requires additional work.

We’ll use clean_names() from the janitor package to give each column a nice, clean name.

url %>% 
  read_html() %>% 
  html_table() %>%
  .[[1]] %>%  # select the first and only element in the list 
  clean_names() %>%
  kable(format = 'html') %>% 
  kable_styling()
x away game game_2 game_3 home time_et tv x_2 forecast extended_forecast wind x_3
NA Texans NA @ NA Chiefs Final: 20 - 34 NBC NA 58f Overcast Overcast. Rain in the morning and afternoon. 6m NNE Details
NA Eagles NA @ NA Washington Final: 17 - 27 FOX NA 76f Partly Cloudy Partly Cloudy. Clear throughout the day. 4m S Details
NA Dolphins NA @ NA Patriots Final: 11 - 21 CBS NA 73f Clear Clear. Partly cloudy throughout the day. 6m S Details
NA Packers NA @ NA Vikings Final: 43 - 34 FOX NA DOME Clear. Clear throughout the day. 2m SW Details
NA Colts NA @ NA Jaguars Final: 20 - 27 CBS NA 86f Humid and Mostly Cloudy Humid and Mostly Cloudy. Rain until evening. 8m ESE Details
NA Bears NA @ NA Lions Final: 27 - 23 FOX NA DOME Clear. Rain in the morning. 11m W Details
NA Raiders NA @ NA Panthers Final: 34 - 30 CBS NA 79f Clear Clear. Clear throughout the day. 3m NE Details
NA Jets NA @ NA Bills Final: 17 - 27 CBS NA 65f Overcast Overcast. Rain in the morning and afternoon. 12m SSW Details
NA Browns NA @ NA Ravens Final: 6 - 38 CBS NA 75f Partly Cloudy Partly Cloudy. Humid and partly cloudy throughout the day. 5m SSE Details
NA Seahawks NA @ NA Falcons Final: 38 - 25 FOX NA DOME Partly Cloudy. Partly cloudy throughout the day. 3m ESE Details
NA Chargers NA @ NA Bengals Final: 16 - 13 CBS NA 83f Mostly Cloudy Mostly Cloudy. Possible drizzle in the morning. 4m WNW Details
NA Cardinals NA @ NA 49ers Final: 24 - 20 FOX NA 71f Clear Clear. Clear throughout the day. 6m NW Details
NA Buccaneers NA @ NA Saints Final: 23 - 34 FOX NA DOME Humid and Partly Cloudy. Humid and partly cloudy throughout the day. 10m ENE Details
NA Cowboys NA @ NA Rams Final: 17 - 20 NBC NA DOME Clear. Clear throughout the day. 7m W Details
NA Steelers NA @ NA Giants Final: 26 - 16 ESPN NA 70f Clear Clear. Clear throughout the day. 10m NNW Details
NA Titans NA @ NA Broncos Final: 16 - 14 ESPN NA 78f Foggy Foggy. Foggy throughout the day. 7m NE Details

From there, we’ll select both home and away team columns, the forecast column, and the wind column. The forecast field holds a text description of the weather from which we can parse the temperature and precipiation out of.

url %>% 
  read_html() %>% 
  html_table() %>%
  .[[1]] %>% 
  clean_names() %>%
  select(away, home, forecast, wind) %>% 
  kable(format = 'html') %>% 
  kable_styling()
away home forecast wind
Texans Chiefs 58f Overcast 6m NNE
Eagles Washington 76f Partly Cloudy 4m S
Dolphins Patriots 73f Clear 6m S
Packers Vikings DOME 2m SW
Colts Jaguars 86f Humid and Mostly Cloudy 8m ESE
Bears Lions DOME 11m W
Raiders Panthers 79f Clear 3m NE
Jets Bills 65f Overcast 12m SSW
Browns Ravens 75f Partly Cloudy 5m SSE
Seahawks Falcons DOME 3m ESE
Chargers Bengals 83f Mostly Cloudy 4m WNW
Cardinals 49ers 71f Clear 6m NW
Buccaneers Saints DOME 10m ENE
Cowboys Rams DOME 7m W
Steelers Giants 70f Clear 10m NNW
Titans Broncos 78f Foggy 7m NE

Text Gymnastics

Since the season and week fields are not included in the data, we need to extract them from the url. Remembering back to when we changed the ‘2020’ in the url to ‘2019’, we know that the position of the season portion of the url will not change. Using stringr and str_sub, we’re able to extract characters based on their position within the url - in this case characters 31-34.

Extracting the week out of the url is a little more complex. To start, we use regular expressions to extract all characters of the url following the ‘-’. Using sub(), we actually pattern match all characters prior to and including the ‘-’ and replace with ’’ (nothing). This leaves us with 1/. From here, we just need to capture the number to serve as our week column. Here, I go back to str_sub and extract the characters from the right beginning with 3 characters from the right and ending with 2 characters from the right which will capture double digit weeks as well.

url %>% 
  read_html() %>% 
  html_table() %>%
  .[[1]] %>% 
  clean_names() %>%
  select(away, home, forecast, wind) %>%
  mutate(season = str_sub(url, start= 31, end = 34),
         week = str_sub(gsub('.*\\-', '', url), start = -3, end = -2)) %>% 
  kable(format = 'html') %>% 
  kable_styling()
away home forecast wind season week
Texans Chiefs 58f Overcast 6m NNE 2020 1
Eagles Washington 76f Partly Cloudy 4m S 2020 1
Dolphins Patriots 73f Clear 6m S 2020 1
Packers Vikings DOME 2m SW 2020 1
Colts Jaguars 86f Humid and Mostly Cloudy 8m ESE 2020 1
Bears Lions DOME 11m W 2020 1
Raiders Panthers 79f Clear 3m NE 2020 1
Jets Bills 65f Overcast 12m SSW 2020 1
Browns Ravens 75f Partly Cloudy 5m SSE 2020 1
Seahawks Falcons DOME 3m ESE 2020 1
Chargers Bengals 83f Mostly Cloudy 4m WNW 2020 1
Cardinals 49ers 71f Clear 6m NW 2020 1
Buccaneers Saints DOME 10m ENE 2020 1
Cowboys Rams DOME 7m W 2020 1
Steelers Giants 70f Clear 10m NNW 2020 1
Titans Broncos 78f Foggy 7m NE 2020 1

For capturing the wind, we’ll simply remove all characters following and including the lower case ‘m’ in the wind column. This will give us the numerical representation of the wind. Similarly, for temperature we can use the same regular expression for the lower case ‘f’ in the forecast column. The weather column will be generated by removing every character prior to and including the first white space using the forecast column.

url %>% 
  read_html() %>% 
  html_table() %>%
  .[[1]] %>% 
  clean_names() %>%
  select(away, home, forecast, wind) %>%
  mutate(season = str_sub(url, start= 31, end = 34),
         week = str_sub(gsub('.*\\-', '', url), start = -3, end = -2),
         wind = as.numeric(gsub( "m.*$", "",wind)),
         temperature = ifelse(forecast == 'DOME', 71, gsub( "f.*$", "", forecast)),
         weather = gsub(".*? ", "", forecast)) %>%
  kable(format = 'html') %>% 
  kable_styling()
away home forecast wind season week temperature weather
Texans Chiefs 58f Overcast 6 2020 1 58 Overcast
Eagles Washington 76f Partly Cloudy 4 2020 1 76 Cloudy
Dolphins Patriots 73f Clear 6 2020 1 73 Clear
Packers Vikings DOME 2 2020 1 71 DOME
Colts Jaguars 86f Humid and Mostly Cloudy 8 2020 1 86 Cloudy
Bears Lions DOME 11 2020 1 71 DOME
Raiders Panthers 79f Clear 3 2020 1 79 Clear
Jets Bills 65f Overcast 12 2020 1 65 Overcast
Browns Ravens 75f Partly Cloudy 5 2020 1 75 Cloudy
Seahawks Falcons DOME 3 2020 1 71 DOME
Chargers Bengals 83f Mostly Cloudy 4 2020 1 83 Cloudy
Cardinals 49ers 71f Clear 6 2020 1 71 Clear
Buccaneers Saints DOME 10 2020 1 71 DOME
Cowboys Rams DOME 7 2020 1 71 DOME
Steelers Giants 70f Clear 10 2020 1 70 Clear
Titans Broncos 78f Foggy 7 2020 1 78 Foggy

Pivot Action

At this point, the data is tidy’d enough to start using. However, before we move on to writing a function to scrape the data for us across multiple web pages at a time, we need to alter the orientation of the data. As is, there is a row for each game. Thinking bigger picture, we will want a row for each team - so that each row is unique for season, week, and team. Suppose you want to join or merge this weather data with a team’s box score data from one or multiple games? This final transformation will make that task much easier.

With the pivot_longer() function, we can elongate the data using the home and away columns - renaming the pivoted values column simply to ‘team’. For example, if the Texans and Chiefs played, this function ensures there is a row for each team along with the corresponding weather data from the remaining columns. What you’ll notice is that instead of 16 rows like the previous outputs, there will now be 32 (1 for each team instead of 1 for each game). The result is a clean dataset with a row for each season, week, team.

url %>% 
  read_html() %>% 
  html_table() %>%
  .[[1]] %>% 
  clean_names() %>%
  select(away, home, forecast, wind) %>%
  mutate(season = str_sub(url, start= 31, end = 34),
         week = str_sub(gsub('.*\\-', '', url), start = -3, end = -2),
         wind = as.numeric(gsub( "m.*$", "",wind)),
         temperature = ifelse(forecast == 'DOME', 71, gsub( "f.*$", "", forecast)),
         weather = gsub(".*? ", "", forecast)) %>%
  pivot_longer(cols = c('away', 'home'), values_to = 'team') %>% 
  select(-name, -forecast) %>% 
  select(team, season, week, temperature, wind, weather) %>% 
  kable(format = 'html') %>% 
  kable_styling()
team season week temperature wind weather
Texans 2020 1 58 6 Overcast
Chiefs 2020 1 58 6 Overcast
Eagles 2020 1 76 4 Cloudy
Washington 2020 1 76 4 Cloudy
Dolphins 2020 1 73 6 Clear
Patriots 2020 1 73 6 Clear
Packers 2020 1 71 2 DOME
Vikings 2020 1 71 2 DOME
Colts 2020 1 86 8 Cloudy
Jaguars 2020 1 86 8 Cloudy
Bears 2020 1 71 11 DOME
Lions 2020 1 71 11 DOME
Raiders 2020 1 79 3 Clear
Panthers 2020 1 79 3 Clear
Jets 2020 1 65 12 Overcast
Bills 2020 1 65 12 Overcast
Browns 2020 1 75 5 Cloudy
Ravens 2020 1 75 5 Cloudy
Seahawks 2020 1 71 3 DOME
Falcons 2020 1 71 3 DOME
Chargers 2020 1 83 4 Cloudy
Bengals 2020 1 83 4 Cloudy
Cardinals 2020 1 71 6 Clear
49ers 2020 1 71 6 Clear
Buccaneers 2020 1 71 10 DOME
Saints 2020 1 71 10 DOME
Cowboys 2020 1 71 7 DOME
Rams 2020 1 71 7 DOME
Steelers 2020 1 70 10 Clear
Giants 2020 1 70 10 Clear
Titans 2020 1 78 7 Foggy
Broncos 2020 1 78 7 Foggy

Mapping the Function

The final objective here is to use this code as a function. Functions can be used over and over without rewriting code. In this example, we can use and reuse our code for existing urls, grabbing data from a particular week of a particular season - or we can run for all weeks from multiple seasons. For this blog post, we’re going to grab weather data for weeks 1 through 17 from 2018 and 2019.

We actually need two functions. One for generating the web page urls and another for scraping / wrangling the data. For generating the urls, we establish the sequence of weeks and years. Then, with the help of tidyr and the crossing() function, we can generate all combinations of weeks and years. The result is a tibble with two columns containing a row for weeks 1-17 for each season 2018 and 2019. The function that generates the urls simply concatenates the base url with the year and week provided. Using pmap, we can iterate over each row of the weather_weeks_and_years tibble, passing the year and week from each row into our generate_urls function. The result is a list of 85 urls or web pages that we can now scrape data from.

# set up weeks and years
weeks <- c(1:17)
years <- c(2018:2019)
weather_weeks_and_years <- crossing(year = years, week = weeks) # generate all unique combinations

# function to generate urls
generate_urls <- function(year, week) {
  full_url <- glue::glue("http://nflweather.com/en/week/{year}/week-{week}/")
}

# pass weeks and years through function
url_list <- pmap(weather_weeks_and_years, generate_urls)

head(url_list)
## [[1]]
## http://nflweather.com/en/week/2018/week-1/
## 
## [[2]]
## http://nflweather.com/en/week/2018/week-2/
## 
## [[3]]
## http://nflweather.com/en/week/2018/week-3/
## 
## [[4]]
## http://nflweather.com/en/week/2018/week-4/
## 
## [[5]]
## http://nflweather.com/en/week/2018/week-5/
## 
## [[6]]
## http://nflweather.com/en/week/2018/week-6/

Below, we create a function that accepts a url, scrapes, wrangles, and returns the data. The code is the same code written above, used repeatedly for each url. We execute the function for each url within url_list by using map_df from the purrr package. map_df accepts an input list and applies a function to each element while returning a data frame. It might take a minute or two to run. Upon completion, you should have a solid weather dataset to start digging into.

# function to scrape weather
scrape_weather_data <- function(webpage_url) {
  webpage_url %>% 
    read_html() %>% 
    html_table() %>%
    .[[1]] %>% 
    clean_names() %>%
    select(away, home, forecast, wind) %>%
    mutate(season = str_sub(webpage_url, start= 31, end = 34),
           week = str_sub(gsub('.*\\-', '', webpage_url), start = -3, end = -2),
           wind = as.numeric(gsub( "m.*$", "",wind)),
           temperature = ifelse(forecast == 'DOME', 71, gsub( "f.*$", "", forecast)),
           weather = gsub(".*? ", "", forecast)) %>%
    pivot_longer(cols = c('away', 'home'), values_to = 'team') %>% 
    select(-name, -forecast) %>% 
    select(team, season, week, temperature, wind, weather)
}

# pass urls through the function
weather_data <- map_df(url_list, scrape_weather_data)

weather_data %>% 
  sample_n(10) %>% 
  kable(format = 'html') %>% 
  kable_styling()
team season week temperature wind weather
Titans 2018 14 44 4 Overcast
Bengals 2019 2 82 3 Clear
Cowboys 2018 1 83 6 Drizzle
Buccaneers 2018 6 71 3 DOME
Ravens 2018 11 46 4 Cloudy
Browns 2019 3 82 5 Clear
Chiefs 2018 17 40 11 Clear
Jets 2019 10 47 5 Overcast
Chiefs 2018 1 85 5 Clear
Texans 2018 1 62 6 Cloudy

Finally

That was a lot. From regular expressions, to creating functions, and using purrr - there were quite a few different tasks described in this post. In the wild, most data are messy and need to be gathered / cleaned using a wide variety of tricks and trades. Good bye.

 Share!