Warmish Welcome
Fantasy sports are a passion of mine because they reside at the intersection of two of my favorite things in the world: sports and data. Over the last 4 or so years, I have built data sets suitable for modeling fantasy football and basketball outcomes. I tend to favor daily fantasy, but weather data is valuable for season long fantasy football as well.
In fantasy football, weather data are valuable for a variety of reasons. Conventional wisdom states temperature, wind, and precipation will affect throwing the football, the kicking game, holding on to the football, etc. After reading through this post, you will have the necessary data to explore these claims.
Setting Up
To start, we’ll load the necessary libraries. From there, we’ll find a website with weather data, write a function to scrape the data and then run that function on several web pages automatically to build a complete historical data set.
The website we will be using is NFL Weather. Like most real-world exercises, the data is not presented to us in a perfectly built table. We’ll need to do a fair amount of data wrangling to mold the data into a useable structure.
# package management
library(tidyverse)
library(rvest)
library(janitor)
library(glue)
Exploring NFLWeather.com
Once you navigate to NFL Weather, you can see the weeks indexed with small blue boxes towards the top, directly in the middle of the page. Clicking on the ‘1’ takes you to the weather for week 1 of the 2020 season. Notice the url for this page: http://nflweather.com/en/week/2020/week-1/. Simply replace ‘2020’ in the url with ‘2019’ will direct you to the week 1 forecast for the 2019 season. As you might have guessed, swapping ‘week-1’ with ‘week-8’ will launch the forecast for week 8. Already, we can see that the forecast for each week of a given season can be accessed by switching the year and week parts of the url.
Let’s start with accessing the weather data for week 1 of the 2020 season. Using rvest, we can quickly scan the html for tables and parse appropriately using the read_html()
and html_table()
functions. These commands parse the html from the url given and return the tables in a list. Using glimpse()
, we see that a list of 1 data frame that has some blank rows and null columns is returned.
url <- 'http://nflweather.com/en/week/2020/week-1/'
url %>%
read_html() %>%
html_table() %>%
glimpse()
## List of 1
## $ :'data.frame': 16 obs. of 13 variables:
## ..$ : logi [1:16] NA NA NA NA NA NA ...
## ..$ Away : chr [1:16] "Texans" "Eagles" "Dolphins" "Packers" ...
## ..$ Game : logi [1:16] NA NA NA NA NA NA ...
## ..$ Game : chr [1:16] "@" "@" "@" "@" ...
## ..$ Game : logi [1:16] NA NA NA NA NA NA ...
## ..$ Home : chr [1:16] "Chiefs" "Washington" "Patriots" "Vikings" ...
## ..$ Time (ET) : chr [1:16] "Final: 20 - 34" "Final: 17 - 27" "Final: 11 - 21" "Final: 43 - 34" ...
## ..$ TV : chr [1:16] "NBC" "FOX" "CBS" "FOX" ...
## ..$ : logi [1:16] NA NA NA NA NA NA ...
## ..$ Forecast : chr [1:16] "58f Overcast" "76f Partly Cloudy" "73f Clear" "DOME" ...
## ..$ Extended Forecast: chr [1:16] "Overcast. Rain in the morning and afternoon." "Partly Cloudy. Clear throughout the day." "Clear. Partly cloudy throughout the day." "Clear. Clear throughout the day." ...
## ..$ Wind : chr [1:16] "6m NNE" "4m S" "6m S" "2m SW" ...
## ..$ : chr [1:16] "Details" "Details" "Details" "Details" ...
Data Wrangling
We’ve returned the data frame we want, but definitely not the data frame we need. The initial output requires additional work.
We’ll use clean_names()
from the janitor
package to give each column a nice, clean name.
url %>%
read_html() %>%
html_table() %>%
.[[1]] %>% # select the first and only element in the list
clean_names() %>%
kable(format = 'html') %>%
kable_styling()
x | away | game | game_2 | game_3 | home | time_et | tv | x_2 | forecast | extended_forecast | wind | x_3 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
NA | Texans | NA | @ | NA | Chiefs | Final: 20 - 34 | NBC | NA | 58f Overcast | Overcast. Rain in the morning and afternoon. | 6m NNE | Details |
NA | Eagles | NA | @ | NA | Washington | Final: 17 - 27 | FOX | NA | 76f Partly Cloudy | Partly Cloudy. Clear throughout the day. | 4m S | Details |
NA | Dolphins | NA | @ | NA | Patriots | Final: 11 - 21 | CBS | NA | 73f Clear | Clear. Partly cloudy throughout the day. | 6m S | Details |
NA | Packers | NA | @ | NA | Vikings | Final: 43 - 34 | FOX | NA | DOME | Clear. Clear throughout the day. | 2m SW | Details |
NA | Colts | NA | @ | NA | Jaguars | Final: 20 - 27 | CBS | NA | 86f Humid and Mostly Cloudy | Humid and Mostly Cloudy. Rain until evening. | 8m ESE | Details |
NA | Bears | NA | @ | NA | Lions | Final: 27 - 23 | FOX | NA | DOME | Clear. Rain in the morning. | 11m W | Details |
NA | Raiders | NA | @ | NA | Panthers | Final: 34 - 30 | CBS | NA | 79f Clear | Clear. Clear throughout the day. | 3m NE | Details |
NA | Jets | NA | @ | NA | Bills | Final: 17 - 27 | CBS | NA | 65f Overcast | Overcast. Rain in the morning and afternoon. | 12m SSW | Details |
NA | Browns | NA | @ | NA | Ravens | Final: 6 - 38 | CBS | NA | 75f Partly Cloudy | Partly Cloudy. Humid and partly cloudy throughout the day. | 5m SSE | Details |
NA | Seahawks | NA | @ | NA | Falcons | Final: 38 - 25 | FOX | NA | DOME | Partly Cloudy. Partly cloudy throughout the day. | 3m ESE | Details |
NA | Chargers | NA | @ | NA | Bengals | Final: 16 - 13 | CBS | NA | 83f Mostly Cloudy | Mostly Cloudy. Possible drizzle in the morning. | 4m WNW | Details |
NA | Cardinals | NA | @ | NA | 49ers | Final: 24 - 20 | FOX | NA | 71f Clear | Clear. Clear throughout the day. | 6m NW | Details |
NA | Buccaneers | NA | @ | NA | Saints | Final: 23 - 34 | FOX | NA | DOME | Humid and Partly Cloudy. Humid and partly cloudy throughout the day. | 10m ENE | Details |
NA | Cowboys | NA | @ | NA | Rams | Final: 17 - 20 | NBC | NA | DOME | Clear. Clear throughout the day. | 7m W | Details |
NA | Steelers | NA | @ | NA | Giants | Final: 26 - 16 | ESPN | NA | 70f Clear | Clear. Clear throughout the day. | 10m NNW | Details |
NA | Titans | NA | @ | NA | Broncos | Final: 16 - 14 | ESPN | NA | 78f Foggy | Foggy. Foggy throughout the day. | 7m NE | Details |
From there, we’ll select both home and away team columns, the forecast column, and the wind column. The forecast field holds a text description of the weather from which we can parse the temperature and precipiation out of.
url %>%
read_html() %>%
html_table() %>%
.[[1]] %>%
clean_names() %>%
select(away, home, forecast, wind) %>%
kable(format = 'html') %>%
kable_styling()
away | home | forecast | wind |
---|---|---|---|
Texans | Chiefs | 58f Overcast | 6m NNE |
Eagles | Washington | 76f Partly Cloudy | 4m S |
Dolphins | Patriots | 73f Clear | 6m S |
Packers | Vikings | DOME | 2m SW |
Colts | Jaguars | 86f Humid and Mostly Cloudy | 8m ESE |
Bears | Lions | DOME | 11m W |
Raiders | Panthers | 79f Clear | 3m NE |
Jets | Bills | 65f Overcast | 12m SSW |
Browns | Ravens | 75f Partly Cloudy | 5m SSE |
Seahawks | Falcons | DOME | 3m ESE |
Chargers | Bengals | 83f Mostly Cloudy | 4m WNW |
Cardinals | 49ers | 71f Clear | 6m NW |
Buccaneers | Saints | DOME | 10m ENE |
Cowboys | Rams | DOME | 7m W |
Steelers | Giants | 70f Clear | 10m NNW |
Titans | Broncos | 78f Foggy | 7m NE |
Text Gymnastics
Since the season and week fields are not included in the data, we need to extract them from the url. Remembering back to when we changed the ‘2020’ in the url to ‘2019’, we know that the position of the season portion of the url will not change. Using stringr
and str_sub
, we’re able to extract characters based on their position within the url - in this case characters 31-34.
Extracting the week out of the url is a little more complex. To start, we use regular expressions to extract all characters of the url following the ‘-’. Using sub()
, we actually pattern match all characters prior to and including the ‘-’ and replace with ’’ (nothing). This leaves us with 1/
. From here, we just need to capture the number to serve as our week column. Here, I go back to str_sub
and extract the characters from the right beginning with 3 characters from the right and ending with 2 characters from the right which will capture double digit weeks as well.
url %>%
read_html() %>%
html_table() %>%
.[[1]] %>%
clean_names() %>%
select(away, home, forecast, wind) %>%
mutate(season = str_sub(url, start= 31, end = 34),
week = str_sub(gsub('.*\\-', '', url), start = -3, end = -2)) %>%
kable(format = 'html') %>%
kable_styling()
away | home | forecast | wind | season | week |
---|---|---|---|---|---|
Texans | Chiefs | 58f Overcast | 6m NNE | 2020 | 1 |
Eagles | Washington | 76f Partly Cloudy | 4m S | 2020 | 1 |
Dolphins | Patriots | 73f Clear | 6m S | 2020 | 1 |
Packers | Vikings | DOME | 2m SW | 2020 | 1 |
Colts | Jaguars | 86f Humid and Mostly Cloudy | 8m ESE | 2020 | 1 |
Bears | Lions | DOME | 11m W | 2020 | 1 |
Raiders | Panthers | 79f Clear | 3m NE | 2020 | 1 |
Jets | Bills | 65f Overcast | 12m SSW | 2020 | 1 |
Browns | Ravens | 75f Partly Cloudy | 5m SSE | 2020 | 1 |
Seahawks | Falcons | DOME | 3m ESE | 2020 | 1 |
Chargers | Bengals | 83f Mostly Cloudy | 4m WNW | 2020 | 1 |
Cardinals | 49ers | 71f Clear | 6m NW | 2020 | 1 |
Buccaneers | Saints | DOME | 10m ENE | 2020 | 1 |
Cowboys | Rams | DOME | 7m W | 2020 | 1 |
Steelers | Giants | 70f Clear | 10m NNW | 2020 | 1 |
Titans | Broncos | 78f Foggy | 7m NE | 2020 | 1 |
For capturing the wind, we’ll simply remove all characters following and including the lower case ‘m’ in the wind column. This will give us the numerical representation of the wind. Similarly, for temperature we can use the same regular expression for the lower case ‘f’ in the forecast column. The weather column will be generated by removing every character prior to and including the first white space using the forecast column.
url %>%
read_html() %>%
html_table() %>%
.[[1]] %>%
clean_names() %>%
select(away, home, forecast, wind) %>%
mutate(season = str_sub(url, start= 31, end = 34),
week = str_sub(gsub('.*\\-', '', url), start = -3, end = -2),
wind = as.numeric(gsub( "m.*$", "",wind)),
temperature = ifelse(forecast == 'DOME', 71, gsub( "f.*$", "", forecast)),
weather = gsub(".*? ", "", forecast)) %>%
kable(format = 'html') %>%
kable_styling()
away | home | forecast | wind | season | week | temperature | weather |
---|---|---|---|---|---|---|---|
Texans | Chiefs | 58f Overcast | 6 | 2020 | 1 | 58 | Overcast |
Eagles | Washington | 76f Partly Cloudy | 4 | 2020 | 1 | 76 | Cloudy |
Dolphins | Patriots | 73f Clear | 6 | 2020 | 1 | 73 | Clear |
Packers | Vikings | DOME | 2 | 2020 | 1 | 71 | DOME |
Colts | Jaguars | 86f Humid and Mostly Cloudy | 8 | 2020 | 1 | 86 | Cloudy |
Bears | Lions | DOME | 11 | 2020 | 1 | 71 | DOME |
Raiders | Panthers | 79f Clear | 3 | 2020 | 1 | 79 | Clear |
Jets | Bills | 65f Overcast | 12 | 2020 | 1 | 65 | Overcast |
Browns | Ravens | 75f Partly Cloudy | 5 | 2020 | 1 | 75 | Cloudy |
Seahawks | Falcons | DOME | 3 | 2020 | 1 | 71 | DOME |
Chargers | Bengals | 83f Mostly Cloudy | 4 | 2020 | 1 | 83 | Cloudy |
Cardinals | 49ers | 71f Clear | 6 | 2020 | 1 | 71 | Clear |
Buccaneers | Saints | DOME | 10 | 2020 | 1 | 71 | DOME |
Cowboys | Rams | DOME | 7 | 2020 | 1 | 71 | DOME |
Steelers | Giants | 70f Clear | 10 | 2020 | 1 | 70 | Clear |
Titans | Broncos | 78f Foggy | 7 | 2020 | 1 | 78 | Foggy |
Pivot Action
At this point, the data is tidy’d enough to start using. However, before we move on to writing a function to scrape the data for us across multiple web pages at a time, we need to alter the orientation of the data. As is, there is a row for each game. Thinking bigger picture, we will want a row for each team - so that each row is unique for season, week, and team. Suppose you want to join or merge this weather data with a team’s box score data from one or multiple games? This final transformation will make that task much easier.
With the pivot_longer()
function, we can elongate the data using the home and away columns - renaming the pivoted values column simply to ‘team’. For example, if the Texans and Chiefs played, this function ensures there is a row for each team along with the corresponding weather data from the remaining columns. What you’ll notice is that instead of 16 rows like the previous outputs, there will now be 32 (1 for each team instead of 1 for each game). The result is a clean dataset with a row for each season, week, team.
url %>%
read_html() %>%
html_table() %>%
.[[1]] %>%
clean_names() %>%
select(away, home, forecast, wind) %>%
mutate(season = str_sub(url, start= 31, end = 34),
week = str_sub(gsub('.*\\-', '', url), start = -3, end = -2),
wind = as.numeric(gsub( "m.*$", "",wind)),
temperature = ifelse(forecast == 'DOME', 71, gsub( "f.*$", "", forecast)),
weather = gsub(".*? ", "", forecast)) %>%
pivot_longer(cols = c('away', 'home'), values_to = 'team') %>%
select(-name, -forecast) %>%
select(team, season, week, temperature, wind, weather) %>%
kable(format = 'html') %>%
kable_styling()
team | season | week | temperature | wind | weather |
---|---|---|---|---|---|
Texans | 2020 | 1 | 58 | 6 | Overcast |
Chiefs | 2020 | 1 | 58 | 6 | Overcast |
Eagles | 2020 | 1 | 76 | 4 | Cloudy |
Washington | 2020 | 1 | 76 | 4 | Cloudy |
Dolphins | 2020 | 1 | 73 | 6 | Clear |
Patriots | 2020 | 1 | 73 | 6 | Clear |
Packers | 2020 | 1 | 71 | 2 | DOME |
Vikings | 2020 | 1 | 71 | 2 | DOME |
Colts | 2020 | 1 | 86 | 8 | Cloudy |
Jaguars | 2020 | 1 | 86 | 8 | Cloudy |
Bears | 2020 | 1 | 71 | 11 | DOME |
Lions | 2020 | 1 | 71 | 11 | DOME |
Raiders | 2020 | 1 | 79 | 3 | Clear |
Panthers | 2020 | 1 | 79 | 3 | Clear |
Jets | 2020 | 1 | 65 | 12 | Overcast |
Bills | 2020 | 1 | 65 | 12 | Overcast |
Browns | 2020 | 1 | 75 | 5 | Cloudy |
Ravens | 2020 | 1 | 75 | 5 | Cloudy |
Seahawks | 2020 | 1 | 71 | 3 | DOME |
Falcons | 2020 | 1 | 71 | 3 | DOME |
Chargers | 2020 | 1 | 83 | 4 | Cloudy |
Bengals | 2020 | 1 | 83 | 4 | Cloudy |
Cardinals | 2020 | 1 | 71 | 6 | Clear |
49ers | 2020 | 1 | 71 | 6 | Clear |
Buccaneers | 2020 | 1 | 71 | 10 | DOME |
Saints | 2020 | 1 | 71 | 10 | DOME |
Cowboys | 2020 | 1 | 71 | 7 | DOME |
Rams | 2020 | 1 | 71 | 7 | DOME |
Steelers | 2020 | 1 | 70 | 10 | Clear |
Giants | 2020 | 1 | 70 | 10 | Clear |
Titans | 2020 | 1 | 78 | 7 | Foggy |
Broncos | 2020 | 1 | 78 | 7 | Foggy |
Mapping the Function
The final objective here is to use this code as a function. Functions can be used over and over without rewriting code. In this example, we can use and reuse our code for existing urls, grabbing data from a particular week of a particular season - or we can run for all weeks from multiple seasons. For this blog post, we’re going to grab weather data for weeks 1 through 17 from 2018 and 2019.
We actually need two functions. One for generating the web page urls and another for scraping / wrangling the data. For generating the urls, we establish the sequence of weeks and years. Then, with the help of tidyr
and the crossing()
function, we can generate all combinations of weeks and years. The result is a tibble with two columns containing a row for weeks 1-17 for each season 2018 and 2019. The function that generates the urls simply concatenates the base url with the year and week provided. Using pmap
, we can iterate over each row of the weather_weeks_and_years
tibble, passing the year and week from each row into our generate_urls
function. The result is a list of 85 urls or web pages that we can now scrape data from.
# set up weeks and years
weeks <- c(1:17)
years <- c(2018:2019)
weather_weeks_and_years <- crossing(year = years, week = weeks) # generate all unique combinations
# function to generate urls
generate_urls <- function(year, week) {
full_url <- glue::glue("http://nflweather.com/en/week/{year}/week-{week}/")
}
# pass weeks and years through function
url_list <- pmap(weather_weeks_and_years, generate_urls)
head(url_list)
## [[1]]
## http://nflweather.com/en/week/2018/week-1/
##
## [[2]]
## http://nflweather.com/en/week/2018/week-2/
##
## [[3]]
## http://nflweather.com/en/week/2018/week-3/
##
## [[4]]
## http://nflweather.com/en/week/2018/week-4/
##
## [[5]]
## http://nflweather.com/en/week/2018/week-5/
##
## [[6]]
## http://nflweather.com/en/week/2018/week-6/
Below, we create a function that accepts a url, scrapes, wrangles, and returns the data. The code is the same code written above, used repeatedly for each url. We execute the function for each url within url_list
by using map_df
from the purrr
package. map_df
accepts an input list and applies a function to each element while returning a data frame. It might take a minute or two to run. Upon completion, you should have a solid weather dataset to start digging into.
# function to scrape weather
scrape_weather_data <- function(webpage_url) {
webpage_url %>%
read_html() %>%
html_table() %>%
.[[1]] %>%
clean_names() %>%
select(away, home, forecast, wind) %>%
mutate(season = str_sub(webpage_url, start= 31, end = 34),
week = str_sub(gsub('.*\\-', '', webpage_url), start = -3, end = -2),
wind = as.numeric(gsub( "m.*$", "",wind)),
temperature = ifelse(forecast == 'DOME', 71, gsub( "f.*$", "", forecast)),
weather = gsub(".*? ", "", forecast)) %>%
pivot_longer(cols = c('away', 'home'), values_to = 'team') %>%
select(-name, -forecast) %>%
select(team, season, week, temperature, wind, weather)
}
# pass urls through the function
weather_data <- map_df(url_list, scrape_weather_data)
weather_data %>%
sample_n(10) %>%
kable(format = 'html') %>%
kable_styling()
team | season | week | temperature | wind | weather |
---|---|---|---|---|---|
Titans | 2018 | 14 | 44 | 4 | Overcast |
Bengals | 2019 | 2 | 82 | 3 | Clear |
Cowboys | 2018 | 1 | 83 | 6 | Drizzle |
Buccaneers | 2018 | 6 | 71 | 3 | DOME |
Ravens | 2018 | 11 | 46 | 4 | Cloudy |
Browns | 2019 | 3 | 82 | 5 | Clear |
Chiefs | 2018 | 17 | 40 | 11 | Clear |
Jets | 2019 | 10 | 47 | 5 | Overcast |
Chiefs | 2018 | 1 | 85 | 5 | Clear |
Texans | 2018 | 1 | 62 | 6 | Cloudy |
Finally
That was a lot. From regular expressions, to creating functions, and using purrr
- there were quite a few different tasks described in this post. In the wild, most data are messy and need to be gathered / cleaned using a wide variety of tricks and trades. Good bye.