Chapter 13 - Relational data

If you’ve ever worked with SQL before, this chapter should be relatively familiar to you. Here we learn about the different ways to work with relational data in R, using dplyr functions. Quoted from the book, the different types of thins you can perform are:

  • “Mutating joins, which add new variables to one data frame from matching observations in another.”
  • “Filtering joins, which filter observations from one data frame based on whether or not they match an observation in the other table.”
  • “Set operations, which treat observations as if they were set elements.”
library("tidyverse")
library("nycflights13")

The flights dataset in nycflights13 has tibbles that can be used to practice relational data management:

# the complete flights table
flights
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
# info on airlines
airlines
## # A tibble: 16 x 2
##    carrier name                       
##    <chr>   <chr>                      
##  1 9E      Endeavor Air Inc.          
##  2 AA      American Airlines Inc.     
##  3 AS      Alaska Airlines Inc.       
##  4 B6      JetBlue Airways            
##  5 DL      Delta Air Lines Inc.       
##  6 EV      ExpressJet Airlines Inc.   
##  7 F9      Frontier Airlines Inc.     
##  8 FL      AirTran Airways Corporation
##  9 HA      Hawaiian Airlines Inc.     
## 10 MQ      Envoy Air                  
## 11 OO      SkyWest Airlines Inc.      
## 12 UA      United Air Lines Inc.      
## 13 US      US Airways Inc.            
## 14 VX      Virgin America             
## 15 WN      Southwest Airlines Co.     
## 16 YV      Mesa Airlines Inc.
# info on airports
airports
## # A tibble: 1,458 x 8
##    faa   name                    lat    lon   alt    tz dst   tzone        
##    <chr> <chr>                 <dbl>  <dbl> <int> <dbl> <chr> <chr>        
##  1 04G   Lansdowne Airport      41.1  -80.6  1044    -5 A     America/New_…
##  2 06A   Moton Field Municipa…  32.5  -85.7   264    -6 A     America/Chic…
##  3 06C   Schaumburg Regional    42.0  -88.1   801    -6 A     America/Chic…
##  4 06N   Randall Airport        41.4  -74.4   523    -5 A     America/New_…
##  5 09J   Jekyll Island Airport  31.1  -81.4    11    -5 A     America/New_…
##  6 0A9   Elizabethton Municip…  36.4  -82.2  1593    -5 A     America/New_…
##  7 0G6   Williams County Airp…  41.5  -84.5   730    -5 A     America/New_…
##  8 0G7   Finger Lakes Regiona…  42.9  -76.8   492    -5 A     America/New_…
##  9 0P2   Shoestring Aviation …  39.8  -76.6  1000    -5 U     America/New_…
## 10 0S9   Jefferson County Intl  48.1 -123.    108    -8 A     America/Los_…
## # … with 1,448 more rows
# info on individual planes
planes
## # A tibble: 3,322 x 9
##    tailnum  year type       manufacturer  model  engines seats speed engine
##    <chr>   <int> <chr>      <chr>         <chr>    <int> <int> <int> <chr> 
##  1 N10156   2004 Fixed win… EMBRAER       EMB-1…       2    55    NA Turbo…
##  2 N102UW   1998 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  3 N103US   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  4 N104UW   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  5 N10575   2002 Fixed win… EMBRAER       EMB-1…       2    55    NA Turbo…
##  6 N105UW   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  7 N107US   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  8 N108UW   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  9 N109UW   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
## 10 N110UW   1999 Fixed win… AIRBUS INDUS… A320-…       2   182    NA Turbo…
## # … with 3,312 more rows
# weater information
weather
## # A tibble: 26,115 x 15
##    origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##    <chr>  <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
##  1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
##  2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
##  3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
##  4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
##  5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
##  6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
##  7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
##  8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
##  9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
## 10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
## # … with 26,105 more rows, and 5 more variables: wind_gust <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>
  • flights connects to planes via a single variable, tailnum.
  • flights connects to airlines through the carrier variable.
  • flights connects to airports in two ways: via the origin and dest variables.
  • flights connects to weather via origin (the location), and year, month, day and hour (the time).

13.2.1 Exercises

1. Imagine you wanted to draw (approximately) the route each plane flies from its origin to its destination. What variables would you need? What tables would you need to combine?

You would need the variables tailnum, which defines individual planes, and the latitude and longitude values of the origin and destination variables. To do this, you can combine the flights table, which connects tailnum to origin and destination, with the airports table, which contains the latitude and longitude of each airport. Here is the join that can be performed:

flights_1 <- left_join(flights, airports, by = c("dest" = "faa"))
flights_2 <- left_join(flights_1, airports, by = c("origin" = "faa"))
flights_2
## # A tibble: 336,776 x 33
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # … with 336,766 more rows, and 26 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, name.x <chr>, lat.x <dbl>,
## #   lon.x <dbl>, alt.x <int>, tz.x <dbl>, dst.x <chr>, tzone.x <chr>,
## #   name.y <chr>, lat.y <dbl>, lon.y <dbl>, alt.y <int>, tz.y <dbl>,
## #   dst.y <chr>, tzone.y <chr>

2. I forgot to draw the relationship between weather and airports. What is the relationship and how should it appear in the diagram?

The airports table can be joined to the weather table, in which the primary key is origin in the weather table and the foreign key is the faa column in the airports table. For each observation in weather, information about the airport can be appended as a new column. For example:

weather_airports <- left_join(weather, airports, by = c("origin" = "faa"))
weather_airports
## # A tibble: 26,115 x 22
##    origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##    <chr>  <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
##  1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
##  2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
##  3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
##  4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
##  5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
##  6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
##  7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
##  8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
##  9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
## 10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
## # … with 26,105 more rows, and 12 more variables: wind_gust <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>,
## #   name <chr>, lat <dbl>, lon <dbl>, alt <int>, tz <dbl>, dst <chr>,
## #   tzone <chr>

3. weather only contains information for the origin (NYC) airports. If it contained weather records for all airports in the USA, what additional relation would it define with flights?

If it contained weather records for all airports, it could be joined with flights using the primary key consisting of the columns year, month, day, hour and origin in weather, which could be connected with the foreign key year, month, day, hour, and either origin or dest in flights.

4. We know that some days of the year are “special”, and fewer people than usual fly on them. How might you represent that data as a data frame? What would be the primary keys of that table? How would it connect to the existing tables?

You could have a separate data frame containing the variables month, day, and “special”, in which the “special” days would be designated TRUE. This could be connected to the existing tables using the primary key consisting of month, and day (should probably include the year as well, depending on the holiday). The value for the “special” column would be appended to the corresponding foriegn key values as TRUE or FALSE. For example, below is a table containing some special days of the year (Christmas, New Years), that is joined to the flights table.

special <- tribble (
  ~month, ~day, ~event,
  12, 25, TRUE,
  1, 1, TRUE
)
left_join(flights, special) %>%
  select (year, month, day, flight, event) %>%
  mutate ( event = (!is.na(event)))
## Joining, by = c("month", "day")
## # A tibble: 336,776 x 5
##     year month   day flight event
##    <int> <dbl> <dbl>  <int> <lgl>
##  1  2013     1     1   1545 TRUE 
##  2  2013     1     1   1714 TRUE 
##  3  2013     1     1   1141 TRUE 
##  4  2013     1     1    725 TRUE 
##  5  2013     1     1    461 TRUE 
##  6  2013     1     1   1696 TRUE 
##  7  2013     1     1    507 TRUE 
##  8  2013     1     1   5708 TRUE 
##  9  2013     1     1     79 TRUE 
## 10  2013     1     1    301 TRUE 
## # … with 336,766 more rows

13.3 Notes - Keys

A key is a value or combination of values that uniquely identifies an observation. From the book,

  • “A primary key uniquely identifies an observation in its own table. For example, planes$tailnum is a primary key because it uniquely identifies each plane in the planes table.”
  • “A foreign key uniquely identifies an observation in another table. For example, the flights$tailnum is a foreign key because it appears in the flights table where it matches each flight to a unique plane.”

You can verify that a variable can serve as a primary key by verifying that they do uniquely identify each observation. The book uses a combination of count() and filter(). If there are any values with count > 1, the variable is not a primary key.

planes %>%
  count(tailnum) %>%
  filter(n > 1)
## # A tibble: 0 x 2
## # … with 2 variables: tailnum <chr>, n <int>
weather %>%
  count(year, month, day, hour, origin) %>%
  filter(n > 1)
## # A tibble: 3 x 6
##    year month   day  hour origin     n
##   <dbl> <dbl> <int> <int> <chr>  <int>
## 1  2013    11     3     1 EWR        2
## 2  2013    11     3     1 JFK        2
## 3  2013    11     3     1 LGA        2

If there are no variables in the dataset that can serve as primary keys, we can create a primary key with mutate() and row_number(). This is called a surrogate key.

# combination of year, month, day, flight is not adequate as a primary key.
# A surrogate key must be added. See 13.3.1 exercises for an example of adding a surrogate key.
flights %>%
  count(year, month, day, flight) %>%
  filter(n > 1)
## # A tibble: 29,768 x 5
##     year month   day flight     n
##    <int> <int> <int>  <int> <int>
##  1  2013     1     1      1     2
##  2  2013     1     1      3     2
##  3  2013     1     1      4     2
##  4  2013     1     1     11     3
##  5  2013     1     1     15     2
##  6  2013     1     1     21     2
##  7  2013     1     1     27     4
##  8  2013     1     1     31     2
##  9  2013     1     1     32     2
## 10  2013     1     1     35     2
## # … with 29,758 more rows

How a primary key from one table matches a foreign key in another table is what is called the relation. Depending on how they match, we can perform joins or other set operations, based on the values present in both keys.

13.3.1 Exercises

1. Add a surrogate key to flights.

# add a surrogate key to flights, display vs flight
flights %>%
  transmute (flight, surrogate_key = row_number() )
## # A tibble: 336,776 x 2
##    flight surrogate_key
##     <int>         <int>
##  1   1545             1
##  2   1714             2
##  3   1141             3
##  4    725             4
##  5    461             5
##  6   1696             6
##  7    507             7
##  8   5708             8
##  9     79             9
## 10    301            10
## # … with 336,766 more rows

2. Identify the keys in the following datasets:

We can determine the primary keys by looking for unique groupings of variables that match the total number of observations in each table using group_by() and count(), and piping the ouput either to nrow() or filtering for groups that have n>1. If nrow() is equal to the total observations, or if filter(n>1) yields zero observations, then we have found the primary key.

The primary key for Lahman::Batting is the combination of playerID, yearID, and stint. The primary key for babynames::babynames is year, sex, and name. The primary key for nasaweather::atmos is lat, long, year, and month. The primary key for fueleconomy::vehicles is id. For ggplot2::diamonds, there is no primary key. We can generate a surrogate key using mutate() and row_number().

head(Lahman::Batting)
##    playerID yearID stint teamID lgID  G  AB  R  H X2B X3B HR RBI SB CS BB
## 1 abercda01   1871     1    TRO   NA  1   4  0  0   0   0  0   0  0  0  0
## 2  addybo01   1871     1    RC1   NA 25 118 30 32   6   0  0  13  8  1  4
## 3 allisar01   1871     1    CL1   NA 29 137 28 40   4   5  0  19  3  1  2
## 4 allisdo01   1871     1    WS3   NA 27 133 28 44  10   2  2  27  1  1  0
## 5 ansonca01   1871     1    RC1   NA 25 120 29 39  11   3  0  16  6  2  2
## 6 armstbo01   1871     1    FW1   NA 12  49  9 11   2   1  0   5  0  1  0
##   SO IBB HBP SH SF GIDP
## 1  0  NA  NA NA NA   NA
## 2  0  NA  NA NA NA   NA
## 3  5  NA  NA NA NA   NA
## 4  2  NA  NA NA NA   NA
## 5  1  NA  NA NA NA   NA
## 6  1  NA  NA NA NA   NA
# method 1
dim(Lahman::Batting)
## [1] 102816     22
group_by(Lahman::Batting, playerID, yearID,stint) %>%
  count() %>%
  nrow()
## [1] 102816
# method 2
group_by(Lahman::Batting, playerID, yearID,stint) %>%
  count() %>%
  nrow()
## [1] 102816
babynames::babynames
## # A tibble: 1,924,665 x 5
##     year sex   name          n   prop
##    <dbl> <chr> <chr>     <int>  <dbl>
##  1  1880 F     Mary       7065 0.0724
##  2  1880 F     Anna       2604 0.0267
##  3  1880 F     Emma       2003 0.0205
##  4  1880 F     Elizabeth  1939 0.0199
##  5  1880 F     Minnie     1746 0.0179
##  6  1880 F     Margaret   1578 0.0162
##  7  1880 F     Ida        1472 0.0151
##  8  1880 F     Alice      1414 0.0145
##  9  1880 F     Bertha     1320 0.0135
## 10  1880 F     Sarah      1288 0.0132
## # … with 1,924,655 more rows
dim(babynames::babynames)
## [1] 1924665       5
group_by(babynames::babynames, year, sex,name) %>%
  count() %>%
  nrow()
## [1] 1924665
nasaweather::atmos
## # A tibble: 41,472 x 11
##      lat  long  year month surftemp  temp pressure ozone cloudlow cloudmid
##    <dbl> <dbl> <int> <int>    <dbl> <dbl>    <dbl> <dbl>    <dbl>    <dbl>
##  1  36.2 -114.  1995     1     273.  272.      835   304      7.5     34.5
##  2  33.7 -114.  1995     1     280.  282.      940   304     11.5     32.5
##  3  31.2 -114.  1995     1     285.  285.      960   298     16.5     26  
##  4  28.7 -114.  1995     1     289.  291.      990   276     20.5     14.5
##  5  26.2 -114.  1995     1     292.  293.     1000   274     26       10.5
##  6  23.7 -114.  1995     1     294.  294.     1000   264     30        9.5
##  7  21.2 -114.  1995     1     295   295.     1000   258     29.5     11  
##  8  18.7 -114.  1995     1     298.  297.     1000   252     26.5     17.5
##  9  16.2 -114.  1995     1     300.  298.     1000   250     27.5     18.5
## 10  13.7 -114.  1995     1     300.  299.     1000   250     26       16.5
## # … with 41,462 more rows, and 1 more variable: cloudhigh <dbl>
dim(nasaweather::atmos)
## [1] 41472    11
group_by(nasaweather::atmos, lat, long, year, month) %>%
  count() %>%
  nrow()
## [1] 41472
fueleconomy::vehicles
## # A tibble: 33,442 x 12
##       id make   model  year class trans drive   cyl displ fuel    hwy   cty
##    <int> <chr>  <chr> <int> <chr> <chr> <chr> <int> <dbl> <chr> <int> <int>
##  1 27550 AM Ge… DJ P…  1984 Spec… Auto… 2-Wh…     4   2.5 Regu…    17    18
##  2 28426 AM Ge… DJ P…  1984 Spec… Auto… 2-Wh…     4   2.5 Regu…    17    18
##  3 27549 AM Ge… FJ8c…  1984 Spec… Auto… 2-Wh…     6   4.2 Regu…    13    13
##  4 28425 AM Ge… FJ8c…  1984 Spec… Auto… 2-Wh…     6   4.2 Regu…    13    13
##  5  1032 AM Ge… Post…  1985 Spec… Auto… Rear…     4   2.5 Regu…    17    16
##  6  1033 AM Ge… Post…  1985 Spec… Auto… Rear…     6   4.2 Regu…    13    13
##  7  3347 ASC I… GNX    1987 Mids… Auto… Rear…     6   3.8 Prem…    21    14
##  8 13309 Acura  2.2C…  1997 Subc… Auto… Fron…     4   2.2 Regu…    26    20
##  9 13310 Acura  2.2C…  1997 Subc… Manu… Fron…     4   2.2 Regu…    28    22
## 10 13311 Acura  2.2C…  1997 Subc… Auto… Fron…     6   3   Regu…    26    18
## # … with 33,432 more rows
dim(fueleconomy::vehicles)
## [1] 33442    12
group_by(fueleconomy::vehicles, id) %>%
  count() %>%
  nrow()
## [1] 33442
ggplot2::diamonds
## # A tibble: 53,940 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # … with 53,930 more rows
diamonds_withkey <- ggplot2::diamonds %>%
  mutate (surrogate_key = row_number())
diamonds_withkey
## # A tibble: 53,940 x 11
##    carat cut   color clarity depth table price     x     y     z
##    <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23  Ideal E     SI2      61.5    55   326  3.95  3.98  2.43
##  2 0.21  Prem… E     SI1      59.8    61   326  3.89  3.84  2.31
##  3 0.23  Good  E     VS1      56.9    65   327  4.05  4.07  2.31
##  4 0.290 Prem… I     VS2      62.4    58   334  4.2   4.23  2.63
##  5 0.31  Good  J     SI2      63.3    58   335  4.34  4.35  2.75
##  6 0.24  Very… J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7 0.24  Very… I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8 0.26  Very… H     SI1      61.9    55   337  4.07  4.11  2.53
##  9 0.22  Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
## 10 0.23  Very… H     VS1      59.4    61   338  4     4.05  2.39
## # … with 53,930 more rows, and 1 more variable: surrogate_key <int>

3. Draw a diagram illustrating the connections between the Batting, Master, and Salaries tables in the Lahman package. Draw another diagram that shows the relationship between Master, Managers, AwardsManagers. How would you characterise the relationship between the Batting, Pitching, and Fielding tables?

  • Batting connects to Master using playerID, and with Salaries through playerID, yearID, teamID, and lgID. Master connects with Salaries through playerID.
  • Master connects with Managers and with AwardsManagers through playerID. Managers connects with AwardsManagers using playerID, year, and lgID.
  • Batting, Pitching, and Fielding all connect with each other through playerID, yearID, stint, teamID, and lgID. They are just variables for this key categorized into different tables based on the type of play. You could also just use playerID, yearID, and stint to get a unique key for these tables.
library(Lahman)

as_tibble(Batting)
## # A tibble: 102,816 x 22
##    playerID yearID stint teamID lgID      G    AB     R     H   X2B   X3B
##    <chr>     <int> <int> <fct>  <fct> <int> <int> <int> <int> <int> <int>
##  1 abercda…   1871     1 TRO    NA        1     4     0     0     0     0
##  2 addybo01   1871     1 RC1    NA       25   118    30    32     6     0
##  3 allisar…   1871     1 CL1    NA       29   137    28    40     4     5
##  4 allisdo…   1871     1 WS3    NA       27   133    28    44    10     2
##  5 ansonca…   1871     1 RC1    NA       25   120    29    39    11     3
##  6 armstbo…   1871     1 FW1    NA       12    49     9    11     2     1
##  7 barkeal…   1871     1 RC1    NA        1     4     0     1     0     0
##  8 barnero…   1871     1 BS1    NA       31   157    66    63    10     9
##  9 barrebi…   1871     1 FW1    NA        1     5     1     1     1     0
## 10 barrofr…   1871     1 BS1    NA       18    86    13    13     2     1
## # … with 102,806 more rows, and 11 more variables: HR <int>, RBI <int>,
## #   SB <int>, CS <int>, BB <int>, SO <int>, IBB <int>, HBP <int>,
## #   SH <int>, SF <int>, GIDP <int>
as_tibble(Master)
## # A tibble: 19,105 x 26
##    playerID birthYear birthMonth birthDay birthCountry birthState birthCity
##    <chr>        <int>      <int>    <int> <chr>        <chr>      <chr>    
##  1 aardsda…      1981         12       27 USA          CO         Denver   
##  2 aaronha…      1934          2        5 USA          AL         Mobile   
##  3 aaronto…      1939          8        5 USA          AL         Mobile   
##  4 aasedo01      1954          9        8 USA          CA         Orange   
##  5 abadan01      1972          8       25 USA          FL         Palm Bea…
##  6 abadfe01      1985         12       17 D.R.         La Romana  La Romana
##  7 abadijo…      1850         11        4 USA          PA         Philadel…
##  8 abbated…      1877          4       15 USA          PA         Latrobe  
##  9 abbeybe…      1869         11       11 USA          VT         Essex    
## 10 abbeych…      1866         10       14 USA          NE         Falls Ci…
## # … with 19,095 more rows, and 19 more variables: deathYear <int>,
## #   deathMonth <int>, deathDay <int>, deathCountry <chr>,
## #   deathState <chr>, deathCity <chr>, nameFirst <chr>, nameLast <chr>,
## #   nameGiven <chr>, weight <int>, height <int>, bats <fct>, throws <fct>,
## #   debut <chr>, finalGame <chr>, retroID <chr>, bbrefID <chr>,
## #   deathDate <date>, birthDate <date>
as_tibble(Salaries)
## # A tibble: 26,428 x 5
##    yearID teamID lgID  playerID  salary
##     <int> <fct>  <fct> <chr>      <int>
##  1   1985 ATL    NL    barkele01 870000
##  2   1985 ATL    NL    bedrost01 550000
##  3   1985 ATL    NL    benedbr01 545000
##  4   1985 ATL    NL    campri01  633333
##  5   1985 ATL    NL    ceronri01 625000
##  6   1985 ATL    NL    chambch01 800000
##  7   1985 ATL    NL    dedmoje01 150000
##  8   1985 ATL    NL    forstte01 483333
##  9   1985 ATL    NL    garbege01 772000
## 10   1985 ATL    NL    harpete01 250000
## # … with 26,418 more rows
as_tibble(Managers)
## # A tibble: 3,436 x 10
##    playerID  yearID teamID lgID  inseason     G     W     L  rank plyrMgr
##    <chr>      <int> <fct>  <fct>    <int> <int> <int> <int> <int> <fct>  
##  1 wrighha01   1871 BS1    NA           1    31    20    10     3 Y      
##  2 woodji01    1871 CH1    NA           1    28    19     9     2 Y      
##  3 paborch01   1871 CL1    NA           1    29    10    19     8 Y      
##  4 lennobi01   1871 FW1    NA           1    14     5     9     8 Y      
##  5 deaneha01   1871 FW1    NA           2     5     2     3     8 Y      
##  6 fergubo01   1871 NY2    NA           1    33    16    17     5 Y      
##  7 mcbridi01   1871 PH1    NA           1    28    21     7     1 Y      
##  8 hastisc01   1871 RC1    NA           1    25     4    21     9 Y      
##  9 pikeli01    1871 TRO    NA           1     4     1     3     6 Y      
## 10 cravebi01   1871 TRO    NA           2    25    12    12     6 Y      
## # … with 3,426 more rows
as_tibble(AwardsManagers)
## # A tibble: 179 x 6
##    playerID  awardID                   yearID lgID  tie   notes
##    <chr>     <chr>                      <int> <fct> <chr> <lgl>
##  1 larusto01 BBWAA Manager of the Year   1983 AL    <NA>  NA   
##  2 lasorto01 BBWAA Manager of the Year   1983 NL    <NA>  NA   
##  3 andersp01 BBWAA Manager of the Year   1984 AL    <NA>  NA   
##  4 freyji99  BBWAA Manager of the Year   1984 NL    <NA>  NA   
##  5 coxbo01   BBWAA Manager of the Year   1985 AL    <NA>  NA   
##  6 herzowh01 BBWAA Manager of the Year   1985 NL    <NA>  NA   
##  7 mcnamjo99 BBWAA Manager of the Year   1986 AL    <NA>  NA   
##  8 lanieha01 BBWAA Manager of the Year   1986 NL    <NA>  NA   
##  9 andersp01 BBWAA Manager of the Year   1987 AL    <NA>  NA   
## 10 rodgebu01 BBWAA Manager of the Year   1987 NL    <NA>  NA   
## # … with 169 more rows
as_tibble(Pitching)
## # A tibble: 44,963 x 30
##    playerID yearID stint teamID lgID      W     L     G    GS    CG   SHO
##    <chr>     <int> <int> <fct>  <fct> <int> <int> <int> <int> <int> <int>
##  1 bechtge…   1871     1 PH1    NA        1     2     3     3     2     0
##  2 brainas…   1871     1 WS3    NA       12    15    30    30    30     0
##  3 fergubo…   1871     1 NY2    NA        0     0     1     0     0     0
##  4 fishech…   1871     1 RC1    NA        4    16    24    24    22     1
##  5 fleetfr…   1871     1 NY2    NA        0     1     1     1     1     0
##  6 flowedi…   1871     1 TRO    NA        0     0     1     0     0     0
##  7 mackde01   1871     1 RC1    NA        0     1     3     1     1     0
##  8 mathebo…   1871     1 FW1    NA        6    11    19    19    19     1
##  9 mcbridi…   1871     1 PH1    NA       18     5    25    25    25     0
## 10 mcmuljo…   1871     1 TRO    NA       12    15    29    29    28     0
## # … with 44,953 more rows, and 19 more variables: SV <int>, IPouts <int>,
## #   H <int>, ER <int>, HR <int>, BB <int>, SO <int>, BAOpp <dbl>,
## #   ERA <dbl>, IBB <int>, WP <int>, HBP <int>, BK <int>, BFP <int>,
## #   GF <int>, R <int>, SH <int>, SF <int>, GIDP <int>
as_tibble(Fielding)
## # A tibble: 136,815 x 18
##    playerID yearID stint teamID lgID  POS       G    GS InnOuts    PO     A
##    <chr>     <int> <int> <fct>  <fct> <chr> <int> <int>   <int> <int> <int>
##  1 abercda…   1871     1 TRO    NA    SS        1    NA      NA     1     3
##  2 addybo01   1871     1 RC1    NA    2B       22    NA      NA    67    72
##  3 addybo01   1871     1 RC1    NA    SS        3    NA      NA     8    14
##  4 allisar…   1871     1 CL1    NA    2B        2    NA      NA     1     4
##  5 allisar…   1871     1 CL1    NA    OF       29    NA      NA    51     3
##  6 allisdo…   1871     1 WS3    NA    C        27    NA      NA    68    15
##  7 ansonca…   1871     1 RC1    NA    1B        1    NA      NA     7     0
##  8 ansonca…   1871     1 RC1    NA    2B        2    NA      NA     3     4
##  9 ansonca…   1871     1 RC1    NA    3B       20    NA      NA    38    52
## 10 ansonca…   1871     1 RC1    NA    C         5    NA      NA    10     0
## # … with 136,805 more rows, and 7 more variables: E <int>, DP <int>,
## #   PB <int>, WP <int>, SB <int>, CS <int>, ZR <int>
Batting %>%
  group_by(playerID, yearID, stint) %>%
  count() %>%
  filter(n>1)
## # A tibble: 0 x 4
## # Groups:   playerID, yearID, stint [0]
## # … with 4 variables: playerID <chr>, yearID <int>, stint <int>, n <int>

13.4 Notes - Mutating Joins

This section on mutating joins introduces the usage of the join functions and provides examples of how they could be used. Below are the provided examples.

# make a truncated dataset to work with
flights2 <- flights %>%
  select(year:day, hour, origin, dest, tailnum, carrier)
flights2
## # A tibble: 336,776 x 8
##     year month   day  hour origin dest  tailnum carrier
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
##  1  2013     1     1     5 EWR    IAH   N14228  UA     
##  2  2013     1     1     5 LGA    IAH   N24211  UA     
##  3  2013     1     1     5 JFK    MIA   N619AA  AA     
##  4  2013     1     1     5 JFK    BQN   N804JB  B6     
##  5  2013     1     1     6 LGA    ATL   N668DN  DL     
##  6  2013     1     1     5 EWR    ORD   N39463  UA     
##  7  2013     1     1     6 EWR    FLL   N516JB  B6     
##  8  2013     1     1     6 LGA    IAD   N829AS  EV     
##  9  2013     1     1     6 JFK    MCO   N593JB  B6     
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA     
## # … with 336,766 more rows

Comparing how to perform joins using dplyr or baseR:

# left join using left_join() from dplyr
flights2 %>%
  select(-origin, -dest) %>%
  left_join(airlines, by = "carrier")
## # A tibble: 336,776 x 7
##     year month   day  hour tailnum carrier name                    
##    <int> <int> <int> <dbl> <chr>   <chr>   <chr>                   
##  1  2013     1     1     5 N14228  UA      United Air Lines Inc.   
##  2  2013     1     1     5 N24211  UA      United Air Lines Inc.   
##  3  2013     1     1     5 N619AA  AA      American Airlines Inc.  
##  4  2013     1     1     5 N804JB  B6      JetBlue Airways         
##  5  2013     1     1     6 N668DN  DL      Delta Air Lines Inc.    
##  6  2013     1     1     5 N39463  UA      United Air Lines Inc.   
##  7  2013     1     1     6 N516JB  B6      JetBlue Airways         
##  8  2013     1     1     6 N829AS  EV      ExpressJet Airlines Inc.
##  9  2013     1     1     6 N593JB  B6      JetBlue Airways         
## 10  2013     1     1     6 N3ALAA  AA      American Airlines Inc.  
## # … with 336,766 more rows
# left join using base R commands and mutate()
flights2 %>%
  select(-origin, -dest) %>%
  mutate(name = airlines$name[match(carrier, airlines$carrier)])
## # A tibble: 336,776 x 7
##     year month   day  hour tailnum carrier name                    
##    <int> <int> <int> <dbl> <chr>   <chr>   <chr>                   
##  1  2013     1     1     5 N14228  UA      United Air Lines Inc.   
##  2  2013     1     1     5 N24211  UA      United Air Lines Inc.   
##  3  2013     1     1     5 N619AA  AA      American Airlines Inc.  
##  4  2013     1     1     5 N804JB  B6      JetBlue Airways         
##  5  2013     1     1     6 N668DN  DL      Delta Air Lines Inc.    
##  6  2013     1     1     5 N39463  UA      United Air Lines Inc.   
##  7  2013     1     1     6 N516JB  B6      JetBlue Airways         
##  8  2013     1     1     6 N829AS  EV      ExpressJet Airlines Inc.
##  9  2013     1     1     6 N593JB  B6      JetBlue Airways         
## 10  2013     1     1     6 N3ALAA  AA      American Airlines Inc.  
## # … with 336,766 more rows

There are two classes of joins: inner joins, in which only observations with common key values are provided in the joined output, or outer joins, in which all observations from one or more of the tables are kept, in the joined output, and the rest of the missing values are filled in with NA.

From the text, the types of outer joins are defined as:

  • “A left join keeps all observations in x.”
  • “A right join keeps all observations in y.”
  • “A full join keeps all observations in x and y.”

These joins can be performed using dplyr commands (left_join(), right_join(), full_join(), inner_join(), semi_join(), anti_join()) or using the base R command merge(). The “by” argument is important to modify depending on the type of join you want to perform and the keys you want to use between the two tables. The book provides the examples below on the different ways the “by” argument can be used.

# left join, if "by" argument is unspecified, will perform natural join
flights2 %>%
  left_join(weather)
## Joining, by = c("year", "month", "day", "hour", "origin")
## # A tibble: 336,776 x 18
##     year month   day  hour origin dest  tailnum carrier  temp  dewp humid
##    <dbl> <dbl> <int> <dbl> <chr>  <chr> <chr>   <chr>   <dbl> <dbl> <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA       39.0  28.0  64.4
##  2  2013     1     1     5 LGA    IAH   N24211  UA       39.9  25.0  54.8
##  3  2013     1     1     5 JFK    MIA   N619AA  AA       39.0  27.0  61.6
##  4  2013     1     1     5 JFK    BQN   N804JB  B6       39.0  27.0  61.6
##  5  2013     1     1     6 LGA    ATL   N668DN  DL       39.9  25.0  54.8
##  6  2013     1     1     5 EWR    ORD   N39463  UA       39.0  28.0  64.4
##  7  2013     1     1     6 EWR    FLL   N516JB  B6       37.9  28.0  67.2
##  8  2013     1     1     6 LGA    IAD   N829AS  EV       39.9  25.0  54.8
##  9  2013     1     1     6 JFK    MCO   N593JB  B6       37.9  27.0  64.3
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA       39.9  25.0  54.8
## # … with 336,766 more rows, and 7 more variables: wind_dir <dbl>,
## #   wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## #   visib <dbl>, time_hour <dttm>
# left join specifying which column to use as key
flights2 %>%
  left_join(planes, by = "tailnum")
## # A tibble: 336,776 x 16
##    year.x month   day  hour origin dest  tailnum carrier year.y type 
##     <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>    <int> <chr>
##  1   2013     1     1     5 EWR    IAH   N14228  UA        1999 Fixe…
##  2   2013     1     1     5 LGA    IAH   N24211  UA        1998 Fixe…
##  3   2013     1     1     5 JFK    MIA   N619AA  AA        1990 Fixe…
##  4   2013     1     1     5 JFK    BQN   N804JB  B6        2012 Fixe…
##  5   2013     1     1     6 LGA    ATL   N668DN  DL        1991 Fixe…
##  6   2013     1     1     5 EWR    ORD   N39463  UA        2012 Fixe…
##  7   2013     1     1     6 EWR    FLL   N516JB  B6        2000 Fixe…
##  8   2013     1     1     6 LGA    IAD   N829AS  EV        1998 Fixe…
##  9   2013     1     1     6 JFK    MCO   N593JB  B6        2004 Fixe…
## 10   2013     1     1     6 LGA    ORD   N3ALAA  AA          NA <NA> 
## # … with 336,766 more rows, and 6 more variables: manufacturer <chr>,
## #   model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>
# left join specifying column to use as key, if named differently in the two tables
flights2 %>%
  left_join(airports, c("dest" = "faa"))
## # A tibble: 336,776 x 15
##     year month   day  hour origin dest  tailnum carrier name    lat   lon
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr> <dbl> <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA      Geor…  30.0 -95.3
##  2  2013     1     1     5 LGA    IAH   N24211  UA      Geor…  30.0 -95.3
##  3  2013     1     1     5 JFK    MIA   N619AA  AA      Miam…  25.8 -80.3
##  4  2013     1     1     5 JFK    BQN   N804JB  B6      <NA>   NA    NA  
##  5  2013     1     1     6 LGA    ATL   N668DN  DL      Hart…  33.6 -84.4
##  6  2013     1     1     5 EWR    ORD   N39463  UA      Chic…  42.0 -87.9
##  7  2013     1     1     6 EWR    FLL   N516JB  B6      Fort…  26.1 -80.2
##  8  2013     1     1     6 LGA    IAD   N829AS  EV      Wash…  38.9 -77.5
##  9  2013     1     1     6 JFK    MCO   N593JB  B6      Orla…  28.4 -81.3
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA      Chic…  42.0 -87.9
## # … with 336,766 more rows, and 4 more variables: alt <int>, tz <dbl>,
## #   dst <chr>, tzone <chr>

13.4.6 Exercises

1. Compute the average delay by destination, then join on the airports data frame so you can show the spatial distribution of delays. Here’s an easy way to draw a map of the United States:

library(nycflights13)
airports %>%
  semi_join(flights, c("faa" = "dest")) %>%
  ggplot(aes(lon, lat)) +
    borders("state") +
    geom_point() +
    coord_quickmap()
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map

(Don’t worry if you don’t understand what semi_join() does — you’ll learn about it next.) You might want to use the size or colour of the points to display the average delay for each airport.

I first grouped by destination and computed the average delay per destination, and then joined the airports data to this metric using left_join(). Then, we can use this table to create a plot displaying the relative average delays at airports across the US by modifying the code provided by the book above.

# caluclate average delay by destination and join airports data
avg_delay_byDest <- flights %>%
  group_by(dest) %>%
  summarize ( avg_delay = mean(arr_delay, na.rm=T) ) %>%
  left_join(airports, by = c("dest" = "faa"))
avg_delay_byDest
## # A tibble: 105 x 9
##    dest  avg_delay name              lat    lon   alt    tz dst   tzone    
##    <chr>     <dbl> <chr>           <dbl>  <dbl> <int> <dbl> <chr> <chr>    
##  1 ABQ        4.38 Albuquerque In…  35.0 -107.   5355    -7 A     America/…
##  2 ACK        4.85 Nantucket Mem    41.3  -70.1    48    -5 A     America/…
##  3 ALB       14.4  Albany Intl      42.7  -73.8   285    -5 A     America/…
##  4 ANC       -2.5  Ted Stevens An…  61.2 -150.    152    -9 A     America/…
##  5 ATL       11.3  Hartsfield Jac…  33.6  -84.4  1026    -5 A     America/…
##  6 AUS        6.02 Austin Bergstr…  30.2  -97.7   542    -6 A     America/…
##  7 AVL        8.00 Asheville Regi…  35.4  -82.5  2165    -5 A     America/…
##  8 BDL        7.05 Bradley Intl     41.9  -72.7   173    -5 A     America/…
##  9 BGR        8.03 Bangor Intl      44.8  -68.8   192    -5 A     America/…
## 10 BHM       16.9  Birmingham Intl  33.6  -86.8   644    -6 A     America/…
## # … with 95 more rows
# plot the results on top of the map of the US
avg_delay_byDest %>%
  ggplot(aes(lon, lat)) +
    borders("state") +
    geom_point(aes(size = avg_delay, alpha = 1/5, color = avg_delay)) +
    coord_quickmap()
## Warning: Removed 5 rows containing missing values (geom_point).

2. Add the location of the origin and destination (i.e. the lat and lon) to flights.

To do this, I first make a smaller table from airports containing just the variables that we need: faa, lat, and lon. Then, I join this table to flights using origin to add the lat + lon values for origin. I join again using dest this time, to add the lat + lon values for dest. In order to label the lat and lon values according to origin or dest airport, I specify suffix = c(".origin", ".dest").

lat_lon_airports <- airports %>%
  select(faa, lat, lon)
with_origin <- left_join(flights, lat_lon_airports, by = c("origin" = "faa"))
with_origin
## # A tibble: 336,776 x 21
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # … with 336,766 more rows, and 14 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, lat <dbl>, lon <dbl>
with_origin_and_dest <- left_join(with_origin, lat_lon_airports, by = c("dest" = "faa"), suffix = c(".origin", ".dest"))
with_origin_and_dest
## # A tibble: 336,776 x 23
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # … with 336,766 more rows, and 16 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, lat.origin <dbl>, lon.origin <dbl>,
## #   lat.dest <dbl>, lon.dest <dbl>

3. Is there a relationship between the age of a plane and its delays?

First group by tailnum, then calculate avg delays for arr_delay and dep_delay, then join the info from planes. Then we can visualize any relationships between the year variable (age) of the plane and the delays associated with it.

# first group by tailnum, calculate avg delays, then join with planes data
avg_delay_plane <- flights %>%
  group_by(tailnum) %>%
  summarize(
    avg_delay_dep = mean (dep_delay, na.rm = T),
    avg_delay_arr = mean (arr_delay, na.rm = T)
  ) %>%
  left_join(planes, by = "tailnum")
avg_delay_plane
## # A tibble: 4,044 x 11
##    tailnum avg_delay_dep avg_delay_arr  year type  manufacturer model
##    <chr>           <dbl>         <dbl> <int> <chr> <chr>        <chr>
##  1 D942DN         31.5          31.5      NA <NA>  <NA>         <NA> 
##  2 N0EGMQ          8.49          9.98     NA <NA>  <NA>         <NA> 
##  3 N10156         17.8          12.7    2004 Fixe… EMBRAER      EMB-…
##  4 N102UW          8             2.94   1998 Fixe… AIRBUS INDU… A320…
##  5 N103US         -3.20         -6.93   1999 Fixe… AIRBUS INDU… A320…
##  6 N104UW          9.94          1.80   1999 Fixe… AIRBUS INDU… A320…
##  7 N10575         22.7          20.7    2002 Fixe… EMBRAER      EMB-…
##  8 N105UW          2.58         -0.267  1999 Fixe… AIRBUS INDU… A320…
##  9 N107US         -0.463        -5.73   1999 Fixe… AIRBUS INDU… A320…
## 10 N108UW          4.22         -1.25   1999 Fixe… AIRBUS INDU… A320…
## # … with 4,034 more rows, and 4 more variables: engines <int>,
## #   seats <int>, speed <int>, engine <chr>
# vizualize dep_delay vs year of plane
avg_delay_plane %>%
  ggplot( aes ( year, avg_delay_dep )) +
    geom_point() +
    geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 798 rows containing non-finite values (stat_smooth).
## Warning: Removed 798 rows containing missing values (geom_point).

# vizualize arr_delay vs year of plane
avg_delay_plane %>%
  ggplot( aes ( year, avg_delay_arr )) +
    geom_point() +
    geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 798 rows containing non-finite values (stat_smooth).

## Warning: Removed 798 rows containing missing values (geom_point).

4. What weather conditions make it more likely to see a delay?

Join the weather table with flights, and then we can visualize the relationship with delay and weather conditions. One way to do so is with a correlogram, which shows the correlation between each of the variables in a matrix. After joining weather with flights, we can remove the time variables (year, month, day, hour, etc) and use as.matrix() to turn the tibble into a matrix. Then, we can use the ggcorrplot package to plot a nice looking correllogram. From this plot, we can observe that variable in weather with the highest positive correlation with dep_delay is precip, at roughly 0.05. While this is a very small number (between 0-1), over a large number of observations it is still meaningful. This also makes sense since high precipitation values often make visibility an issue, which may lead to delayed flights.

It might make more sense to group dep_delay by day, and then look at the correlations in the manner I have done so below. I will revisit this in the 13.5.1 exercise 4, in which I find that lower visibility, higher humidity, and lower pressure on average correlate with higher cumulative delays.

# install.packages("ggcorrplot")
library(ggcorrplot)
# join weather table with flights, using natural join (no "by" argument specified)
weatherdelays <- flights %>%
  select(year, month, day, hour, time_hour, origin, dep_delay) %>%
  left_join(weather)%>%
  select(-year,-month,-day,-hour,-time_hour,-origin)
## Joining, by = c("year", "month", "day", "hour", "time_hour", "origin")
weatherdelays
## # A tibble: 336,776 x 10
##    dep_delay  temp  dewp humid wind_dir wind_speed wind_gust precip
##        <dbl> <dbl> <dbl> <dbl>    <dbl>      <dbl>     <dbl>  <dbl>
##  1         2  39.0  28.0  64.4      260       12.7      NA        0
##  2         4  39.9  25.0  54.8      250       15.0      21.9      0
##  3         2  39.0  27.0  61.6      260       15.0      NA        0
##  4        -1  39.0  27.0  61.6      260       15.0      NA        0
##  5        -6  39.9  25.0  54.8      260       16.1      23.0      0
##  6        -4  39.0  28.0  64.4      260       12.7      NA        0
##  7        -5  37.9  28.0  67.2      240       11.5      NA        0
##  8        -3  39.9  25.0  54.8      260       16.1      23.0      0
##  9        -3  37.9  27.0  64.3      260       13.8      NA        0
## 10        -2  39.9  25.0  54.8      260       16.1      23.0      0
## # … with 336,766 more rows, and 2 more variables: pressure <dbl>,
## #   visib <dbl>
rm_na_weatherdelays <- as.matrix (na.omit(weatherdelays))
head(rm_na_weatherdelays)
##      dep_delay  temp  dewp humid wind_dir wind_speed wind_gust precip
## [1,]         4 39.92 24.98 54.81      250   14.96014  21.86482      0
## [2,]        -6 39.92 24.98 54.81      260   16.11092  23.01560      0
## [3,]        -3 39.92 24.98 54.81      260   16.11092  23.01560      0
## [4,]        -2 39.92 24.98 54.81      260   16.11092  23.01560      0
## [5,]        -1 39.92 24.98 54.81      260   16.11092  23.01560      0
## [6,]         0 39.92 24.98 54.81      260   16.11092  23.01560      0
##      pressure visib
## [1,]   1011.4    10
## [2,]   1011.7    10
## [3,]   1011.7    10
## [4,]   1011.7    10
## [5,]   1011.7    10
## [6,]   1011.7    10
corr_weather <- cor(rm_na_weatherdelays)

ggcorrplot(corr_weather, hc.order = TRUE,
           type = "lower",
           lab = TRUE,
           lab_size = 3,
           method="circle",
           colors = c("steelblue", "white", "orangered3"),
           title="Correlogram",
           ggtheme=theme_bw)

weatherdelays %>%
  ggplot( aes (x = precip, y = dep_delay))+
    geom_point()+
    geom_smooth(se = F)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 9783 rows containing non-finite values (stat_smooth).
## Warning: Removed 9783 rows containing missing values (geom_point).

weatherdelays %>%
  ggplot( aes (x = humid, y = dep_delay))+
    geom_point()+
    geom_smooth(se = F)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 9800 rows containing non-finite values (stat_smooth).
## Warning: Removed 9800 rows containing missing values (geom_point).

5. What happened on June 13 2013? Display the spatial pattern of delays, and then use Google to cross-reference with the weather.

First filter flights for observations from June 13, 2013 using filter(). We can see that there were only flights comming out of EWR, JFK, and LGA on this day! Furthermore, all the flights were considerably delayed (high dep_delay value). I initially found this by overlaying the average dep_delay for the filtered data table onto the map of the US, and was surprised that there were so few origin airports on the map. Looking at the opposite, grouping by destination and looking at arr_delay values, we can see that there are many, many more airports in the destination variable. Most flights flew out of only a few airports (EWR, JFK, and LGA) but arrived at many different destinations, and these arrivals were delayed as well.

flights %>%
  filter (year == 2013, month == 6, day == 13) %>%
   group_by(origin) %>%
  summarize ( avg_delay = mean(dep_delay, na.rm=T) )
## # A tibble: 3 x 2
##   origin avg_delay
##   <chr>      <dbl>
## 1 EWR         45.6
## 2 JFK         43.7
## 3 LGA         48.4
avg_delay_byorigin <- flights %>%
  filter (year == 2013, month == 6, day == 13) %>%
   group_by(origin) %>%
  summarize ( avg_delay = mean(dep_delay, na.rm=T) ) %>%
  left_join(airports, by = c("origin" = "faa"))
avg_delay_byorigin
## # A tibble: 3 x 9
##   origin avg_delay name             lat   lon   alt    tz dst   tzone      
##   <chr>      <dbl> <chr>          <dbl> <dbl> <int> <dbl> <chr> <chr>      
## 1 EWR         45.6 Newark Libert…  40.7 -74.2    18    -5 A     America/Ne…
## 2 JFK         43.7 John F Kenned…  40.6 -73.8    13    -5 A     America/Ne…
## 3 LGA         48.4 La Guardia      40.8 -73.9    22    -5 A     America/Ne…
# plot the results on top of the map of the US
avg_delay_byorigin %>%
  ggplot(aes(lon, lat)) +
    borders("state") +
    geom_point(aes(size = avg_delay, alpha = 1/5, color = avg_delay)) +
    coord_quickmap()

avg_delay_byDest <- flights %>%
  filter (year == 2013, month == 6, day == 13) %>%
   group_by(dest) %>%
  summarize ( avg_delay = mean(arr_delay, na.rm=T) ) %>%
  left_join(airports, by = c("dest" = "faa"))
avg_delay_byDest
## # A tibble: 89 x 9
##    dest  avg_delay name               lat    lon   alt    tz dst   tzone   
##    <chr>     <dbl> <chr>            <dbl>  <dbl> <int> <dbl> <chr> <chr>   
##  1 ABQ        64   Albuquerque Int…  35.0 -107.   5355    -7 A     America…
##  2 ACK         8   Nantucket Mem     41.3  -70.1    48    -5 A     America…
##  3 ALB       -19   Albany Intl       42.7  -73.8   285    -5 A     America…
##  4 ATL       109.  Hartsfield Jack…  33.6  -84.4  1026    -5 A     America…
##  5 AUS        64.3 Austin Bergstro…  30.2  -97.7   542    -6 A     America…
##  6 AVL        90   Asheville Regio…  35.4  -82.5  2165    -5 A     America…
##  7 BDL        18.5 Bradley Intl      41.9  -72.7   173    -5 A     America…
##  8 BGR       102   Bangor Intl       44.8  -68.8   192    -5 A     America…
##  9 BHM       158   Birmingham Intl   33.6  -86.8   644    -6 A     America…
## 10 BNA        95.9 Nashville Intl    36.1  -86.7   599    -6 A     America…
## # … with 79 more rows
# plot the results on top of the map of the US
avg_delay_byDest %>%
  ggplot(aes(lon, lat)) +
    borders("state") +
    geom_point(aes(size = avg_delay, alpha = 1/5, color = avg_delay)) +
    coord_quickmap()
## Warning: Removed 7 rows containing missing values (geom_point).

13.5 Notes - Filtering Joins

There are two types of filtering joins that can be performed, listed by the book:

  • “semi_join(x, y) keeps all observations in x that have a match in y.”
  • “anti_join(x, y) drops all observations in x that have a match in y.”

The semi_join() function is useful when you want to filter a table for observations that match a certain criteria set by a second table. The provided example in the book is finding top destinations, then looking for the observations in the orginal data table that corresponded to the top destinations. You could use filter() but this might become difficult when analzying multiple variables.

# find top destinations
top_dest <- flights %>%
  count(dest, sort = TRUE) %>%
  head(10)
top_dest
## # A tibble: 10 x 2
##    dest      n
##    <chr> <int>
##  1 ORD   17283
##  2 ATL   17215
##  3 LAX   16174
##  4 BOS   15508
##  5 MCO   14082
##  6 CLT   14064
##  7 SFO   13331
##  8 FLL   12055
##  9 MIA   11728
## 10 DCA    9705
# use filter() to find observations in flights that have matched key in top destinations table
flights %>%
  filter(dest %in% top_dest$dest)
## # A tibble: 141,145 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      542            540         2      923
##  2  2013     1     1      554            600        -6      812
##  3  2013     1     1      554            558        -4      740
##  4  2013     1     1      555            600        -5      913
##  5  2013     1     1      557            600        -3      838
##  6  2013     1     1      558            600        -2      753
##  7  2013     1     1      558            600        -2      924
##  8  2013     1     1      558            600        -2      923
##  9  2013     1     1      559            559         0      702
## 10  2013     1     1      600            600         0      851
## # … with 141,135 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
# use semi_join() to find obseravtions in flights that have matched key in top destinations table
flights %>%
  semi_join(top_dest)
## Joining, by = "dest"
## # A tibble: 141,145 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      542            540         2      923
##  2  2013     1     1      554            600        -6      812
##  3  2013     1     1      554            558        -4      740
##  4  2013     1     1      555            600        -5      913
##  5  2013     1     1      557            600        -3      838
##  6  2013     1     1      558            600        -2      753
##  7  2013     1     1      558            600        -2      924
##  8  2013     1     1      558            600        -2      923
##  9  2013     1     1      559            559         0      702
## 10  2013     1     1      600            600         0      851
## # … with 141,135 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

The opposite of the semi_join() function is the anti_join() function, which will return mismatches between the two tables (which values in x do not have a matched key in y?). The book uses anti_join() to find which flights do not have a match in the planes data table.

# anti join to find which observations in flights do not have a match in the planes table
flights %>%
  anti_join(planes, by = "tailnum") %>%
  count(tailnum, sort = TRUE)
## # A tibble: 722 x 2
##    tailnum     n
##    <chr>   <int>
##  1 <NA>     2512
##  2 N725MQ    575
##  3 N722MQ    513
##  4 N723MQ    507
##  5 N713MQ    483
##  6 N735MQ    396
##  7 N0EGMQ    371
##  8 N534MQ    364
##  9 N542MQ    363
## 10 N531MQ    349
## # … with 712 more rows

13.5.1 Exercises

1. What does it mean for a flight to have a missing tailnum? What do the tail numbers that don’t have a matching record in planes have in common? (Hint: one variable explains ~90% of the problems.)

The flights that have a missing tailnum can be found by filtering the flights table using filter(is.na(tailnum)). We can observe that these entries also have missing values for dep_time, suggesting that these flights were cancelled. If we remove all flights that have a missing dep_time, we observe that the flights with missing tailnum are also removed as well.

To find the tail numbers that don’t have a matching record in planes, we can use the anti_join() function. Then, we can examine the columns of the resulting table to see if any particular variables are enriched. We observe that some carriers are particularly enriched in the flights that have tailnums that are not recorded in the planes table, including MQ and AA.

# find flights that have a missing tailnum
flights %>%
  filter(is.na(tailnum))
## # A tibble: 2,512 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     2       NA           1545        NA       NA
##  2  2013     1     2       NA           1601        NA       NA
##  3  2013     1     3       NA            857        NA       NA
##  4  2013     1     3       NA            645        NA       NA
##  5  2013     1     4       NA            845        NA       NA
##  6  2013     1     4       NA           1830        NA       NA
##  7  2013     1     5       NA            840        NA       NA
##  8  2013     1     7       NA            820        NA       NA
##  9  2013     1     8       NA           1645        NA       NA
## 10  2013     1     9       NA            755        NA       NA
## # … with 2,502 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
# remove flights with missing dep_time and look at whether missing tailnums still exist
not_cancelled <- flights %>%
  filter(!is.na(dep_time))
not_cancelled %>%
    filter(is.na(tailnum))
## # A tibble: 0 x 19
## # … with 19 variables: year <int>, month <int>, day <int>, dep_time <int>,
## #   sched_dep_time <int>, dep_delay <dbl>, arr_time <int>,
## #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# anti join to find which observations in flights do not have a match in the planes table
flights %>%
  anti_join(planes, by = "tailnum") %>%
  count(tailnum, sort = T)
## # A tibble: 722 x 2
##    tailnum     n
##    <chr>   <int>
##  1 <NA>     2512
##  2 N725MQ    575
##  3 N722MQ    513
##  4 N723MQ    507
##  5 N713MQ    483
##  6 N735MQ    396
##  7 N0EGMQ    371
##  8 N534MQ    364
##  9 N542MQ    363
## 10 N531MQ    349
## # … with 712 more rows
# observe that some carriers are enriched
flights %>%
  anti_join(planes, by = "tailnum") %>%
  count(carrier, sort = T) %>%
  ggplot (aes (x = carrier, y = n))+
  geom_bar( stat = "identity", aes(fill = carrier))

2. Filter flights to only show flights with planes that have flown at least 100 flights.

We can do this by using the semi_join() function. First, we have to identify which tailnums (planes) have flown at least 100 flights, using count() and filter(). We can store this output in a separate table which I call flights_100, then use semi_join() to keep only the observations in flights that also have an entry in flights_100.

# make a separate table containing tailnums that have over 100 flights
flights_100 <- flights %>%
  count(tailnum) %>%
  filter(n>100)

# use semi join to keep only tailnums in flights that exist in our flights_100 table
flights %>%
  semi_join (flights_100)
## Joining, by = "tailnum"
## # A tibble: 229,202 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      544            545        -1     1004
##  4  2013     1     1      554            558        -4      740
##  5  2013     1     1      555            600        -5      913
##  6  2013     1     1      557            600        -3      709
##  7  2013     1     1      557            600        -3      838
##  8  2013     1     1      558            600        -2      849
##  9  2013     1     1      558            600        -2      853
## 10  2013     1     1      558            600        -2      923
## # … with 229,192 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

3. Combine fueleconomy::vehicles and fueleconomy::common to find only the records for the most common models.

We can use the semi_join() function to combine vehicles and common, which will keep only the records in vehicles that also have a match in the common table.

library(fueleconomy)
semi_join(vehicles,common)
## Joining, by = c("make", "model")
## # A tibble: 14,531 x 12
##       id make  model   year class trans drive   cyl displ fuel    hwy   cty
##    <int> <chr> <chr>  <int> <chr> <chr> <chr> <int> <dbl> <chr> <int> <int>
##  1  1833 Acura Integ…  1986 Subc… Auto… Fron…     4   1.6 Regu…    28    22
##  2  1834 Acura Integ…  1986 Subc… Manu… Fron…     4   1.6 Regu…    28    23
##  3  3037 Acura Integ…  1987 Subc… Auto… Fron…     4   1.6 Regu…    28    22
##  4  3038 Acura Integ…  1987 Subc… Manu… Fron…     4   1.6 Regu…    28    23
##  5  4183 Acura Integ…  1988 Subc… Auto… Fron…     4   1.6 Regu…    27    22
##  6  4184 Acura Integ…  1988 Subc… Manu… Fron…     4   1.6 Regu…    28    23
##  7  5303 Acura Integ…  1989 Subc… Auto… Fron…     4   1.6 Regu…    27    22
##  8  5304 Acura Integ…  1989 Subc… Manu… Fron…     4   1.6 Regu…    28    23
##  9  6442 Acura Integ…  1990 Subc… Auto… Fron…     4   1.8 Regu…    24    20
## 10  6443 Acura Integ…  1990 Subc… Manu… Fron…     4   1.8 Regu…    26    21
## # … with 14,521 more rows

4. Find the 48 hours (over the course of the whole year) that have the worst delays. Cross-reference it with the weather data. Can you see any patterns?

To find the 48 hours that have the worst delays, we can use group_by() and summarize() to find the total delay per day, then combine the total delay per day with lag() to get the total delay per 48 hours. There will be a missing value for the first day of the year, since the method that I use will sum the total delay for the current day with the previous day. Sorting this table will identify the days with the worst dep_delay over 48 hours, which turns out to be March 7-8th, 2013, with a total delay of 84713 min. We can left_join() the weather conditions to this table, and then analyze the weather patterns for this day in particular. We can see that on average, higher 48hr delays are associated with higher humidity values, higher temp/dewpoint, lower pressure, and lower visibility, based on both the correlogram and scatterplots.

# identify the 48 hrs with highest combined dep_delay, then join weather conditions
most_delayed_48 <- flights %>%
  group_by(year, month, day) %>%
  summarize ( delay_per_day = sum( dep_delay, na.rm=T))%>%
  mutate ( delay_48hr = delay_per_day + lag(delay_per_day)) %>%
  arrange(desc(delay_48hr)) %>%
  left_join(weather, by = c("year", "month", "day"))
most_delayed_48
## # A tibble: 26,116 x 17
## # Groups:   year, month [?]
##     year month   day delay_per_day delay_48hr origin  hour  temp  dewp
##    <dbl> <dbl> <int>         <dbl>      <dbl> <chr>  <int> <dbl> <dbl>
##  1  2013     3     8         66746      84713 EWR        0  33.8  32  
##  2  2013     3     8         66746      84713 EWR        1  33.1  32  
##  3  2013     3     8         66746      84713 EWR        2  33.1  30.9
##  4  2013     3     8         66746      84713 EWR        3  33.1  30.9
##  5  2013     3     8         66746      84713 EWR        4  33.8  30.9
##  6  2013     3     8         66746      84713 EWR        5  32    30.9
##  7  2013     3     8         66746      84713 EWR        6  32    30.0
##  8  2013     3     8         66746      84713 EWR        7  32    30.0
##  9  2013     3     8         66746      84713 EWR        8  32    30.2
## 10  2013     3     8         66746      84713 EWR        9  32    30.2
## # … with 26,106 more rows, and 8 more variables: humid <dbl>,
## #   wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>, precip <dbl>,
## #   pressure <dbl>, visib <dbl>, time_hour <dttm>
# visualize correlations of variables with delay_48hr
most_delayed_48 %>%
  ggplot( aes (x = delay_48hr, y = humid))+
  geom_point(aes (size = precip))+
  geom_smooth(method = "lm", se = F)
## Warning: Removed 857 rows containing non-finite values (stat_smooth).
## Warning: Removed 857 rows containing missing values (geom_point).

most_delayed_48 %>%
  ggplot( aes (x = delay_48hr, y = visib))+
  geom_point()+
  geom_smooth(method = "lm", se = F)
## Warning: Removed 856 rows containing non-finite values (stat_smooth).
## Warning: Removed 856 rows containing missing values (geom_point).

most_delayed_48 %>%
  ggplot( aes (x = delay_48hr, y = pressure))+
  geom_point()+
  geom_smooth(method = "lm", se = F)
## Warning: Removed 3477 rows containing non-finite values (stat_smooth).
## Warning: Removed 3477 rows containing missing values (geom_point).

# plot a correlogram to observe how delay_48hr is associated with other weather variables
library(ggcorrplot)
weatherdelays <- most_delayed_48 %>%
  select(-day,-hour,-delay_per_day,-origin, -time_hour)
rm_na_weatherdelays <- as.matrix (na.omit(weatherdelays)[,-c(1:2)])
corr_weather <- cor(rm_na_weatherdelays)

ggcorrplot(corr_weather, hc.order = TRUE,
           type = "lower",
           lab = TRUE,
           lab_size = 3,
           method="circle",
           colors = c("steelblue", "white", "orangered3"),
           title="Correlogram",
           ggtheme=theme_bw)

5. What does anti_join(flights, airports, by = c(“dest” = “faa”)) tell you? What does anti_join(airports, flights, by = c(“faa” = “dest”)) tell you?

anti_join(flights, airports, by = c("dest" = "faa")) tells you which observations in flights do not have a destination that is listed in the ‘faa’ variable of the airports table.

anti_join(airports, flights, by = c("faa" = "dest")) tells you which observations in airports do not have a ‘faa’ value that is present in the ‘dest’ variable of the flights table.

anti_join(flights, airports, by = c("dest" = "faa"))
## # A tibble: 7,602 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      544            545        -1     1004
##  2  2013     1     1      615            615         0     1039
##  3  2013     1     1      628            630        -2     1137
##  4  2013     1     1      701            700         1     1123
##  5  2013     1     1      711            715        -4     1151
##  6  2013     1     1      820            820         0     1254
##  7  2013     1     1      820            820         0     1249
##  8  2013     1     1      840            845        -5     1311
##  9  2013     1     1      909            810        59     1331
## 10  2013     1     1      913            918        -5     1346
## # … with 7,592 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
anti_join(airports, flights, by = c("faa" = "dest"))
## # A tibble: 1,357 x 8
##    faa   name                    lat    lon   alt    tz dst   tzone        
##    <chr> <chr>                 <dbl>  <dbl> <int> <dbl> <chr> <chr>        
##  1 04G   Lansdowne Airport      41.1  -80.6  1044    -5 A     America/New_…
##  2 06A   Moton Field Municipa…  32.5  -85.7   264    -6 A     America/Chic…
##  3 06C   Schaumburg Regional    42.0  -88.1   801    -6 A     America/Chic…
##  4 06N   Randall Airport        41.4  -74.4   523    -5 A     America/New_…
##  5 09J   Jekyll Island Airport  31.1  -81.4    11    -5 A     America/New_…
##  6 0A9   Elizabethton Municip…  36.4  -82.2  1593    -5 A     America/New_…
##  7 0G6   Williams County Airp…  41.5  -84.5   730    -5 A     America/New_…
##  8 0G7   Finger Lakes Regiona…  42.9  -76.8   492    -5 A     America/New_…
##  9 0P2   Shoestring Aviation …  39.8  -76.6  1000    -5 U     America/New_…
## 10 0S9   Jefferson County Intl  48.1 -123.    108    -8 A     America/Los_…
## # … with 1,347 more rows

6. You might expect that there’s an implicit relationship between plane and airline, because each plane is flown by a single airline. Confirm or reject this hypothesis using the tools you’ve learned above.

To explore whether each plane exclusively flies with a single airline, we can group the flights table by tailnum, then by carrier. Counting this grouped table will yield the number of flights each plane made with its carrier or carriers. We can group this table again by tailnum, then count how many carriers it flew with. If the hypothesis is true, then we would expect that there would be no entries with more than one carrier per plane. However, when we filter this table for planes with carriers > 1, we observe that 17 planes have flown with 2 carriers. Out of the roughly 4000 planes, this is only a very small fraction. This evidence supports our hypothesis that, in general, there is an association between plane and airline. Most of the time, planes only fly with a single carrier. However, a small amount of planes have flown with multiple airlines. This may have occured if an airline sold their planes to another airline.

# find the number of carriers that each plane has flown with
flights %>%
  group_by (tailnum, carrier) %>%
  count() %>%
  group_by(tailnum)%>%
  count() %>%
  arrange(desc(nn)) %>%
  filter(nn>1)
## # A tibble: 18 x 2
## # Groups:   tailnum [18]
##    tailnum    nn
##    <chr>   <int>
##  1 <NA>        7
##  2 N146PQ      2
##  3 N153PQ      2
##  4 N176PQ      2
##  5 N181PQ      2
##  6 N197PQ      2
##  7 N200PQ      2
##  8 N228PQ      2
##  9 N232PQ      2
## 10 N933AT      2
## 11 N935AT      2
## 12 N977AT      2
## 13 N978AT      2
## 14 N979AT      2
## 15 N981AT      2
## 16 N989AT      2
## 17 N990AT      2
## 18 N994AT      2

13.6 & 13.7 Notes - Join problems and set operations

The book recommends the following process for setting up your data to perform join operations.

  1. First, identify the variables that form the primary key in each table, usually thinking about the types of variables that would make sense to be keys (id, or a date/time, etc.).
  2. See if there are any missing values in the variable(s) for your primary key. Since values cannot be joined to missing values, this might result in issues.
  3. Make sure that all the values of the keys match between the tables you are joining. If a value for a key in one of the tables was missing due to a data entry error or other reason, the value will be lost in the subsequent join. Values that are not consistent between keys can be assessed using anti_join().

The last segment of the chapter focuses on set operations. Below are the functions listed by the book for performing set operations. They treat the observations (rows) of two tables that have the same variables as sets.

  • intersect(x, y): return only observations in both x and y.
  • union(x, y): return unique observations in x and y.
  • setdiff(x, y): return observations in x, but not in y.

Here are the examples provided by the book for the set operations described above:

df1 <- tribble(
  ~x, ~y,
   1,  1,
   2,  1
)
df2 <- tribble(
  ~x, ~y,
   1,  1,
   1,  2
)
intersect(df1, df2)
## # A tibble: 1 x 2
##       x     y
##   <dbl> <dbl>
## 1     1     1
union(df1, df2)
## # A tibble: 3 x 2
##       x     y
##   <dbl> <dbl>
## 1     1     2
## 2     2     1
## 3     1     1
setdiff(df1, df2)
## # A tibble: 1 x 2
##       x     y
##   <dbl> <dbl>
## 1     2     1
setdiff(df2, df1)
## # A tibble: 1 x 2
##       x     y
##   <dbl> <dbl>
## 1     1     2