Chapter 25 - Many models

First, we’ll load and prepare the data used for the exercises in this chapter.

library(gapminder)
by_country <- gapminder %>% 
  group_by(country, continent) %>% 
  nest()

by_country
## # A tibble: 142 x 3
##    country     continent data             
##    <fct>       <fct>     <list>           
##  1 Afghanistan Asia      <tibble [12 × 4]>
##  2 Albania     Europe    <tibble [12 × 4]>
##  3 Algeria     Africa    <tibble [12 × 4]>
##  4 Angola      Africa    <tibble [12 × 4]>
##  5 Argentina   Americas  <tibble [12 × 4]>
##  6 Australia   Oceania   <tibble [12 × 4]>
##  7 Austria     Europe    <tibble [12 × 4]>
##  8 Bahrain     Asia      <tibble [12 × 4]>
##  9 Bangladesh  Asia      <tibble [12 × 4]>
## 10 Belgium     Europe    <tibble [12 × 4]>
## # … with 132 more rows
## wrapper function to model lifeExp by year
country_model <- function(df) {
  lm(lifeExp ~ year, data = df)
}

## add the model as a list-column to the nested data
by_country <- by_country %>% 
  mutate(model = map(data, country_model))
## Error in mutate_impl(.data, dots): Evaluation error: cannot coerce type 'closure' to vector of type 'character'.
by_country
## # A tibble: 142 x 3
##    country     continent data             
##    <fct>       <fct>     <list>           
##  1 Afghanistan Asia      <tibble [12 × 4]>
##  2 Albania     Europe    <tibble [12 × 4]>
##  3 Algeria     Africa    <tibble [12 × 4]>
##  4 Angola      Africa    <tibble [12 × 4]>
##  5 Argentina   Americas  <tibble [12 × 4]>
##  6 Australia   Oceania   <tibble [12 × 4]>
##  7 Austria     Europe    <tibble [12 × 4]>
##  8 Bahrain     Asia      <tibble [12 × 4]>
##  9 Bangladesh  Asia      <tibble [12 × 4]>
## 10 Belgium     Europe    <tibble [12 × 4]>
## # … with 132 more rows
## calculate residuals based on the nested data and the associated model
by_country <- by_country %>% 
  mutate(
    resids = map2(data, model, add_residuals)
  )
## Error in mutate_impl(.data, dots): Evaluation error: object 'model' not found.
by_country
## # A tibble: 142 x 3
##    country     continent data             
##    <fct>       <fct>     <list>           
##  1 Afghanistan Asia      <tibble [12 × 4]>
##  2 Albania     Europe    <tibble [12 × 4]>
##  3 Algeria     Africa    <tibble [12 × 4]>
##  4 Angola      Africa    <tibble [12 × 4]>
##  5 Argentina   Americas  <tibble [12 × 4]>
##  6 Australia   Oceania   <tibble [12 × 4]>
##  7 Austria     Europe    <tibble [12 × 4]>
##  8 Bahrain     Asia      <tibble [12 × 4]>
##  9 Bangladesh  Asia      <tibble [12 × 4]>
## 10 Belgium     Europe    <tibble [12 × 4]>
## # … with 132 more rows
resids <- unnest(by_country, resids)
## Error in mutate_impl(.data, dots): Binding not found: resids.
resids %>% group_by(country) %>% summarise (rsme = sqrt(mean(resid^2)))
## Error in eval(lhs, parent, parent): object 'resids' not found

25.2.5 Exercises

1. A linear trend seems to be slightly too simple for the overall trend. Can you do better with a quadratic polynomial? How can you interpret the coefficients of the quadratic? (Hint you might want to transform year so that it has mean zero.)

We can fit a quadratic polynomial either by using the equation y ~ poly(x,2) or writing it out as y ~ I(x^2) + x. When we fit this model to the data, we see that the majority of the rmse values calculated using the residuals by country are reduced compared to the original linear model used in this chapter. This suggests that we are doing better.

country_model_poly <- function(df) {
  lm(lifeExp ~ poly(year,2), data = df)
  # alternatively can use lm(lifeExp ~ I(year^2)+year, data = df) for raw polynomial
}

country_model_poly_centered <- function(df) {
  lm(lifeExp ~ poly(year-mean(year),2), data = df)
}

by_country <- by_country %>% 
  mutate(model2 = map(data, country_model_poly),
         model3 = map(data, country_model_poly_centered)) %>%
  mutate(resids2 = map2(data, model2, add_residuals))
## Error in mutate_impl(.data, dots): Evaluation error: cannot coerce type 'closure' to vector of type 'character'.
## residual freq-poly plot
resids2 <- unnest(by_country, resids2)
## Error in mutate_impl(.data, dots): Binding not found: resids2.
resids2 %>% 
  ggplot(aes(year, resid)) +
    geom_line(aes(group = country), alpha = 1 / 3) + 
    geom_smooth(se = FALSE)
## Error in eval(lhs, parent, parent): object 'resids2' not found
## R-squared value
glance <- by_country %>% 
  mutate(glance = map(model2, broom::glance)) %>% 
  unnest(glance, .drop = TRUE) 
## Error in mutate_impl(.data, dots): Evaluation error: object 'model2' not found.
glance %>% ggplot(aes(continent, r.squared)) + 
    geom_jitter(width = 0.5)
## Error in eval(lhs, parent, parent): object 'glance' not found
# rmse using the polynomial model
resids2 %>% group_by(country) %>% summarise (rmse = sqrt(mean(resid^2)))
## Error in eval(lhs, parent, parent): object 'resids2' not found
# rmse using the original linear model
resids %>% group_by(country) %>% summarise (rmse = sqrt(mean(resid^2)))
## Error in eval(lhs, parent, parent): object 'resids' not found

When we compare the model summaries for the centered (mean 0) model vs the non-centered model, the estimates for the coefficients are the same. Poly() creates orthogonal polynomials for the fit. The coefficients may be interpreted as the weight associated with a unit of change in year.

summary(by_country$model2[[1]])
## Warning: Unknown or uninitialised column: 'model2'.
## Length  Class   Mode 
##      0   NULL   NULL
summary(by_country$model3[[1]])
## Warning: Unknown or uninitialised column: 'model3'.
## Length  Class   Mode 
##      0   NULL   NULL

2. Explore other methods for visualising the distribution of R2 per continent. You might want to try the ggbeeswarm package, which provides similar methods for avoiding overlaps as jitter, but uses deterministic methods.

To use the ggbeeswarm package, simply replace geom_jitter() with geom_beeswarm(). The output is below. We can augment the graph by adding additional graphics such as a boxplot.

library(ggbeeswarm)
glance %>% 
  ggplot(aes(continent, r.squared)) + 
    geom_beeswarm()
## Error in eval(lhs, parent, parent): object 'glance' not found
glance %>% 
  ggplot(aes(continent, r.squared)) + 
    geom_boxplot( aes(color = continent))+
    geom_beeswarm()
## Error in eval(lhs, parent, parent): object 'glance' not found

3. To create the last plot (showing the data for the countries with the worst model fits), we needed two steps: we created a data frame with one row per country and then semi-joined it to the original dataset. It’s possible to avoid this join if we use unnest() instead of unnest(.drop = TRUE). How?

Instead of using unnest(glance,.drop = TRUE), using unnest(glance) will retain the other columns in the dataset in addition to unnesting the glance column. This means that the list-column “data” will still be associated with the glance output, which contains the year and lifeExp variables that are needed to plot the final graph. We can then perform the filtering on the r.squared value, unnest the data, and then plot the graph, as shown below, without needing to semi-join.

glance <- by_country %>% 
  mutate(glance = map(model, broom::glance)) %>%
  unnest(glance)
## Error in mutate_impl(.data, dots): Evaluation error: object 'model' not found.
glance
## Error in eval(expr, envir, enclos): object 'glance' not found
bad_fit <- filter(glance, r.squared < 0.25)
## Error in filter(glance, r.squared < 0.25): object 'glance' not found
bad_fit
## Error in eval(expr, envir, enclos): object 'bad_fit' not found
bad_fit %>% unnest (data) %>%
  ggplot(aes(year, lifeExp, colour = country)) +
    geom_line()
## Error in eval(lhs, parent, parent): object 'bad_fit' not found

25.4.5 Exercises

1. List all the functions that you can think of that take a atomic vector and return a list.

enframe() or as_tibble() will convert an atomic vector into a list. The map() function will return a list in which it applies a function that you specify on each element of the atomic vector (examples below). stringr functions return lists of strings, such as str_split(). You may also encounter package-specific functions which may create their own types of objects. For example, in the bioinformatics world, you may have DESeq2 objects (bulk RNA-sequencing) or Seurat objects (single cell RNA-sequencing), which are S4 objects which contain several data types, including lists. Other functions that return lists are those shown in this chapter, such as broom::glance(), although in this instance the input is a model and not an atomic vector. split() also returns a list when applied to a dataframe.

my_atomic_vector <- c("hello","world")

map(my_atomic_vector, length)
## Error: 'helloMapEnv' is not an exported object from 'namespace:maps'
str_split(my_atomic_vector, "l")
## [[1]]
## [1] "he" ""   "o" 
## 
## [[2]]
## [1] "wor" "d"
typeof(as_tibble(my_atomic_vector))
## Warning: Calling `as_tibble()` on a vector is discouraged, because the behavior is likely to change in the future. Use `enframe(name = NULL)` instead.
## This warning is displayed once per session.
## [1] "list"
typeof(enframe(my_atomic_vector))
## [1] "list"
summary(my_atomic_vector)
##    Length     Class      Mode 
##         2 character character
mtcars %>% 
  split(.$cyl)
## $`4`
##                 mpg cyl     disp  hp drat    wt  qsec vs     am gear carb
## Datsun 710     22.8   4 1.769807  93 3.85 2.320 18.61  1 manual    4    1
## Merc 240D      24.4   4 2.403988  62 3.69 3.190 20.00  1   auto    4    2
## Merc 230       22.8   4 2.307304  95 3.92 3.150 22.90  1   auto    4    2
## Fiat 128       32.4   4 1.289665  66 4.08 2.200 19.47  1 manual    4    1
## Honda Civic    30.4   4 1.240503  52 4.93 1.615 18.52  1 manual    4    2
## Toyota Corolla 33.9   4 1.165123  65 4.22 1.835 19.90  1 manual    4    1
## Toyota Corona  21.5   4 1.968091  97 3.70 2.465 20.01  1   auto    3    1
## Fiat X1-9      27.3   4 1.294581  66 4.08 1.935 18.90  1 manual    4    1
## Porsche 914-2  26.0   4 1.971368  91 4.43 2.140 16.70  0 manual    5    2
## Lotus Europa   30.4   4 1.558413 113 3.77 1.513 16.90  1 manual    5    2
## Volvo 142E     21.4   4 1.982839 109 4.11 2.780 18.60  1 manual    4    2
## 
## $`6`
##                 mpg cyl     disp  hp drat    wt  qsec vs     am gear carb
## Mazda RX4      21.0   6 2.621936 110 3.90 2.620 16.46  0 manual    4    4
## Mazda RX4 Wag  21.0   6 2.621936 110 3.90 2.875 17.02  0 manual    4    4
## Hornet 4 Drive 21.4   6 4.227872 110 3.08 3.215 19.44  1   auto    3    1
## Valiant        18.1   6 3.687098 105 2.76 3.460 20.22  1   auto    3    1
## Merc 280       19.2   6 2.746478 123 3.92 3.440 18.30  1   auto    4    4
## Merc 280C      17.8   6 2.746478 123 3.92 3.440 18.90  1   auto    4    4
## Ferrari Dino   19.7   6 2.376130 175 3.62 2.770 15.50  0 manual    5    6
## 
## $`8`
##                      mpg cyl     disp  hp drat    wt  qsec vs     am gear
## Hornet Sportabout   18.7   8 5.899356 175 3.15 3.440 17.02  0   auto    3
## Duster 360          14.3   8 5.899356 245 3.21 3.570 15.84  0   auto    3
## Merc 450SE          16.4   8 4.519562 180 3.07 4.070 17.40  0   auto    3
## Merc 450SL          17.3   8 4.519562 180 3.07 3.730 17.60  0   auto    3
## Merc 450SLC         15.2   8 4.519562 180 3.07 3.780 18.00  0   auto    3
## Cadillac Fleetwood  10.4   8 7.734711 205 2.93 5.250 17.98  0   auto    3
## Lincoln Continental 10.4   8 7.538066 215 3.00 5.424 17.82  0   auto    3
## Chrysler Imperial   14.7   8 7.210324 230 3.23 5.345 17.42  0   auto    3
## Dodge Challenger    15.5   8 5.211098 150 2.76 3.520 16.87  0   auto    3
## AMC Javelin         15.2   8 4.981678 150 3.15 3.435 17.30  0   auto    3
## Camaro Z28          13.3   8 5.735485 245 3.73 3.840 15.41  0   auto    3
## Pontiac Firebird    19.2   8 6.554840 175 3.08 3.845 17.05  0   auto    3
## Ford Pantera L      15.8   8 5.751872 264 4.22 3.170 14.50  0 manual    5
## Maserati Bora       15.0   8 4.932517 335 3.54 3.570 14.60  0 manual    5
##                     carb
## Hornet Sportabout      2
## Duster 360             4
## Merc 450SE             3
## Merc 450SL             3
## Merc 450SLC            3
## Cadillac Fleetwood     4
## Lincoln Continental    4
## Chrysler Imperial      4
## Dodge Challenger       2
## AMC Javelin            2
## Camaro Z28             4
## Pontiac Firebird       2
## Ford Pantera L         4
## Maserati Bora          8

2. Brainstorm useful summary functions that, like quantile(), return multiple values.

Summary functions in addition to quantile() include: summary(), range(), dim(), and coef() for linear models.

x <- 1:10
summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.25    5.50    5.50    7.75   10.00
range(x)
## [1]  1 10
dim(mtcars)
## [1] 32 11
coef(lm(mpg ~cyl, data = mtcars))
## (Intercept)         cyl 
##    37.88458    -2.87579

3. What’s missing in the following data frame? How does quantile() return that missing piece? Why isn’t that helpful here?

The data frame shows the output of unnesting the column q, which contains the quantile() values for each cylinder type in mtcars stored as a list-column. The important information that is missing is the label corresponding to each of the quantile values. For example, 21.4 (the first entry) corresponds to the 0% quantile. Without knowing how the function works, someone looking at the table would not know this important information. quantile() returns this information by naming the output vector with these labels. This is not helpful when creating list-columns, because the names are not stored automatically.

# what quantile should return
quantile(mtcars[which(mtcars$cyl==4),"mpg"])
##   0%  25%  50%  75% 100% 
## 21.4 22.8 26.0 30.4 33.9
names(quantile(mtcars[which(mtcars$cyl==4),"mpg"]))
## [1] "0%"   "25%"  "50%"  "75%"  "100%"
# quantile is missing the labels for each of the statistics (0%, 25%, ...)
mtcars %>% 
  group_by(cyl) %>% 
  summarise(q = list(quantile(mpg))) %>% 
  unnest()
## # A tibble: 15 x 2
##      cyl     q
##    <dbl> <dbl>
##  1     4  21.4
##  2     4  22.8
##  3     4  26  
##  4     4  30.4
##  5     4  33.9
##  6     6  17.8
##  7     6  18.6
##  8     6  19.7
##  9     6  21  
## 10     6  21.4
## 11     8  10.4
## 12     8  14.4
## 13     8  15.2
## 14     8  16.2
## 15     8  19.2
mtcars %>% 
  group_by(cyl) %>% 
  summarise(q = list(quantile(mpg))) %>% 
  unnest()
## # A tibble: 15 x 2
##      cyl     q
##    <dbl> <dbl>
##  1     4  21.4
##  2     4  22.8
##  3     4  26  
##  4     4  30.4
##  5     4  33.9
##  6     6  17.8
##  7     6  18.6
##  8     6  19.7
##  9     6  21  
## 10     6  21.4
## 11     8  10.4
## 12     8  14.4
## 13     8  15.2
## 14     8  16.2
## 15     8  19.2

4. What does this code do? Why might might it be useful?

This code will group the dataset mtcars by cylinder type (in this case, 4, 6, or 8), then aggregate the values for each of the columns into a list. Specifically, there are 11 rows in the dataset that have a cyl value of 4, 7 rows that have a cyl value of 6, and 14 rows with a cyl value of 8. This is why the entries now look like , , and , respectively. Retrieving the value of at the first row of column mpg, for example, will return the 11 values associated with cylinder 4.

mtcars2 <- mtcars %>% 
  group_by(cyl) %>% 
  summarise_each(funs(list))
## `summarise_each()` is deprecated.
## Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
## To map `funs` over all variables, use `summarise_all()`
mtcars2
## # A tibble: 3 x 11
##     cyl mpg    disp   hp     drat   wt     qsec   vs     am     gear  carb 
##   <dbl> <list> <list> <list> <list> <list> <list> <list> <list> <lis> <lis>
## 1     4 <dbl … <dbl … <dbl … <dbl … <dbl … <dbl … <dbl … <fct … <dbl… <dbl…
## 2     6 <dbl … <dbl … <dbl … <dbl … <dbl … <dbl … <dbl … <fct … <dbl… <dbl…
## 3     8 <dbl … <dbl … <dbl … <dbl … <dbl … <dbl … <dbl … <fct … <dbl… <dbl…
mtcars2$mpg[1]
## [[1]]
##  [1] 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26.0 30.4 21.4

25.5.3 Exercises

1. Why might the lengths() function be useful for creating atomic vector columns from list-columns?

The lengths() function will return the length of each element in a list. This differs from the length() function which will return the number of elements contained in a list. For an example of the differences, see the code chunk below. However, for whatever reason, when lengths() is used with mutate() on a list-column, it will return the number of elements in the list rather than returning the vector with the lengths for each element in the list. In other words, when used with mutate(), lengths() performs what we normally would have thought length() would do. Even more confusing is how when length() is used with mutate, it breaks down and now returns the number of rows in the data frame regardless of what is stored in each row.

lengths(list(a = 1:10, b = 2, z = 5))
##  a  b  z 
## 10  1  1
length(mtcars)
## [1] 11
lengths(mtcars)
##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
##   32   32   32   32   32   32   32   32   32   32   32
df <- tribble(
  ~x,
  list(a = 1:10, b = 2, z = 5),
  list(a = 2:5, c = 4),
  list(a = 1, b = 2)
)

# lengths() returns the number of elements in the list-column
df %>% mutate(
  length_of_a = lengths(x)
)
## # A tibble: 3 x 2
##   x          length_of_a
##   <list>           <int>
## 1 <list [3]>           3
## 2 <list [2]>           2
## 3 <list [2]>           2
# length() returns the number of rows regardless of the value in the list-column
df %>% mutate(
  length_of_a = length(x)
)
## # A tibble: 3 x 2
##   x          length_of_a
##   <list>           <int>
## 1 <list [3]>           3
## 2 <list [2]>           3
## 3 <list [2]>           3

2. List the most common types of vector found in a data frame. What makes lists different?

In a data frame, I usually encounter numerics, characters, and factors, in which there is just a single value at any row x column designation. Lists are different because they can contain any number of data types, and multiple values for each. You can even have a list of lists! This means that a very diverse data set can be stored within lists in a single column of your data frame. That is the beauty of list-columns that this chapter tries to highlight.