Chapter 25 - Many models
First, we’ll load and prepare the data used for the exercises in this chapter.
library(gapminder)
by_country <- gapminder %>%
group_by(country, continent) %>%
nest()
by_country
## # A tibble: 142 x 3
## country continent data
## <fct> <fct> <list>
## 1 Afghanistan Asia <tibble [12 × 4]>
## 2 Albania Europe <tibble [12 × 4]>
## 3 Algeria Africa <tibble [12 × 4]>
## 4 Angola Africa <tibble [12 × 4]>
## 5 Argentina Americas <tibble [12 × 4]>
## 6 Australia Oceania <tibble [12 × 4]>
## 7 Austria Europe <tibble [12 × 4]>
## 8 Bahrain Asia <tibble [12 × 4]>
## 9 Bangladesh Asia <tibble [12 × 4]>
## 10 Belgium Europe <tibble [12 × 4]>
## # … with 132 more rows
## wrapper function to model lifeExp by year
country_model <- function(df) {
lm(lifeExp ~ year, data = df)
}
## add the model as a list-column to the nested data
by_country <- by_country %>%
mutate(model = map(data, country_model))
## Error in mutate_impl(.data, dots): Evaluation error: cannot coerce type 'closure' to vector of type 'character'.
by_country
## # A tibble: 142 x 3
## country continent data
## <fct> <fct> <list>
## 1 Afghanistan Asia <tibble [12 × 4]>
## 2 Albania Europe <tibble [12 × 4]>
## 3 Algeria Africa <tibble [12 × 4]>
## 4 Angola Africa <tibble [12 × 4]>
## 5 Argentina Americas <tibble [12 × 4]>
## 6 Australia Oceania <tibble [12 × 4]>
## 7 Austria Europe <tibble [12 × 4]>
## 8 Bahrain Asia <tibble [12 × 4]>
## 9 Bangladesh Asia <tibble [12 × 4]>
## 10 Belgium Europe <tibble [12 × 4]>
## # … with 132 more rows
## calculate residuals based on the nested data and the associated model
by_country <- by_country %>%
mutate(
resids = map2(data, model, add_residuals)
)
## Error in mutate_impl(.data, dots): Evaluation error: object 'model' not found.
by_country
## # A tibble: 142 x 3
## country continent data
## <fct> <fct> <list>
## 1 Afghanistan Asia <tibble [12 × 4]>
## 2 Albania Europe <tibble [12 × 4]>
## 3 Algeria Africa <tibble [12 × 4]>
## 4 Angola Africa <tibble [12 × 4]>
## 5 Argentina Americas <tibble [12 × 4]>
## 6 Australia Oceania <tibble [12 × 4]>
## 7 Austria Europe <tibble [12 × 4]>
## 8 Bahrain Asia <tibble [12 × 4]>
## 9 Bangladesh Asia <tibble [12 × 4]>
## 10 Belgium Europe <tibble [12 × 4]>
## # … with 132 more rows
resids <- unnest(by_country, resids)
## Error in mutate_impl(.data, dots): Binding not found: resids.
resids %>% group_by(country) %>% summarise (rsme = sqrt(mean(resid^2)))
## Error in eval(lhs, parent, parent): object 'resids' not found
25.2.5 Exercises
1. A linear trend seems to be slightly too simple for the overall trend. Can you do better with a quadratic polynomial? How can you interpret the coefficients of the quadratic? (Hint you might want to transform year so that it has mean zero.)
We can fit a quadratic polynomial either by using the equation y ~ poly(x,2) or writing it out as y ~ I(x^2) + x. When we fit this model to the data, we see that the majority of the rmse values calculated using the residuals by country are reduced compared to the original linear model used in this chapter. This suggests that we are doing better.
country_model_poly <- function(df) {
lm(lifeExp ~ poly(year,2), data = df)
# alternatively can use lm(lifeExp ~ I(year^2)+year, data = df) for raw polynomial
}
country_model_poly_centered <- function(df) {
lm(lifeExp ~ poly(year-mean(year),2), data = df)
}
by_country <- by_country %>%
mutate(model2 = map(data, country_model_poly),
model3 = map(data, country_model_poly_centered)) %>%
mutate(resids2 = map2(data, model2, add_residuals))
## Error in mutate_impl(.data, dots): Evaluation error: cannot coerce type 'closure' to vector of type 'character'.
## residual freq-poly plot
resids2 <- unnest(by_country, resids2)
## Error in mutate_impl(.data, dots): Binding not found: resids2.
resids2 %>%
ggplot(aes(year, resid)) +
geom_line(aes(group = country), alpha = 1 / 3) +
geom_smooth(se = FALSE)
## Error in eval(lhs, parent, parent): object 'resids2' not found
## R-squared value
glance <- by_country %>%
mutate(glance = map(model2, broom::glance)) %>%
unnest(glance, .drop = TRUE)
## Error in mutate_impl(.data, dots): Evaluation error: object 'model2' not found.
glance %>% ggplot(aes(continent, r.squared)) +
geom_jitter(width = 0.5)
## Error in eval(lhs, parent, parent): object 'glance' not found
# rmse using the polynomial model
resids2 %>% group_by(country) %>% summarise (rmse = sqrt(mean(resid^2)))
## Error in eval(lhs, parent, parent): object 'resids2' not found
# rmse using the original linear model
resids %>% group_by(country) %>% summarise (rmse = sqrt(mean(resid^2)))
## Error in eval(lhs, parent, parent): object 'resids' not found
When we compare the model summaries for the centered (mean 0) model vs the non-centered model, the estimates for the coefficients are the same. Poly() creates orthogonal polynomials for the fit. The coefficients may be interpreted as the weight associated with a unit of change in year.
summary(by_country$model2[[1]])
## Warning: Unknown or uninitialised column: 'model2'.
## Length Class Mode
## 0 NULL NULL
summary(by_country$model3[[1]])
## Warning: Unknown or uninitialised column: 'model3'.
## Length Class Mode
## 0 NULL NULL
2. Explore other methods for visualising the distribution of R2 per continent. You might want to try the ggbeeswarm package, which provides similar methods for avoiding overlaps as jitter, but uses deterministic methods.
To use the ggbeeswarm package, simply replace geom_jitter() with geom_beeswarm(). The output is below. We can augment the graph by adding additional graphics such as a boxplot.
library(ggbeeswarm)
glance %>%
ggplot(aes(continent, r.squared)) +
geom_beeswarm()
## Error in eval(lhs, parent, parent): object 'glance' not found
glance %>%
ggplot(aes(continent, r.squared)) +
geom_boxplot( aes(color = continent))+
geom_beeswarm()
## Error in eval(lhs, parent, parent): object 'glance' not found
3. To create the last plot (showing the data for the countries with the worst model fits), we needed two steps: we created a data frame with one row per country and then semi-joined it to the original dataset. It’s possible to avoid this join if we use unnest() instead of unnest(.drop = TRUE). How?
Instead of using unnest(glance,.drop = TRUE), using unnest(glance) will retain the other columns in the dataset in addition to unnesting the glance column. This means that the list-column “data” will still be associated with the glance output, which contains the year
and lifeExp
variables that are needed to plot the final graph. We can then perform the filtering on the r.squared value, unnest the data, and then plot the graph, as shown below, without needing to semi-join.
glance <- by_country %>%
mutate(glance = map(model, broom::glance)) %>%
unnest(glance)
## Error in mutate_impl(.data, dots): Evaluation error: object 'model' not found.
glance
## Error in eval(expr, envir, enclos): object 'glance' not found
bad_fit <- filter(glance, r.squared < 0.25)
## Error in filter(glance, r.squared < 0.25): object 'glance' not found
bad_fit
## Error in eval(expr, envir, enclos): object 'bad_fit' not found
bad_fit %>% unnest (data) %>%
ggplot(aes(year, lifeExp, colour = country)) +
geom_line()
## Error in eval(lhs, parent, parent): object 'bad_fit' not found
25.4.5 Exercises
1. List all the functions that you can think of that take a atomic vector and return a list.
enframe() or as_tibble() will convert an atomic vector into a list. The map() function will return a list in which it applies a function that you specify on each element of the atomic vector (examples below). stringr functions return lists of strings, such as str_split(). You may also encounter package-specific functions which may create their own types of objects. For example, in the bioinformatics world, you may have DESeq2 objects (bulk RNA-sequencing) or Seurat objects (single cell RNA-sequencing), which are S4 objects which contain several data types, including lists. Other functions that return lists are those shown in this chapter, such as broom::glance(), although in this instance the input is a model and not an atomic vector. split() also returns a list when applied to a dataframe.
my_atomic_vector <- c("hello","world")
map(my_atomic_vector, length)
## Error: 'helloMapEnv' is not an exported object from 'namespace:maps'
str_split(my_atomic_vector, "l")
## [[1]]
## [1] "he" "" "o"
##
## [[2]]
## [1] "wor" "d"
typeof(as_tibble(my_atomic_vector))
## Warning: Calling `as_tibble()` on a vector is discouraged, because the behavior is likely to change in the future. Use `enframe(name = NULL)` instead.
## This warning is displayed once per session.
## [1] "list"
typeof(enframe(my_atomic_vector))
## [1] "list"
summary(my_atomic_vector)
## Length Class Mode
## 2 character character
mtcars %>%
split(.$cyl)
## $`4`
## mpg cyl disp hp drat wt qsec vs am gear carb
## Datsun 710 22.8 4 1.769807 93 3.85 2.320 18.61 1 manual 4 1
## Merc 240D 24.4 4 2.403988 62 3.69 3.190 20.00 1 auto 4 2
## Merc 230 22.8 4 2.307304 95 3.92 3.150 22.90 1 auto 4 2
## Fiat 128 32.4 4 1.289665 66 4.08 2.200 19.47 1 manual 4 1
## Honda Civic 30.4 4 1.240503 52 4.93 1.615 18.52 1 manual 4 2
## Toyota Corolla 33.9 4 1.165123 65 4.22 1.835 19.90 1 manual 4 1
## Toyota Corona 21.5 4 1.968091 97 3.70 2.465 20.01 1 auto 3 1
## Fiat X1-9 27.3 4 1.294581 66 4.08 1.935 18.90 1 manual 4 1
## Porsche 914-2 26.0 4 1.971368 91 4.43 2.140 16.70 0 manual 5 2
## Lotus Europa 30.4 4 1.558413 113 3.77 1.513 16.90 1 manual 5 2
## Volvo 142E 21.4 4 1.982839 109 4.11 2.780 18.60 1 manual 4 2
##
## $`6`
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 2.621936 110 3.90 2.620 16.46 0 manual 4 4
## Mazda RX4 Wag 21.0 6 2.621936 110 3.90 2.875 17.02 0 manual 4 4
## Hornet 4 Drive 21.4 6 4.227872 110 3.08 3.215 19.44 1 auto 3 1
## Valiant 18.1 6 3.687098 105 2.76 3.460 20.22 1 auto 3 1
## Merc 280 19.2 6 2.746478 123 3.92 3.440 18.30 1 auto 4 4
## Merc 280C 17.8 6 2.746478 123 3.92 3.440 18.90 1 auto 4 4
## Ferrari Dino 19.7 6 2.376130 175 3.62 2.770 15.50 0 manual 5 6
##
## $`8`
## mpg cyl disp hp drat wt qsec vs am gear
## Hornet Sportabout 18.7 8 5.899356 175 3.15 3.440 17.02 0 auto 3
## Duster 360 14.3 8 5.899356 245 3.21 3.570 15.84 0 auto 3
## Merc 450SE 16.4 8 4.519562 180 3.07 4.070 17.40 0 auto 3
## Merc 450SL 17.3 8 4.519562 180 3.07 3.730 17.60 0 auto 3
## Merc 450SLC 15.2 8 4.519562 180 3.07 3.780 18.00 0 auto 3
## Cadillac Fleetwood 10.4 8 7.734711 205 2.93 5.250 17.98 0 auto 3
## Lincoln Continental 10.4 8 7.538066 215 3.00 5.424 17.82 0 auto 3
## Chrysler Imperial 14.7 8 7.210324 230 3.23 5.345 17.42 0 auto 3
## Dodge Challenger 15.5 8 5.211098 150 2.76 3.520 16.87 0 auto 3
## AMC Javelin 15.2 8 4.981678 150 3.15 3.435 17.30 0 auto 3
## Camaro Z28 13.3 8 5.735485 245 3.73 3.840 15.41 0 auto 3
## Pontiac Firebird 19.2 8 6.554840 175 3.08 3.845 17.05 0 auto 3
## Ford Pantera L 15.8 8 5.751872 264 4.22 3.170 14.50 0 manual 5
## Maserati Bora 15.0 8 4.932517 335 3.54 3.570 14.60 0 manual 5
## carb
## Hornet Sportabout 2
## Duster 360 4
## Merc 450SE 3
## Merc 450SL 3
## Merc 450SLC 3
## Cadillac Fleetwood 4
## Lincoln Continental 4
## Chrysler Imperial 4
## Dodge Challenger 2
## AMC Javelin 2
## Camaro Z28 4
## Pontiac Firebird 2
## Ford Pantera L 4
## Maserati Bora 8
2. Brainstorm useful summary functions that, like quantile(), return multiple values.
Summary functions in addition to quantile() include: summary(), range(), dim(), and coef() for linear models.
x <- 1:10
summary(x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.25 5.50 5.50 7.75 10.00
range(x)
## [1] 1 10
dim(mtcars)
## [1] 32 11
coef(lm(mpg ~cyl, data = mtcars))
## (Intercept) cyl
## 37.88458 -2.87579
3. What’s missing in the following data frame? How does quantile() return that missing piece? Why isn’t that helpful here?
The data frame shows the output of unnesting the column q
, which contains the quantile() values for each cylinder type in mtcars stored as a list-column. The important information that is missing is the label corresponding to each of the quantile values. For example, 21.4 (the first entry) corresponds to the 0% quantile. Without knowing how the function works, someone looking at the table would not know this important information. quantile() returns this information by naming the output vector with these labels. This is not helpful when creating list-columns, because the names are not stored automatically.
# what quantile should return
quantile(mtcars[which(mtcars$cyl==4),"mpg"])
## 0% 25% 50% 75% 100%
## 21.4 22.8 26.0 30.4 33.9
names(quantile(mtcars[which(mtcars$cyl==4),"mpg"]))
## [1] "0%" "25%" "50%" "75%" "100%"
# quantile is missing the labels for each of the statistics (0%, 25%, ...)
mtcars %>%
group_by(cyl) %>%
summarise(q = list(quantile(mpg))) %>%
unnest()
## # A tibble: 15 x 2
## cyl q
## <dbl> <dbl>
## 1 4 21.4
## 2 4 22.8
## 3 4 26
## 4 4 30.4
## 5 4 33.9
## 6 6 17.8
## 7 6 18.6
## 8 6 19.7
## 9 6 21
## 10 6 21.4
## 11 8 10.4
## 12 8 14.4
## 13 8 15.2
## 14 8 16.2
## 15 8 19.2
mtcars %>%
group_by(cyl) %>%
summarise(q = list(quantile(mpg))) %>%
unnest()
## # A tibble: 15 x 2
## cyl q
## <dbl> <dbl>
## 1 4 21.4
## 2 4 22.8
## 3 4 26
## 4 4 30.4
## 5 4 33.9
## 6 6 17.8
## 7 6 18.6
## 8 6 19.7
## 9 6 21
## 10 6 21.4
## 11 8 10.4
## 12 8 14.4
## 13 8 15.2
## 14 8 16.2
## 15 8 19.2
4. What does this code do? Why might might it be useful?
This code will group the dataset mtcars by cylinder type (in this case, 4, 6, or 8), then aggregate the values for each of the columns into a list. Specifically, there are 11 rows in the dataset that have a cyl value of 4, 7 rows that have a cyl value of 6, and 14 rows with a cyl value of 8. This is why the entries now look like mpg
, for example, will return the 11 values associated with cylinder 4.
mtcars2 <- mtcars %>%
group_by(cyl) %>%
summarise_each(funs(list))
## `summarise_each()` is deprecated.
## Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
## To map `funs` over all variables, use `summarise_all()`
mtcars2
## # A tibble: 3 x 11
## cyl mpg disp hp drat wt qsec vs am gear carb
## <dbl> <list> <list> <list> <list> <list> <list> <list> <list> <lis> <lis>
## 1 4 <dbl … <dbl … <dbl … <dbl … <dbl … <dbl … <dbl … <fct … <dbl… <dbl…
## 2 6 <dbl … <dbl … <dbl … <dbl … <dbl … <dbl … <dbl … <fct … <dbl… <dbl…
## 3 8 <dbl … <dbl … <dbl … <dbl … <dbl … <dbl … <dbl … <fct … <dbl… <dbl…
mtcars2$mpg[1]
## [[1]]
## [1] 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26.0 30.4 21.4
25.5.3 Exercises
1. Why might the lengths() function be useful for creating atomic vector columns from list-columns?
The lengths() function will return the length of each element in a list. This differs from the length() function which will return the number of elements contained in a list. For an example of the differences, see the code chunk below. However, for whatever reason, when lengths() is used with mutate() on a list-column, it will return the number of elements in the list rather than returning the vector with the lengths for each element in the list. In other words, when used with mutate(), lengths() performs what we normally would have thought length() would do. Even more confusing is how when length() is used with mutate, it breaks down and now returns the number of rows in the data frame regardless of what is stored in each row.
lengths(list(a = 1:10, b = 2, z = 5))
## a b z
## 10 1 1
length(mtcars)
## [1] 11
lengths(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 32 32 32 32 32 32 32 32 32 32 32
df <- tribble(
~x,
list(a = 1:10, b = 2, z = 5),
list(a = 2:5, c = 4),
list(a = 1, b = 2)
)
# lengths() returns the number of elements in the list-column
df %>% mutate(
length_of_a = lengths(x)
)
## # A tibble: 3 x 2
## x length_of_a
## <list> <int>
## 1 <list [3]> 3
## 2 <list [2]> 2
## 3 <list [2]> 2
# length() returns the number of rows regardless of the value in the list-column
df %>% mutate(
length_of_a = length(x)
)
## # A tibble: 3 x 2
## x length_of_a
## <list> <int>
## 1 <list [3]> 3
## 2 <list [2]> 3
## 3 <list [2]> 3
2. List the most common types of vector found in a data frame. What makes lists different?
In a data frame, I usually encounter numerics, characters, and factors, in which there is just a single value at any row x column designation. Lists are different because they can contain any number of data types, and multiple values for each. You can even have a list of lists! This means that a very diverse data set can be stored within lists in a single column of your data frame. That is the beauty of list-columns that this chapter tries to highlight.