Chapter 10 - Tibbles

Notes - Creating Tibbles

You can create tibbles from existing data frames using as_tibble(), or create brand new tibbles using tibble():

tibble(
  x = 1:5,
  y = 1,
  z = x ^ 2 + y
)

## # A tibble: 5 x 3
##       x     y     z
##   <int> <dbl> <dbl>
## 1     1     1     2
## 2     2     1     5
## 3     3     1    10
## 4     4     1    17
## 5     5     1    26

A cousin of tibble(), tribble(), can also be used as a way to manually enter data into a tibble format:

tribble(
  ~x, ~y, ~z,
  #--|--|----
  "a", 2, 3.6,
  "b", 1, 8.5
)

## # A tibble: 2 x 3
##   x         y     z
##   <chr> <dbl> <dbl>
## 1 a         2   3.6
## 2 b         1   8.5

You can also use non-syntactic names for variables in tibbles:

tb <- tibble(
  `:)` = "smile",
  ` ` = "space",
  `2000` = "number"
)
tb

## # A tibble: 1 x 3
##   `:)`  ` `   `2000`
##   <chr> <chr> <chr> 
## 1 smile space number

When compared to a data.frame in baseR, the tibble looks more user-friendly. Calling a tibble automatically provides only the beginning chunk of the data rather than filling up your entire console (think if it as default head(data.frame) display). Other nice features include not converting strings to factors or changing variable names.

To convert tables to or from data frames, use as_tibble() and as.data.frame():

class(iris)

## [1] "data.frame"

class(as_tibble(iris))

## [1] "tbl_df"     "tbl"        "data.frame"

class(as.data.frame(as_tibble(iris)))

## [1] "data.frame"

You can select columns in tibbles the same way you would with a data.frame:

df <- tibble(
  x = runif(5),
  y = rnorm(5)
)
# extract column 'x' using either $ or [[]]
df$x

## [1] 0.84011082 0.09959577 0.27459996 0.87146539 0.36318207

df[["x"]]

## [1] 0.84011082 0.09959577 0.27459996 0.87146539 0.36318207

df[[1]]

## [1] 0.84011082 0.09959577 0.27459996 0.87146539 0.36318207

10.5 Exercises

1. How can you tell if an object is a tibble? (Hint: try printing mtcars, which is a regular data frame).

You can tell if an object is a tibble because the output you get by calling it will say “tibble”! For example, calling the diamonds tibble returns :# A tibble: 53,940 x 10 as the first line of the output. Also you can tell something is a tibble based on the class specifications underneath each variable name. A tibble will also only print out the first 10 rows by default, whereas a data.frame will print out as many as the console allows. Last, the definitive way to tell something is a tibble is to use the class() function.

class(diamonds)

## [1] "tbl_df"     "tbl"        "data.frame"

class(mtcars)

## [1] "data.frame"

2. Compare and contrast the following operations on a data.frame and equivalent tibble. What is different? Why might the default data frame behaviours cause you frustration?

On a data.frame, df$x will still return the values for column xyz. This behavior does not occur for a tibble, which requires the exact name of the column df$xyz. This data.frame feature might cause frustration if you have columns in your dataset with the same prefix, in which you might fetch the wrong column. The other functions between data.frame and tibble work the same way. One distinction to note is that, when creating the data.frame, “a” is considered a factor with 1 level. When creating the tibble, “a” is not converted into a factor.

df <- data.frame(abc = 1, xyz = "a")
df$x

## [1] a
## Levels: a

df[, "xyz"]

## [1] a
## Levels: a

df[, c("abc", "xyz")]

##   abc xyz
## 1   1   a

df <- tibble(abc = 1, xyz = "a")
df$x

## Warning: Unknown or uninitialised column: 'x'.

## NULL

df[, "xyz"]

## # A tibble: 1 x 1
##   xyz  
##   <chr>
## 1 a

df[, c("abc", "xyz")]

## # A tibble: 1 x 2
##     abc xyz  
##   <dbl> <chr>
## 1     1 a

3. If you have the name of a variable stored in an object, e.g. var <- “mpg”, how can you extract the reference variable from a tibble?

If the name of the variable is stored in an object, you can pass the object in lieu of the variable name using [[]] or [] just as you would do so with the explicit variable name. You can even pass the object and another variable name to obtain multiple reference variables using c(). I provide an example below using the diamonds dataset.

var <- "carat"
var2 <- c("carat","price")

# extract only carat
diamonds[,var]

## # A tibble: 53,940 x 1
##    carat
##    <dbl>
##  1 0.23 
##  2 0.21 
##  3 0.23 
##  4 0.290
##  5 0.31 
##  6 0.24 
##  7 0.24 
##  8 0.26 
##  9 0.22 
## 10 0.23 
## # … with 53,930 more rows

#extract carat and price
diamonds[,c(var,"price")]

## # A tibble: 53,940 x 2
##    carat price
##    <dbl> <int>
##  1 0.23    326
##  2 0.21    326
##  3 0.23    327
##  4 0.290   334
##  5 0.31    335
##  6 0.24    336
##  7 0.24    336
##  8 0.26    337
##  9 0.22    337
## 10 0.23    338
## # … with 53,930 more rows

diamonds[,var2]

## # A tibble: 53,940 x 2
##    carat price
##    <dbl> <int>
##  1 0.23    326
##  2 0.21    326
##  3 0.23    327
##  4 0.290   334
##  5 0.31    335
##  6 0.24    336
##  7 0.24    336
##  8 0.26    337
##  9 0.22    337
## 10 0.23    338
## # … with 53,930 more rows

4. Practice referring to non-syntactic names in the following data frame by:

annoying <- tibble(
  `1` = 1:10,
  `2` = `1` * 2 + rnorm(length(`1`))
)

Extracting the variable called 1.

annoying[,"1"]

## # A tibble: 10 x 1
##      `1`
##    <int>
##  1     1
##  2     2
##  3     3
##  4     4
##  5     5
##  6     6
##  7     7
##  8     8
##  9     9
## 10    10

Plotting a scatterplot of 1 vs 2.

ggplot(annoying, aes(`1`,`2`))+
  geom_point()

Creating a new column called 3 which is 2 divided by 1.

annoying %>%
  mutate(`3` = `2`/`1`)

## # A tibble: 10 x 3
##      `1`   `2`   `3`
##    <int> <dbl> <dbl>
##  1     1  1.58  1.58
##  2     2  3.45  1.73
##  3     3  7.28  2.43
##  4     4  7.28  1.82
##  5     5  9.54  1.91
##  6     6  9.81  1.63
##  7     7 13.6   1.94
##  8     8 15.4   1.93
##  9     9 19.4   2.15
## 10    10 20.4   2.04

Renaming the columns to one, two and three.

annoying %>%
  mutate(`3` = `2`/`1`) %>%
  rename(one = `1`, two = `2`, three = `3` )

## # A tibble: 10 x 3
##      one   two three
##    <int> <dbl> <dbl>
##  1     1  1.58  1.58
##  2     2  3.45  1.73
##  3     3  7.28  2.43
##  4     4  7.28  1.82
##  5     5  9.54  1.91
##  6     6  9.81  1.63
##  7     7 13.6   1.94
##  8     8 15.4   1.93
##  9     9 19.4   2.15
## 10    10 20.4   2.04

5. What does tibble::enframe() do? When might you use it?

Taken from the documentation: “enframe() converts named atomic vectors or lists to two-column data frames. For unnamed vectors, the natural sequence is used as name column.” I might use this when I have a vector that I want to turn into a data.frame for graphing using ggplot, which requires data be in data.frame or tibble.

x = rnorm(100)
names(x) <- c(5:104)
enframe(x)

## # A tibble: 100 x 2
##    name   value
##    <chr>  <dbl>
##  1 5      1.21 
##  2 6     -0.197
##  3 7      0.227
##  4 8     -0.749
##  5 9     -0.262
##  6 10    -1.42 
##  7 11     0.178
##  8 12     3.02 
##  9 13     1.07 
## 10 14     1.49 
## # … with 90 more rows

class(enframe(x))

## [1] "tbl_df"     "tbl"        "data.frame"

6. What option controls how many additional column names are printed at the footer of a tibble?

The documentation for ?format.tbl (tibble formatting) says that the n_extra argument will control how many additional columns to print abbreviated information for. The example provided in the documentation is below, which only prints 2 of the additional columns (whereas the unmodified print(flights) would yield 5 additional columns in the footer).

print(nycflights13::flights, n_extra = 2)

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, …