Chapter 24 - Model building
For this chapter, the following packages and modifications to datasets were used:
options(na.action = na.warn)
library(nycflights13)
library(lubridate)
diamonds2 <- diamonds %>%
filter(carat <= 2.5) %>%
mutate(lprice = log2(price), lcarat = log2(carat))
24.2.3 Exercises
1. In the plot of lcarat vs. lprice, there are some bright vertical strips. What do they represent?
The color of each hexagon in geom_hex corresponds to the number of observations that lie within the hexagon, which means that the bright vertical strips represent highly concentrated areas containing large amounts of observations relative to the dim hexagons. Hadley alludes to the possible reason why these stripes exist in the previous chapters, where he mentions that humans are inclined to report “pretty” intervals such as 1.0, 1.5, 2.0, etc, resulting in more observations lying on these intervals.
ggplot(diamonds2, aes(lcarat, lprice)) +
geom_hex(bins = 50)
2. If log(price) = a_0 + a_1 * log(carat), what does that say about the relationship between price and carat?
This equation suggests that there is a linear relationsihp between log(price) and log(carat), because the equation can be interpreted as a_0 being an intercept and a_1 being a “slope”. It is harder to discern the relationship between the non-logged price and carat given this equation. The relationship between the original non-transformed variables may not necessarily be linear.
3. Extract the diamonds that have very high and very low residuals. Is there anything unusual about these diamonds? Are they particularly bad or good, or do you think these are pricing errors?
Hadley extracts the diamonds with high / low residuals (abs(lresid2) > 1) in the code below. After examining these, we find that most of the outlier diamonds that are priced much lower than predicted have a specific flaw associated with them. For example, $1262 predicted to be $2644 has a huge carat size but has a clarity of I1 (worst clarity), of which there are not that many observations for in the dataset (707 observations for grade I1 vs 13055 observations for grade SI1).
The diamonds in this list usually have a combination of very good qualities as well as very bad qualities. It could be that there is some multicollinearity in the model, in which some of the predictor variables (lcarat, color, cut, and clarity) are correlated with one another. This may result in the model breaking down when something when the values of two variables which usually are correlated do not follow the trend for any specific observation. For example, the diamond priced at 2366 is predicted to only be 774, but this is likely due to the fact that the diamond has one of the best clarity values, but has the worst possible color. Looking at a density plot of color vs clarity, we find that there are very few observations which have this combination of color and clarity, which may be why the model breaks down.
mod_diamond2 <- lm(lprice ~ lcarat + color + cut + clarity, data = diamonds2)
diamonds2 <- diamonds2 %>%
add_residuals(mod_diamond2, "lresid2")
ggplot(diamonds2, aes(lcarat, lresid2)) +
geom_hex(bins = 50)
diamonds2 %>%
filter(abs(lresid2) > 1) %>%
add_predictions(mod_diamond2) %>%
mutate(pred = round(2 ^ pred)) %>%
select(price, pred, carat:table, x:z, lresid2) %>%
arrange(price)
## # A tibble: 16 x 12
## price pred carat cut color clarity depth table x y z
## <int> <dbl> <dbl> <ord> <ord> <ord> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1013 264 0.25 Fair F SI2 54.4 64 4.3 4.23 2.32
## 2 1186 284 0.25 Prem… G SI2 59 60 5.33 5.28 3.12
## 3 1186 284 0.25 Prem… G SI2 58.8 60 5.33 5.28 3.12
## 4 1262 2644 1.03 Fair E I1 78.2 54 5.72 5.59 4.42
## 5 1415 639 0.35 Fair G VS2 65.9 54 5.57 5.53 3.66
## 6 1415 639 0.35 Fair G VS2 65.9 54 5.57 5.53 3.66
## 7 1715 576 0.32 Fair F VS2 59.6 60 4.42 4.34 2.61
## 8 1776 412 0.290 Fair F SI1 55.8 60 4.48 4.41 2.48
## 9 2160 314 0.34 Fair F I1 55.8 62 4.72 4.6 2.6
## 10 2366 774 0.3 Very… D VVS2 60.6 58 4.33 4.35 2.63
## 11 3360 1373 0.51 Prem… F SI1 62.7 62 5.09 4.96 3.15
## 12 3807 1540 0.61 Good F SI2 62.5 65 5.36 5.29 3.33
## 13 3920 1705 0.51 Fair F VVS2 65.4 60 4.98 4.9 3.23
## 14 4368 1705 0.51 Fair F VVS2 60.7 66 5.21 5.11 3.13
## 15 10011 4048 1.01 Fair D SI2 64.6 58 6.25 6.2 4.02
## 16 10470 23622 2.46 Prem… E SI2 59.7 59 8.82 8.76 5.25
## # … with 1 more variable: lresid2 <dbl>
diamonds2 %>% group_by(clarity) %>% summarize (num = n())
## # A tibble: 8 x 2
## clarity num
## <ord> <int>
## 1 I1 707
## 2 SI2 9116
## 3 SI1 13055
## 4 VS2 12256
## 5 VS1 8169
## 6 VVS2 5066
## 7 VVS1 3655
## 8 IF 1790
diamonds2 %>% ggplot(aes(color, clarity))+
geom_bin2d()
4. Does the final model, mod_diamond2, do a good job of predicting diamond prices? Would you trust it to tell you how much to spend if you were buying a diamond?
Judging by the plot of the residuals, the model does a decent job at removing the patterns in the data (fairly flat, only a handful of residuals > 1 st. deviation away from 0) for the log-transformed version of the data. The model could be improved to reduce the variance of the residuals (compress the points toward h=0), in order to get more accurate predictions. However, since we aren’t dealing with log-transformed money when buying diamonds in the real world, we should examine how the residuals look when transformed back into their true values. To do so, we calculate subtract the non-logged prediction value from the true value to get the non-transformed residual. Plotting these non-transformed residuals shows that the variability in residuals increases as carat increases (same is true for lcarat). I would have moderate faith in the model for diamonds less than 1 carat, but for diamonds greater than one carat, I would be cautious. The model would be useful to determine whether you were being completely scammed, but it may not be that good for determining small variations in price.
nontransformed_residual <- diamonds2 %>%
filter(abs(lresid2) < 1) %>%
add_predictions(mod_diamond2) %>%
mutate(pred = round(2 ^ pred)) %>%
mutate(resid = price - pred)
nontransformed_residual %>% ggplot( aes (carat, resid) )+
geom_hex()
nontransformed_residual %>%
summarize (resid_mean = mean(resid),
resid_sd = sd(resid),
resid_interval_low = mean(resid) - 1.96* sd(resid),
resid_interval_high = mean(resid) + 1.96* sd(resid),
limit_upper = max(resid),
limit_lower = min(resid),
maxprice = max(price),
minprice = min(price))
## # A tibble: 1 x 8
## resid_mean resid_sd resid_interval_… resid_interval_… limit_upper
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 49.0 731. -1385. 1483. 8994
## # … with 3 more variables: limit_lower <dbl>, maxprice <dbl>,
## # minprice <dbl>
24.3.5 Exercises
daily <- flights %>%
mutate(date = make_date(year, month, day)) %>%
group_by(date) %>%
summarise(n = n())
daily
## # A tibble: 365 x 2
## date n
## <date> <int>
## 1 2013-01-01 842
## 2 2013-01-02 943
## 3 2013-01-03 914
## 4 2013-01-04 915
## 5 2013-01-05 720
## 6 2013-01-06 832
## 7 2013-01-07 933
## 8 2013-01-08 899
## 9 2013-01-09 902
## 10 2013-01-10 932
## # … with 355 more rows
ggplot(daily, aes(date, n)) +
geom_line()
daily <- daily %>%
mutate(wday = wday(date, label = TRUE))
mod <- lm(n ~ wday, data = daily)
daily <- daily %>%
add_residuals(mod)
daily %>%
ggplot(aes(date, resid)) +
geom_ref_line(h = 0) +
geom_line()
term <- function(date) {
cut(date,
breaks = ymd(20130101, 20130605, 20130825, 20140101),
labels = c("spring", "summer", "fall")
)
}
daily <- daily %>%
mutate(term = term(date))
1. Use your Google sleuthing skills to brainstorm why there were fewer than expected flights on Jan 20, May 26, and Sep 1. (Hint: they all have the same explanation.) How would these days generalise to another year?
These days are all the day before major US holidays (Sep 2, 2013 is Labor Day) which fall exclusively on Mondays. There may be fewer than expected flights due to the extended weekend / other holiday-associated factors. For another year, these days can be found by looking for the day before the third Monday in January, fourth Monday in May, and first Monday in September.
2. What do the three days with high positive residuals represent? How would these days generalise to another year?
These days look like are also associated with major US holidays, with 11/30/2013 and 12/01/2013 being the weekend after Thanksgiving, and 12/28/2013 being the weekend after Christmas. The reason flights peak on these days may be due to families who had visited their relatives leaving to go back home. These can be generalized to another year by looking for the weekends after these holidays, which typically fall after the 4th week of November and December.
daily %>%
top_n(3, resid)
## # A tibble: 3 x 5
## date n wday resid term
## <date> <int> <ord> <dbl> <fct>
## 1 2013-11-30 857 Sat 112. fall
## 2 2013-12-01 987 Sun 95.5 fall
## 3 2013-12-28 814 Sat 69.4 fall
3. Create a new variable that splits the wday variable into terms, but only for Saturdays, i.e. it should have Thurs, Fri, but Sat-summer, Sat-spring, Sat-fall. How does this model compare with the model with every combination of wday and term?
In order to split the saturdays into terms, I wrote an annotate_sat() function that takes the wday column and term column from daily and applies the appropriate suffix to each “Sat”. Fitting this model (mod3) and comparing it to the mod1 and mod2 described in the chapter shows that there is a slight improvement from mod 1 (no term variable), but that the model does slightly worse than mod2 (each day is termed). The RMSE is midway between mod1 and mod2.
annotate_sat <- function (wday, term) {
index <- which (wday == "Sat")
wday <- as.character(wday)
wday[index] <- str_c("Sat-", as.character(term)[index])
wday
}
daily <- daily %>% mutate (wday_sat = annotate_sat(wday,term))
daily
## # A tibble: 365 x 6
## date n wday resid term wday_sat
## <date> <int> <ord> <dbl> <fct> <chr>
## 1 2013-01-01 842 Tue -109. spring Tue
## 2 2013-01-02 943 Wed -19.7 spring Wed
## 3 2013-01-03 914 Thu -51.8 spring Thu
## 4 2013-01-04 915 Fri -52.5 spring Fri
## 5 2013-01-05 720 Sat -24.6 spring Sat-spring
## 6 2013-01-06 832 Sun -59.5 spring Sun
## 7 2013-01-07 933 Mon -41.8 spring Mon
## 8 2013-01-08 899 Tue -52.4 spring Tue
## 9 2013-01-09 902 Wed -60.7 spring Wed
## 10 2013-01-10 932 Thu -33.8 spring Thu
## # … with 355 more rows
mod1 <- lm(n ~ wday, data = daily)
mod2 <- lm(n ~ wday * term, data = daily)
mod3 <- lm(n ~ wday_sat, data = daily)
daily %>%
gather_residuals(no_term = mod1, all_cominbations = mod2, only_sat_term = mod3) %>%
ggplot(aes(date, resid, colour = model)) +
geom_line(alpha = 0.75)
sigma(mod1)
## [1] 48.79656
sigma(mod2)
## [1] 46.16568
sigma(mod3)
## [1] 47.35969
4. Create a new wday variable that combines the day of week, term (for Saturdays), and public holidays. What do the residuals of that model look like?
The code below adds a column to the data frame that indicates which days are holidays (I’ve chosen the main US corporate holidays for 2013). Fitting a model to this results in an rmse value that is lower than the model in which only the saturdays are termed (goes down from 47.36 to 42.94), suggesting that we are doing at least a slightly better job. Looking at the graph, the residuals that spike along the holidays are smaller (but still quite visible). There could be more ways to optimize the model to minimize these residuals, suchs as annotating wday with the few days before/after each holiday.
daily
## # A tibble: 365 x 6
## date n wday resid term wday_sat
## <date> <int> <ord> <dbl> <fct> <chr>
## 1 2013-01-01 842 Tue -109. spring Tue
## 2 2013-01-02 943 Wed -19.7 spring Wed
## 3 2013-01-03 914 Thu -51.8 spring Thu
## 4 2013-01-04 915 Fri -52.5 spring Fri
## 5 2013-01-05 720 Sat -24.6 spring Sat-spring
## 6 2013-01-06 832 Sun -59.5 spring Sun
## 7 2013-01-07 933 Mon -41.8 spring Mon
## 8 2013-01-08 899 Tue -52.4 spring Tue
## 9 2013-01-09 902 Wed -60.7 spring Wed
## 10 2013-01-10 932 Thu -33.8 spring Thu
## # … with 355 more rows
annotate_sat_holiday <- function (date, wday, term) {
index <- which (wday == "Sat")
wday <- as.character(wday)
wday[index] <- str_c("Sat-", as.character(term)[index])
holidays <- ymd(20130101, 20130527, 20130704, 20130902, 20131111, 20131128, 20131225)
holday_index <- which (date %in% holidays)
wday[which (date %in% holidays)] <- "holiday"
wday
}
daily <- daily %>% mutate( wday_sat_holiday = annotate_sat_holiday(date,wday,term))
daily
## # A tibble: 365 x 7
## date n wday resid term wday_sat wday_sat_holiday
## <date> <int> <ord> <dbl> <fct> <chr> <chr>
## 1 2013-01-01 842 Tue -109. spring Tue holiday
## 2 2013-01-02 943 Wed -19.7 spring Wed Wed
## 3 2013-01-03 914 Thu -51.8 spring Thu Thu
## 4 2013-01-04 915 Fri -52.5 spring Fri Fri
## 5 2013-01-05 720 Sat -24.6 spring Sat-spring Sat-spring
## 6 2013-01-06 832 Sun -59.5 spring Sun Sun
## 7 2013-01-07 933 Mon -41.8 spring Mon Mon
## 8 2013-01-08 899 Tue -52.4 spring Tue Tue
## 9 2013-01-09 902 Wed -60.7 spring Wed Wed
## 10 2013-01-10 932 Thu -33.8 spring Thu Thu
## # … with 355 more rows
mod4 <- lm(n ~ wday_sat_holiday, data = daily)
daily %>%
gather_residuals(only_sat_term = mod3, sat_with_holiday = mod4) %>%
ggplot(aes(date, resid, colour = model)) +
geom_line(alpha = 0.75)
sigma(mod3)
## [1] 47.35969
sigma(mod4)
## [1] 42.94494
5. What happens if you fit a day of week effect that varies by month (i.e. n ~ wday * month)? Why is this not very helpful?
Adding the month variable to the model reduces the values of the residuals by a small amount (and subsequently, the rmse). I would say that it is slightly helpful to include. I can see why it is not extremely helpful, because it may be over-fitting the data, as there is not necessarily a good reason to assume that each individual month is associated with their own unique day-of-the-week trends. This adds a large amount of predictor variables to the model that are not all necessarily independent from each other.
# add month column to daily
daily <- daily %>% mutate(month = factor(month(date)))
daily
## # A tibble: 365 x 8
## date n wday resid term wday_sat wday_sat_holiday month
## <date> <int> <ord> <dbl> <fct> <chr> <chr> <fct>
## 1 2013-01-01 842 Tue -109. spring Tue holiday 1
## 2 2013-01-02 943 Wed -19.7 spring Wed Wed 1
## 3 2013-01-03 914 Thu -51.8 spring Thu Thu 1
## 4 2013-01-04 915 Fri -52.5 spring Fri Fri 1
## 5 2013-01-05 720 Sat -24.6 spring Sat-spring Sat-spring 1
## 6 2013-01-06 832 Sun -59.5 spring Sun Sun 1
## 7 2013-01-07 933 Mon -41.8 spring Mon Mon 1
## 8 2013-01-08 899 Tue -52.4 spring Tue Tue 1
## 9 2013-01-09 902 Wed -60.7 spring Wed Wed 1
## 10 2013-01-10 932 Thu -33.8 spring Thu Thu 1
## # … with 355 more rows
mod5 <- lm(n ~ wday*month, data = daily)
daily %>%
gather_residuals(original_model = mod1, with_month = mod5) %>%
ggplot(aes(date, resid, colour = model)) +
geom_line(alpha = 0.75)
#summary(mod5)
sigma(mod1)
## [1] 48.79656
sigma(mod5)
## [1] 42.0489
6. What would you expect the model n ~ wday + ns(date, 5) to look like? Knowing what you know about the data, why would you expect it to be not particularly effective?
The book displays what the model n ~ wday * ns(date, 5) looks like, which factors in the relationship between wday and the time of year into the model. The model n ~ wday + ns(date, 5) uses the “+” operator instead of “*”, which means that this model does not account for the relationship between these two variables. I would expect that n ~ wday + ns(date, 5) does a worse job as a predictive model. Testing this out below, we see that the residuals are indeed larger. The number of predictor variables are also much fewer.
library(splines)
mod6 <- MASS::rlm(n ~ wday * ns(date, 5), data = daily)
mod7 <- MASS::rlm(n ~ wday + ns(date, 5), data = daily)
summary(mod6)
##
## Call: rlm(formula = n ~ wday * ns(date, 5), data = daily)
## Residuals:
## Min 1Q Median 3Q Max
## -339.23 -8.00 1.33 6.94 119.10
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 839.1368 3.2049 261.8314
## wday.L -58.0966 8.7636 -6.6293
## wday.Q -177.5614 8.4931 -20.9065
## wday.C -76.1745 8.5575 -8.9015
## wday^4 -106.5457 8.6304 -12.3454
## wday^5 14.7852 8.3463 1.7715
## wday^6 -18.4661 8.0673 -2.2890
## ns(date, 5)1 90.3704 4.0032 22.5743
## ns(date, 5)2 133.7194 5.1333 26.0493
## ns(date, 5)3 41.3817 3.9584 10.4542
## ns(date, 5)4 180.6645 8.1033 22.2952
## ns(date, 5)5 28.9177 3.5076 8.2443
## wday.L:ns(date, 5)1 -35.0805 10.6922 -3.2809
## wday.Q:ns(date, 5)1 11.6486 10.5964 1.0993
## wday.C:ns(date, 5)1 -6.8751 10.6179 -0.6475
## wday^4:ns(date, 5)1 23.7107 10.6477 2.2268
## wday^5:ns(date, 5)1 -21.0685 10.5431 -1.9983
## wday^6:ns(date, 5)1 6.6103 10.4507 0.6325
## wday.L:ns(date, 5)2 19.8440 13.8192 1.4360
## wday.Q:ns(date, 5)2 66.9609 13.5972 4.9246
## wday.C:ns(date, 5)2 54.5911 13.6393 4.0025
## wday^4:ns(date, 5)2 59.4655 13.7181 4.3348
## wday^5:ns(date, 5)2 -21.2690 13.4571 -1.5805
## wday^6:ns(date, 5)2 8.7439 13.2506 0.6599
## wday.L:ns(date, 5)3 -104.5352 10.4834 -9.9715
## wday.Q:ns(date, 5)3 -83.6926 10.4858 -7.9815
## wday.C:ns(date, 5)3 -77.0187 10.4500 -7.3702
## wday^4:ns(date, 5)3 -36.3447 10.5211 -3.4545
## wday^5:ns(date, 5)3 -25.0391 10.4165 -2.4038
## wday^6:ns(date, 5)3 0.8459 10.4801 0.0807
## wday.L:ns(date, 5)4 -29.5596 22.0451 -1.3409
## wday.Q:ns(date, 5)4 58.6436 21.4762 2.7306
## wday.C:ns(date, 5)4 32.5922 21.5929 1.5094
## wday^4:ns(date, 5)4 61.0040 21.7796 2.8010
## wday^5:ns(date, 5)4 -41.3648 21.1311 -1.9575
## wday^6:ns(date, 5)4 13.1682 20.5795 0.6399
## wday.L:ns(date, 5)5 39.3654 9.1661 4.2947
## wday.Q:ns(date, 5)5 59.7652 9.3101 6.4194
## wday.C:ns(date, 5)5 39.0233 9.1742 4.2536
## wday^4:ns(date, 5)5 22.0797 9.3456 2.3626
## wday^5:ns(date, 5)5 2.1498 9.1824 0.2341
## wday^6:ns(date, 5)5 -4.9030 9.4987 -0.5162
##
## Residual standard error: 10.69 on 323 degrees of freedom
summary(mod7)
##
## Call: rlm(formula = n ~ wday + ns(date, 5), data = daily)
## Residuals:
## Min 1Q Median 3Q Max
## -339.2832 -8.1235 0.1509 9.3763 111.5557
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 843.7636 3.5622 236.8666
## wday.L -80.1706 2.1137 -37.9291
## wday.Q -156.9939 2.1125 -74.3180
## wday.C -68.1163 2.1111 -32.2664
## wday^4 -78.8676 2.1143 -37.3026
## wday^5 -6.3087 2.1084 -2.9922
## wday^6 -11.5275 2.1095 -5.4646
## ns(date, 5)1 87.4028 4.4681 19.5615
## ns(date, 5)2 121.4169 5.7198 21.2276
## ns(date, 5)3 53.7103 4.4159 12.1630
## ns(date, 5)4 169.0229 9.0113 18.7567
## ns(date, 5)5 18.4859 3.9036 4.7356
##
## Residual standard error: 13.24 on 353 degrees of freedom
daily %>%
gather_residuals(multiplicative_splines = mod6, additive_splines = mod7) %>%
ggplot(aes(date, resid, colour = model)) +
geom_line(alpha = 0.75)
sigma(mod6)
## [1] 43.92417
sigma(mod7)
## [1] 44.30197
7. We hypothesised that people leaving on Sundays are more likely to be business travellers who need to be somewhere on Monday. Explore that hypothesis by seeing how it breaks down based on distance and time: if it’s true, you’d expect to see more Sunday evening flights to places that are far away.
To explore this hypothesis, we first generate a new data frame from flights that contains a factor splitting up each day into early morning (12am-8am, 8am-12pm, 12pm-5pm, and 5pm-12am). We can then summarize the data to obtain the mean distance of flights that occurred during each of those intervals grouped by day. Plotting this using boxplots, we observe that flights taking off between 1am and 8am on Sundays have a much higher mean distance compared to all other categories. Contrary to our hypothesis that evening flights (5pm-12am) would be the longest distance-wise, we instead see that there are more Sunday early morning flights between 12am-8am.
# add distance and time of flight to daily matrix
timeofday <- function(date) {
cut(date,
breaks = c(0,7, 12, 17, 23),
labels = c("12am-8am", "8am-12pm", "12pm-5pm","5pm-12am")
)
}
flights2 <- flights %>% mutate (date = make_date (year, month, day), timeofday = timeofday(hour)) %>%
group_by(date, timeofday) %>%
summarize (n = n(),
mean_dist = mean(distance)) %>%
mutate(wday = wday(date, label = TRUE))
flights2
## # A tibble: 1,460 x 5
## # Groups: date [365]
## date timeofday n mean_dist wday
## <date> <fct> <int> <dbl> <ord>
## 1 2013-01-01 12am-8am 107 1234. Tue
## 2 2013-01-01 8am-12pm 246 1074. Tue
## 3 2013-01-01 12pm-5pm 301 1040. Tue
## 4 2013-01-01 5pm-12am 188 1051. Tue
## 5 2013-01-02 12am-8am 146 1132. Wed
## 6 2013-01-02 8am-12pm 274 1045. Wed
## 7 2013-01-02 12pm-5pm 318 1023. Wed
## 8 2013-01-02 5pm-12am 205 1054. Wed
## 9 2013-01-03 12am-8am 144 1118. Thu
## 10 2013-01-03 8am-12pm 262 1037. Thu
## # … with 1,450 more rows
ggplot(flights2, aes (x = wday, y = mean_dist))+
geom_boxplot(aes(color = timeofday))
#
# flights %>% mutate (date = make_date (year, month, day), timeofday = timeofday(hour)) %>%
# mutate(wday = wday(date, label = TRUE))%>%
# ggplot( aes (x = wday, y = distance))+
# geom_boxplot(aes(color = timeofday))
8. It’s a little frustrating that Sunday and Saturday are on separate ends of the plot. Write a small function to set the levels of the factor so that the week starts on Monday.
The levels of a factor can be re-ordered using the factor() function and redefining the order of the levels by manually passing them into the levels argument. Below is an example of a function reordering the factors so that the week starts on monday.
# before reordering the factors
ggplot(daily, aes(wday, n)) +
geom_boxplot()
reorder_week <- function (mydata) {
mydata <- factor(mydata, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
mydata
}
# after reordering the factors
daily %>% mutate (wday2 = reorder_week(wday)) %>% ggplot( aes(wday2, n)) +
geom_boxplot()