Chapter 11 - Data import
11.2 Notes - Reading in files
To practice various utilities for reading in files, we can use inline csv designation, which requires proper newline designation. Below are some examples of reading in inline csv chunks with various arguments tailored for the type of data being read in.
# basic read_csv()
read_csv("a,b,c
1,2,3
4,5,6")
## # A tibble: 2 x 3
## a b c
## <dbl> <dbl> <dbl>
## 1 1 2 3
## 2 4 5 6
# read in data, ignoring metadata lines
read_csv("The first line of metadata
The second line of metadata
x,y,z
1,2,3", skip = 2)
## # A tibble: 1 x 3
## x y z
## <dbl> <dbl> <dbl>
## 1 1 2 3
# designate lines to skip that start with specific symbol
read_csv("# A comment I want to skip
x,y,z
1,2,3", comment = "#")
## # A tibble: 1 x 3
## x y z
## <dbl> <dbl> <dbl>
## 1 1 2 3
# read in file that doesnt have column names
read_csv("1,2,3\n4,5,6", col_names = FALSE)
## # A tibble: 2 x 3
## X1 X2 X3
## <dbl> <dbl> <dbl>
## 1 1 2 3
## 2 4 5 6
# read in file and specify column names
read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
## # A tibble: 2 x 3
## x y z
## <dbl> <dbl> <dbl>
## 1 1 2 3
## 2 4 5 6
# read in file, replacing symbol with missing values (NA)
read_csv("a,b,c\n1,2,.", na = ".")
## # A tibble: 1 x 3
## a b c
## <dbl> <dbl> <lgl>
## 1 1 2 NA
11.2.2 Exercises
1. What function would you use to read a file where fields were separated with “|”?
I would use read_delim()
to read in a file where fields are separated with “|”. For example:
read_delim ("a|b|c
1|2|3
4|5|6", "|")
## # A tibble: 2 x 3
## a b c
## <chr> <dbl> <dbl>
## 1 " 1" 2 3
## 2 " 4" 5 6
2. Apart from file, skip, and comment, what other arguments do read_csv() and read_tsv() have in common?
Based on the documentation, read_csv() and read_tsv() have col_names, col_types, locale, na, quoted_na, quote, trim_ws, n_max, guess_max, and progress.
3. What are the most important arguments to read_fwf()?
The most important arguments are the file, and the col_positions arguments. There are many options to specify col_positions, including fwf_empty(), fwf_widths(), fwf_positions(), and fwf_cols(). Below is the example provided in the documentation:
fwf_sample <- readr_example("fwf-sample.txt")
cat(read_lines(fwf_sample))
## John Smith WA 418-Y11-4111 Mary Hartford CA 319-Z19-4341 Evan Nolan IL 219-532-c301
# You can specify column positions in several ways:
# 1. Guess based on position of empty columns
read_fwf(fwf_sample, fwf_empty(fwf_sample, col_names = c("first", "last", "state", "ssn")))
## Parsed with column specification:
## cols(
## first = col_character(),
## last = col_character(),
## state = col_character(),
## ssn = col_character()
## )
## # A tibble: 3 x 4
## first last state ssn
## <chr> <chr> <chr> <chr>
## 1 John Smith WA 418-Y11-4111
## 2 Mary Hartford CA 319-Z19-4341
## 3 Evan Nolan IL 219-532-c301
# 2. A vector of field widths
read_fwf(fwf_sample, fwf_widths(c(20, 10, 12), c("name", "state", "ssn")))
## Parsed with column specification:
## cols(
## name = col_character(),
## state = col_character(),
## ssn = col_character()
## )
## # A tibble: 3 x 3
## name state ssn
## <chr> <chr> <chr>
## 1 John Smith WA 418-Y11-4111
## 2 Mary Hartford CA 319-Z19-4341
## 3 Evan Nolan IL 219-532-c301
# 3. Paired vectors of start and end positions
read_fwf(fwf_sample, fwf_positions(c(1, 30), c(10, 42), c("name", "ssn")))
## Parsed with column specification:
## cols(
## name = col_character(),
## ssn = col_character()
## )
## # A tibble: 3 x 2
## name ssn
## <chr> <chr>
## 1 John Smith 418-Y11-4111
## 2 Mary Hartf 319-Z19-4341
## 3 Evan Nolan 219-532-c301
# 4. Named arguments with start and end positions
read_fwf(fwf_sample, fwf_cols(name = c(1, 10), ssn = c(30, 42)))
## Parsed with column specification:
## cols(
## name = col_character(),
## ssn = col_character()
## )
## # A tibble: 3 x 2
## name ssn
## <chr> <chr>
## 1 John Smith 418-Y11-4111
## 2 Mary Hartf 319-Z19-4341
## 3 Evan Nolan 219-532-c301
# 5. Named arguments with column widths
read_fwf(fwf_sample, fwf_cols(name = 20, state = 10, ssn = 12))
## Parsed with column specification:
## cols(
## name = col_character(),
## state = col_character(),
## ssn = col_character()
## )
## # A tibble: 3 x 3
## name state ssn
## <chr> <chr> <chr>
## 1 John Smith WA 418-Y11-4111
## 2 Mary Hartford CA 319-Z19-4341
## 3 Evan Nolan IL 219-532-c301
0.0.1 4. Sometimes strings in a CSV file contain commas. To prevent them from causing problems they need to be surrounded by a quoting character, like " or ’. By convention, read_csv() assumes that the quoting character will be “, and if you want to change it you’ll need to use read_delim() instead. What arguments do you need to specify to read the following text into a data frame? {-}
"x,y\n1,'a,b'"
In this example, the string is surrounded by the quoting character ’. This is not what read_csv() default assumes. In the documentation it looks like you can specify the quote argument for both read_csv() and read_delim(). Below I read in the text using both methods, specifying quote = "\'"
.
read_csv ("x,y\n1,'a,b'", quote = "\'")
## # A tibble: 1 x 2
## x y
## <dbl> <chr>
## 1 1 a,b
read_delim("x,y\n1,'a,b'", delim = ",", quote = "\'")
## # A tibble: 1 x 2
## x y
## <dbl> <chr>
## 1 1 a,b
5. Identify what is wrong with each of the following inline CSV files. What happens when you run the code?
I have annotated the code below with the problems for each of the inline CSV files. The output should be displayed if you are viewing the rendered .md file
(you won’t see the output if this is a .Rmd
file).
# There are not enough column names to go with the amount of columns in the data.
read_csv("a,b\n1,2,3\n4,5,6")
## Warning: 2 parsing failures.
## row col expected actual file
## 1 -- 2 columns 3 columns literal data
## 2 -- 2 columns 3 columns literal data
## # A tibble: 2 x 2
## a b
## <dbl> <dbl>
## 1 1 2
## 2 4 5
# Mismatched numbers of columns again. The first row only has 2, whereas the 2nd row has 4, and the header only has 3.
read_csv("a,b,c\n1,2\n1,2,3,4")
## Warning: 2 parsing failures.
## row col expected actual file
## 1 -- 3 columns 2 columns literal data
## 2 -- 3 columns 4 columns literal data
## # A tibble: 2 x 3
## a b c
## <dbl> <dbl> <dbl>
## 1 1 2 NA
## 2 1 2 3
# Two header columns, and one column of data. Also the "1" is still being read in as an integer.
read_csv("a,b\n\"1")
## Warning: 2 parsing failures.
## row col expected actual file
## 1 a closing quote at end of file literal data
## 1 -- 2 columns 1 columns literal data
## # A tibble: 1 x 2
## a b
## <dbl> <chr>
## 1 1 <NA>
# Since there are both integer values and character values in the same column, both columns are defined as character.
read_csv("a,b\n1,2\na,b")
## # A tibble: 2 x 2
## a b
## <chr> <chr>
## 1 1 2
## 2 a b
# I assume ";" was meant to be the delimiter. The csv only has one observation. Use read_delim("a;b\n1;3", ";") instead.
read_csv("a;b\n1;3")
## # A tibble: 1 x 1
## `a;b`
## <chr>
## 1 1;3
11.3 Notes - Parsing a vector
Parsing vectors or files can be useful to convert variables to their appropriate classes. For example, if a column of integer values was read in as characters, we can convert the data back into integers using parse_integer(). Below are the provided examples of use:
str(parse_logical(c("TRUE", "FALSE", "NA")))
## logi [1:3] TRUE FALSE NA
str(parse_integer(c("1", "2", "3")))
## int [1:3] 1 2 3
str(parse_date(c("2010-01-01", "1979-10-14")))
## Date[1:2], format: "2010-01-01" "1979-10-14"
Below are the other examples using parsing functions. Using the problems() function on a parsed vector seems especially useful!
# can specify na values if present in data
parse_integer(c("1", "231", ".", "456"), na = ".")
## [1] 1 231 NA 456
x <- parse_integer(c("123", "345", "abc", "123.45"))
## Warning: 2 parsing failures.
## row col expected actual
## 3 -- an integer abc
## 4 -- no trailing characters .45
x
## [1] 123 345 NA NA
## attr(,"problems")
## # A tibble: 2 x 4
## row col expected actual
## <int> <int> <chr> <chr>
## 1 3 NA an integer abc
## 2 4 NA no trailing characters .45
problems(x)
## # A tibble: 2 x 4
## row col expected actual
## <int> <int> <chr> <chr>
## 1 3 NA an integer abc
## 2 4 NA no trailing characters .45
11.3.1 Parsing Numbers
Reading in data obtained from outside the US is also tricky since there are different conventions used to display numerical data. For example, using “.” instead of “,” to mark decimal places or groupings. The functions have arguments that allow you to specify these marks. The examples provided in the book are below. parse_number() also ignores non-numerical symbols such as $ or %. However, parse_double does not seem to have this feature.
# you can specify the decimal mark symbol if needed
parse_double("1.23")
## [1] 1.23
parse_double("1,23", locale = locale(decimal_mark = ","))
## [1] 1.23
# parse_number() ignores non-numerical symbols
parse_number("$100")
## [1] 100
parse_number("20%")
## [1] 20
parse_number("It cost $123.45")
## [1] 123.45
# You can also specify grouping marks if needed.
# Used in America
parse_number("$123,456,789")
## [1] 123456789
# Used in many parts of Europe
parse_number("123.456.789", locale = locale(grouping_mark = "."))
## [1] 123456789
# Used in Switzerland
parse_number("123'456'789", locale = locale(grouping_mark = "'"))
## [1] 123456789
11.3.2 Parsing Strings
You can use parse_character() to parse strings. Each character in a string is encoded, and you can specify the encoding as an argument in parse_character(). To guess the encoding for a particular string you are parsing, you can use guess_encoding().
x1 <- "El Ni\xf1o was particularly bad this year"
x2 <- "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"
parse_character(x1, locale = locale(encoding = "Latin1"))
## [1] "El Niño was particularly bad this year"
parse_character(x2, locale = locale(encoding = "Shift-JIS"))
## [1] "こんにちは"
guess_encoding(charToRaw(x1))
## # A tibble: 2 x 2
## encoding confidence
## <chr> <dbl>
## 1 ISO-8859-1 0.46
## 2 ISO-8859-9 0.23
guess_encoding(charToRaw(x2))
## # A tibble: 1 x 2
## encoding confidence
## <chr> <dbl>
## 1 KOI8-R 0.42
11.3.3 Parsing Factors
To parse a vector of factors, you can use parse_factor(), specifying the levels that you are expecting to see. If a value in the vector does not exist in the levels argument, an error is returned.
fruit <- c("apple", "banana")
parse_factor(c("apple", "banana", "banana"), levels = fruit)
## [1] apple banana banana
## Levels: apple banana
parse_factor(c("apple", "banana", "bananana"), levels = fruit)
## Warning: 1 parsing failure.
## row col expected actual
## 3 -- value in level set bananana
## [1] apple banana <NA>
## attr(,"problems")
## # A tibble: 1 x 4
## row col expected actual
## <int> <int> <chr> <chr>
## 1 3 NA value in level set bananana
## Levels: apple banana
11.3.4 Parsing Dates, Date-times, and Times
There are three types of parsers for these purposes which spit out a combination of date, time, or date-time. Below are the provided examples from the book for each of the parsers.
# date-time
# requires input as year, month, day (mandatory), time-(optional)-hour, minute, second,
parse_datetime("2010-10-01T2010")
## [1] "2010-10-01 20:10:00 UTC"
parse_datetime("20101010")
## [1] "2010-10-10 UTC"
# date - year, month, day
# expects a four digit year, a - or /, the month, a - or /, then the day
parse_date("2010-10-01")
## [1] "2010-10-01"
# expects the hour, :, minutes, optionally : and seconds, and an optional am/pm specifier:
library(hms)
parse_time("01:10 am")
## 01:10:00
parse_time("01:10 pm")
## 13:10:00
parse_time("20:10:01")
## 20:10:01
You can also create your own date-time format. There are many parameters you can specify for your date-time “key”. See ?parse_date for the options. Depending on how you set up the “key”, you may parse different dates from one set of numbers (book example below).
# different dates are parsed depending on the key that you provide
parse_date("01/02/15", "%m/%d/%y")
## [1] "2015-01-02"
parse_date("01/02/15", "%d/%m/%y")
## [1] "2015-02-01"
parse_date("01/02/15", "%y/%m/%d")
## [1] "2001-02-15"
Last, as with parsing numbers, different countries may have different date formats. You can solve this by specifying the local argument, as we did with parse_integer().
parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr"))
## [1] "2015-01-01"
11.3.5 Exercises
1. What are the most important arguments to locale()?
If you are using locale() for parse_number(), then the most important arguments are decimal_mark
and grouping_mark
. For parse_character(), you should specify encoding
. For parse_date(), you should specify the region using the appropriate characters.
2. What happens if you try and set decimal_mark and grouping_mark to the same character? What happens to the default value of grouping_mark when you set decimal_mark to “,”? What happens to the default value of decimal_mark when you set the grouping_mark to “.”?
When you try to set them to the same character, you get an error: Error: `decimal_mark` and `grouping_mark` must be different
The default grouping_mark becomes ‘.’ if decimal_mark is set to ‘,’.
# parse_number("1,234.567", locale = locale(grouping_mark = '.', decimal_mark = '.')) # This Errors!
parse_number("1.234,567", locale = locale(decimal_mark = ','))
## [1] 1234.567
parse_number("1.234,567", locale = locale(grouping_mark = '.'))
## [1] 1234.567
3. I didn’t discuss the date_format and time_format options to locale(). What do they do? Construct an example that shows when they might be useful.
The date_format and time_format specify the date and time formats for the parse function, which are by default date_format = "%AD"
and time_format = "%AT"
. From the readr vignette, for date_format, “The default value is %AD which uses an automatic date parser that recognises dates of the format Y-m-d or Y/m/d.” For time_format, “The default value is %At which uses an automatic time parser that recognises times of the form H:M optionally followed by seconds and am/pm.” I could see this useful to specify a custom date_format for american dates, which are often entered as m-d-Y instead of the default Y-m-d. The full four year date is also often truncated to the last 2 digits, which might result in an error without specifying it in date_format. Below is an example.
# today's american date parsed incorrectly using default date_format (throws error)
parse_date("05/24/18")
## Warning: 1 parsing failure.
## row col expected actual
## 1 -- date like 05/24/18
## [1] NA
# today's american date parsed correctly by specifying date_format
parse_date("05/24/18", locale = locale(date_format = "%m/%d/%y"))
## [1] "2018-05-24"
4. If you live outside the US, create a new locale object that encapsulates the settings for the types of file you read most commonly.
I live in the US, but for practice purposes lets say I move to Colombia. I might have to commonly read in old files that are not UTF-8 encoded, but latin1 encoded. The decimal marks are also “,” instead of “.”. Below is an example locale object for these requirements.
locale(date_names = "es", decimal_mark = ",", encoding = "latin1")
## <locale>
## Numbers: 123.456,78
## Formats: %AD / %AT
## Timezone: UTC
## Encoding: latin1
## <date_names>
## Days: domingo (dom.), lunes (lun.), martes (mar.), miércoles (mié.),
## jueves (jue.), viernes (vie.), sábado (sáb.)
## Months: enero (ene.), febrero (feb.), marzo (mar.), abril (abr.), mayo
## (may.), junio (jun.), julio (jul.), agosto (ago.),
## septiembre (sept.), octubre (oct.), noviembre (nov.),
## diciembre (dic.)
## AM/PM: a. m./p. m.
5. What’s the difference between read_csv() and read_csv2()?
Base on the documentation, read_csv2() uses semicolons “;” as separators, instead of “,”. read_csv2() would ideally be used if the comma “,” is used as a decimal point within the file, which would mess up read_csv(). Below is an example:
# messed up because "," is used as a decimal point
read_csv("a;b\n1,0;2,0")
## Warning: 1 parsing failure.
## row col expected actual file
## 1 -- 1 columns 3 columns literal data
## # A tibble: 1 x 1
## `a;b`
## <dbl>
## 1 1
# using read_csv2 fixes the problem
read_csv2("a;b\n1,0;2,0")
## Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.
## # A tibble: 1 x 2
## a b
## <dbl> <dbl>
## 1 1 2
6. What are the most common encodings used in Europe? What are the most common encodings used in Asia? Do some googling to find out.
I got the info from the ?stringi::stri_enc_detect documentation.
Common encodings used in Europe are: ISO-8859-1, ISO-8859-2, windows-1252, ISO-8859-7 Common encodings used in Asia are: Shift_JIS, ISO-2022-JP, ISO-2022-CN, ISO-2022-KR, GB18030, EUC-JP, EUC-KR
UTF-8 is widely popular now, and you can also use guess_encoding() if you are unsure what encoding to use. There is also a lot of info about encoding on Wikipedia.
7. Generate the correct format string to parse each of the following dates and times:
My answers are in the R code below. Helpful descriptions for the format string paramters are found at ?parse_datetime.
d1 <- "January 1, 2010"
parse_date (d1, "%B%e%*%Y")
## [1] "2010-01-01"
d2 <- "2015-Mar-07"
parse_date(d2, "%Y-%b-%d")
## [1] "2015-03-07"
d3 <- "06-Jun-2017"
parse_date(d3, "%d-%b-%Y")
## [1] "2017-06-06"
d4 <- c("August 19 (2015)", "July 1 (2015)")
parse_date(d4, "%B %d (%Y)")
## [1] "2015-08-19" "2015-07-01"
d5 <- "12/30/14" # Dec 30, 2014
parse_date(d5, "%m/%d/%y")
## [1] "2014-12-30"
t1 <- "1705"
parse_time(t1, "%H%M")
## 17:05:00
t2 <- "11:15:10.12 PM"
parse_time(t2, "%I:%M:%OS %p")
## 23:15:10.12
11.4 Notes - Parsing a file
The parsers that we learned about in the previous section are automatically applied by readr when reading in a file using read_csv() or other reading functions. These guess the type of data in each column being read in, using a combination of guess_parser() and parse_guess() on the first 1000 rows of observations.
guess_parser(c("TRUE", "FALSE"))
## [1] "logical"
parse_guess("2010-10-10")
## [1] "2010-10-10"
There are usually a lot of issues when parsing a large, unorganized file. readr has a “challenge” example that displays some of the issues that arise:
challenge <- read_csv(readr_example("challenge.csv"))
## Parsed with column specification:
## cols(
## x = col_double(),
## y = col_logical()
## )
## Warning: 1000 parsing failures.
## row col expected actual file
## 1001 y 1/0/T/F/TRUE/FALSE 2015-01-16 '/Library/Frameworks/R.framework/Versions/3.5/Resources/library/readr/extdata/challenge.csv'
## 1002 y 1/0/T/F/TRUE/FALSE 2018-05-18 '/Library/Frameworks/R.framework/Versions/3.5/Resources/library/readr/extdata/challenge.csv'
## 1003 y 1/0/T/F/TRUE/FALSE 2015-09-05 '/Library/Frameworks/R.framework/Versions/3.5/Resources/library/readr/extdata/challenge.csv'
## 1004 y 1/0/T/F/TRUE/FALSE 2012-11-28 '/Library/Frameworks/R.framework/Versions/3.5/Resources/library/readr/extdata/challenge.csv'
## 1005 y 1/0/T/F/TRUE/FALSE 2020-01-13 '/Library/Frameworks/R.framework/Versions/3.5/Resources/library/readr/extdata/challenge.csv'
## .... ... .................. .......... ............................................................................................
## See problems(...) for more details.
Since the default only looks at the first 1000 rows, we can run into issues if the first 1000 rows or more have troubling characteristics. Here there are many issues displayed in the output after attempting to read the file in. It is very helpful that the error output displays what the function attempted to do. We see that it attempted to parse column x using col_integer(), and column y using col_character(). We can see more details by using problems():
problems(challenge)
## # A tibble: 1,000 x 5
## row col expected actual file
## <int> <chr> <chr> <chr> <chr>
## 1 1001 y 1/0/T/F/TRUE… 2015-01… '/Library/Frameworks/R.framework/Ver…
## 2 1002 y 1/0/T/F/TRUE… 2018-05… '/Library/Frameworks/R.framework/Ver…
## 3 1003 y 1/0/T/F/TRUE… 2015-09… '/Library/Frameworks/R.framework/Ver…
## 4 1004 y 1/0/T/F/TRUE… 2012-11… '/Library/Frameworks/R.framework/Ver…
## 5 1005 y 1/0/T/F/TRUE… 2020-01… '/Library/Frameworks/R.framework/Ver…
## 6 1006 y 1/0/T/F/TRUE… 2016-04… '/Library/Frameworks/R.framework/Ver…
## 7 1007 y 1/0/T/F/TRUE… 2011-05… '/Library/Frameworks/R.framework/Ver…
## 8 1008 y 1/0/T/F/TRUE… 2020-07… '/Library/Frameworks/R.framework/Ver…
## 9 1009 y 1/0/T/F/TRUE… 2011-04… '/Library/Frameworks/R.framework/Ver…
## 10 1010 y 1/0/T/F/TRUE… 2010-05… '/Library/Frameworks/R.framework/Ver…
## # … with 990 more rows
tail(challenge)
## # A tibble: 6 x 2
## x y
## <dbl> <lgl>
## 1 0.805 NA
## 2 0.164 NA
## 3 0.472 NA
## 4 0.718 NA
## 5 0.270 NA
## 6 0.608 NA
typeof(challenge$x)
## [1] "double"
typeof(challenge$y)
## [1] "logical"
unique(challenge$y[1:1000])
## [1] NA
The values for column x after row 1000 seem to be doubles, rather than integers. We can fix this by changing the default parsing function from col_integer() to col_double(). We also observe that column y contains date values, but the default type was character, since the first 1000 values were NA. We can fix this by changing col_character() to col_date().
challenge <- read_csv(
readr_example("challenge.csv"),
col_types = cols(
x = col_double(),
y = col_date()
)
)
tail(challenge)
## # A tibble: 6 x 2
## x y
## <dbl> <date>
## 1 0.805 2019-11-21
## 2 0.164 2018-03-29
## 3 0.472 2014-08-04
## 4 0.718 2015-08-16
## 5 0.270 2020-02-04
## 6 0.608 2019-01-06
typeof(challenge$x)
## [1] "double"
typeof(challenge$y)
## [1] "double"
Hadley recommends that we should always examine the output of the read_() function and re-specify the col_parsers to match what is appropriate for the data. One strategy around this that he describes, which I think would probably be more straightforward if you have many, many columns of data, is to read everything in as a character, then use type_convert() on the table to convert to the appropriate types. We can see in the example below that type_convert() properly converts column x to double and column y to date formats.
challenge2 <- read_csv(readr_example("challenge.csv"),
col_types = cols(.default = col_character())
)
challenge2
## # A tibble: 2,000 x 2
## x y
## <chr> <chr>
## 1 404 <NA>
## 2 4172 <NA>
## 3 3004 <NA>
## 4 787 <NA>
## 5 37 <NA>
## 6 2332 <NA>
## 7 2489 <NA>
## 8 1449 <NA>
## 9 3665 <NA>
## 10 3863 <NA>
## # … with 1,990 more rows
type_convert(challenge2)
## Parsed with column specification:
## cols(
## x = col_double(),
## y = col_date(format = "")
## )
## # A tibble: 2,000 x 2
## x y
## <dbl> <date>
## 1 404 NA
## 2 4172 NA
## 3 3004 NA
## 4 787 NA
## 5 37 NA
## 6 2332 NA
## 7 2489 NA
## 8 1449 NA
## 9 3665 NA
## 10 3863 NA
## # … with 1,990 more rows
The functions read_lines() and read_file() also seem useful to read in the raw lines or unstructured text of a file, in order to better understand the type of data contained. Unless you want to manipulate the strings or extract data using regexes, it might be more efficient to use less
in your terminal to view the data file, rather than read_file().
# use read_lines to read individual lines of the file
head(read_lines(readr_example("challenge.csv")))
## [1] "x,y" "404,NA" "4172,NA" "3004,NA" "787,NA" "37,NA"
#use read_file to read the entire file in as one string
substr(read_file(readr_example("challenge.csv")), 1, 100)
## [1] "x,y\n404,NA\n4172,NA\n3004,NA\n787,NA\n37,NA\n2332,NA\n2489,NA\n1449,NA\n3665,NA\n3863,NA\n4374,NA\n875,NA\n172,N"
11.5 Notes - Writing to a file
The functions write_csv() and write_tsv() are useful functions to write data.frames or tibbles in R to files. When writing files, it is important to use UTF-8 encoding for strings and save dates/date-times in ISO8601 format. There is also a special function for writing to excel: write_excel_csv(). write_rds() will save the actual R object containing the data frame, so that if you load the .rds
using read_rds() you can access the data as it was at the time of save. Think of this as using save() in base R to save a .Robj
for a variable you want to keep track of. You could also use write_feather() from the feather
package to save the data in a format accessible by other programming languages, or read it back into R using read_feather().
# write_csv(challenge, "challenge.csv")
# write_rds(challenge, "challenge.rds")
# library(feather)
# write_feather(challenge, "challenge.feather")
When executing the above write_csv() or write_rds(), the files will appear in your working directory, which if you are using an R notebook, conveniently is where your .Rmd
file is being kept!
There are also other types of files that might be read in. You can use haven for reading in SPSS, Stata, and SAS files, readxl for excel files, DBI to run SQL queries against databases (returns data frame), jsonlite for json, and xml2 for XML.