R has specific support for UTF-8 and latin1 encoded strings. This
mostly matters for internal conversions. Thanks to this support,
you can reencode strings to UTF-8 or latin1 for internal
processing, and return these strings without having to convert them
back to the native encoding. However, it is important to make sure
the encoding mark has not been lost in the process, otherwise the
output will be treated as if encoded according to the current
mut_utf8_locale() for documentation about locale
codesets), which is not appropriate if it does not coincide with
the actual encoding. In those situations, you can use these
functions to ensure an encoding mark in your strings.
set_chr_encoding(x, encoding = c("unknown", "UTF-8", "latin1", "bytes")) chr_encoding(x) set_str_encoding(x, encoding = c("unknown", "UTF-8", "latin1", "bytes")) str_encoding(x)
A string or character vector.
Either an encoding specially handled by R
# Encoding marks are always ignored on ASCII strings: str_encoding(set_str_encoding("cafe", "UTF-8"))#>  "unknown"# You can specify the encoding of strings containing non-ASCII # characters: cafe <- string(c(0x63, 0x61, 0x66, 0xC3, 0xE9)) str_encoding(cafe)#>  "unknown"str_encoding(set_str_encoding(cafe, "UTF-8"))#>  "UTF-8"# It is important to consistently mark the encoding of strings # because R and other packages perform internal string conversions # all the time. Here is an example with the names attribute: latin1 <- string(c(0x63, 0x61, 0x66, 0xE9), "latin1") latin1 <- set_names(latin1) # The names attribute is encoded in latin1 as we would expect: str_encoding(names(latin1))#>  "latin1"# However the names are converted to UTF-8 by the c() function: str_encoding(names(c(latin1)))#>  "UTF-8"as_bytes(names(c(latin1)))#>  63 61 66 c3 a9