R has specific support for UTF-8 and latin1 encoded strings. This mostly matters for internal conversions. Thanks to this support, you can reencode strings to UTF-8 or latin1 for internal processing, and return these strings without having to convert them back to the native encoding. However, it is important to make sure the encoding mark has not been lost in the process, otherwise the output will be treated as if encoded according to the current locale (see mut_utf8_locale() for documentation about locale codesets), which is not appropriate if it does not coincide with the actual encoding. In those situations, you can use these functions to ensure an encoding mark in your strings.

set_chr_encoding(x, encoding = c("unknown", "UTF-8", "latin1", "bytes"))

chr_encoding(x)

set_str_encoding(x, encoding = c("unknown", "UTF-8", "latin1", "bytes"))

str_encoding(x)

Arguments

x

A string or character vector.

encoding

Either an encoding specially handled by R ("UTF-8" or "latin1"), "bytes" to inhibit all encoding conversions, or "unknown" if the string should be treated as encoded in the current locale codeset.

See also

mut_utf8_locale() about the effects of the locale, and as_utf8_string() about encoding conversion.

Examples

# Encoding marks are always ignored on ASCII strings: str_encoding(set_str_encoding("cafe", "UTF-8"))
#> [1] "unknown"
# You can specify the encoding of strings containing non-ASCII # characters: cafe <- string(c(0x63, 0x61, 0x66, 0xC3, 0xE9)) str_encoding(cafe)
#> [1] "unknown"
str_encoding(set_str_encoding(cafe, "UTF-8"))
#> [1] "UTF-8"
# It is important to consistently mark the encoding of strings # because R and other packages perform internal string conversions # all the time. Here is an example with the names attribute: latin1 <- string(c(0x63, 0x61, 0x66, 0xE9), "latin1") latin1 <- set_names(latin1) # The names attribute is encoded in latin1 as we would expect: str_encoding(names(latin1))
#> [1] "latin1"
# However the names are converted to UTF-8 by the c() function: str_encoding(names(c(latin1)))
#> [1] "UTF-8"
as_bytes(names(c(latin1)))
#> [1] 63 61 66 c3 a9
# Bad things happen when the encoding marker is lost and R performs # a conversion. R will assume that the string is encoded according # to the current locale: not_run({ bad <- set_names(set_str_encoding(latin1, "unknown")) mut_utf8_locale() str_encoding(names(c(bad))) as_bytes(names(c(bad))) })