R and foreign characters


Working with Russian characters can be mind-numbingly frustrating. This is true for R, as for other applications, so below I've written out the my top five tricks for making Russian inputs work in R; i believe they should be transferable to most other languages.



Having forced any number of programs to accept Russian characters in the past, I have come to appreciate UTF-8 as the only sensible encoding system for non-latin script. R operates with UTF-8 as default, so using Russian or other foreign scripts should be straightforward, right?
Wrong. There is no end to the annoyance experienced when attempting to import data into R by appending
encoding = "utf-8"
to the end of every line. Sometimes it will work, but rarely both in the characters displayed on screen, and those output by R. So, annoyingly, characters formatted as Russian in a data.frame will magically appear as gobbledygook when written to an output file, or even a plot. Infuriating. The solution is brutal in its simplicity - don't rely on R's UTF-8 to display characters for you, instead start sessions in the appropriate language, using the line
Sys.setlocale("LC_CTYPE", "russian")
Now that solves all the problems, right?
Almost. Often when scraping data or when inputting data (e.g. through Shiny apps), strings need to be formatted as UTF-8 as follows:
>Enoding(annoyingMisbehavingString) <- "UTF-8"
Be careful with this one, though. Encoding text that already is utf-8 as utf-8 will not work well.
Finally, if you ever want to save .R scripts with non-Latin characters in them, do so with care. When you reopen the files the strings will be scrambled, for some reason not quite clear to me. If you use the script as a source file, any command reliant on the non-Latin string (e.g. grep) will return errors or no hits. The solution is to use a different function all together:
eval(parse("iPolarCalc.R", encoding = "UTF-8"))
And that's about it. For Windows systems at least.

======
Update: 06/02/2013
Except encoding issues never really end. Enter the latest problem:
displaying cyrillic characters with Knitr.

Knitr is great. It will take R code and combine it with markdown, allowing you to create ready formatted webpages with calculations and graphics created on the fly from R. But it doesn't work properly with non ascii characters. The solution: Don't use R-studio's built in knitr to html (ctrl-shift-h). Instead save the rmd file in your working directory, and run these lines:
knit("test.rmd", encoding = "utf-8")
markdownToHTML("test.md", "test.html")
browseURL(paste("file://", file.path(getwd(), "test.html"), sep = ""))
-->

=====
Update 21/11/2013

Here's my latest discovery: you know when you have foreign characters in a url? Chances are you didn't notice, because most browsers can handle this. Paste this into your browser, and you will get search results for the Katyn massacre:
https://www.google.co.uk/search?q=катынь

However, this is all smoke and mirrors: paste the same string into notepad, and you will see this:
https://www.google.co.uk/search?q=%D0%BA%D0%B0%D1%82%D1%8B%D0%BD%D1%8C

What does this have to do with R? well, we need some way to convert the former to the latter if we want to access URLs with foreign characters in. To do that, use curlEscape() from the rCurl package:

> curlEscape("катынь")
[1] "%D0%BA%D0%B0%D1%82%D1%8B%D0%BD%D1%8C"
Perfect.

6 comments:

  1. I have an SPSS file in Russian encoding, apparently it's 1251, and I can't read it either in R or in SPSS 21.


    Sys.setlocale("LC_CTYPE", "russian") doesn't work on my Mac machine for some reason. Is there any other way of solving this issue? Or, perhaps, there is something that I'm not doing right?

    ReplyDelete
    Replies
    1. Hi Valery, the short answer is I don't know, because I don't know how to use a mac. But this post seems to have something that may be of interest:
      http://stackoverflow.com/questions/17031002/get-weekdays-in-english-in-rstudio
      I would guess you are looking for "ru_RU.UTF-8". Best, R

      Delete
    2. Sys.setlocale("LC_CTYPE", "ru_RU.UTF-8")

      worked like a charm!

      Delete
    3. Hi Rolf, I have to deal with Vietnamese data and I would like to set locale in R to be "en_US.UTF-8" but it doesn't work.
      My code is: Sys.setlocale(category="LC_ALL", locale = "en_US.UTF-8")
      However, the warning message in console is:
      In Sys.setlocale(category = "LC_ALL", locale = "en_US.UTF-8") :
      OS reports request to set locale to "en_US.UTF-8" cannot be honored
      And the locale did not change to "en_US.UTF-8"
      I have tried several ways but nothing worked. Could you help me to set my locale to be "en_US.UTF-8"?

      Delete
  2. data_heatmap can't handle Russian letters as well, see https://github.com/yihui/knitr/issues/436#issuecomment-32781891
    Oh, it's bad, feel like back in 1999. Windows did not want to change my world after seeing Sys.setlocale("LC_CTYPE", "russian")

    ReplyDelete
  3. Hi Rolf! Thank you very much for your quite useful advices on dealing with Russian characters in R programming language. You saved a lot of time on this matter. But never knows what to expect from text in Cyrillic.

    ReplyDelete