Top Seven Tips for Processing 'Foreign' Text in Python (2.7)

Following on from my guide to making R play nice with utf-8, here is a seven-step guide to understanding Python's handling of unicode. Trust me, if you work with non-latin characters, you need to know this stuff:

As ever the usual caveats: I have no idea whether any of this works for Python 3, or on a Mac.

1) Reading files in:
Make sure your files are saved in UTF-8 format. Then use this function to read them in correctly:

Use this function as follows:
text = file_contents('my_file.txt')

2) Printing text in the console:
use 'print'. Thus:
>>> word
>>> print word

3) Using utf in a script:
As the example above illustrates, utf-8 text should be indicated by a 'u' preceding the single or double quotes containing your string. This makes the type 'unicode':
>>> type(u'enter something here')
<type 'unicode'>

To make sure your script is recognised as UTF-8, though, add this to the start of your script:
# -*- coding: utf-8 -*-
This tells Python to expect utf-8 characters

4) Saving output Use the 'encode' method as follows:

This gives you beautiful unicode output.

5) Entering characters from the console: This one can be a pain. First make sure that your operating system is set up to display 'special characters' correctly. For Windows, do this:

Now here is the trick: characters you enter on the console are already encoded. Python needs to *decode* them rather than encode. Consider:

>>> "слово"

>>> u'слово'

>>> "слово".encode("UTF-8")

Traceback (most recent call last):
File "", line 1, in
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf1 in position 0: ordinal not in range(128)

>>> u"слово".encode("UTF-8")

Basically there's any number of exciting errors.

To enter text correctly, you need to know the encoding for stdin (This is unix speak for 'standard input')
import sys
>>> sys.stdin.encoding

Eureka. In brief, this means whatever I type in to the console is encoded using 'cp1251'. According to Wikipedia: 'Windows-1251 (a.k.a. code page CP1251) is a popular 8-bit character encoding, designed to cover languages that use the Cyrillic script such as Russian, Bulgarian, Serbian Cyrillic and other languages.'. So because my system locale is set to display Russian characters, this is the encoding used by default. If you want to display a German umlaut, your system default will probably be different.

But, back to entering input. Do this:
>>> "слово".decode("cp1251")
>>> print "слово".decode("cp1251")

And voila.
Or voilà - as the cool kids with the correct unicode setup say

6) piping unicode output 

Imagine this scenario: doing stuff with text is faster in Python, but you like to analyse results in R. Consequently you sometimes execute a bit of Python from R using the system command. For instance, I use a SQL database which I access through Python. I have a lookup function which I can run from R, which returns the content of the desired file. the R code is like this:

Here 'path' is the id of the file I want fetched from the database. intern=T means that the R console will record the output it sees printed by python script. But what does it see? Characters with some sort of encoding. Now the problem here is if you print something with UTF-8, stdout (standard output) will add more encoding - as we have seen, in my case 'CP1251'. This makes for a bad combination, and my lookup function returns something like this:
рети армии его тылы были атакованы русской легкой конницей казаками и калмыками | Сражение Карл проиграл и бежал в Османскую империю | Дмитрий Табачник народный депутат Украины доктор исторических наук Киев "

Yuck. Instead, you guessed it, use something like the following in your script:
output.encode("cp1251"). This gives perfectly formatted output.

7) And finally: Beware of capitalisation. The utf-8 code for capitals and lower case letters are different - see here for an example with everyone's favourite Russian character, 'Ya':

>>> "я".decode("cp1251")
>>> "Я".decode("cp1251")

Consequently, never assume your application thinks these are the same. For instance, pymongo - the python wrapper to the popular noSQL mongodb - is case-insensitive - for latin characters. But, as this individual found out, the capability does not at the minute extend to all unicode characters. So remember: what you read in the English language documentation may be irrelevant or wrong. Always check. 

To be continued.


  1. Just a small thing: your gist uses 'with' like it's in R (as a temporary scope adjustment) but actually it does more in python. Specifically, it looks after final file closing. So your statement could be:

    with, encoding="utf-8") as f:

    Also, since you were wondering, codecs is mostly redundant in this context in Python 3 because open takes an encoding argument.

    1. Oh cool - I never knew 'with' worked like that in R =) Thanks for the comment


  2. Thank you very much, Rolf, I found your article very useful, you helped me a lot
    Richard Brown data room solutions

  3. actually many professionals has the doubt of handling foreign test in python language. by this blog you have clarified every doubt of them . thank you for this blog. keep on sharing.
    python training in chennai

  4. G club Online gambling sites will give you the play that you choose. Make good money every way. Gambling for yourself every day. It will make a good income. Gamblers play this way. Give more returns every day. There are gambling games that will enjoy the simple things. Give a good return. Having fun everyday betting games can be played on your own. Where to play gambling. Have more profit. Gambling is a friendly way to play. Can be played well. There are betting games that will give you more money to get good returns.

    This is a gambling game. No need to go to the casino to spend time to play all areas have gambling games to make more profits. You can make a profit as all. Fun, easy to play gamblers. Ready to gamble to make a good profit every day. Gamblers are guaranteed to fulfill all competitions. There are good bets to play everywhere. Get more compensation. Gamblers can gamble themselves. Gambling is easy. There are gambling games to choose from. Get the good stuff. To play like every day. Gclub มือถือ

  5. Good post and I like it very much. By the way, anybody try this app development company for iOS and Android? I find it is so professional to help me boost app ranking and increase app downloads.

  6. Those guidelines additionally worked to become a good way to recognize that other people online have the identical fervor like mine to grasp great deal more around this condition.

    python training in bangalore|

  7. I simply wanted to write down a quick word to say thanks to you for those wonderful tips and hints you are showing on this site.
    It’s great to come across a blog every once in a while that isn’t the same out of date rehashed material. Fantastic read.

    Python Training in Chennai | Python Training Institutes in Chennai

  8. Needed to compose you a very little word to thank you yet again regarding the nice suggestions you’ve contributed here.
    Python Training in Bangalore

  9. Thanks for providing good information,Thanks for your sharing python Online Training

  10. Nice blog it is informative thank you for sharing Python Online Training

  11. Thanks for sharing this post. Your post is really very helpful its students.

    python training in chennai

    selenium training in chennai

  12. Thanks for the great information, its very useful for me thanks for the shairngs
    Blockchain Development Services

  13. Very interesting blog which helps me to get the in depth knowledge about the technology, Thanks for sharing such a nice blog..
    Good discussion.
    Six Sigma Training in Abu Dhabi
    Six Sigma Training in Dammam
    Six Sigma Training in Riyadh

  14. This is an awesome post. Really very informative and creative contents. This concept is a good way to enhance knowledge. I like it and help me to development very well. Thank you for this brief explanation and very nice information. Well, got good knowledge.
    WordPress development company in Chennai

  15. I like your blog, I read this blog please update more content on python, further check it once at python online training