Top Seven Tips for Processing 'Foreign' Text in Python (2.7)


Following on from my guide to making R play nice with utf-8, here is a seven-step guide to understanding Python's handling of unicode. Trust me, if you work with non-latin characters, you need to know this stuff:



As ever the usual caveats: I have no idea whether any of this works for Python 3, or on a Mac.


1) Reading files in:
Make sure your files are saved in UTF-8 format. Then use this function to read them in correctly:




Use this function as follows:
text = file_contents('my_file.txt')


2) Printing text in the console:
use 'print'. Thus:
word=u'слово'
>>> word
u'\u0441\u043b\u043e\u0432\u043e'
>>> print word
слово


3) Using utf in a script:
As the example above illustrates, utf-8 text should be indicated by a 'u' preceding the single or double quotes containing your string. This makes the type 'unicode':
>>> type(u'enter something here')
<type 'unicode'>

To make sure your script is recognised as UTF-8, though, add this to the start of your script:
# -*- coding: utf-8 -*-
This tells Python to expect utf-8 characters


4) Saving output Use the 'encode' method as follows:



This gives you beautiful unicode output.


5) Entering characters from the console: This one can be a pain. First make sure that your operating system is set up to display 'special characters' correctly. For Windows, do this:
http://windows.microsoft.com/en-gb/windows-vista/change-the-system-locale

Now here is the trick: characters you enter on the console are already encoded. Python needs to *decode* them rather than encode. Consider:

>>> "слово"
'\xf1\xeb\xee\xe2\xee'

>>> u'слово'
u'\xf1\xeb\xee\xe2\xee'

>>> "слово".encode("UTF-8")

Traceback (most recent call last):
File "", line 1, in
"слово".encode("UTF-8")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf1 in position 0: ordinal not in range(128)

>>> u"слово".encode("UTF-8")
'\xc3\xb1\xc3\xab\xc3\xae\xc3\xa2\xc3\xae'

Basically there's any number of exciting errors.

To enter text correctly, you need to know the encoding for stdin (This is unix speak for 'standard input')
import sys
>>> sys.stdin.encoding
'cp1251'

Eureka. In brief, this means whatever I type in to the console is encoded using 'cp1251'. According to Wikipedia: 'Windows-1251 (a.k.a. code page CP1251) is a popular 8-bit character encoding, designed to cover languages that use the Cyrillic script such as Russian, Bulgarian, Serbian Cyrillic and other languages.'. So because my system locale is set to display Russian characters, this is the encoding used by default. If you want to display a German umlaut, your system default will probably be different.

But, back to entering input. Do this:
>>> "слово".decode("cp1251")
u'\u0441\u043b\u043e\u0432\u043e'
>>> print "слово".decode("cp1251")
слово

And voila.
Or voilà - as the cool kids with the correct unicode setup say


6) piping unicode output 

FYI: http://en.wikipedia.org/wiki/Pipeline_(Unix)
Imagine this scenario: doing stuff with text is faster in Python, but you like to analyse results in R. Consequently you sometimes execute a bit of Python from R using the system command. For instance, I use a SQL database which I access through Python. I have a lookup function which I can run from R, which returns the content of the desired file. the R code is like this:



Here 'path' is the id of the file I want fetched from the database. intern=T means that the R console will record the output it sees printed by python script. But what does it see? Characters with some sort of encoding. Now the problem here is if you print something with UTF-8, stdout (standard output) will add more encoding - as we have seen, in my case 'CP1251'. This makes for a bad combination, and my lookup function returns something like this:
рети армии его тылы были атакованы русской легкой конницей казаками и калмыками | Сражение Карл проиграл и бежал в Османскую империю | Дмитрий Табачник народный депутат Украины доктор исторических наук Киев "

Yuck. Instead, you guessed it, use something like the following in your script:
output.encode("cp1251"). This gives perfectly formatted output.


7) And finally: Beware of capitalisation. The utf-8 code for capitals and lower case letters are different - see here for an example with everyone's favourite Russian character, 'Ya':

>>> "я".decode("cp1251")
u'\u044f'
>>> "Я".decode("cp1251")
u'\u042f'

Consequently, never assume your application thinks these are the same. For instance, pymongo - the python wrapper to the popular noSQL mongodb - is case-insensitive - for latin characters. But, as this individual found out, the capability does not at the minute extend to all unicode characters. So remember: what you read in the English language documentation may be irrelevant or wrong. Always check. 



To be continued.

4 comments:

  1. Just a small thing: your gist uses 'with' like it's in R (as a temporary scope adjustment) but actually it does more in python. Specifically, it looks after final file closing. So your statement could be:

    with codecs.open(file_name, encoding="utf-8") as f:
    return f.read()

    Also, since you were wondering, codecs is mostly redundant in this context in Python 3 because open takes an encoding argument.

    ReplyDelete
    Replies
    1. Oh cool - I never knew 'with' worked like that in R =) Thanks for the comment

      Delete

  2. Thank you very much, Rolf, I found your article very useful, you helped me a lot
    Richard Brown data room solutions

    ReplyDelete
  3. actually many professionals has the doubt of handling foreign test in python language. by this blog you have clarified every doubt of them . thank you for this blog. keep on sharing.
    python training in chennai

    ReplyDelete