Top Seven Tips for Processing 'Foreign' Text in Python (2.7)

Following on from my guide to making R play nice with utf-8, here is a seven-step guide to understanding Python's handling of unicode. Trust me, if you work with non-latin characters, you need to know this stuff:

As ever the usual caveats: I have no idea whether any of this works for Python 3, or on a Mac.

1) Reading files in:
Make sure your files are saved in UTF-8 format. Then use this function to read them in correctly:

Use this function as follows:
text = file_contents('my_file.txt')

2) Printing text in the console:
use 'print'. Thus:
>>> word
>>> print word

3) Using utf in a script:
As the example above illustrates, utf-8 text should be indicated by a 'u' preceding the single or double quotes containing your string. This makes the type 'unicode':
>>> type(u'enter something here')
<type 'unicode'>

To make sure your script is recognised as UTF-8, though, add this to the start of your script:
# -*- coding: utf-8 -*-
This tells Python to expect utf-8 characters

4) Saving output Use the 'encode' method as follows:

This gives you beautiful unicode output.

5) Entering characters from the console: This one can be a pain. First make sure that your operating system is set up to display 'special characters' correctly. For Windows, do this:

Now here is the trick: characters you enter on the console are already encoded. Python needs to *decode* them rather than encode. Consider:

>>> "слово"

>>> u'слово'

>>> "слово".encode("UTF-8")

Traceback (most recent call last):
File "", line 1, in
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf1 in position 0: ordinal not in range(128)

>>> u"слово".encode("UTF-8")

Basically there's any number of exciting errors.

To enter text correctly, you need to know the encoding for stdin (This is unix speak for 'standard input')
import sys
>>> sys.stdin.encoding

Eureka. In brief, this means whatever I type in to the console is encoded using 'cp1251'. According to Wikipedia: 'Windows-1251 (a.k.a. code page CP1251) is a popular 8-bit character encoding, designed to cover languages that use the Cyrillic script such as Russian, Bulgarian, Serbian Cyrillic and other languages.'. So because my system locale is set to display Russian characters, this is the encoding used by default. If you want to display a German umlaut, your system default will probably be different.

But, back to entering input. Do this:
>>> "слово".decode("cp1251")
>>> print "слово".decode("cp1251")

And voila.
Or voilà - as the cool kids with the correct unicode setup say

6) piping unicode output 

Imagine this scenario: doing stuff with text is faster in Python, but you like to analyse results in R. Consequently you sometimes execute a bit of Python from R using the system command. For instance, I use a SQL database which I access through Python. I have a lookup function which I can run from R, which returns the content of the desired file. the R code is like this:

Here 'path' is the id of the file I want fetched from the database. intern=T means that the R console will record the output it sees printed by python script. But what does it see? Characters with some sort of encoding. Now the problem here is if you print something with UTF-8, stdout (standard output) will add more encoding - as we have seen, in my case 'CP1251'. This makes for a bad combination, and my lookup function returns something like this:
рети армии его тылы были атакованы русской легкой конницей казаками и калмыками | Сражение Карл проиграл и бежал в Османскую империю | Дмитрий Табачник народный депутат Украины доктор исторических наук Киев "

Yuck. Instead, you guessed it, use something like the following in your script:
output.encode("cp1251"). This gives perfectly formatted output.

7) And finally: Beware of capitalisation. The utf-8 code for capitals and lower case letters are different - see here for an example with everyone's favourite Russian character, 'Ya':

>>> "я".decode("cp1251")
>>> "Я".decode("cp1251")

Consequently, never assume your application thinks these are the same. For instance, pymongo - the python wrapper to the popular noSQL mongodb - is case-insensitive - for latin characters. But, as this individual found out, the capability does not at the minute extend to all unicode characters. So remember: what you read in the English language documentation may be irrelevant or wrong. Always check. 

To be continued.


  1. Just a small thing: your gist uses 'with' like it's in R (as a temporary scope adjustment) but actually it does more in python. Specifically, it looks after final file closing. So your statement could be:

    with, encoding="utf-8") as f:

    Also, since you were wondering, codecs is mostly redundant in this context in Python 3 because open takes an encoding argument.

    1. Oh cool - I never knew 'with' worked like that in R =) Thanks for the comment


  2. Thank you very much, Rolf, I found your article very useful, you helped me a lot
    Richard Brown data room solutions

  3. actually many professionals has the doubt of handling foreign test in python language. by this blog you have clarified every doubt of them . thank you for this blog. keep on sharing.
    python training in chennai

  4. G club Online gambling sites will give you the play that you choose. Make good money every way. Gambling for yourself every day. It will make a good income. Gamblers play this way. Give more returns every day. There are gambling games that will enjoy the simple things. Give a good return. Having fun everyday betting games can be played on your own. Where to play gambling. Have more profit. Gambling is a friendly way to play. Can be played well. There are betting games that will give you more money to get good returns.

    This is a gambling game. No need to go to the casino to spend time to play all areas have gambling games to make more profits. You can make a profit as all. Fun, easy to play gamblers. Ready to gamble to make a good profit every day. Gamblers are guaranteed to fulfill all competitions. There are good bets to play everywhere. Get more compensation. Gamblers can gamble themselves. Gambling is easy. There are gambling games to choose from. Get the good stuff. To play like every day. Gclub มือถือ

  5. Good post and I like it very much. By the way, anybody try this app development company for iOS and Android? I find it is so professional to help me boost app ranking and increase app downloads.

  6. Those guidelines additionally worked to become a good way to recognize that other people online have the identical fervor like mine to grasp great deal more around this condition.

    python training in bangalore|

  7. I simply wanted to write down a quick word to say thanks to you for those wonderful tips and hints you are showing on this site.
    It’s great to come across a blog every once in a while that isn’t the same out of date rehashed material. Fantastic read.

    Python Training in Chennai | Python Training Institutes in Chennai

  8. Needed to compose you a very little word to thank you yet again regarding the nice suggestions you’ve contributed here.
    Python Training in Bangalore

  9. Thanks for providing good information,Thanks for your sharing python Online Training

  10. Nice blog it is informative thank you for sharing Python Online Training

  11. Thanks for sharing this post. Your post is really very helpful its students.

    python training in chennai

    selenium training in chennai

  12. Thanks for the great information, its very useful for me thanks for the shairngs
    Blockchain Development Services

  13. Very interesting blog which helps me to get the in depth knowledge about the technology, Thanks for sharing such a nice blog..
    Good discussion.
    Six Sigma Training in Abu Dhabi
    Six Sigma Training in Dammam
    Six Sigma Training in Riyadh

  14. This is an awesome post. Really very informative and creative contents. This concept is a good way to enhance knowledge. I like it and help me to development very well. Thank you for this brief explanation and very nice information. Well, got good knowledge.
    WordPress development company in Chennai

  15. I like your blog, I read this blog please update more content on python, further check it once at python online training

  16. articles are very interesting

    I hope it can help

    hy there, if you have problem with your printers, laptops, computers drivers. you can visit us at:

  17. python online training
    artificial intelligence training
    we are go to help people to crack interview by providing interview questions. Here I am giving some interview questions related sites, you can visit and prepare for interview
    dbms interview questions
    bootstrap interview questions

  18. Currently Python is the most popular Language in IT. Python adopted as a language of choice for almost all the domain in IT including Web Development, Cloud Computing (AWS, OpenStack, VMware, Google Cloud, etc.. ),Read More

  19. This is quite educational arrange. It has famous breeding about what I rarity to vouch. Colossal proverb.
    This trumpet is a famous tone to nab to troths. Congratulations on a career well achieved. This arrange is synchronous s informative impolites festivity to pity. I appreciated what you ok extremely here 

    Selenium training in bangalore
    Selenium training in Chennai
    Selenium training in Bangalore
    Selenium training in Pune
    Selenium Online training

  20. This is quite educational arrange. It has famous breeding about what I rarity to vouch. Colossal proverb.
    This trumpet is a famous tone to nab to troths. Congratulations on a career well achieved. This arrange is synchronous s informative impolites festivity to pity. I appreciated what you ok extremely here 

    Selenium training in bangalore
    Selenium training in Chennai
    Selenium training in Bangalore
    Selenium training in Pune
    Selenium Online training

  21. Its a wonderful post and very helpful, thanks for all this information. You are including better information.
    Python Training in Noida


    MS Ramaiah University Engineering Admission
    PES University Engineering Admission
    RV College of Engineering Admission
    BMS College of Engineering Admission
    CRM Institute of Tchnology Engineering Admission
    Reva College of Engineering Admission
    New Horizon College of Engineering Admission>
    Dayananda sagar College of Engineering Admission
    BIT Bangalore Engineering Admission

    Joinus4education is one of the most leading & best educational consultants in Bangalore for higher education providing complete assistance in Admission Guidance, Career guidance
    & Education Counselling for the students & parents to choose the right educational path apt according to their knowledge, interests, personal strengths & other consequential constraints by helping them at every step along with trustworthy services.

    call us:- +91-9538317377
    visit our education portal Joinus4education

  23. Attend The Python training in bangalore From ExcelR. Practical Python training in bangalore Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Python training in bangalore.
    python training in bangalore

  24. These are really amazing and valuable websites you have share with us. Thanks for the informative post.
    WordPress development company in Chennai

  25. 그럼 그대로 먹튀당해 버리는 일이 발생되는 것입니다 먹튀검증

  26. thanks for your information really good and very nice web design company in velachery

  27. I’ve found extensive lists before, but none this informative. Thanks for sharing!

    Visit Us- I Digital Academy

  28. This comment has been removed by the author.

  29. I learned World's Trending Technology from certified experts for free of cost. I got a job in decent Top MNC Company with handsome 14 LPA salary, I have learned the World's Trending Technology from Data science training in btm layout experts who know advanced concepts which can help to solve any type of Real-time issues in the field of Python. Really worth trying Freelance SEO expert in Bangalore