Top Seven Tips for Processing 'Foreign' Text in Python (2.7)


Following on from my guide to making R play nice with utf-8, here is a seven-step guide to understanding Python's handling of unicode. Trust me, if you work with non-latin characters, you need to know this stuff:



As ever the usual caveats: I have no idea whether any of this works for Python 3, or on a Mac.


1) Reading files in:
Make sure your files are saved in UTF-8 format. Then use this function to read them in correctly:




Use this function as follows:
text = file_contents('my_file.txt')


2) Printing text in the console:
use 'print'. Thus:
word=u'слово'
>>> word
u'\u0441\u043b\u043e\u0432\u043e'
>>> print word
слово


3) Using utf in a script:
As the example above illustrates, utf-8 text should be indicated by a 'u' preceding the single or double quotes containing your string. This makes the type 'unicode':
>>> type(u'enter something here')
<type 'unicode'>

To make sure your script is recognised as UTF-8, though, add this to the start of your script:
# -*- coding: utf-8 -*-
This tells Python to expect utf-8 characters


4) Saving output Use the 'encode' method as follows:



This gives you beautiful unicode output.


5) Entering characters from the console: This one can be a pain. First make sure that your operating system is set up to display 'special characters' correctly. For Windows, do this:
http://windows.microsoft.com/en-gb/windows-vista/change-the-system-locale

Now here is the trick: characters you enter on the console are already encoded. Python needs to *decode* them rather than encode. Consider:

>>> "слово"
'\xf1\xeb\xee\xe2\xee'

>>> u'слово'
u'\xf1\xeb\xee\xe2\xee'

>>> "слово".encode("UTF-8")

Traceback (most recent call last):
File "", line 1, in
"слово".encode("UTF-8")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf1 in position 0: ordinal not in range(128)

>>> u"слово".encode("UTF-8")
'\xc3\xb1\xc3\xab\xc3\xae\xc3\xa2\xc3\xae'

Basically there's any number of exciting errors.

To enter text correctly, you need to know the encoding for stdin (This is unix speak for 'standard input')
import sys
>>> sys.stdin.encoding
'cp1251'

Eureka. In brief, this means whatever I type in to the console is encoded using 'cp1251'. According to Wikipedia: 'Windows-1251 (a.k.a. code page CP1251) is a popular 8-bit character encoding, designed to cover languages that use the Cyrillic script such as Russian, Bulgarian, Serbian Cyrillic and other languages.'. So because my system locale is set to display Russian characters, this is the encoding used by default. If you want to display a German umlaut, your system default will probably be different.

But, back to entering input. Do this:
>>> "слово".decode("cp1251")
u'\u0441\u043b\u043e\u0432\u043e'
>>> print "слово".decode("cp1251")
слово

And voila.
Or voilà - as the cool kids with the correct unicode setup say


6) piping unicode output 

FYI: http://en.wikipedia.org/wiki/Pipeline_(Unix)
Imagine this scenario: doing stuff with text is faster in Python, but you like to analyse results in R. Consequently you sometimes execute a bit of Python from R using the system command. For instance, I use a SQL database which I access through Python. I have a lookup function which I can run from R, which returns the content of the desired file. the R code is like this:



Here 'path' is the id of the file I want fetched from the database. intern=T means that the R console will record the output it sees printed by python script. But what does it see? Characters with some sort of encoding. Now the problem here is if you print something with UTF-8, stdout (standard output) will add more encoding - as we have seen, in my case 'CP1251'. This makes for a bad combination, and my lookup function returns something like this:
рети армии его тылы были атакованы русской легкой конницей казаками и калмыками | Сражение Карл проиграл и бежал в Османскую империю | Дмитрий Табачник народный депутат Украины доктор исторических наук Киев "

Yuck. Instead, you guessed it, use something like the following in your script:
output.encode("cp1251"). This gives perfectly formatted output.


7) And finally: Beware of capitalisation. The utf-8 code for capitals and lower case letters are different - see here for an example with everyone's favourite Russian character, 'Ya':

>>> "я".decode("cp1251")
u'\u044f'
>>> "Я".decode("cp1251")
u'\u042f'

Consequently, never assume your application thinks these are the same. For instance, pymongo - the python wrapper to the popular noSQL mongodb - is case-insensitive - for latin characters. But, as this individual found out, the capability does not at the minute extend to all unicode characters. So remember: what you read in the English language documentation may be irrelevant or wrong. Always check. 



To be continued.

84 comments:

  1. Just a small thing: your gist uses 'with' like it's in R (as a temporary scope adjustment) but actually it does more in python. Specifically, it looks after final file closing. So your statement could be:

    with codecs.open(file_name, encoding="utf-8") as f:
    return f.read()

    Also, since you were wondering, codecs is mostly redundant in this context in Python 3 because open takes an encoding argument.

    ReplyDelete
    Replies
    1. Oh cool - I never knew 'with' worked like that in R =) Thanks for the comment

      Delete
    2. KARNA RASA HATI YANG GEMBIRA BERKAT BANTUAN AKI SOLEH
      MAKANYA SENGAJA NAMA BELIAU SAYA CANTUNKAN DI INTERNET !!!

      assalamualaikum wr, wb, saya IBU PUSPITA WATI saya Mengucapkan banyak2
      Terima kasih kepada: AKI SOLEH
      atas nomor togelnya yang kemarin AKI berikan "4D"
      alhamdulillah ternyata itu benar2 tembus AKI
      dan berkat bantuan AKI SOLEH saya bisa melunasi semua hutan2…
      orang tua saya yang ada di BANK BRI dan bukan hanya itu AKI alhamdulillah,
      sekarang saya sudah bisa bermodal sedikit untuk mencukupi kebutuhan keluarga saya sehari2.
      Itu semua berkat bantuan AKI SOLEH sekali lagi makasih banyak ya, AKI
      yang ingin merubah nasib
      seperti saya...?
      SILAHKAN GABUNG SAMA AKI SOLEH No; { 082-313-336-747 }

      Sebelum Gabung Sama AKI Baca Duluh Kata2 Yang Dibawah Ini
      Apakah anda termasuk dalam kategori di bawah ini...!!
      1: Di kejar2 tagihan hutang..
      2: Selaluh kalah dalam bermain togel
      3: Barang berharga sudah
      terjual buat judi togel..
      4: Sudah kemana2 tapi tidak
      menghasilkan, solusi yang tepat..!
      5: Sudah banyak dukun ditempati minta angka ritual blom dapat juga,
      satu jalan menyelesaikan masalah anda..
      Dijamin anda akan berhasil
      silahkan buktikan sendiri
      Atau Chat/Tlpn di WhatsApp (WA)
      No WA Aki : 082313336747

      TERIMA KASIH YANG PUNYA
      ROOM ATAS TUMPANGANYA SALAM KOMPAK SELALU
      "KLIK DISINI BOCORAN TOGEL SGP HK SDY DAN DLL"

      Delete
    3. KARNA RASA HATI YANG GEMBIRA BERKAT BANTUAN AKI SOLEH
      MAKANYA SENGAJA NAMA BELIAU SAYA CANTUNKAN DI INTERNET !!!

      assalamualaikum wr, wb, saya IBU PUSPITA WATI saya Mengucapkan banyak2
      Terima kasih kepada: AKI SOLEH
      atas nomor togelnya yang kemarin AKI berikan "4D"
      alhamdulillah ternyata itu benar2 tembus AKI
      dan berkat bantuan AKI SOLEH saya bisa melunasi semua hutan2…
      orang tua saya yang ada di BANK BRI dan bukan hanya itu AKI alhamdulillah,
      sekarang saya sudah bisa bermodal sedikit untuk mencukupi kebutuhan keluarga saya sehari2.
      Itu semua berkat bantuan AKI SOLEH sekali lagi makasih banyak ya, AKI
      yang ingin merubah nasib
      seperti saya...?
      SILAHKAN GABUNG SAMA AKI SOLEH No; { 082-313-336-747 }

      Sebelum Gabung Sama AKI Baca Duluh Kata2 Yang Dibawah Ini
      Apakah anda termasuk dalam kategori di bawah ini...!!
      1: Di kejar2 tagihan hutang..
      2: Selaluh kalah dalam bermain togel
      3: Barang berharga sudah
      terjual buat judi togel..
      4: Sudah kemana2 tapi tidak
      menghasilkan, solusi yang tepat..!
      5: Sudah banyak dukun ditempati minta angka ritual blom dapat juga,
      satu jalan menyelesaikan masalah anda..
      Dijamin anda akan berhasil
      silahkan buktikan sendiri
      Atau Chat/Tlpn di WhatsApp (WA)
      No WA Aki : 082313336747

      TERIMA KASIH YANG PUNYA
      ROOM ATAS TUMPANGANYA SALAM KOMPAK SELALU
      "KLIK DISINI BOCORAN TOGEL SGP HK SDY DAN DLL"

      Delete

  2. Thank you very much, Rolf, I found your article very useful, you helped me a lot
    Richard Brown data room solutions

    ReplyDelete
  3. actually many professionals has the doubt of handling foreign test in python language. by this blog you have clarified every doubt of them . thank you for this blog. keep on sharing.
    python training in chennai

    ReplyDelete
  4. G club Online gambling sites will give you the play that you choose. Make good money every way. Gambling for yourself every day. It will make a good income. Gamblers play this way. Give more returns every day. There are gambling games that will enjoy the simple things. Give a good return. Having fun everyday betting games can be played on your own. Where to play gambling. Have more profit. Gambling is a friendly way to play. Can be played well. There are betting games that will give you more money to get good returns.

    This is a gambling game. No need to go to the casino to spend time to play all areas have gambling games to make more profits. You can make a profit as all. Fun, easy to play gamblers. Ready to gamble to make a good profit every day. Gamblers are guaranteed to fulfill all competitions. There are good bets to play everywhere. Get more compensation. Gamblers can gamble themselves. Gambling is easy. There are gambling games to choose from. Get the good stuff. To play like every day. Gclub มือถือ

    ReplyDelete
  5. Good post and I like it very much. By the way, anybody try this app development company for iOS and Android? I find it is so professional to help me boost app ranking and increase app downloads.

    ReplyDelete
  6. Those guidelines additionally worked to become a good way to recognize that other people online have the identical fervor like mine to grasp great deal more around this condition.

    python training in bangalore|

    ReplyDelete
  7. I simply wanted to write down a quick word to say thanks to you for those wonderful tips and hints you are showing on this site.
    It’s great to come across a blog every once in a while that isn’t the same out of date rehashed material. Fantastic read.

    Python Training in Chennai | Python Training Institutes in Chennai

    ReplyDelete
  8. Needed to compose you a very little word to thank you yet again regarding the nice suggestions you’ve contributed here.
    Python Training in Bangalore

    ReplyDelete
  9. Thanks for providing good information,Thanks for your sharing python Online Training

    ReplyDelete
  10. Nice blog it is informative thank you for sharing Python Online Training

    ReplyDelete
  11. Thanks for sharing this post. Your post is really very helpful its students.

    python training in chennai

    selenium training in chennai

    ReplyDelete
  12. Thanks for the great information, its very useful for me thanks for the shairngs
    Blockchain Development Services

    ReplyDelete
  13. Very interesting blog which helps me to get the in depth knowledge about the technology, Thanks for sharing such a nice blog..
    Good discussion.
    Six Sigma Training in Abu Dhabi
    Six Sigma Training in Dammam
    Six Sigma Training in Riyadh

    ReplyDelete
  14. This is an awesome post. Really very informative and creative contents. This concept is a good way to enhance knowledge. I like it and help me to development very well. Thank you for this brief explanation and very nice information. Well, got good knowledge.
    WordPress development company in Chennai

    ReplyDelete
  15. I like your blog, I read this blog please update more content on python, further check it once at python online training

    ReplyDelete
  16. articles are very interesting

    I hope it can help

    hy there, if you have problem with your printers, laptops, computers drivers. you can visit us at:

    printersdrivercenter.blogspot.com
    www.epson-printerdriver.com
    www.dr-driver.com
    www.drivers-pack.com

    ReplyDelete
  17. python online training
    artificial intelligence training
    we are go to help people to crack interview by providing interview questions. Here I am giving some interview questions related sites, you can visit and prepare for interview
    dbms interview questions
    bootstrap interview questions

    ReplyDelete
  18. Currently Python is the most popular Language in IT. Python adopted as a language of choice for almost all the domain in IT including Web Development, Cloud Computing (AWS, OpenStack, VMware, Google Cloud, etc.. ),Read More

    ReplyDelete
  19. This is quite educational arrange. It has famous breeding about what I rarity to vouch. Colossal proverb.
    This trumpet is a famous tone to nab to troths. Congratulations on a career well achieved. This arrange is synchronous s informative impolites festivity to pity. I appreciated what you ok extremely here 


    Selenium training in bangalore
    Selenium training in Chennai
    Selenium training in Bangalore
    Selenium training in Pune
    Selenium Online training

    ReplyDelete
  20. This is quite educational arrange. It has famous breeding about what I rarity to vouch. Colossal proverb.
    This trumpet is a famous tone to nab to troths. Congratulations on a career well achieved. This arrange is synchronous s informative impolites festivity to pity. I appreciated what you ok extremely here 


    Selenium training in bangalore
    Selenium training in Chennai
    Selenium training in Bangalore
    Selenium training in Pune
    Selenium Online training

    ReplyDelete
  21. Its a wonderful post and very helpful, thanks for all this information. You are including better information.
    Python Training in Noida

    ReplyDelete
  22. LIST OF TOP ENGINEERING COLLEGES

    MS Ramaiah University Engineering Admission
    PES University Engineering Admission
    RV College of Engineering Admission
    BMS College of Engineering Admission
    CRM Institute of Tchnology Engineering Admission
    Reva College of Engineering Admission
    New Horizon College of Engineering Admission>
    Dayananda sagar College of Engineering Admission
    BIT Bangalore Engineering Admission


    Joinus4education is one of the most leading & best educational consultants in Bangalore for higher education providing complete assistance in Admission Guidance, Career guidance
    & Education Counselling for the students & parents to choose the right educational path apt according to their knowledge, interests, personal strengths & other consequential constraints by helping them at every step along with trustworthy services.

    call us:- +91-9538317377
    visit our education portal Joinus4education

    ReplyDelete
  23. Attend The Python training in bangalore From ExcelR. Practical Python training in bangalore Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Python training in bangalore.
    python training in bangalore

    ReplyDelete
  24. These are really amazing and valuable websites you have share with us. Thanks for the informative post.
    WordPress development company in Chennai

    ReplyDelete
  25. 그럼 그대로 먹튀당해 버리는 일이 발생되는 것입니다 먹튀검증

    ReplyDelete
  26. thanks for your information really good and very nice web design company in velachery

    ReplyDelete
  27. I’ve found extensive lists before, but none this informative. Thanks for sharing!

    Visit Us- I Digital Academy

    ReplyDelete
  28. This comment has been removed by the author.

    ReplyDelete
  29. I learned World's Trending Technology from certified experts for free of cost. I got a job in decent Top MNC Company with handsome 14 LPA salary, I have learned the World's Trending Technology from Data science training in btm layout experts who know advanced concepts which can help to solve any type of Real-time issues in the field of Python. Really worth trying Freelance SEO expert in Bangalore

    ReplyDelete
  30. Visit for Python training in Bangalore:- Python training in Bangalore

    ReplyDelete
  31. Great blog thanks for sharing Leaders in the branding business - Adhuntt Media is now creating a buzz among marketing circles in Chennai. Global standard content creation, SEO and Web Development are the pillars of our brand building tactics. Through smart strategies and customer analysis, we can find the perfect audience following for you right now through Facebook and Instagram marketing. Click here 360 your brand journey Adhuntt Media.
    social media marketing company in chennai

    ReplyDelete
  32. Nice blog thanks for sharing You have come to the right place. Karuna Nursery Gardens is the ideal place to begin your journey into landscape gardening. Our specialists have built some of the finest landscape garden in Chennai that too at the best price and amazing service.
    plant nursery in chennai

    ReplyDelete
  33. Excellent blog thanks for sharing Setting up a successful salon means that you need the best wholesale cosmetics suppliers in Chennai to back up your brand. With hundreds of exclusive international brands and down to earth service, Pixies Beauty Shop is your destination to success.
    beauty Shop in Chennai

    ReplyDelete

  34. Thanks for sharing.Really Wonderful article with great piece of information and well written
    AWS training institute in Bangalore

    ReplyDelete
  35. The development of artificial intelligence (AI) has propelled more programming architects, information scientists, and different experts to investigate the plausibility of a vocation in machine learning. Notwithstanding, a few newcomers will in general spotlight a lot on hypothesis and insufficient on commonsense application. Machine Learning Final Year Projects In case you will succeed, you have to begin building machine learning projects in the near future.

    Projects assist you with improving your applied ML skills rapidly while allowing you to investigate an intriguing point. Furthermore, you can include projects into your portfolio, making it simpler to get a vocation, discover cool profession openings, and Final Year Project Centers in Chennai even arrange a more significant compensation.


    Data analytics is the study of dissecting crude data so as to make decisions about that data. Data analytics advances and procedures are generally utilized in business ventures to empower associations to settle on progressively Python Training in Chennai educated business choices. In the present worldwide commercial center, it isn't sufficient to assemble data and do the math; you should realize how to apply that data to genuine situations such that will affect conduct. In the program you will initially gain proficiency with the specialized skills, including R and Python dialects most usually utilized in data analytics programming and usage; Python Training in Chennai at that point center around the commonsense application, in view of genuine business issues in a scope of industry segments, for example, wellbeing, promoting and account.

    ReplyDelete
  36. Thanks for one marvelous posting! I enjoyed reading it; you are a great author. I will make sure to bookmark your blog and may come back someday. I want to encourage that you continue your great posts.Informatica Training in Bangalore

    ReplyDelete
  37. Thank you for your post. This is excellent information. It is amazing and wonderful to visit your site.MSBI Training in Bangalore

    ReplyDelete
  38. Awesome,Thank you so much for sharing such an awesome blog.Big Data Training in Marathahalli

    ReplyDelete
  39. Thanks for sharing this blog. This very important and informative blog.Tableau Training in Bangalore

    ReplyDelete
  40. Learned a lot of new things from your post! Good creation and HATS OFF to the creativity of your mind.Hadoop Training in Bangalore

    ReplyDelete
  41. Post is very useful. Thank you, this useful information.

    Become an Expert In Python Training in Bangalore ! The most trusted and trending Programming Language. Learn from experienced Trainers and get the knowledge to crack a coding interview, @Softgen Infotech Located in BTM Layout.

    ReplyDelete
  42. Great Article. Thank you for sharing! Really an awesome post for every one.

    IEEE Final Year projects Project Centers in Chennai are consistently sought after. Final Year Students Projects take a shot at them to improve their aptitudes, while specialists like the enjoyment in interfering with innovation. For experts, it's an alternate ball game through and through. Smaller than expected IEEE Final Year project centers ground for all fragments of CSE & IT engineers hoping to assemble. Final Year Project Domains for IT It gives you tips and rules that is progressively critical to consider while choosing any final year project point.

    Spring Framework has already made serious inroads as an integrated technology stack for building user-facing applications. Spring Framework Corporate TRaining the authors explore the idea of using Java in Big Data platforms.
    Specifically, Spring Framework provides various tasks are geared around preparing data for further analysis and visualization. Spring Training in Chennai

    ReplyDelete
  43. Enjoyed reading the article above, really explains everything in detail, the article is very interesting and effective. Thank you and good luck…

    Learn Hadoop Training from the Industry Experts we bridge the gap between the need of the industry. Softgen Infotech provide the Best Hadoop Training in Bangalore with 100% Placement Assistance. Book a Free Demo Today.
    Big Data Analytics Training in Bangalore
    Tableau Training in Bangalore
    Data Science Training in Bangalore
    Workday Training in Bangalore

    ReplyDelete
  44. Thanks for sharing your innovative ideas to our vision. I have read your blog and I gathered some new information through your blog. Your blog is really very informative and unique. Keep posting like this. Awaiting for your further update.If you are looking for any Big Data related information, please visit our website Big Data training institute in Bangalore.

    ReplyDelete
  45. Thanks for providing good information,Thanks for your sharing...
    Informatica Bangalore

    ReplyDelete
  46. Very useful blog thanks for sharing IndPac India the German technology Packaging and sealing machines in India is the leading manufacturer and exporter of Packing Machines in India.

    ReplyDelete
  47. Attend The Machine Learning courses in Bangalore From ExcelR. Practical Machine Learning courses in Bangalore Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Machine Learning courses in Bangalore.
    Machine Learning courses in Bangalore

    ReplyDelete
  48. Attend The Machine Learning course Bangalore From ExcelR. Practical Machine Learning course Bangalore Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Machine Learning course Bangalore.
    Machine Learning course Bangalore

    ReplyDelete
  49. Attend The Artificial Intelligence course From ExcelR. Practical Artificial Intelligence course Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Artificial Intelligence course.
    Artificial Intelligence Course

    ReplyDelete
  50. Attend The Artificial Intelligence course From ExcelR. Practical Artificial Intelligence course Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Artificial Intelligence course.
    Artificial Intelligence Course

    ReplyDelete
  51. Hey, What's up, I'm Shivani. I'm an application developer living in Noida, INDIA. I am a fan of technology. I'm also interested in programming and web development. You can download my app with a click on the link.
    Best astrology app
    Astro guru tips
    Astro guru tips
    Free horoscope
    Best astrology app for android
    Hindi astrology app
    Best kundli app
    Astrology app in hindi
    Kundli app

    ReplyDelete