Top Seven Tips for Processing 'Foreign' Text in Python (2.7)


Following on from my guide to making R play nice with utf-8, here is a seven-step guide to understanding Python's handling of unicode. Trust me, if you work with non-latin characters, you need to know this stuff:



As ever the usual caveats: I have no idea whether any of this works for Python 3, or on a Mac.


1) Reading files in:
Make sure your files are saved in UTF-8 format. Then use this function to read them in correctly:




Use this function as follows:
text = file_contents('my_file.txt')


2) Printing text in the console:
use 'print'. Thus:
word=u'слово'
>>> word
u'\u0441\u043b\u043e\u0432\u043e'
>>> print word
слово


3) Using utf in a script:
As the example above illustrates, utf-8 text should be indicated by a 'u' preceding the single or double quotes containing your string. This makes the type 'unicode':
>>> type(u'enter something here')
<type 'unicode'>

To make sure your script is recognised as UTF-8, though, add this to the start of your script:
# -*- coding: utf-8 -*-
This tells Python to expect utf-8 characters


4) Saving output Use the 'encode' method as follows:



This gives you beautiful unicode output.


5) Entering characters from the console: This one can be a pain. First make sure that your operating system is set up to display 'special characters' correctly. For Windows, do this:
http://windows.microsoft.com/en-gb/windows-vista/change-the-system-locale

Now here is the trick: characters you enter on the console are already encoded. Python needs to *decode* them rather than encode. Consider:

>>> "слово"
'\xf1\xeb\xee\xe2\xee'

>>> u'слово'
u'\xf1\xeb\xee\xe2\xee'

>>> "слово".encode("UTF-8")

Traceback (most recent call last):
File "", line 1, in
"слово".encode("UTF-8")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf1 in position 0: ordinal not in range(128)

>>> u"слово".encode("UTF-8")
'\xc3\xb1\xc3\xab\xc3\xae\xc3\xa2\xc3\xae'

Basically there's any number of exciting errors.

To enter text correctly, you need to know the encoding for stdin (This is unix speak for 'standard input')
import sys
>>> sys.stdin.encoding
'cp1251'

Eureka. In brief, this means whatever I type in to the console is encoded using 'cp1251'. According to Wikipedia: 'Windows-1251 (a.k.a. code page CP1251) is a popular 8-bit character encoding, designed to cover languages that use the Cyrillic script such as Russian, Bulgarian, Serbian Cyrillic and other languages.'. So because my system locale is set to display Russian characters, this is the encoding used by default. If you want to display a German umlaut, your system default will probably be different.

But, back to entering input. Do this:
>>> "слово".decode("cp1251")
u'\u0441\u043b\u043e\u0432\u043e'
>>> print "слово".decode("cp1251")
слово

And voila.
Or voilà - as the cool kids with the correct unicode setup say


6) piping unicode output 

FYI: http://en.wikipedia.org/wiki/Pipeline_(Unix)
Imagine this scenario: doing stuff with text is faster in Python, but you like to analyse results in R. Consequently you sometimes execute a bit of Python from R using the system command. For instance, I use a SQL database which I access through Python. I have a lookup function which I can run from R, which returns the content of the desired file. the R code is like this:



Here 'path' is the id of the file I want fetched from the database. intern=T means that the R console will record the output it sees printed by python script. But what does it see? Characters with some sort of encoding. Now the problem here is if you print something with UTF-8, stdout (standard output) will add more encoding - as we have seen, in my case 'CP1251'. This makes for a bad combination, and my lookup function returns something like this:
рети армии его тылы были атакованы русской легкой конницей казаками и калмыками | Сражение Карл проиграл и бежал в Османскую империю | Дмитрий Табачник народный депутат Украины доктор исторических наук Киев "

Yuck. Instead, you guessed it, use something like the following in your script:
output.encode("cp1251"). This gives perfectly formatted output.


7) And finally: Beware of capitalisation. The utf-8 code for capitals and lower case letters are different - see here for an example with everyone's favourite Russian character, 'Ya':

>>> "я".decode("cp1251")
u'\u044f'
>>> "Я".decode("cp1251")
u'\u042f'

Consequently, never assume your application thinks these are the same. For instance, pymongo - the python wrapper to the popular noSQL mongodb - is case-insensitive - for latin characters. But, as this individual found out, the capability does not at the minute extend to all unicode characters. So remember: what you read in the English language documentation may be irrelevant or wrong. Always check. 



To be continued.

92 comments:

  1. Just a small thing: your gist uses 'with' like it's in R (as a temporary scope adjustment) but actually it does more in python. Specifically, it looks after final file closing. So your statement could be:

    with codecs.open(file_name, encoding="utf-8") as f:
    return f.read()

    Also, since you were wondering, codecs is mostly redundant in this context in Python 3 because open takes an encoding argument.

    ReplyDelete
    Replies
    1. Oh cool - I never knew 'with' worked like that in R =) Thanks for the comment

      Delete

  2. Thank you very much, Rolf, I found your article very useful, you helped me a lot
    Richard Brown data room solutions

    ReplyDelete
  3. actually many professionals has the doubt of handling foreign test in python language. by this blog you have clarified every doubt of them . thank you for this blog. keep on sharing.
    python training in chennai

    ReplyDelete
  4. G club Online gambling sites will give you the play that you choose. Make good money every way. Gambling for yourself every day. It will make a good income. Gamblers play this way. Give more returns every day. There are gambling games that will enjoy the simple things. Give a good return. Having fun everyday betting games can be played on your own. Where to play gambling. Have more profit. Gambling is a friendly way to play. Can be played well. There are betting games that will give you more money to get good returns.

    This is a gambling game. No need to go to the casino to spend time to play all areas have gambling games to make more profits. You can make a profit as all. Fun, easy to play gamblers. Ready to gamble to make a good profit every day. Gamblers are guaranteed to fulfill all competitions. There are good bets to play everywhere. Get more compensation. Gamblers can gamble themselves. Gambling is easy. There are gambling games to choose from. Get the good stuff. To play like every day. Gclub มือถือ

    ReplyDelete
  5. Those guidelines additionally worked to become a good way to recognize that other people online have the identical fervor like mine to grasp great deal more around this condition.

    python training in bangalore|

    ReplyDelete
  6. Needed to compose you a very little word to thank you yet again regarding the nice suggestions you’ve contributed here.
    Python Training in Bangalore

    ReplyDelete
  7. Thanks for the great information, its very useful for me thanks for the shairngs
    Blockchain Development Services

    ReplyDelete
  8. Very interesting blog which helps me to get the in depth knowledge about the technology, Thanks for sharing such a nice blog..
    Good discussion.
    Six Sigma Training in Abu Dhabi
    Six Sigma Training in Dammam
    Six Sigma Training in Riyadh

    ReplyDelete
  9. python online training
    artificial intelligence training
    we are go to help people to crack interview by providing interview questions. Here I am giving some interview questions related sites, you can visit and prepare for interview
    dbms interview questions
    bootstrap interview questions

    ReplyDelete
  10. This is quite educational arrange. It has famous breeding about what I rarity to vouch. Colossal proverb.
    This trumpet is a famous tone to nab to troths. Congratulations on a career well achieved. This arrange is synchronous s informative impolites festivity to pity. I appreciated what you ok extremely here 


    Selenium training in bangalore
    Selenium training in Chennai
    Selenium training in Bangalore
    Selenium training in Pune
    Selenium Online training

    ReplyDelete
  11. This is quite educational arrange. It has famous breeding about what I rarity to vouch. Colossal proverb.
    This trumpet is a famous tone to nab to troths. Congratulations on a career well achieved. This arrange is synchronous s informative impolites festivity to pity. I appreciated what you ok extremely here 


    Selenium training in bangalore
    Selenium training in Chennai
    Selenium training in Bangalore
    Selenium training in Pune
    Selenium Online training

    ReplyDelete
  12. Its a wonderful post and very helpful, thanks for all this information. You are including better information.
    Python Training in Noida

    ReplyDelete
  13. Attend The Python training in bangalore From ExcelR. Practical Python training in bangalore Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Python training in bangalore.
    python training in bangalore

    ReplyDelete
  14. 그럼 그대로 먹튀당해 버리는 일이 발생되는 것입니다 먹튀검증

    ReplyDelete
  15. I’ve found extensive lists before, but none this informative. Thanks for sharing!

    Visit Us- I Digital Academy

    ReplyDelete
  16. This comment has been removed by the author.

    ReplyDelete
  17. Great blog thanks for sharing Leaders in the branding business - Adhuntt Media is now creating a buzz among marketing circles in Chennai. Global standard content creation, SEO and Web Development are the pillars of our brand building tactics. Through smart strategies and customer analysis, we can find the perfect audience following for you right now through Facebook and Instagram marketing. Click here 360 your brand journey Adhuntt Media.
    social media marketing company in chennai

    ReplyDelete
  18. Excellent blog thanks for sharing Setting up a successful salon means that you need the best wholesale cosmetics suppliers in Chennai to back up your brand. With hundreds of exclusive international brands and down to earth service, Pixies Beauty Shop is your destination to success.
    beauty Shop in Chennai

    ReplyDelete

  19. Thanks for sharing.Really Wonderful article with great piece of information and well written
    AWS training institute in Bangalore

    ReplyDelete
  20. Thanks for one marvelous posting! I enjoyed reading it; you are a great author. I will make sure to bookmark your blog and may come back someday. I want to encourage that you continue your great posts.Informatica Training in Bangalore

    ReplyDelete
  21. Post is very useful. Thank you, this useful information.

    Become an Expert In Python Training in Bangalore ! The most trusted and trending Programming Language. Learn from experienced Trainers and get the knowledge to crack a coding interview, @Softgen Infotech Located in BTM Layout.

    ReplyDelete
  22. Thanks for providing good information,Thanks for your sharing...
    Informatica Bangalore

    ReplyDelete
  23. It has been simply incredibly generous with you to provide openly what exactly many individuals would’ve marketed for an eBook to end up making some cash for their end, primarily given that you could have tried it in the event you wanted.
    Best CRM System

    ReplyDelete
  24. Wow!!! You have such an impressive content. If you want to know MS Ramaiah College of Law, Bangalore - Admissions 2020. Check out management quota fees, admission process, and the eligibility & course details for MS Ramaiah College of Law.

    Visit: https://www.edudunia.com/colleges/ms-ramaiah-college-of-law-bangalore

    ReplyDelete

  25. I am really enjoying reading your well written articles. It looks like you spend a lot of effort and time on your blog. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work.

    Azure Training in Chennai

    Azure Training in Bangalore

    Azure Training in Hyderabad

    Azure Training in Pune

    Azure Training | microsoft azure certification | Azure Online Training Course

    Azure Online Training


    ReplyDelete
  26. his is quite educational arrange. It has famous breeding about what I rarity to vouch. Colossal proverb.
    This trumpet is a famous tone to nab to troths. Congratulations on a career well achieved. This arrange is synchronous s informative impolites festivity to pity. I appreciated what you ok extremely here
    java training in chennai

    java training in omr

    aws training in chennai

    aws training in omr

    python training in chennai

    python training in omr

    selenium training in chennai

    selenium training in omr

    ReplyDelete
  27. I just recently discovered your blog and have now scrolled through the entire thing several times. I am very impressed and inspired by your skill and creativity, and your "style" is very much in line with mine. I hope you keep blogging and sharing your design idea
    angular js training in chennai

    angular js training in velachery

    full stack training in chennai

    full stack training in velachery

    php training in chennai

    php training in velachery

    photoshop training in chennai

    photoshop training in velachery

    ReplyDelete
  28. actually many professionals has the doubt of handling foreign test in python language. by this blog you have clarified every doubt of them . thank you for this blog. keep on sharing.


    AWS Course in Bangalore

    AWS Course in Hyderabad

    AWS Course in Coimbatore

    AWS Course

    AWS Certification Course

    AWS Certification Training

    AWS Online Training

    AWS Training

    ReplyDelete
  29. Thanks for sharing your innovative ideas to our vision. I have read your blog and I gathered some new information through your blog. Your blog is really very informative and unique.
    acte reviews

    acte velachery reviews

    acte tambaram reviews

    acte anna nagar reviews

    acte porur reviews

    acte omr reviews

    acte chennai reviews

    acte student reviews


    ReplyDelete
  30. Wow!!! You have such an impressive content. If you want to know & Get Admission at BMS College of Engineering it is one of the esteemed and leading Engineering college in Bangalore. Get the details of Courses, Fees Structures, Admissions Process, Placements and Facilities!

    Visit: https://www.edudunia.com/colleges/bms-college-of-engineering-bangalore

    ReplyDelete
  31. I found this is an informative and interesting blog so I think so it is very useful and knowledgeable. I would like to thank you for the efforts you have made in writing this blog. talk to astrologer online

    ReplyDelete
  32. Wow! Such an amazing and helpful post this is. I really really love it. I hope that you continue to do your work like this in the future also.

    Apache Spark Training in Pune
    Python Classes in Pune

    ReplyDelete
  33. This comment has been removed by the author.

    ReplyDelete
  34. Awesome Blog with informative concept. Really I feel happy to see this useful blog, Thanks for sharing such a nice blog...
    Digital Marketing Courses near me

    ReplyDelete
  35. Wonderful information, thanks a lot for sharing kind of information. Your website gives the best and the most interesting information. Free online chat
    Live Video Call with Girls

    ReplyDelete
  36. Thanks for the post. It was very interesting and meaningful. I really appreciate it! Keep updating stuff like this.
    Scan to BIM in Houston
    Scan to BIM in Minnesota

    ReplyDelete
  37. Attractive component of the material. I just stumbled across your web site and accession capital to say that I really enjoyed your site. With just a few clicks, foreign nationals can apply for a visa application form Turkey from their home. Fill the form with accurate and complete information about the passenger's data.

    ReplyDelete
  38. I am happy to see your work. Some additional charges may apply in case of emergency in India visa fees.

    ReplyDelete
  39. This comment has been removed by the author.

    ReplyDelete

  40. ecommerce development company
    ecommerce website development company

    ReplyDelete
  41. Information that is useful and appealing. This blog is really rocking... Yes, I like the post very much. Indian visa for United Kingdom citizens, apply India regular visa from United Kingdom online via India visa website within 5minutes you can fill your visa form.

    ReplyDelete
  42. Thank you for the interesting and informative article, keep it up, you know Myanmar visa online services have resumed for business visa applications as of 1st April 2022. You can find more information about the Burma visa on our Myanmar eVisa page.

    ReplyDelete
  43. Suzuki Suspension Parts tend to wear out with time. But don't wait for complete damage. Restore ride quality and smoothen all ride bumps with our spare parts.

    Genuine and robust Suzuki Gear Parts for all Suzuki cars . Check out our vast list of Suzuki Spare Parts and aftermarket replacement parts here at BP Auto Spares India.

    Get the smooth driving feel of your Suzuki car as when it was new. Make every turn smooth with BP Auto Spares India tried and trusted Suzuki Steering Parts.

    Suzuki Propeller Shaft Parts: When your Suzuki car’s propeller shaft fails, it can detriment the propulsion function capacity. So, be on the alert for steel-to-steel contact, and get your spares always ready.

    Genuine and robust Suzuki Various Pipes and Hoses for all Suzuki cars . Check out our vast list of Suzuki Spare Parts and aftermarket replacement parts here at BP Auto Spares India.

    Genuine and robust Suzuki Other Parts for all Suzuki cars . Check out our vast list of Suzuki Spare Parts and aftermarket replacement parts here at BP Auto Spares India.

    Complete Online Suzuki Parts Catalog

    ReplyDelete
  44. Gemini Promo Code Get Gemini Promo code $50 Bonus when you signup Gemini and Trade $500+ in three days then you received $50 bonus from gemini.Also You can Get $10 Signup Bonus at gemini when you Create account and buy sell $100 crypto in 30 days and Earn $10 bonus instantly.

    ReplyDelete
  45. The program has powerful controller aptitudes which uphold blend, prepared capacity individuals, simple to-locate, the individual and irrelevant expense of proprietorship accompanied by snappy use. Tally ERP 9 GST Crack

    ReplyDelete
  46. Have I ever mentioned, how wonderful and comfortable your clothes are? No? Ups, sorry, they really are. Thanks! Happy birthday, sis, and hope your closet will always be full. Birthday Wishes For Sister Funny

    ReplyDelete
  47. great article to read
    Do read:http://blog.rolffredheim.com/2013/11/top-seven-tips-for-processing-foreign.html?sc=1677216094985#c290680374058275910

    ReplyDelete
  48. Turkeys new electronic visa system that covers all the hard procedures for tourists, alleviating the burdensome procedures typically associated with obtaining a visa. This innovative system encompasses all the necessary steps, from application submission to visa issuance, within a user-friendly online platform.

    ReplyDelete
  49. The evisa login Kenya portal offers a seamless and hassle-free experience for travelers seeking to access their visa application status and make updates if necessary. With our evisa system, applicants can conveniently track the progress of their visa application, ensuring transparency and peace of mind throughout the process

    ReplyDelete
  50. Sri Lanka Eases Visa Requirements For Travelers From 39 Countries. This move simplifies the visa application process and promotes tourism in the country. Visitors from these countries can enjoy a visa-free entry or obtain an electronic travel authorization online, making it more convenient to explore Sri Lanka's beautiful landscapes, rich history, and cultural attractions. Travelers should check the specific eligibility criteria and stay updated on any changes to ensure a smooth entry into Sri Lanka.

    ReplyDelete