Databases for text analysis: archive and access texts using SQL

This post is a collection of scripts I've found useful for integrating a SQL database into more complex applications. SQL allows quickish access to largish repositories of text (I wrote about this at some length here), and are a good starting point for taking textual analysis beyond thousands of texts.



I timed Python to be thirteen times quicker than R when reading and performing basic operations on text. Consequently I decided to create a database in SQL which I accessed through a Python script. Finally I put a wrapper around all of this so I can study metadata about the texts in R. This process, then, shields R from having to see large quantities of text, and keeps both me and my computer relatively sane

All the code is available in an unseemly github dump here

We start with a folder of text files. If you need some, try these, 'borrowed' from the R Text Mining package.

First the setup: we need to be able to read the files. We do that with a handy function, which conveniently smooths over many text-encoding woes. Also we load in the required libraries (sqlite3 loads SQL, codecs allows us to work with UTF-8 characters, while os allows us to browse the operating system's file tree)



The next step is to initialise the database. This is quite straightforward, and described in wonderful detail here
The code does what it says on the box: the first line establishes a connection, which we name 'conn', and the second a cursor - which allows us to access individual records within the database. The bigger block of code is a SQL statement wrapped in a try/except phrase: this attempts to create a table if it doesn't already exist.



The syntax of the sql statement is not complicated: the first line creates a table called texts. Within this table we create three columns - id, file_name, and content. After specifying the name of the column, 'integer' and 'text' specify column type. Finally we specify that this information must not be left empty.

Now we load the files in to Python, and save them to the database



This script iterates over all the files in the folder named in the first line. It then takes each file_name, attempts to read in the corresponding file, and creates a SQL entry made up of three pieces of data:
1 - the ID, here taken from the counter.
2 - the file name, taken from the list of names
3 - the text we read in.

All this gets saved as a tuple (the name for Python's most basic data type) called entry. This gets passed to the database by using the connection we opened previously, and the SQL statement:
INSERT OF IGNORE INTO [table name] VALUES([three place holders]), [our data]

It is worth familiarising oneself with the different types of SQL operation. Here we are telling the database to add the entry unless it already exists (insert or ignore). We could also just use 'insert' - which would throw an error in the case of duplicates. Similarly, we might want to overwrite previous entries using 'replace'.
Finally we run conn.commit() which actually executes the command and writes the entry to the database.

And that is it. All done, the data is now archived away.

The eagle eyed will have noted that this is a pretty inefficient method - we are entering texts into the database individually. We can use a list of entries instead; this gives a performance boost:



Here we do exactly the same thing, except we add each entry to a list called 'holder'. Then we commit all the entries in one go using 'executemany'.

Now, we may have many files to archive, so let's not assume they would all fit into memory; instead we will set the code to archive every 100 entries, and then reset the holder. To ensure that any residual texts are also entered, we keep the final executemany command:



Using this method we can archive thousands of texts relatively efficiently, and without needing to worry about the data fitting in memory.



So, we filed away the data. How do we get it out again?

To get the first entry in the database, we run this code:


The SELECT statement allows us to retrieve one or more fields - here I am retrieving the whole entry of id, file_name, and texts. 'From' specifies that we want to fetch this data from the table named 'texts', and the fetchone() method selects the first entry in the database. Using this code we might cycle through the database's content to perform some calculation. Python receives the entry as a list; the statement above immediately unpacks it into its constitutent parts.

To get a specific entry we add a condition to the SQL statement using 'WHERE'

This allows us to name a target file_name and find the corresponding entry

Similarly, we can use the fetchall() method to get all the entries in the database (or those matching a particular condition, such as id range, date of publication, author, etc.):



WHERE statements can be used to access manageable chunks of data, but in some situations we might want to see all the texts in the archive, even though the archive doesn't fit into memory. Instead, let's use a generator to yield a batch of files.
This means we load x (here:100) number of files, do something to these, load the next 100 files, etc. Consequently, we never have more than 100 files in memory at a given time:



And that's that - now you have the necessary tools to archive a folder of files, and to retrieve them later. Next up I'll show how we can wrap all of this in a function, allowing retrieval of data from within R.

17 comments:

  1. Great series of posts Rolf, thank you for sharing these insights and scripts. I'm looking forward to the post on indexing.

    Out of curiosity, are you doing any merge operations with your data? If so, are you doing them in sqlite, R, Python, or something else? I've been looking for an efficient merge implementation (I have some 200GB of texts that I reduce to term-frequency matrices and merge; I'm trying different subsets of the texts, so I have to do this multiple times). So far the fastest thing I've found is Python's 'pandas' library, but it requires a huge amount of memory (590GB to be precise; I've been using a high-performance cluster that has 1TB of RAM) and takes forever (over 10 hours). I tried sqlite3 too (from a Python script, like you do here), but it was even slower (630 seconds vs 30 seconds, on a tiny subset of the data). I was going to try R but I gave up after seeing some benchmarks (http://wesmckinney.com/blog/?p=395). So, any suggestions? How do you go about merging your own data?

    ReplyDelete
    Replies
    1. Your data is much bigger than mine, but my initial response is: find ways to reduce the number of words, e.g. through stop lists, reducing to root forms, excluding proper nouns, place names, etc. If I understand you right and you have a massive document/term matrix, then I would consider using sparse matrixes and in general following Andrew Goldstein's comments to this http://quantifyingmemory.blogspot.com/2013/10/scaling-up-and-shutting-up-r-topic.html

      I have found pandas to be pretty efficient, but you might speed things up by using noSQL instead of SQL - I have found mongoDB to be a fair bit faster, for instance.

      Delete
    2. KARNA RASA HATI YANG GEMBIRA BERKAT BANTUAN AKI SOLEH
      MAKANYA SENGAJA NAMA BELIAU SAYA CANTUNKAN DI INTERNET !!!

      assalamualaikum wr, wb, saya IBU PUSPITA WATI saya Mengucapkan banyak2
      Terima kasih kepada: AKI SOLEH
      atas nomor togelnya yang kemarin AKI berikan "4D"
      alhamdulillah ternyata itu benar2 tembus AKI
      dan berkat bantuan AKI SOLEH saya bisa melunasi semua hutan2…
      orang tua saya yang ada di BANK BRI dan bukan hanya itu AKI alhamdulillah,
      sekarang saya sudah bisa bermodal sedikit untuk mencukupi kebutuhan keluarga saya sehari2.
      Itu semua berkat bantuan AKI SOLEH sekali lagi makasih banyak ya, AKI
      yang ingin merubah nasib
      seperti saya...?
      SILAHKAN GABUNG SAMA AKI SOLEH No; { 082-313-336-747 }

      Sebelum Gabung Sama AKI Baca Duluh Kata2 Yang Dibawah Ini
      Apakah anda termasuk dalam kategori di bawah ini...!!
      1: Di kejar2 tagihan hutang..
      2: Selaluh kalah dalam bermain togel
      3: Barang berharga sudah
      terjual buat judi togel..
      4: Sudah kemana2 tapi tidak
      menghasilkan, solusi yang tepat..!
      5: Sudah banyak dukun ditempati minta angka ritual blom dapat juga,
      satu jalan menyelesaikan masalah anda..
      Dijamin anda akan berhasil
      silahkan buktikan sendiri
      Atau Chat/Tlpn di WhatsApp (WA)
      No WA Aki : 082313336747

      TERIMA KASIH YANG PUNYA
      ROOM ATAS TUMPANGANYA SALAM KOMPAK SELALU
      "KLIK DISINI BOCORAN TOGEL SGP HK SDY DAN DLL"

      Delete
  2. Thank you for the reply, Rolf.

    I tried pandas' sparse matrix feature, but it doesn't allow merges. I'm already removing proper nouns (well, in a probabilistic way), but for now I'd rather keep stop words and avoid stemming (I'm trying to come up with a "language-blind" algorithm). I'll look into noSQL and Goldstein's comments then.

    Thiago

    ReplyDelete
  3. I like your blog, I read this blog please update more content on python, further check it once at python online training

    ReplyDelete
  4. thank for sharing the information.
    beautiful informtion.
    keep sharing
    Best windows server Training in Bangalore

    ReplyDelete
  5. The ultimate article. Thank you for writing a good article to read.
    Sa gaming สมัคร

    ReplyDelete
  6. Great Post!! This is very informative post, thanks for share it with us Amazing Post!! This is very impressive post.

    digital marketing course in hubli

    ReplyDelete
  7. ยินดีต้อนรับสู่ UPLAY365.COM เว็บพนันออนไลน์ All In One ที่รวมเว็บพนันออนไลน์อันดับ 1 ไว้ที่เดียวกันมากที่สุด ไม่ว่าจะเป็น เกมส์ไพ่ ที่เป็นที่นิยม เช่นบาคาร่า แบล็คแจ็ค เสือมังกร หรือจะเป็น รูเล็ต สล็อตออนไลน์ คีโน โป๊กเกอร์ forex ไก่ชน เกมส์ยิงปลา แทงบอล แทงบาส เทนนิส ESPORT แทงมวยไทย และอื่นๆอีกมากมาย พร้อมเทคโนโลยีชั้นนำจากผู้ผลิตซอฟต์แวร์เกมส์ระดับโลก ความน่าเชื่อถือได้มาเป็นอันดับ 1 สามารถเล่นได้ทั้งบนคอมพิวเตอร์ , มือถือ ระบบ android และ IOS *คาสิโนออนไลน์ : สามารถเลือกเล่นกับคาสิโนชั้นนำดังนี้ SexyBaccarat, AG Casino, GOLD Casino, SA Casino, W88 Casino, D88 Casino, WM Casino, GD Casino เป็นต้น *แทงบอล : U กีฬา (U SPORTS) , S กีฬา (S SPORTS) มั่นใจได้เลยว่า อัตราการจ่ายค่าน้ำดีที่สุดต้อง uplay365 เหมาะสำหรับทั้งนักพนันมืออาชีพและ มือใหม่ โดยทางเรามีพนักงานคอยสอนเรื่องการแทงบอลเบื้องต้น แทงง่าย อัตราจ่ายดี *สล็อตออนไลน์ ,เกมส์ยิงปลา : JOKER123,PLAYTECH และอื่นๆ อีกมากมาย ทั้งหมดนี้ สามารถเล่นได้ใน 1 ยูสเซอร์เท่านั้น สนใจสมัครสมาชิกรับเครดิตฟรี สามารถสมัครได้ตนเองที่หน้าเว็บ หรือติดต่อ Callcenter โดย ทางเรามีพนักงานไว้บริการและแก้ปัญหา ตลอด 24 ชั่วโมง สอบถามข้อมูลเพิ่มเติมได้กับแอดมินได้ตลอด 24 ชม.ค่ะ


    WM Casino
    W88

    ReplyDelete
  8. Fiducia Solutions is an ISO 9001:2015 certified institute providing course certifications to all its students. We, at Fiducia Solutions, see to it that the candidate is certified and entitled to bag a good position in acclaimed companies. We provide certificates that are valued, and our alumni reputation proves that we are good at what we offer.

    And that is not all! We are known to provide 100% live project training and 100% written guaranteed placements to our students. That’s what makes us the best PHP/ HR/ Digital Marketing training institutes in Noida and Ghaziabad.

    PHP Training Institute in Noida
    HR Training Institute in Noida
    Digital Marketing Training Institute in Noida
    Android Training Institute in Noida

    ReplyDelete

  9. Really nice and informative blog, keep it up. Thanks for sharing and I have some suggestions.
    if you want to learn pyhton Programming, Join Now Python Training in Bangalore.
    Here is a Website- Software Training in Bangalore | AchieversIT
    This may help you to find something useful

    ReplyDelete
  10. Wonderful blog! Thanks for the information, Also if anyone is interested for Jobs in Aviation field do visit the institute which provides Air Hostess course

    ReplyDelete
  11. Thankful for sharing an especially confounding post. Keep on sharing more substance. Is it safe to download free ringtones on iPhone? Yes it is safe but depending on the website you choose to download ringtones, it can be safe or unsafe. For instance, many websites might lure you to the latest ringtones for free download against many irrelevant ads, i know one site about iPhone ringtones that is 100% safe. for more info, you can visit this site.

    ReplyDelete
  12. This entire post absolutely ROCKS! Thank you for the hard work you put into it. It really shows. if you are you playing Clash royale? Something is surprising for you Master Royale is the best suitable private server for the most addictive game Clash Royale. you can check the article it has support for various devices. grab it is awesome.

    ReplyDelete