Quantifying Memory: Databases for text analysis: archive and access texts using SQL

This post is a collection of scripts I've found useful for integrating a SQL database into more complex applications. SQL allows quickish access to largish repositories of text (I wrote about this at some length here), and are a good starting point for taking textual analysis beyond thousands of texts.

I timed Python to be thirteen times quicker than R when reading and performing basic operations on text. Consequently I decided to create a database in SQL which I accessed through a Python script. Finally I put a wrapper around all of this so I can study metadata about the texts in R. This process, then, shields R from having to see large quantities of text, and keeps both me and my computer relatively sane

All the code is available in an unseemly github dump here

We start with a folder of text files. If you need some, try these, 'borrowed' from the R Text Mining package.

First the setup: we need to be able to read the files. We do that with a handy function, which conveniently smooths over many text-encoding woes. Also we load in the required libraries (sqlite3 loads SQL, codecs allows us to work with UTF-8 characters, while os allows us to browse the operating system's file tree)

The next step is to initialise the database. This is quite straightforward, and described in wonderful detail here
The code does what it says on the box: the first line establishes a connection, which we name 'conn', and the second a cursor - which allows us to access individual records within the database. The bigger block of code is a SQL statement wrapped in a try/except phrase: this attempts to create a table if it doesn't already exist.

The syntax of the sql statement is not complicated: the first line creates a table called texts. Within this table we create three columns - id, file_name, and content. After specifying the name of the column, 'integer' and 'text' specify column type. Finally we specify that this information must not be left empty.

Now we load the files in to Python, and save them to the database

This script iterates over all the files in the folder named in the first line. It then takes each file_name, attempts to read in the corresponding file, and creates a SQL entry made up of three pieces of data:
1 - the ID, here taken from the counter.
2 - the file name, taken from the list of names
3 - the text we read in.

All this gets saved as a tuple (the name for Python's most basic data type) called entry. This gets passed to the database by using the connection we opened previously, and the SQL statement:
INSERT OF IGNORE INTO [table name] VALUES([three place holders]), [our data]

It is worth familiarising oneself with the different types of SQL operation. Here we are telling the database to add the entry unless it already exists (insert or ignore). We could also just use 'insert' - which would throw an error in the case of duplicates. Similarly, we might want to overwrite previous entries using 'replace'.
Finally we run conn.commit() which actually executes the command and writes the entry to the database.

And that is it. All done, the data is now archived away.

The eagle eyed will have noted that this is a pretty inefficient method - we are entering texts into the database individually. We can use a list of entries instead; this gives a performance boost:

Here we do exactly the same thing, except we add each entry to a list called 'holder'. Then we commit all the entries in one go using 'executemany'.

Now, we may have many files to archive, so let's not assume they would all fit into memory; instead we will set the code to archive every 100 entries, and then reset the holder. To ensure that any residual texts are also entered, we keep the final executemany command:

Using this method we can archive thousands of texts relatively efficiently, and without needing to worry about the data fitting in memory.

So, we filed away the data. How do we get it out again?

To get the first entry in the database, we run this code:

The SELECT statement allows us to retrieve one or more fields - here I am retrieving the whole entry of id, file_name, and texts. 'From' specifies that we want to fetch this data from the table named 'texts', and the fetchone() method selects the first entry in the database. Using this code we might cycle through the database's content to perform some calculation. Python receives the entry as a list; the statement above immediately unpacks it into its constitutent parts.

To get a specific entry we add a condition to the SQL statement using 'WHERE'

This allows us to name a target file_name and find the corresponding entry

Similarly, we can use the fetchall() method to get all the entries in the database (or those matching a particular condition, such as id range, date of publication, author, etc.):

WHERE statements can be used to access manageable chunks of data, but in some situations we might want to see all the texts in the archive, even though the archive doesn't fit into memory. Instead, let's use a generator to yield a batch of files.
This means we load x (here:100) number of files, do something to these, load the next 100 files, etc. Consequently, we never have more than 100 files in memory at a given time:

And that's that - now you have the necessary tools to archive a folder of files, and to retrieve them later. Next up I'll show how we can wrap all of this in a function, allowing retrieval of data from within R.

10 comments:

AnonymousNovember 7, 2013 at 7:29 PM
Great series of posts Rolf, thank you for sharing these insights and scripts. I'm looking forward to the post on indexing.

Out of curiosity, are you doing any merge operations with your data? If so, are you doing them in sqlite, R, Python, or something else? I've been looking for an efficient merge implementation (I have some 200GB of texts that I reduce to term-frequency matrices and merge; I'm trying different subsets of the texts, so I have to do this multiple times). So far the fastest thing I've found is Python's 'pandas' library, but it requires a huge amount of memory (590GB to be precise; I've been using a high-performance cluster that has 1TB of RAM) and takes forever (over 10 hours). I tried sqlite3 too (from a Python script, like you do here), but it was even slower (630 seconds vs 30 seconds, on a tiny subset of the data). I was going to try R but I gave up after seeing some benchmarks (http://wesmckinney.com/blog/?p=395). So, any suggestions? How do you go about merging your own data?
AnonymousNovember 8, 2013 at 7:47 AM
Thank you for the reply, Rolf.

I tried pandas' sparse matrix feature, but it doesn't allow merges. I'm already removing proper nouns (well, in a probabilistic way), but for now I'd rather keep stop words and avoid stemming (I'm trying to come up with a "language-blind" algorithm). I'll look into noSQL and Goldstein's comments then.

Thiago
AnonymousMarch 7, 2019 at 12:45 AM
thank for sharing the information.
beautiful informtion.
keep sharing
Best windows server Training in Bangalore
SagaminghahahaFebruary 14, 2020 at 10:07 PM
The ultimate article. Thank you for writing a good article to read.
Sa gaming สมัคร
ShivarajMay 29, 2020 at 2:54 AM
Great Post!! This is very informative post, thanks for share it with us Amazing Post!! This is very impressive post.

digital marketing course in hubli
AchieversITMay 19, 2021 at 4:37 AM

Really nice and informative blog, keep it up. Thanks for sharing and I have some suggestions.
if you want to learn pyhton Programming, Join Now Python Training in Bangalore.
Here is a Website- Software Training in Bangalore | AchieversIT
This may help you to find something useful

Wandering Hearts DuoNovember 19, 2021 at 2:50 AM
Wonderful blog! Thanks for the information, Also if anyone is interested for Jobs in Aviation field do visit the institute which provides Air Hostess course
Crystal FemiAugust 13, 2022 at 2:38 AM
Thankful for sharing an especially confounding post. Keep on sharing more substance. Is it safe to download free ringtones on iPhone? Yes it is safe but depending on the website you choose to download ringtones, it can be safe or unsafe. For instance, many websites might lure you to the latest ringtones for free download against many irrelevant ads, i know one site about iPhone ringtones that is 100% safe. for more info, you can visit this site.
Master RoyaleMarch 14, 2023 at 11:01 PM
This entire post absolutely ROCKS! Thank you for the hard work you put into it. It really shows. if you are you playing Clash royale? Something is surprising for you Master Royale is the best suitable private server for the most addictive game Clash Royale. you can check the article it has support for various devices. grab it is awesome.

HOME | ABOUT | POPULAR POSTS | CV

document.write(ssyby);

Databases for text analysis: archive and access texts using SQL

10 comments:

HOME

|

ABOUT

|

POPULAR POSTS

|

CV