Databases for text analysis: archive and access texts using SQL

This post is a collection of scripts I've found useful for integrating a SQL database into more complex applications. SQL allows quickish access to largish repositories of text (I wrote about this at some length here), and are a good starting point for taking textual analysis beyond thousands of texts.

I timed Python to be thirteen times quicker than R when reading and performing basic operations on text. Consequently I decided to create a database in SQL which I accessed through a Python script. Finally I put a wrapper around all of this so I can study metadata about the texts in R. This process, then, shields R from having to see large quantities of text, and keeps both me and my computer relatively sane

All the code is available in an unseemly github dump here

We start with a folder of text files. If you need some, try these, 'borrowed' from the R Text Mining package.

First the setup: we need to be able to read the files. We do that with a handy function, which conveniently smooths over many text-encoding woes. Also we load in the required libraries (sqlite3 loads SQL, codecs allows us to work with UTF-8 characters, while os allows us to browse the operating system's file tree)

The next step is to initialise the database. This is quite straightforward, and described in wonderful detail here
The code does what it says on the box: the first line establishes a connection, which we name 'conn', and the second a cursor - which allows us to access individual records within the database. The bigger block of code is a SQL statement wrapped in a try/except phrase: this attempts to create a table if it doesn't already exist.

The syntax of the sql statement is not complicated: the first line creates a table called texts. Within this table we create three columns - id, file_name, and content. After specifying the name of the column, 'integer' and 'text' specify column type. Finally we specify that this information must not be left empty.

Now we load the files in to Python, and save them to the database

This script iterates over all the files in the folder named in the first line. It then takes each file_name, attempts to read in the corresponding file, and creates a SQL entry made up of three pieces of data:
1 - the ID, here taken from the counter.
2 - the file name, taken from the list of names
3 - the text we read in.

All this gets saved as a tuple (the name for Python's most basic data type) called entry. This gets passed to the database by using the connection we opened previously, and the SQL statement:
INSERT OF IGNORE INTO [table name] VALUES([three place holders]), [our data]

It is worth familiarising oneself with the different types of SQL operation. Here we are telling the database to add the entry unless it already exists (insert or ignore). We could also just use 'insert' - which would throw an error in the case of duplicates. Similarly, we might want to overwrite previous entries using 'replace'.
Finally we run conn.commit() which actually executes the command and writes the entry to the database.

And that is it. All done, the data is now archived away.

The eagle eyed will have noted that this is a pretty inefficient method - we are entering texts into the database individually. We can use a list of entries instead; this gives a performance boost:

Here we do exactly the same thing, except we add each entry to a list called 'holder'. Then we commit all the entries in one go using 'executemany'.

Now, we may have many files to archive, so let's not assume they would all fit into memory; instead we will set the code to archive every 100 entries, and then reset the holder. To ensure that any residual texts are also entered, we keep the final executemany command:

Using this method we can archive thousands of texts relatively efficiently, and without needing to worry about the data fitting in memory.

So, we filed away the data. How do we get it out again?

To get the first entry in the database, we run this code:

The SELECT statement allows us to retrieve one or more fields - here I am retrieving the whole entry of id, file_name, and texts. 'From' specifies that we want to fetch this data from the table named 'texts', and the fetchone() method selects the first entry in the database. Using this code we might cycle through the database's content to perform some calculation. Python receives the entry as a list; the statement above immediately unpacks it into its constitutent parts.

To get a specific entry we add a condition to the SQL statement using 'WHERE'

This allows us to name a target file_name and find the corresponding entry

Similarly, we can use the fetchall() method to get all the entries in the database (or those matching a particular condition, such as id range, date of publication, author, etc.):

WHERE statements can be used to access manageable chunks of data, but in some situations we might want to see all the texts in the archive, even though the archive doesn't fit into memory. Instead, let's use a generator to yield a batch of files.
This means we load x (here:100) number of files, do something to these, load the next 100 files, etc. Consequently, we never have more than 100 files in memory at a given time:

And that's that - now you have the necessary tools to archive a folder of files, and to retrieve them later. Next up I'll show how we can wrap all of this in a function, allowing retrieval of data from within R.


  1. Great series of posts Rolf, thank you for sharing these insights and scripts. I'm looking forward to the post on indexing.

    Out of curiosity, are you doing any merge operations with your data? If so, are you doing them in sqlite, R, Python, or something else? I've been looking for an efficient merge implementation (I have some 200GB of texts that I reduce to term-frequency matrices and merge; I'm trying different subsets of the texts, so I have to do this multiple times). So far the fastest thing I've found is Python's 'pandas' library, but it requires a huge amount of memory (590GB to be precise; I've been using a high-performance cluster that has 1TB of RAM) and takes forever (over 10 hours). I tried sqlite3 too (from a Python script, like you do here), but it was even slower (630 seconds vs 30 seconds, on a tiny subset of the data). I was going to try R but I gave up after seeing some benchmarks ( So, any suggestions? How do you go about merging your own data?

    1. Your data is much bigger than mine, but my initial response is: find ways to reduce the number of words, e.g. through stop lists, reducing to root forms, excluding proper nouns, place names, etc. If I understand you right and you have a massive document/term matrix, then I would consider using sparse matrixes and in general following Andrew Goldstein's comments to this

      I have found pandas to be pretty efficient, but you might speed things up by using noSQL instead of SQL - I have found mongoDB to be a fair bit faster, for instance.

  2. Thank you for the reply, Rolf.

    I tried pandas' sparse matrix feature, but it doesn't allow merges. I'm already removing proper nouns (well, in a probabilistic way), but for now I'd rather keep stop words and avoid stemming (I'm trying to come up with a "language-blind" algorithm). I'll look into noSQL and Goldstein's comments then.