The Challenge of not-quite-Gargantuan Data (and why DH needs SQL)

I felt strangely belittled when Andrew Goldstone tweeted about a recent blog entry:

Not even medium sized? But... but... my archive is really big! I am working on more than a million texts! Of course, he is right - and it occurs to me that medium sized data such as mine is in its own way quite tricky to handle: small enough to be archived on a laptop, too big to fit into memory. An archive of this size creates the illusion of being both Big and easily manageable - when in reality it is neither.

This post, the first in a series of three, explains why I decided to use a database of texts. The second post will explore how to archive and retrieve data from a SQL database, while the third will introduce how to use indexes to keep textual data at arms length and facilitate quick information retrieval.

Big Data is by definition difficult to handle. Following Manovich, who himself cited the 2011 Wikipedia page about Big Data: 'in computer industry the term has a more precise meaning: "Big Data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time."

Big, then, is defined by capabilities, and to most humanities scholars Big might be any collection too large to be analysed through close reading - possibly only a few thousand texts.My medium sized data is small enough that I could collect and archive it independently, for instance through web-scraping, but it was too big for me to easily analyse it.

There are any number of tutorials easily available showing how to work with textual data. Take Matthew Jockers forthcoming book about literary analysis through R: Jockers makes available a whole toolkit for textual analysis, into which the user can 'plug' their own texts. I don't mean to belittle Jockers' efforts - I think the book is both a really excellent introduction and well written. I find, though, that applications which work well for Small example data often don't seamlessly accommodate the Bigger data we want to analyse. In practice, sample code needs to be plugged into a flexible research workflow that can be scaled beyond a few thousand entries.

There are in my mind two essential difficulties about using medium sized data: it might fit on your hard-drive, but probably not in computer memory - I wrote a little bit about that here. Secondly, as scholars generally without formal computer science training, our go-to methods are usually iterative.

Imagine the following scenario: we have 100 files, and want to know the word-count of each. In R, for instance, it is very easy to load in a file, count the number of words, store that number in a vector, and repeat for the remaining 99 files. We might do something like this:

The code is pretty straightforward, if remarkably inefficient:
1) get a list of the file names
2) read in one file in UTF-8 format (Or rather: open the file, read it into memory, close the file)
3) split the text at space characters
4) make a list of word counts

I wrapped that loop into a function and ran it an increasing number of times for one book with 80 000 words
1 file: 0.14 seconds
10 files 1.3 seconds
100 files 13.02 seconds

Not bad, right? Well, actually, yes, yes it is.

As we would expect, the cost of this function increases linearly: 100 files takes about 100 times as long as a single file. Still only 13 seconds, but if it were 1 000 000 files the operation would take more than 36 hours. This, then, is the point about medium data: there is somewhere between 100 files and 1 million files where this kind of inefficient approach becomes too slow, especially if we want to know more than just a word count - which we usually do. Imagine we wanted to conduct a search - this approach of looking for a term would also take 36 hours. In short, we need proper ways to archive and index text.

The main reason this was slow is that R is not very good at working with text - reading the same file 100 times in python took exactly 1 second (script). Now, the point I want to make here is not primarily that R is slower than python when working with text: processing 1 million files would still take the best part of three hours in Python - but that working with many small files is inefficient.

To demonstrate this I created a second file 800 words long (1% of the original), and loaded this file 100 times in 0.05 seconds. The longer file, though, was processed in 0.002 seconds. The lesson here, then, is that it is much quicker to work with fewer large files than many small files.

In reality there are other considerations too: try putting more than 100 000 files in a directory and see how your computer reacts - most windows systems will struggle to even open the folder. In most cases, we are better off working with a single large file than many small ones.

There are many ways to group files - we might group multiple files together, one on each line. This requires consideration of how to deal with line-breaks, as well as other formatting questions, but it can work. The difficulty of working with a single large file, though, is that it has to be loaded into memory. This is no problem for a few thousand texts, but again, as we approach a million texts, we are certain to run out of free RAM. What we have here then, is the dilemma of medium-sized data: the time needed to process the data is probably too high if it is stored in individual files, and merging them into one large file will mean we run into memory issues. One approach is to have many big files, but then we need to solve the problem of retrieval: in which large file is text 'x' stored? In the past I have solved this using hashing, or by archiving the files in order, but have tended to find it a not very robust approach.

As scholars working with medium sized data and low-level methods we tend to face the question: is our data big enough to force us to get more efficient, or do we put up with slow operations that get the job done? When I decided to 'find a better way' I turned to databases. A SQL database allows retrieval of individual files, meaning it does not have to be loaded into memory, and it uses indexing which makes it generally quite fast.

This text is already quite long, so I'll leave the code for archiving and retrieving text for the next post. In the third post I will explore how to use noSQL and indexing to dramatically speed up working with archives of text.