Digital Data Collection course

Another year, another web scraping course. Taught through SSRMC at the University of Cambridge. Below are slides from all three sessions.

In the course I tried to achieve the following:
- Show how to connect R to resources online
- Use loops and functions to iteratively access online content
- How to work with APIs
- How to harvest data manually using Xpath expressions.

What's new?
- Many more examples and practice tasks
- Updated API usage
- Some bug fixes (and probably many new bugs introduced)

Better handling of JSON data in R?

What is the best way to read data in JSON format into R? Though really common for almost all modern online applications, JSON is not every R user's best friend. After seeing the slides for my Web Scraping course, in which I somewhat arbitrarily veered between using the packages rjson and RJSONIO, the creator of a third JSON package, Jeroen Ooms, urged me to reconsider my package selection process. So without further ado, is jsonlite any better? Does it get rid of the problem of seemingly infinitely nested lists?

As part of exploring digital data collection we used a range of sources that provide JSON data - from Wikipedia page views to social media sharing stats to YouTube Comments and real-time cricket scores. A persistent annoyance for students was navigating the JSON structure, typically translated into R as a list. Here is what my YouTube stats scraper looks like:

getStats <- span=""> function(id) {
    url = paste0("", id, "?v=2&alt=json") <- span=""> readLines(url, warn = "F")
    rd <- span=""> fromJSON(
    dop <- span=""> as.character(rd$entry$published)
    term <- span=""> rd$entry$category[[2]]["term"]
    label <- span=""> rd$entry$category[[2]]["label"]
    title <- span=""> rd$entry$title
    author <- span=""> rd$entry$author[[1]]$name
    duration <- span=""> rd$entry$`media$group`$`media$content`[[1]]["duration"]
    favs <- span=""> rd$entry$`yt$statistics`["favoriteCount"]
    views <- span=""> rd$entry$`yt$statistics`["viewCount"]
    dislikes <- span=""> rd$entry$`yt$rating`["numDislikes"]
    likes <- span=""> rd$entry$`yt$rating`["numLikes"]
    return(list(id, dop, term, label, title, author, duration, favs, views,
        dislikes, likes))


[1] "Ya2elsR5s5s"
[1] "2013-12-17T19:01:44.000Z"


Now, this is all fine, except that, upon closer inspection, the scraper function burrows into lists to extract the correct field. We use special ticks to accommodate names with dollar-signs in them, to name but one challenge.

Is this any easier using jsonlite?
id = "Ya2elsR5s5s"
url = paste0("", id, "?v=2&alt=json") <- span=""> readLines(url, warn = "F")
rd <- span=""> fromJSON(
term <- span=""> rd$entry$category$term[2]
label <- span=""> rd$entry$category$label[2]
title <- span=""> rd$entry$title
author <- span=""> rd$entry$author[1]
duration <- span=""> rd$entry$`media$group`$`media$content`$duration[1]

is this any better? I’m not convinced there's much in it: because of the JSON structure used by the YouTube API, jsonlite can only coerce a few elements into data.frames, and these are still buried deep in the list structure. The object 'rd' contains a mix of named entities and data.frames, and in this case we have to do similar excavation to get at interesting data.

What about social stats, e.g. facebook shares?

Here is my approach from the web scraping tutorials: first we construct the HTTP request, then we read the response using rjson

fqlQuery = "select share_count,like_count,comment_count from link_stat where url=\""
url = ""
queryUrl = paste0("", fqlQuery, url, "\"")  #ignoring the callback part
lookUp <- span=""> URLencode(queryUrl)  #What do you think this does?
## [1] ",like_count,comment_count%20from%20link_stat%20where%20url=%22"
rd <- span=""> readLines(lookUp, warn = "F")

dat <- span=""> fromJSON(rd)
## $data
## $data[[1]]
## $data[[1]]$share_count
## [1] 388
## $data[[1]]$like_count
## [1] 430
## $data[[1]]$comment_count
## [1] 231
## $share_count
## [1] 388

How does jsonlite compare?

dat <- span=""> fromJSON(rd)
## $data
##   share_count like_count comment_count
## 1         388        430           231
## [1] 388

Is that better? Yes, I think jsonlite in this case offers a significant improvement.

What about writing to JSON?

Not long ago I did a bit of work involving exporting data from R for use in d3 visualisations. This data had to be in a nested JSON format, which I approximated through a (to me) rather complex process using split and lapply. Can jsonlite simplify this at all?
Possibly. Though my gut reaction is that creating nested data.frames is not much simpler than manually creating creating nested lists. I repeatedly used the split function to chop up the data into a nested structure. Once this was done, however, toJSON wrote very nice output:

"9" : {
"33" : {
  "74" : [
      "label" : "V155",
      "labs" : "Bird Flu and Epidemics"
      "label" : "V415",
      "labs" : "Fowl and Meat Industry"
  "75" : [
      "label" : "V166",
      "labs" : "Academics"
      "label" : "V379",
      "labs" : "Places Of Study and Investigation"
  "76" : [
      "label" : "V169",
      "labs" : "Space Exploration"
      "label" : "V261",
      "labs" : "Cosmonauts"

My verdict: jsonlite makes saving a data.frame in JSON very easy indeed, and the fact we can turn a data.frame seamlessly into a 'flat' JSON file is excellent. In many real-world situations the reason for using JSON in the first place (rather than say csv) is that a columns/row structure is either inefficient or plain inappropriate. jsonlite is a welcome addition, though transporting data between R and javascript and applications is not seamless just yet. The bottom-line: great for simple cases; tricky structures remain tricky.

Seriously: does anyone know how to automatically created nested data frames or lists?

Web Scraping: working with APIs

APIs present researchers with a diverse set of data sources through a standardised access mechanism: send a pasted together HTTP request, receive JSON or XML in return. Today we tap into a range of APIs to get comfortable sending queries and processing responses.

These are the slides from the final class in Web Scraping through R: Web scraping for the humanities and social sciences

This week we explore how to use APIs in R, focusing on the Google Maps API. We then attempt to transfer this approach to query the Yandex Maps API. Finally, the practice section includes examples of working with the YouTube V2 API, a few 'social' APIs such as LinkedIn and Twitter, as well as APIs less off the beaten track (Cricket scores, anyone?).

I enjoyed teaching this course and hope to repeat and improve on it next year. When designing the course I tried to cram in everything I wish I had been taught early on in my PhD (resulting in information overload, I fear). Still, hopefully it has been useful to students getting started with digital data collection, showing on the one hand what is possible, and on the other giving some idea of key steps in achieving research objectives.

Download the .Rpres file to use in Rstudio here

A regular R script with code-snippets only can be accessed here

Slides from the first session here

Slides from the second session here

Slides from the third session here

UPDATE March 2015:
New 2015 version of slides here
PDFs of slides available here

Web Scraping: Scaling up Digital Data Collection

The latest slides from web scraping through R: Web scraping for the humanities and social sciences

Slides from the first session here

Slides from the second session here

Slides from the fourth and final session here

This week we look in greater detail at scaling up digital data-collection: coercing scraper output into dataframes, how to download files (along with a cursory look at the state of IP law), cover basic text-manipulation in R, and take a first look at working with the APIs (share counts on Facebook).

Download the .Rpres file to use in Rstudio here

A regular R script with code-snippets only can be accessed here

UPDATE March 2015:
New 2015 version of slides here
PDFs of slides available here

Should Ukraine and Russia be united?

New adventures in d3, this time with some Ukrainian polling data.

According to a poll collected in the period 8-18 February, 13% of Ukrainians would wish for Ukraine to join Russia. The numbers all around seem pretty low, though newer data might present a different picture.

Source: KIIS

Web Scraping part2: Digging deeper

Slides from the second web scraping through R session: Web scraping for the humanities and social sciences

Slides from the first session here

...the third session here

... and the fourth and final session here

In which we make sure we are comfortable with functions, before looking at XPath queries to download data from newspaper articles. Examples including BBC news and Guardian comments

Download the .Rpres file to use in Rstudio here

A regular R script with the code only can be accessed here

UPDATE March 2015:
New 2015 version of slides here
PDFs of slides available here

Detecting bots

This is part 3 of the series about Nashi bots. 
If doing this currently I would approach the problem differently, but to my knowledge NodeXL is still a viable way of accessing the Twitter API.

Part - theory
Part2 - the leaked email correspondence

Detecting Nashi’s bots
Kambulov was hired to create and populate a number of social network accounts, and he apparently completed his task efficiently enough. His motivation, though, was to deliver a set number of user profiles, rather than ensuring these appeared realist upon close inspection. Consequently, the Nashi bots are easy to identify. The creation of online accounts leaves a multitude of traces: when the account was created, what email address or phone-number was used to activate it, and of course links to other online users through correspondence or lists of ‘friends’ and ‘followers’.  To give the semblance of being live accounts, followers had to be acquired for each bot. One way of doing this is by following a list of accounts, seeing which of these users ‘follow back’, and removing those which do not. The second way is to set the bot accounts to follow each other. One way the Nashi twitter bots functioned was by automatically reposting content originally hosted by Nashi commissars. Consequently bots may be identified by locating accounts that immediately reposted messages from Nashi Commisars. Exploring the followers of these users reveals a mixture predominantly made up of follow-back accounts and bots. Bot accounts may be identified by clusters in account details, for instance user-name patterns or date of creation. Consequently, whole networks may be unravelled by identifying one bot, downloading lists of their followers, and identifying clusters or patterns in user details. The details of twitter accounts may be accessed in numerous ways through the Twitter API. This is made relatively simple by a library for Python or package for R, but also in a user-friendly environment such as excel. Using the nodeXL plugin I was in a matter of minutes able to identify large botnet created in November-December 2011, linked to Kristina Potupchik and other Nashi commissars, and dormant until February 2012. A small part of the data is included below:

Notice how the usernames (Vertex) look randomly generated, how the ‘Location’ looks like it may be from a list of Russian names, while the ‘Joined Twitter’ times are very closely clustered.

Here is a clear example of efficient automation techniques being used. The simplest, most efficient automation techniques are both the most profitable for the person creating them, and the easiest to detect due to patterns left by lazy automation techniques. Many of the accounts I found in this network were well maintained, possibly using chatterbots or rewriters, but because they shared many characteristics with dormant, obviously fake accounts, they could easily be traced. Had Nashi themselves created powerful automation techniques, such patterns could have been removed altogether, and researchers would struggle to pinpoint fake accounts. They could also have isolated a set of bots used for spamming, from a set of bots they hoped would pass off as live users.

As Nashi expanded their online activity in an attempt to dominate an increasingly active online community they began to rely on technically skilled but less politically committed outsiders. The technical possibilities of macros and other forms of automation seem to have aided middlemen in defrauding their superiors, rather than in streamlining Nashi’s online activity. Considering that the higher levels of the pyramid exhibited the least understanding of automated techniques, there is no guarantee that mid-level actors who claimed to hire ‘internet lemmings’ actually did so.

Let’s return briefly to the robot analogy from the introduction: hiring bot services is like buying a robot pre-programmed to fulfil particular tasks. As the robot owner you are hopeful the product will work, but probably you don’t understand how it functions. If it malfunctions, you are unlikely to be able to fix it; if a task is too complicated for the robot, you won’t be able to reprogram it. Consequently, you will either have to buy a new robot, build a new robot, or forget the robot and pay someone to do the task. Nashi overwhelmingly favoured options one and three – they brought in technology from outside, and shaped it to be used through subordinates in campaigns where a few individuals simulated the activity of many.
In conclusion, there is no evidence that Nashi invested in creating sustainable, hard to detect bots. The correspondence indicates that the campaign was seen as a one off, and that it expanded a pre-existing program where activists were paid to promote Putin online. Nashi’s focus was to create and disseminate online content; in so doing they borrowed techniques from the internet shadow economy, but they wanted a high standard maintained in their online footprint, and consequently preferred to pay humans to simulate grass-root support for the organisation online. Bot activity was either unsanctioned, or provided by outsiders, which resulted in a diverse and poorly disguised online trace. Consequently it is easy to trace Nashi’s ‘dead souls’.

Protests in Ukraine 18-20 February

The map below, based on data from here shows where the people killed in the protests of 18-20 February came from.* I can't vouch for the accuracy of the data. Apparently most of the protesters killed in Kiev come from central regions and Lviv Oblast', almost all regions are represented in the list of fatalities.

Mouse over for info about numbers and people.

The data is changing all the time - the list is twice as long as it was 24 hours ago. I will try to keep the visualization up to date.

Updated 22/02/2014

[*] People with missing 'place of origin' are not included. I removed one entry which looked like a duplicate. Age may be off by up to a year.

I transliterated names using a slightly modified Russian language transliteration scheme. I will be delighted to take corrections:


Web-Scraping: the Basics

Slides from the first session of my course about web scraping through R: Web scraping for the humanities and social sciences

Includes an introduction to the paste function, working with URLs, functions and loops.
Putting it all together we fetch data in JSON format about Wikipedia page views from

Solutions here:

Download the .Rpres file to use in Rstudio here

Slides from part two can be seen here

Slides from part three here

Slides from the fourth and final session here

UPDATE March 2015:
New 2015 version of slides here
PDFs of slides available here

Slides from part two can be seen here

Plugging hierarchical data from R into d3

Here I show how to convert tabulated data into a json format that can be used in d3 graphics. The motivation for this was an attempt at getting an overview of topic models (link). Illustrations like the one to the right are very attractive; my motivation to learn how to make them was that the radial layout sometimes saves a lot of space - in my case when visualising tree diagrams. But, this type of layout is hard to do in R.  d3 can be used with data in both csv and json format, and has a method 'nest' to convert tabular data into a hierarchical structure. When I started out with d3, though, this was all over my head, and this post shows how to make the conversion from tabular data to json in R.

This post has three parts:
1) I map topics about Stalin to illustrate how this approach can be used to visualise topic models
2) I go through a function to shape data for use in d3 illustrations
3) I end with variations on how to show complexity in topic models

Visualising Structure in Topic Models

How exactly should we visualise topic models to get an overview of how topics relate to each other? This post is a brief lit review of that debate - I realise the subject matter is sooo last year. I also present my chosen solution to the dilemma: I use dendrograms to position topic, and add a network visualisation using an arcplot to expose linkages between subjects that frequently co-occur, without being correlated.

For details of what topic models are, read Ted Underwood blog posts here , and Matthew Jockers' macroanalysis. I wrote a little bit about it elsewhere, so I will get straight to the jugular:

Topic Modelling Media Coverage of Memory Conflicts

Ostensibly this is a blog about memory conflict. It has become more of a repository of script snippets and visualisations, but here I get back to my roots and apply topic modelling to the representation of memory in the Russian media. 

Topic models are discussed really well elsewhere, and rather superficially by me here. In my topic model for the Russian media over the period of 2003-2013 I found seven or eight topics about history and memory. One of them was clearly about Katyn and about Stalinist repression.

Top Seven Tips for Processing 'Foreign' Text in Python (2.7)

Following on from my guide to making R play nice with utf-8, here is a seven-step guide to understanding Python's handling of unicode. Trust me, if you work with non-latin characters, you need to know this stuff:

[deleted post on] d3 visualisations of the GDELT data

I accidentally deleted my post on visualising the GDELT data using d3, and because it was really fiddly to make blogger display it properly in the first place, I won't reupload it. Instead here is the intro along with the RMD file (compile it in R using knitr). With any luck, running the code will generate a nice d3 visualisation where you can click shiny buttons to make data appear or disappear at will. I have not maintained this script since the early days of GDELT, so there's every likelihood it will need some tinkering to work with the latest data. 

[start of deleted post]
Below I take the example of the GDELT data to demonstrate how Python can very quickly slice data into manageable chunks, which in turn can be formatted with R and visualised using rCharts to create interactive d3 visualisations. I don't think I present anything new here, but rather give a quick demo of how easy it can be to peek into relatively big data. Regarding GDELT specifically, I've included a quick-fix for importing the details of the event-codes into the data, which helps clarify some specifics. Some users of the GDELT data have apparently found it a bit impenetrable, a bit dense, because of the sheer quantity of data. With any luck this post should lower this barrier to getting started with GDELT.

But more generally I utilise the rCharts package which draws on d3 to make interactive charts. This makes getting an overview very much easier, as irrelevant event categories can easily be hidden. The workflow below is much more efficient than it would be if the whole process was done in just one language. Timed on my old laptop the 3.3 million events were filtered and made interactive in about 30 seconds.

Databases for text analysis: archive and access texts using SQL

This post is a collection of scripts I've found useful for integrating a SQL database into more complex applications. SQL allows quickish access to largish repositories of text (I wrote about this at some length here), and are a good starting point for taking textual analysis beyond thousands of texts.

The Challenge of not-quite-Gargantuan Data (and why DH needs SQL)

I felt strangely belittled when Andrew Goldstone tweeted about a recent blog entry:

Not even medium sized? But... but... my archive is really big! I am working on more than a million texts! Of course, he is right - and it occurs to me that medium sized data such as mine is in its own way quite tricky to handle: small enough to be archived on a laptop, too big to fit into memory. An archive of this size creates the illusion of being both Big and easily manageable - when in reality it is neither.

This post, the first in a series of three, explains why I decided to use a database of texts. The second post will explore how to archive and retrieve data from a SQL database, while the third will introduce how to use indexes to keep textual data at arms length and facilitate quick information retrieval.

Scaling up text processing and Shutting up R: Topic modelling and MALLET

In this post I show how a combination of MALLET, Python, and data.table means we can analyse quite Big data in R, even though R itself buckles when confronted by textual data. 

Topic modelling is great fun. Using topic modelling I have been able to separate articles about the 'Kremlin' as a) a building, b) an international actor c) the adversary of the Russian opposition, and d) as a political and ideological symbol.  But for topic modelling to be really useful it needs to see a lot of text. I like to do my analysis in R, but R tends to disagree with large quantities of text. Reading digital humanities literature lately, I suspect I am not alone in confronting this four-step process, of which this post is a result:

1) We want more data to get better results (and really just because it's available)
2) More data makes our software crash. Sometimes spectacularly, and particularly if its R.
3) Ergo we need new or more efficient methods.
4) We write a blog post about our achievements in reaching step 3

Failures in Gephi

I'm sure most users of Gephi have had moments where they stumble across incredible visualizations that make no apparent sense. I have found the most spectacular failures often to give the most jaw-dropping results:

This is a graph mapping grammatical similarities between 4000 random Russian news articles; links in gold occurred at election time, while dark red are all other articles. It seems to form a single long chain of connectivity that makes no sense, apart from on a grammatical level, and even there the links are pretty spurious.

Fun simulating Wimbledon in R and Python

R and Python have different strengths. There's little you can do in R you absolutely can't do in Python and vice versa, but there's a lot of stuff that's really annoying in one and nice and simple in the other. I'm sure simulations can be run in R, but it seems frightfully tricky. Recently I wrote a simple Tennis simulator in Python, which copies all the Tennis rules, and allows player skill to be entered. It would print running scores as the game went, or if asked to, would run a large number of matches and calculate win percentages. I quickly found that the structure of Tennis is such that marginal gains are really valuable, as only a small increase in skill translated into a large increase in number of matches won. How about mapping this? what does the relationship between skill and tennis matches won look like? Where exactly is the cut-off point of skill, below which winning is not just lucky, but impossible? Does increasing the 'serve bonus', meaning service holds are very likely, improve or reduce the odds for the underdog?

Topic maps

I've been exploring ways of calculating the subject matter discussed by Russian newspapers recently. In the end I settled on using TF-IDF (Term Frequence - Inverse Document Frequency) keywords on a large (read: near exhaustive) database of the Russian press. Taking each keyword as a node, and each keyword pair that occur in a document as an edge (I collected the top ten keywords for each text), quite nice topic maps can be created.

Putin's bot army - part two: Nashi's online campaign (and undesirable bots)

Part one of 'Putin's bot army' examined what bots are, and how they have been used to simulate online interaction. In part two I explore in detail email correspondence between Nashi activists to demonstrate how they conducted their pro-Putin campaign. Nashi were happy to pay people to simulate online engagement with pro-Putin content. However, it emerges from the correspondence that the higher levels of technical expertise existed at the lower echelons of Nashi's organisational structure, resulting in a tension between superiors who wanted meticulous manual labour, and employees who used automation techniques to take shortcuts when completing tasks. 

Nashi’s use of bots
How is it that pro-Kremlin ‘dead-souls’ populated Russian social media and news outlets at the time of the 2011 and 2012 elections? Some insights come from a series of emails leaked in February 2012. The correspondence between Nashi press-secretary Kristina Potupchik, Nashi founder and then head of Rosmolodezh’ Vasilii Yakemenko, and other Nashi activists revealed the murky details of Nashi’s activity, much of which was online.

A giant step for man...

no wait, 'a giant step for me'... a massively underwhelming one for mankind: the Memory at War Project newsletter features a writeup I did on detecting Memory Events in Russian press coverage of the August Putch. If that sounds like your sort of thing, I understand you can read about it here, unabridged and all:

Not that I made the front-page. I'll tell myself it was only a beauty-contest, anyway!

Simulating skill in the Premier League: part1

I love sports. Every week I watch Tottenham play, and just as regularly I go through the emotional roller-coaster that entails. As a sports fan I use the first person plural to describe 'my' team, and am convinced we need to keep Gareth Bale, just as I hold the conviction that Tottenham will bottle it come May. But at the same time, I get frustrated when pundits bring out stats along the lines of: 'Wigan haven't lost in their last 5 trips to X' (OK, that would be quite a good stat), or Liverpool haven't won without Suarez since February, or 'player X has scored in five games in a row, wow is he on fire'. I am convinced sports pundits highlight 'trends' that are actually almost all random. Of course skill plays a big part in sport, but when everyone has skill, how important is chance? Below I demonstrate that doubling skill levels, rather than guaranteeing victory instead results only in a 50% increase in points gained.

big geo-data visualisations

Spotting international conflict is very easy with the GDELT data set, combined with ggplot and R. The simple gif above shows snapshots of Russian/Soviet activity from January 1980 and January 2000. I think it also illustrates how Russia nowadays looks more to the east and the South than during the Cold War. The trend, though not very strong above, gets even clearer by the end of the 2000s.

This blog post is really two in one: first I argue the GDELT data is great for the grand overviews it allows, and second I present a few tentative thoughts about the coding scheme that allows us to slice up the data. Specifically, I think a structure has been chosen that maximises the number of entries, and as a result, also the number of false positives. I would be delighted to be proven wrong, though - I think this is a fantastic tool!

I wanted to go one step further than the gif above, so I made an animation of all the events in the GDELT dataset featuring Russia. That's 3.3 million entries, each mapped 12 times (for blur).

Mapping the GDELT data in R (and some Russian protests, too)

In this post I show how to select relevant bits of the GDELT data in R and present some introductory ideas about how to visualise it as a network map. I've included all the code used to generate the illustrations. Because of this, if you here for the shiny visualisations, you'll have to scroll way down

The Guardian recently published an article linking to a database of 250 million events. Sounds too good to be true, but as I'm writing a PhD on recent Russian memory events, I was excited to try it out. I downloaded the data, generously made available by Kalev Leetaru of the University of Illinois, and got going. It's a large 650mb zip file (4.6gb uncompressed!), and this is apparently the abbreviated version. Consequently this early stage of the analysis was dominated by eager anticipation, as the Cambridge University internet did its thing.

plot textual differences in Shiny

Wordclouds such as Wordle are pretty rubbish, so I thought I'd try to make a better one, one that actually produces (statistically) meaningful results. I was so happy with the outcome I decided to make it interactive, so go on, have a play!

Compare any two files texts (turns out file uploading in Shiny is pretty experimental/dysfunctional) , and graphically map differences between them. The application will stem the file, remove stop words, and calculate statistical significance, all in a few clicks. Using the controls below you can also change the text size, plot title, the positioning of the terms (to avoid overlap), add transparency, and change the number of words plotted.

Better modelling and visualisation of newspaper count data

<!-- Styles for R syntax highlighter

In this post I outline how count data may be modelled using a negative binomial distribution in order to more accurately present trends in time series count data than using linear methods. I also show how to use ANOVA to identify the point at which one model gains explanatory power, and how confidence intervals may be calculated and plotted around the predicted values. The resulting illustration gives a robust visualisation of how the Beslan Hostage crisis has taken on features of a memory event

Related blogs