Quantifying Memory
- HOME
- |
- ABOUT
- |
- POPULAR POSTS
- |
- CV
- HOME
- |
- ABOUT
- |
- POPULAR POSTS
- |
- CV
Digital Data Collection course
In the course I tried to achieve the following:
- Show how to connect R to resources online
- Use loops and functions to iteratively access online content
- How to work with APIs
- How to harvest data manually using Xpath expressions.
What's new?
- Many more examples and practice tasks
- Updated API usage
- Some bug fixes (and probably many new bugs introduced)
Better handling of JSON data in R?
What is the best way to read data in JSON format into R? Though really common for almost all modern online applications, JSON is not every R user's best friend. After seeing the slides for my Web Scraping course, in which I somewhat arbitrarily veered between using the packages rjson and RJSONIO, the creator of a third JSON package, Jeroen Ooms, urged me to reconsider my package selection process. So without further ado, is jsonlite any better? Does it get rid of the problem of seemingly infinitely nested lists?
As part of exploring digital data collection we used a range of sources that provide JSON data - from Wikipedia page views to social media sharing stats to YouTube Comments and real-time cricket scores. A persistent annoyance for students was navigating the JSON structure, typically translated into R as a list. Here is what my YouTube stats scraper looks like:
getStats <- span=""> function(id) { url = paste0("https://gdata.youtube.com/feeds/api/videos/", id, "?v=2&alt=json") raw.data <- span=""> readLines(url, warn = "F") rd <- span=""> fromJSON(raw.data) dop <- span=""> as.character(rd$entry$published) term <- span=""> rd$entry$category[[2]]["term"] label <- span=""> rd$entry$category[[2]]["label"] title <- span=""> rd$entry$title author <- span=""> rd$entry$author[[1]]$name duration <- span=""> rd$entry$`media$group`$`media$content`[[1]]["duration"] favs <- span=""> rd$entry$`yt$statistics`["favoriteCount"] views <- span=""> rd$entry$`yt$statistics`["viewCount"] dislikes <- span=""> rd$entry$`yt$rating`["numDislikes"] likes <- span=""> rd$entry$`yt$rating`["numLikes"] return(list(id, dop, term, label, title, author, duration, favs, views, dislikes, likes)) } (getStats("Ya2elsR5s5s")) ->->->->->->->->->->->->->
[[1]]
[1] "Ya2elsR5s5s"
[[2]]
[1] "2013-12-17T19:01:44.000Z"
etc.
Now, this is all fine, except that, upon closer inspection, the scraper function burrows into lists to extract the correct field. We use special ticks to accommodate names with dollar-signs in them, to name but one challenge.
Is this any easier using jsonlite?
require(jsonlite) id = "Ya2elsR5s5s" url = paste0("https://gdata.youtube.com/feeds/api/videos/", id, "?v=2&alt=json") raw.data <- span=""> readLines(url, warn = "F") rd <- span=""> fromJSON(raw.data) term <- span=""> rd$entry$category$term[2] label <- span=""> rd$entry$category$label[2] title <- span=""> rd$entry$title author <- span=""> rd$entry$author[1] duration <- span=""> rd$entry$`media$group`$`media$content`$duration[1] ->->->->->->->
is this any better? I’m not convinced there's much in it: because of the JSON structure used by the YouTube API, jsonlite can only coerce a few elements into data.frames, and these are still buried deep in the list structure. The object 'rd' contains a mix of named entities and data.frames, and in this case we have to do similar excavation to get at interesting data.
What about social stats, e.g. facebook shares?
Here is my approach from the web scraping tutorials: first we construct the HTTP request, then we read the response using rjson
fqlQuery = "select share_count,like_count,comment_count from link_stat where url=\"" url = "http://www.theguardian.com/world/2014/mar/03/ukraine-navy-officers-defect-russian-crimea-berezovsky" queryUrl = paste0("http://graph.facebook.com/fql?q=", fqlQuery, url, "\"") #ignoring the callback part lookUp <- span=""> URLencode(queryUrl) #What do you think this does? lookUp ->
## [1] "http://graph.facebook.com/fql?q=select%20share_count,like_count,comment_count%20from%20link_stat%20where%20url=%22http://www.theguardian.com/world/2014/mar/03/ukraine-navy-officers-defect-russian-crimea-berezovsky%22"
rd <- span=""> readLines(lookUp, warn = "F") require(rjson) dat <- span=""> fromJSON(rd) dat ->->
## $data ## $data[[1]] ## $data[[1]]$share_count ## [1] 388 ## ## $data[[1]]$like_count ## [1] 430 ## ## $data[[1]]$comment_count ## [1] 231
dat$data[[1]]["share_count"]
## $share_count ## [1] 388
How does jsonlite compare?
require(jsonlite) dat <- span=""> fromJSON(rd) dat ->
## $data ## share_count like_count comment_count ## 1 388 430 231
dat$data$share_count
## [1] 388
Is that better? Yes, I think jsonlite in this case offers a significant improvement.
What about writing to JSON?
Not long ago I did a bit of work involving exporting data from R for use in d3 visualisations. This data had to be in a nested JSON format, which I approximated through a (to me) rather complex process using split and lapply. Can jsonlite simplify this at all?
Possibly. Though my gut reaction is that creating nested data.frames is not much simpler than manually creating creating nested lists. I repeatedly used the split function to chop up the data into a nested structure. Once this was done, however, toJSON wrote very nice output:
"9" : { "33" : { "74" : [ { "label" : "V155", "labs" : "Bird Flu and Epidemics" }, { "label" : "V415", "labs" : "Fowl and Meat Industry" } ], "75" : [ { "label" : "V166", "labs" : "Academics" }, { "label" : "V379", "labs" : "Places Of Study and Investigation" } ], "76" : [ { "label" : "V169", "labs" : "Space Exploration" }, { "label" : "V261", "labs" : "Cosmonauts" } ] }
My verdict: jsonlite makes saving a data.frame in JSON very easy indeed, and the fact we can turn a data.frame seamlessly into a 'flat' JSON file is excellent. In many real-world situations the reason for using JSON in the first place (rather than say csv) is that a columns/row structure is either inefficient or plain inappropriate. jsonlite is a welcome addition, though transporting data between R and javascript and applications is not seamless just yet. The bottom-line: great for simple cases; tricky structures remain tricky.
Seriously: does anyone know how to automatically created nested data frames or lists?
Web Scraping: working with APIs
These are the slides from the final class in Web Scraping through R: Web scraping for the humanities and social sciences
This week we explore how to use APIs in R, focusing on the Google Maps API. We then attempt to transfer this approach to query the Yandex Maps API. Finally, the practice section includes examples of working with the YouTube V2 API, a few 'social' APIs such as LinkedIn and Twitter, as well as APIs less off the beaten track (Cricket scores, anyone?).
I enjoyed teaching this course and hope to repeat and improve on it next year. When designing the course I tried to cram in everything I wish I had been taught early on in my PhD (resulting in information overload, I fear). Still, hopefully it has been useful to students getting started with digital data collection, showing on the one hand what is possible, and on the other giving some idea of key steps in achieving research objectives.
Download the .Rpres file to use in Rstudio here
A regular R script with code-snippets only can be accessed here
Slides from the first session here
Slides from the second session here
Slides from the third session here
UPDATE March 2015:
New 2015 version of slides here
PDFs of slides available here
Web Scraping: Scaling up Digital Data Collection
Slides from the first session here
Slides from the second session here
Slides from the fourth and final session here
This week we look in greater detail at scaling up digital data-collection: coercing scraper output into dataframes, how to download files (along with a cursory look at the state of IP law), cover basic text-manipulation in R, and take a first look at working with the APIs (share counts on Facebook).
Download the .Rpres file to use in Rstudio here
A regular R script with code-snippets only can be accessed here
UPDATE March 2015:
New 2015 version of slides here
PDFs of slides available here
Should Ukraine and Russia be united?
According to a poll collected in the period 8-18 February, 13% of Ukrainians would wish for Ukraine to join Russia. The numbers all around seem pretty low, though newer data might present a different picture.
Source: KIIS
Data
Web Scraping part2: Digging deeper
Slides from the first session here
...the third session here
... and the fourth and final session here
In which we make sure we are comfortable with functions, before looking at XPath queries to download data from newspaper articles. Examples including BBC news and Guardian comments
Download the .Rpres file to use in Rstudio here
A regular R script with the code only can be accessed here
UPDATE March 2015:
New 2015 version of slides here
PDFs of slides available here
Detecting bots
If doing this currently I would approach the problem differently, but to my knowledge NodeXL is still a viable way of accessing the Twitter API.
Part - theory
Part2 - the leaked email correspondence
Detecting Nashi’s bots
Protests in Ukraine 18-20 February
Mouse over for info about numbers and people.
The data is changing all the time - the list is twice as long as it was 24 hours ago. I will try to keep the visualization up to date.
Updated 22/02/2014
[*] People with missing 'place of origin' are not included. I removed one entry which looked like a duplicate. Age may be off by up to a year.
I transliterated names using a slightly modified Russian language transliteration scheme. I will be delighted to take corrections:
a["Ё"]="Yo";a["Й"]="I";a["Ц"]="Ts";a["У"]="U";a["К"]="K";a["Е"]="E";a["Н"]="N";a["Г"]="G";a["Ш"]="Sh";a["Щ"]="Shch";a["З"]="Z";a["Х"]="Kh";a["Ъ"]="'";a["ё"]="yo";a["й"]="i";a["ц"]="ts";a["у"]="u";a["к"]="k";a["е"]="e";a["н"]="n";a["г"]="g";a["ш"]="sh";a["щ"]="shch";a["з"]="z";a["х"]="kh";a["ъ"]="'";a["Ф"]="F";a["Ы"]="Y";a["В"]="V";a["А"]="A";a["П"]="P";a["Р"]="R";a["О"]="O";a["Л"]="L";a["Д"]="D";a["Ж"]="Zh";a["Э"]="E";a["ф"]="f";a["ы"]="y";a["в"]="v";a["а"]="a";a["п"]="p";a["р"]="r";a["о"]="o";a["л"]="l";a["д"]="d";a["ж"]="zh";a["э"]="e";a["Я"]="Ya";a["Ч"]="Ch";a["С"]="S";a["М"]="M";a["И"]="I";a["Т"]="T";a["Ь"]="'";a["Б"]="B";a["Ю"]="Yu";a["я"]="ia";a["ч"]="ch";a["с"]="s";a["м"]="m";a["и"]="i";a["т"]="t";a["ь"]="'";a["б"]="b";a["ю"]="iu";a["є"]="e";a["Є"]="Ye";a["і"]="i";a["Ї"]="Yi";
Web-Scraping: the Basics
Includes an introduction to the paste function, working with URLs, functions and loops.
Putting it all together we fetch data in JSON format about Wikipedia page views from http://stats.grok.se/
Solutions here:
Download the .Rpres file to use in Rstudio here
Slides from part two can be seen here
Slides from part three here
Slides from the fourth and final session here
UPDATE March 2015:
New 2015 version of slides here
PDFs of slides available here
Slides from part two can be seen here
Plugging hierarchical data from R into d3
This post has three parts:
1) I map topics about Stalin to illustrate how this approach can be used to visualise topic models
2) I go through a function to shape data for use in d3 illustrations
3) I end with variations on how to show complexity in topic models
Visualising Structure in Topic Models
For details of what topic models are, read Ted Underwood blog posts here , and Matthew Jockers' macroanalysis. I wrote a little bit about it elsewhere, so I will get straight to the jugular:
Topic Modelling Media Coverage of Memory Conflicts
Topic models are discussed really well elsewhere, and rather superficially by me here. In my topic model for the Russian media over the period of 2003-2013 I found seven or eight topics about history and memory. One of them was clearly about Katyn and about Stalinist repression.
Top Seven Tips for Processing 'Foreign' Text in Python (2.7)
Following on from my guide to making R play nice with utf-8, here is a seven-step guide to understanding Python's handling of unicode. Trust me, if you work with non-latin characters, you need to know this stuff:
[deleted post on] d3 visualisations of the GDELT data
Databases for text analysis: archive and access texts using SQL
The Challenge of not-quite-Gargantuan Data (and why DH needs SQL)
So familiar…Dealing w/ R's habit of choking on not-even-medium data. MT @RolfFredheim: Shutting up R: http://t.co/XKr0gIGoNz via @dmimno
— Andrew Goldstone (@goldstoneandrew) October 30, 2013
Not even medium sized? But... but... my archive is really big! I am working on more than a million texts! Of course, he is right - and it occurs to me that medium sized data such as mine is in its own way quite tricky to handle: small enough to be archived on a laptop, too big to fit into memory. An archive of this size creates the illusion of being both Big and easily manageable - when in reality it is neither.
This post, the first in a series of three, explains why I decided to use a database of texts. The second post will explore how to archive and retrieve data from a SQL database, while the third will introduce how to use indexes to keep textual data at arms length and facilitate quick information retrieval.
Scaling up text processing and Shutting up R: Topic modelling and MALLET
Failures in Gephi
This is a graph mapping grammatical similarities between 4000 random Russian news articles; links in gold occurred at election time, while dark red are all other articles. It seems to form a single long chain of connectivity that makes no sense, apart from on a grammatical level, and even there the links are pretty spurious.
Fun simulating Wimbledon in R and Python
Topic maps
Putin's bot army - part two: Nashi's online campaign (and undesirable bots)
A giant step for man...
Not that I made the front-page. I'll tell myself it was only a beauty-contest, anyway!
Simulating skill in the Premier League: part1
big geo-data visualisations
Spotting international conflict is very easy with the GDELT data set, combined with ggplot and R. The simple gif above shows snapshots of Russian/Soviet activity from January 1980 and January 2000. I think it also illustrates how Russia nowadays looks more to the east and the South than during the Cold War. The trend, though not very strong above, gets even clearer by the end of the 2000s.
I wanted to go one step further than the gif above, so I made an animation of all the events in the GDELT dataset featuring Russia. That's 3.3 million entries, each mapped 12 times (for blur).
Mapping the GDELT data in R (and some Russian protests, too)
In this post I show how to select relevant bits of the GDELT data in R and present some introductory ideas about how to visualise it as a network map. I've included all the code used to generate the illustrations. Because of this, if you here for the shiny visualisations, you'll have to scroll way down
The Guardian recently published an article linking to a database of 250 million events. Sounds too good to be true, but as I'm writing a PhD on recent Russian memory events, I was excited to try it out. I downloaded the data, generously made available by Kalev Leetaru of the University of Illinois, and got going. It's a large 650mb zip file (4.6gb uncompressed!), and this is apparently the abbreviated version. Consequently this early stage of the analysis was dominated by eager anticipation, as the Cambridge University internet did its thing.
!doctype>plot textual differences in Shiny
Wordclouds such as Wordle are pretty rubbish, so I thought I'd try to make a better one, one that actually produces (statistically) meaningful results. I was so happy with the outcome I decided to make it interactive, so go on, have a play!
Compare any two
Better modelling and visualisation of newspaper count data
In this post I outline how count data may be modelled using a negative binomial distribution in order to more accurately present trends in time series count data than using linear methods. I also show how to use ANOVA to identify the point at which one model gains explanatory power, and how confidence intervals may be calculated and plotted around the predicted values. The resulting illustration gives a robust visualisation of how the Beslan Hostage crisis has taken on features of a memory event