Better handling of JSON data in R?


What is the best way to read data in JSON format into R? Though really common for almost all modern online applications, JSON is not every R user's best friend. After seeing the slides for my Web Scraping course, in which I somewhat arbitrarily veered between using the packages rjson and RJSONIO, the creator of a third JSON package, Jeroen Ooms, urged me to reconsider my package selection process. So without further ado, is jsonlite any better? Does it get rid of the problem of seemingly infinitely nested lists?

As part of exploring digital data collection we used a range of sources that provide JSON data - from Wikipedia page views to social media sharing stats to YouTube Comments and real-time cricket scores. A persistent annoyance for students was navigating the JSON structure, typically translated into R as a list. Here is what my YouTube stats scraper looks like:

getStats <- span=""> function(id) {
    url = paste0("https://gdata.youtube.com/feeds/api/videos/", id, "?v=2&alt=json")
    raw.data <- span=""> readLines(url, warn = "F")
    rd <- span=""> fromJSON(raw.data)
    dop <- span=""> as.character(rd$entry$published)
    term <- span=""> rd$entry$category[[2]]["term"]
    label <- span=""> rd$entry$category[[2]]["label"]
    title <- span=""> rd$entry$title
    author <- span=""> rd$entry$author[[1]]$name
    duration <- span=""> rd$entry$`media$group`$`media$content`[[1]]["duration"]
    favs <- span=""> rd$entry$`yt$statistics`["favoriteCount"]
    views <- span=""> rd$entry$`yt$statistics`["viewCount"]
    dislikes <- span=""> rd$entry$`yt$rating`["numDislikes"]
    likes <- span=""> rd$entry$`yt$rating`["numLikes"]
    return(list(id, dop, term, label, title, author, duration, favs, views,
        dislikes, likes))
}

(getStats("Ya2elsR5s5s"))


[[1]]
[1] "Ya2elsR5s5s"
[[2]]
[1] "2013-12-17T19:01:44.000Z"

etc.


Now, this is all fine, except that, upon closer inspection, the scraper function burrows into lists to extract the correct field. We use special ticks to accommodate names with dollar-signs in them, to name but one challenge.

Is this any easier using jsonlite?
require(jsonlite)
id = "Ya2elsR5s5s"
url = paste0("https://gdata.youtube.com/feeds/api/videos/", id, "?v=2&alt=json")
raw.data <- span=""> readLines(url, warn = "F")
rd <- span=""> fromJSON(raw.data)
term <- span=""> rd$entry$category$term[2]
label <- span=""> rd$entry$category$label[2]
title <- span=""> rd$entry$title
author <- span=""> rd$entry$author[1]
duration <- span=""> rd$entry$`media$group`$`media$content`$duration[1]


is this any better? I’m not convinced there's much in it: because of the JSON structure used by the YouTube API, jsonlite can only coerce a few elements into data.frames, and these are still buried deep in the list structure. The object 'rd' contains a mix of named entities and data.frames, and in this case we have to do similar excavation to get at interesting data.

What about social stats, e.g. facebook shares?


Here is my approach from the web scraping tutorials: first we construct the HTTP request, then we read the response using rjson


fqlQuery = "select share_count,like_count,comment_count from link_stat where url=\""
url = "http://www.theguardian.com/world/2014/mar/03/ukraine-navy-officers-defect-russian-crimea-berezovsky"
queryUrl = paste0("http://graph.facebook.com/fql?q=", fqlQuery, url, "\"")  #ignoring the callback part
lookUp <- span=""> URLencode(queryUrl)  #What do you think this does?
lookUp
## [1] "http://graph.facebook.com/fql?q=select%20share_count,like_count,comment_count%20from%20link_stat%20where%20url=%22http://www.theguardian.com/world/2014/mar/03/ukraine-navy-officers-defect-russian-crimea-berezovsky%22"
rd <- span=""> readLines(lookUp, warn = "F")

require(rjson)
dat <- span=""> fromJSON(rd)
dat
## $data
## $data[[1]]
## $data[[1]]$share_count
## [1] 388
## 
## $data[[1]]$like_count
## [1] 430
## 
## $data[[1]]$comment_count
## [1] 231
dat$data[[1]]["share_count"]
## $share_count
## [1] 388


How does jsonlite compare?

require(jsonlite)
dat <- span=""> fromJSON(rd)
dat
## $data
##   share_count like_count comment_count
## 1         388        430           231
dat$data$share_count
## [1] 388



Is that better? Yes, I think jsonlite in this case offers a significant improvement.



What about writing to JSON?

Not long ago I did a bit of work involving exporting data from R for use in d3 visualisations. This data had to be in a nested JSON format, which I approximated through a (to me) rather complex process using split and lapply. Can jsonlite simplify this at all?
Possibly. Though my gut reaction is that creating nested data.frames is not much simpler than manually creating creating nested lists. I repeatedly used the split function to chop up the data into a nested structure. Once this was done, however, toJSON wrote very nice output:

"9" : {
"33" : {
  "74" : [
    {
      "label" : "V155",
      "labs" : "Bird Flu and Epidemics"
    },
    {
      "label" : "V415",
      "labs" : "Fowl and Meat Industry"
    }
  ],
  "75" : [
    {
      "label" : "V166",
      "labs" : "Academics"
    },
    {
      "label" : "V379",
      "labs" : "Places Of Study and Investigation"
    }
  ],
  "76" : [
    {
      "label" : "V169",
      "labs" : "Space Exploration"
    },
    {
      "label" : "V261",
      "labs" : "Cosmonauts"
    }
  ]
}





My verdict: jsonlite makes saving a data.frame in JSON very easy indeed, and the fact we can turn a data.frame seamlessly into a 'flat' JSON file is excellent. In many real-world situations the reason for using JSON in the first place (rather than say csv) is that a columns/row structure is either inefficient or plain inappropriate. jsonlite is a welcome addition, though transporting data between R and javascript and applications is not seamless just yet. The bottom-line: great for simple cases; tricky structures remain tricky.



Seriously: does anyone know how to automatically created nested data frames or lists?



2 comments:

  1. I have also had a hard time creating those nested lists in R and output to the "flare.json" format. It's far more intuitive imho in JS, where I can easily move around the nodes and at each node determine whether to add a new node or keep moving down the tree by using push as needed, eg: new = {"name": node, "children": []}; I guess it makes sense that JSON would be easy to use in JS!

    ReplyDelete
  2. > Seriously: does anyone know how to automatically create[] nested data frames or lists?

    I have dealt with nested lists by using tons of lapply / rbind.fill / rbind.fill.matrix calls, and it works alright, albeit slowly in some circumstances. I share your skepticism about the ergonomics of manipulating data this way.

    My own impression is that JSON just adds tons of brackets in your R code, which is probably already riddled with 2-3-bracket-thick multiple function closures. It's not really helping much, in my limited experience of it.

    Still, it's absolutely great to have the I/O capability, principally for d3.js applications, and the vignette in jsonlite shows that it's doing many things better than the previous solutions.

    ReplyDelete