Better handling of JSON data in R?


What is the best way to read data in JSON format into R? Though really common for almost all modern online applications, JSON is not every R user's best friend. After seeing the slides for my Web Scraping course, in which I somewhat arbitrarily veered between using the packages rjson and RJSONIO, the creator of a third JSON package, Jeroen Ooms, urged me to reconsider my package selection process. So without further ado, is jsonlite any better? Does it get rid of the problem of seemingly infinitely nested lists?

As part of exploring digital data collection we used a range of sources that provide JSON data - from Wikipedia page views to social media sharing stats to YouTube Comments and real-time cricket scores. A persistent annoyance for students was navigating the JSON structure, typically translated into R as a list. Here is what my YouTube stats scraper looks like:

getStats <- span=""> function(id) {
    url = paste0("https://gdata.youtube.com/feeds/api/videos/", id, "?v=2&alt=json")
    raw.data <- span=""> readLines(url, warn = "F")
    rd <- span=""> fromJSON(raw.data)
    dop <- span=""> as.character(rd$entry$published)
    term <- span=""> rd$entry$category[[2]]["term"]
    label <- span=""> rd$entry$category[[2]]["label"]
    title <- span=""> rd$entry$title
    author <- span=""> rd$entry$author[[1]]$name
    duration <- span=""> rd$entry$`media$group`$`media$content`[[1]]["duration"]
    favs <- span=""> rd$entry$`yt$statistics`["favoriteCount"]
    views <- span=""> rd$entry$`yt$statistics`["viewCount"]
    dislikes <- span=""> rd$entry$`yt$rating`["numDislikes"]
    likes <- span=""> rd$entry$`yt$rating`["numLikes"]
    return(list(id, dop, term, label, title, author, duration, favs, views,
        dislikes, likes))
}

(getStats("Ya2elsR5s5s"))


[[1]]
[1] "Ya2elsR5s5s"
[[2]]
[1] "2013-12-17T19:01:44.000Z"

etc.


Now, this is all fine, except that, upon closer inspection, the scraper function burrows into lists to extract the correct field. We use special ticks to accommodate names with dollar-signs in them, to name but one challenge.

Is this any easier using jsonlite?
require(jsonlite)
id = "Ya2elsR5s5s"
url = paste0("https://gdata.youtube.com/feeds/api/videos/", id, "?v=2&alt=json")
raw.data <- span=""> readLines(url, warn = "F")
rd <- span=""> fromJSON(raw.data)
term <- span=""> rd$entry$category$term[2]
label <- span=""> rd$entry$category$label[2]
title <- span=""> rd$entry$title
author <- span=""> rd$entry$author[1]
duration <- span=""> rd$entry$`media$group`$`media$content`$duration[1]


is this any better? I’m not convinced there's much in it: because of the JSON structure used by the YouTube API, jsonlite can only coerce a few elements into data.frames, and these are still buried deep in the list structure. The object 'rd' contains a mix of named entities and data.frames, and in this case we have to do similar excavation to get at interesting data.

What about social stats, e.g. facebook shares?


Here is my approach from the web scraping tutorials: first we construct the HTTP request, then we read the response using rjson


fqlQuery = "select share_count,like_count,comment_count from link_stat where url=\""
url = "http://www.theguardian.com/world/2014/mar/03/ukraine-navy-officers-defect-russian-crimea-berezovsky"
queryUrl = paste0("http://graph.facebook.com/fql?q=", fqlQuery, url, "\"")  #ignoring the callback part
lookUp <- span=""> URLencode(queryUrl)  #What do you think this does?
lookUp
## [1] "http://graph.facebook.com/fql?q=select%20share_count,like_count,comment_count%20from%20link_stat%20where%20url=%22http://www.theguardian.com/world/2014/mar/03/ukraine-navy-officers-defect-russian-crimea-berezovsky%22"
rd <- span=""> readLines(lookUp, warn = "F")

require(rjson)
dat <- span=""> fromJSON(rd)
dat
## $data
## $data[[1]]
## $data[[1]]$share_count
## [1] 388
## 
## $data[[1]]$like_count
## [1] 430
## 
## $data[[1]]$comment_count
## [1] 231
dat$data[[1]]["share_count"]
## $share_count
## [1] 388


How does jsonlite compare?

require(jsonlite)
dat <- span=""> fromJSON(rd)
dat
## $data
##   share_count like_count comment_count
## 1         388        430           231
dat$data$share_count
## [1] 388



Is that better? Yes, I think jsonlite in this case offers a significant improvement.



What about writing to JSON?

Not long ago I did a bit of work involving exporting data from R for use in d3 visualisations. This data had to be in a nested JSON format, which I approximated through a (to me) rather complex process using split and lapply. Can jsonlite simplify this at all?
Possibly. Though my gut reaction is that creating nested data.frames is not much simpler than manually creating creating nested lists. I repeatedly used the split function to chop up the data into a nested structure. Once this was done, however, toJSON wrote very nice output:

"9" : {
"33" : {
  "74" : [
    {
      "label" : "V155",
      "labs" : "Bird Flu and Epidemics"
    },
    {
      "label" : "V415",
      "labs" : "Fowl and Meat Industry"
    }
  ],
  "75" : [
    {
      "label" : "V166",
      "labs" : "Academics"
    },
    {
      "label" : "V379",
      "labs" : "Places Of Study and Investigation"
    }
  ],
  "76" : [
    {
      "label" : "V169",
      "labs" : "Space Exploration"
    },
    {
      "label" : "V261",
      "labs" : "Cosmonauts"
    }
  ]
}





My verdict: jsonlite makes saving a data.frame in JSON very easy indeed, and the fact we can turn a data.frame seamlessly into a 'flat' JSON file is excellent. In many real-world situations the reason for using JSON in the first place (rather than say csv) is that a columns/row structure is either inefficient or plain inappropriate. jsonlite is a welcome addition, though transporting data between R and javascript and applications is not seamless just yet. The bottom-line: great for simple cases; tricky structures remain tricky.



Seriously: does anyone know how to automatically created nested data frames or lists?



7 comments:

  1. I have also had a hard time creating those nested lists in R and output to the "flare.json" format. It's far more intuitive imho in JS, where I can easily move around the nodes and at each node determine whether to add a new node or keep moving down the tree by using push as needed, eg: new = {"name": node, "children": []}; I guess it makes sense that JSON would be easy to use in JS!

    ReplyDelete
    Replies
    1. KARNA RASA HATI YANG GEMBIRA BERKAT BANTUAN AKI SOLEH
      MAKANYA SENGAJA NAMA BELIAU SAYA CANTUNKAN DI INTERNET !!!

      assalamualaikum wr, wb, saya IBU PUSPITA WATI saya Mengucapkan banyak2
      Terima kasih kepada: AKI SOLEH
      atas nomor togelnya yang kemarin AKI berikan "4D"
      alhamdulillah ternyata itu benar2 tembus AKI
      dan berkat bantuan AKI SOLEH saya bisa melunasi semua hutan2…
      orang tua saya yang ada di BANK BRI dan bukan hanya itu AKI alhamdulillah,
      sekarang saya sudah bisa bermodal sedikit untuk mencukupi kebutuhan keluarga saya sehari2.
      Itu semua berkat bantuan AKI SOLEH sekali lagi makasih banyak ya, AKI
      yang ingin merubah nasib
      seperti saya...?
      SILAHKAN GABUNG SAMA AKI SOLEH No; { 082-313-336-747 }

      Sebelum Gabung Sama AKI Baca Duluh Kata2 Yang Dibawah Ini
      Apakah anda termasuk dalam kategori di bawah ini...!!
      1: Di kejar2 tagihan hutang..
      2: Selaluh kalah dalam bermain togel
      3: Barang berharga sudah
      terjual buat judi togel..
      4: Sudah kemana2 tapi tidak
      menghasilkan, solusi yang tepat..!
      5: Sudah banyak dukun ditempati minta angka ritual blom dapat juga,
      satu jalan menyelesaikan masalah anda..
      Dijamin anda akan berhasil
      silahkan buktikan sendiri
      Atau Chat/Tlpn di WhatsApp (WA)
      No WA Aki : 082313336747

      TERIMA KASIH YANG PUNYA
      ROOM ATAS TUMPANGANYA SALAM KOMPAK SELALU
      "KLIK DISINI BOCORAN TOGEL SGP HK SDY DAN DLL"

      Delete
    2. KARNA RASA HATI YANG GEMBIRA BERKAT BANTUAN AKI SOLEH
      MAKANYA SENGAJA NAMA BELIAU SAYA CANTUNKAN DI INTERNET !!!

      assalamualaikum wr, wb, saya IBU PUSPITA WATI saya Mengucapkan banyak2
      Terima kasih kepada: AKI SOLEH
      atas nomor togelnya yang kemarin AKI berikan "4D"
      alhamdulillah ternyata itu benar2 tembus AKI
      dan berkat bantuan AKI SOLEH saya bisa melunasi semua hutan2…
      orang tua saya yang ada di BANK BRI dan bukan hanya itu AKI alhamdulillah,
      sekarang saya sudah bisa bermodal sedikit untuk mencukupi kebutuhan keluarga saya sehari2.
      Itu semua berkat bantuan AKI SOLEH sekali lagi makasih banyak ya, AKI
      yang ingin merubah nasib
      seperti saya...?
      SILAHKAN GABUNG SAMA AKI SOLEH No; { 082-313-336-747 }

      Sebelum Gabung Sama AKI Baca Duluh Kata2 Yang Dibawah Ini
      Apakah anda termasuk dalam kategori di bawah ini...!!
      1: Di kejar2 tagihan hutang..
      2: Selaluh kalah dalam bermain togel
      3: Barang berharga sudah
      terjual buat judi togel..
      4: Sudah kemana2 tapi tidak
      menghasilkan, solusi yang tepat..!
      5: Sudah banyak dukun ditempati minta angka ritual blom dapat juga,
      satu jalan menyelesaikan masalah anda..
      Dijamin anda akan berhasil
      silahkan buktikan sendiri
      Atau Chat/Tlpn di WhatsApp (WA)
      No WA Aki : 082313336747

      TERIMA KASIH YANG PUNYA
      ROOM ATAS TUMPANGANYA SALAM KOMPAK SELALU
      "KLIK DISINI BOCORAN TOGEL SGP HK SDY DAN DLL"

      Delete
  2. > Seriously: does anyone know how to automatically create[] nested data frames or lists?

    I have dealt with nested lists by using tons of lapply / rbind.fill / rbind.fill.matrix calls, and it works alright, albeit slowly in some circumstances. I share your skepticism about the ergonomics of manipulating data this way.

    My own impression is that JSON just adds tons of brackets in your R code, which is probably already riddled with 2-3-bracket-thick multiple function closures. It's not really helping much, in my limited experience of it.

    Still, it's absolutely great to have the I/O capability, principally for d3.js applications, and the vignette in jsonlite shows that it's doing many things better than the previous solutions.

    ReplyDelete
  3. This is so fun! What a great idea. Also I love how authentic you seem to be. Your style and passion for blogging is contagious. Thank you for sharing your life!
    kajal agarwal hot

    ReplyDelete

  4. Hi everybody, I hope you are doing good. Dear admin I had scroll down your site and found it really useful for web development tools. Since I am a developer, I most often refer people to use best website in the town for web development. Your efforts are really appreciable. We also provide web development tool. Here is the link of our site jsononline


    ReplyDelete
  5. I just found this blog and have high hopes for it to continue. Keep up the great work, its hard to find good ones. I have added to my favorites. Thank You دمج ملفات pdf

    ReplyDelete