In April, Hans Rosling examined the influence of religion on fertility. I used R to replicate a graphic of his talk:
> library(datamart)
> gm <- gapminder()
> #queries(gm)
> #
> # babies per woman
> tmp <- query(gm, "TotalFertilityRate")
> babies <- as.vector(tmp["2008"])
> names(babies) <- names(tmp)
> babies <- babies[!is.na(babies)]
> countries <- names(babies)
> #
> # income per capita, PPP adjusted
> tmp <- query(gm, "IncomePerCapita")
> income <- as.vector(tmp["2008"])
> names(income) <- names(tmp)
> income <- income[!is.na(income)]
> countries <- intersect(countries, names(income))
> #
> # religion
> tmp <- query(gm, "MainReligion")
> religion <- tmp[,"Group"]
> names(religion) <- tmp[,"Entity"]
> religion[religion==""] <- "unknown"
> colcodes <- c(
+ Christian="blue",
+ "Eastern religions"="red",
+ Muslim="green", "unknown"="grey"
+ )
> countries <- intersect(countries, names(religion))
> #
> # plot
> par(mar=c(4,4,0,0)+0.1)
> plot(
+ x=income[countries],
+ y=babies[countries],
+ col=colcodes[religion[countries]],
+ log="x",
+ xlab="Income per Person, PPP-adjusted",
+ ylab="Babies per Woman"
+ )
> legend(
+ "topright",
+ legend=names(colcodes),
+ fill=colcodes,
+ border=colcodes
+ )
One of the points Rosling wanted to make is: Religion has no or very little influence on fertility, but economic welfare has. I wonder if demographs agree and take this economic effect into account.
If you want to know more about that gapminder
function and that query
method, read on.
The result of calling gapminder()
is an object of (S4) class UrlData
. The class defines a three-step query
process. Each step can be customized. The steps are
- Map the resource parameter to an URL.
- Extract the data from the web (i.e. download).
- Transform the data to a suitable R object.
There is also a scrape
method that adds a fourth step and stores the extracted and transformed data to a local sqlite database. But that is topic of another post and will not be covered here.
The gapminder
function passes suitable parameters to the constructor for each of the steps.
Map resource names to URLs
Gapminder’s datasets are hosted at Google Spreadsheets. Each dataset has a URL like “https://docs.google.com/spreadsheet/pub?key=%s&output=csv”, where %s
is a unique but unmemorizable key. The constructor urldata
offers two parameters to handle this situation
> gm <- urldata(
+ template="https://docs.google.com/spreadsheet/pub?key=%s&output=csv",
+ map.lst=list(
+ "TotalFertilityRate"="phAwcNAVuyj0TAlJeCEzcGQ&gid=0",
+ "IncomePerCapita"="phAwcNAVuyj1jiMAkmq1iMg&gid=0"
+ )
+ )
When we provide these two parameters, on query(gm, "IncomePerCapita")
the URL is constructed by calling sprintf(template, map.lst[["IncomePerCapita"]])
. (It is possible to provide several parameters or to provide a function map.fct
instead of map.list
, but I do not go into that now.)
Extracting, i.e. downloading the data
By default, UrlData
uses readLines
for downloading the dataset. In this example, this fails, at least on windows, since readLines
does not support https. One solution proposed at Stackoverflow is to use RCurl::getURL
with suitable parameters. Thus the object construction becomes:
> gm <- urldata(
+ #template="...", see above
+ #map.lst=list(TotalFertility="..."), see above
+ extract.fct=function(uri)
+ getURL(
+ uri,
+ cainfo = system.file(
+ "CurlSSL",
+ "cacert.pem",
+ package = "RCurl"
+ )
+ )
+ )
Now, query(gm, "IncomePerCapita")
returns the dataset as a vector of strings. Other use cases of UrlData
may use fromJSON
, readLines(gzcon(uri))
or pass an authentication object to getURL
.
Transform the raw data to an R object
The last step of the query
converts the raw data. It takes care of character encoding, separator and comment characters and type conversions. In the gapminder
example, a call to read.csv
is performed, followed by numerical conversions (due to the fact the 1000 char is not detected) and returns a xts
object. It is passed as transform.fct
parameter:
> gm <- urldata(
+ #template="...", see above
+ #map.lst=list(TotalFertility="..."), see above
+ #extract.fct=function(uri) ..., see above
+ transform.fct=function(x) {
+ dat <- read.csv(
+ textConnection(x),
+ na.strings=c("..", "-"),
+ stringsAsFactor=FALSE
+ )
+ # other steps omitted
+ }
+ )
With this last customization, the gm
object works as in the opening example. This is how the gapminder
function is defined in the datamart
package.
Conclusion and other examples
The class UrlData
introduced in this blog post aims to make it easy to access data from the web in a unified way. The class inherits from Xdata
and hence takes advantage of the infrastructure provided by this class. I hope UrlData
invites you to create your own data classes for other web data. I think the class is one step towards playable data. The other steps involve convenient storing fo the scraped data, mashup of several data sources and some tools that make use of the unified interface of Xdata
.
As a proof of concept, I implemented other data objects. The following functions are part of the datamart
package and can be inspected by looking at the source code, for example with showMethods("query", includeDefs=TRUE)
. Most of the examples use code snippets of the R blogosphere:
mauna_loa
, simple example for CO2 data.tourdefrance
, sports data collected by Martin Theusrus.sourceforge
, access to JSON stats API for a given project.gscholar
, counting hits for given search terms Robert A. Muenchen
No comments:
Post a Comment