Internet-Defense-League

2012-06-19

A wrapper for R's data() function

The workflow for statistical analyses is discussed at several places. Often, it is recommended:

  • never change the raw data, but transform it,
  • keep your analysis reproducible,
  • separate functions and data,
  • use R package system as organizing structure.

In some recent projects I tried an S4 class approach for this workflow, which I want to present and discuss. It makes use of the package datamart, which I recently submitted to CRAN. Here is a sample session:

> library(datamart)
> library(beeswarm)
> # load one of my datasets
> xp <- expenditures()
> # introspection: what
> # "resources" for this
> # dataset did I once define?
> queries(xp)
Evs#Categories Evs#Elasticities   Evs#Elasticity 
"Categories"   "Elasticities"     "Elasticity" 
InternalData#Raw 
"Raw" 
> # get me a resource
> head(query(xp, "Raw"))
coicop2                                    coicop2de
1      15 Expenditures (exclusive private consumption)
2      15 Expenditures (exclusive private consumption)
3      15 Expenditures (exclusive private consumption)
4      15 Expenditures (exclusive private consumption)
5      15 Expenditures (exclusive private consumption)
6      15 Expenditures (exclusive private consumption)
income               hhtype value
1  (all)                (all)  2539
2  (all)               Single  1462
3  (all)         Single woman  1232
4  (all)           Single man  1866
5  (all)        Single parent  1004
6  (all) Single parent, 1 kid   991

Read on to see how a S4 dataset object is defined and accessed, and what I see in favour and against this approach.

Here is an use case. The R user gets a data file, maybe a CSV or a SPSS SAV file and some questions he wants to answer using this data. Now instead of putting the data file and some script files into a folder and sourceing or using R’s work space, here is a different approach.

Set up project’s package structure

The proposed workflow uses R’s package structure to organize code, data, documentation and additional information. This has some advantages:

  • the directory structure stays the same across projects; it is easier to find back after a couple of months,
  • the packaging mechanism, especially the R CMD check command guide the archivation process,
  • it is possible but not necessary to make the data including the analysis public,
  • there are tools that support development of packages. Even in base R there is the package.skeleton function, and the add-on package devtools has a load_all function which allows to load a development version of a project’s package without having to detach, build, and attach it. (However, I am still using my own less mature function devlib as part of my .Rprofile for this.)

As an example I use a CSV file derived from the German Income and Expenditure Survey. This file evs2008.lvl2.csv is put in the data subdirectory. I add a file evs2008.lvl2.R that basically calls read.csv2 with the right parameters.

Declare a S4 class for the dataset

The new step, and the point of this blog post, is that I now define a S4 class Evs that is derived from datamart::InternalData:

> setClass(
+   Class="Evs", 
+   contains="InternalData"
+ )

The InternalData class is itself derived from Xdata. Xdata defines two generics: one method called query to access the data and one method queries for introspection (more on that later). The InternalData class just adds a simple wrapper for the data function. On instantiation, the dataset is loaded in a object’s private environment. It can then accessed by querying the Raw resource. So the usual call data(evs2008.lvl2) now becomes

> xp <- expenditures()
> evs2008.lvl2 <- query(xp, "Raw")

This divides the data loading process in two steps: The import from data subdirectory, and the querying of the dataset. If there were a second call query(xp, "Raw") the import process would not take place, instead the already imported dataset is handed out. If the import process takes some time, this might save some time. The second advantage is that is possible to define custom queries.

Define custom queries

As S4 class, it is possible to enhance it using inheritance and other object oriented techniques. For instance, it is possible to create a parameterized query for expenditure category in certain household types and/or income situations

> setMethod(
+   f="query",
+   signature=c(self="Evs", resource=resource("Categories")),
+   definition=function(self, resource, income="(all)", hhtype="(all)", relative=TRUE, ...) {
+     
+     dat <- subset(query(self, "Raw"), coicop2 != "15" & coicop2 != "00")
+     income_lvls <- unique(dat$income)
+     if(!income %in% income_lvls) stop("invalid 'income' argument, expected one of '", paste(income_lvls, collapse="', '"), "'.")
+     hhtype_lvls <- unique(dat$hhtype)
+     if(!hhtype %in% hhtype_lvls) stop("invalid 'hhtype' argument, expected one of '", paste(hhtype_lvls, collapse="', '"), "'.")
+     dat <- dat[dat$income==income & dat$hhtype==hhtype,]
+     if(relative) dat <- transform(dat, value=value/sum(value))
+     res <- dat$value
+     names(res) <- dat$coicop2de
+     return(res)
+   }
+ )

This query provides a simple interface to subset with some argument checking and an optional transformation. The second argument to query defines its name. Usually it is passed as a string. However S4 dispatches on types, hence it is not possible to dispatch on the value of the string. That is why a not so clean (and not efficient) workaround was necessary. For each custom query, a new class derived from Resource must be declared. There is a convenience function resource for this. The query method dispatches on this class. If a string is passed as resource argument, an object of this class is instantiated, and query is called again with this object as second argument. This is a somewhat internal detail but is important for the definition of custom queries — the signature (its second argument) should be defined with the help of the resource function.

Other queries may produce graphics, for example

> setMethod(
+   f="query",
+   signature=c(self="Evs", resource=resource("Elasticity")),
+   definition=function(self, resource, categ="", xlab="", ylab="", main=NULL, ...) {
+     if(!require(beeswarm)) stop("could not load required package 'beeswarm'")
+     dat <- subset(query(self, "Raw"), coicop2 != "15" & coicop2 != "00" & income != "(all)")
+     cat_lvls <- unique(dat$coicop2)
+     if (!categ %in% cat_lvls) stop("invalid 'categ' argument, expected one of '", paste(cat_lvls, collapse="', '"), "'.")
+     
+     income_lvls <- c("lt900", "900to1300", "1300to1500", "1500t2000", "2000t2600",
+                     "2600t3600", "3600t5000", "5000t18000")
+     dat$income <- factor(dat$income, levels=income_lvls)
+     dat <- subset(dat, coicop2==categ)
+     if(is.null(main)) main <- dat[1, "coicop2de"]
+     
+     par(mar=c(2,2,3,0)+0.1)
+     beeswarm(value ~ income, data=dat, ylab=ylab, main=main, ylim=c(0, 1000), ...)
+     
+     return(dat)
+   }
+ )

Similarities and overlaps with other approaches

One “S3 way” to achieve something similar would be to create an S3 class evs for the data structure and then define new methods like categories.evs or elasticity.evs. The resources are thus defined by the method names. This works well as long there are no more than one arguments to dispatch on, and aas long all that is done with the data object is to query it.

The biomaRt package of the bioconductor project is somewhat similar, for instance, with its useMart function. It is focused on certain bioinformatics databases.

The original inspiration for the datamart comes from the CountryData function in Mathematica. This and similar xxxData functions in Mathematica provide introspection facilities like CountryData[], CountryData["Tags"], CountryData["Properties"] and deliver the data in various formats, as numbers, as polygons, etc.

Future directions

In addition to the InternalData class, there is already a class for spatial data tha uses the sp package, and various classes for web APIs such as the MediaWiki API or SPARQL end points. More on those in future posts.

One next thing I think about often is enabling the class to collect data. So maybe the next version has a scrape method.

Also I like the idea of using the uniform query interface and introspection to build something on top of it. I think of a simple templating mechanism for creating markdown reports, powerpoint slides. Interactive SVG or Tk windows is another direction.

Post a Comment