2012-06-19

A wrapper for R's data() function

The workflow for statistical analyses is discussed at several places. Often, it is recommended:

  • never change the raw data, but transform it,
  • keep your analysis reproducible,
  • separate functions and data,
  • use R package system as organizing structure.

In some recent projects I tried an S4 class approach for this workflow, which I want to present and discuss. It makes use of the package datamart, which I recently submitted to CRAN. Here is a sample session:

library(datamart) # version 0.5 or later

# load one of my datasets
xp <- expenditures()

# introspection: what
# "resources" for this
# dataset did I once define?
queries(xp)

# get me a resource
head(query(xp, "evs2008.lvl2"))

Read on to see how a S4 dataset object is defined and accessed, and what I see in favour and against this approach.

<!– more –>

Here is an use case. The R user develops some R functions that depend on some data that do not change very often. The R user decides to put the functions and the dataset into an R package. The approach described below wraps both the functions and the data into one object.

As an example I use a CSV file derived from the German Income and Expenditure Survey. This file evs2008.lvl2.csv is put in the data subdirectory. I add a file evs2008.lvl2.R that basically calls read.csv2 with the right parameters. The example is part of the datamart package.

Wrapping the dataset into an S4 object

The new step, and the point of this blog post, is that I now define a S4 object of class datamart::InternalData2 for the dataset:

dat <- internalData2(resource="evs2008.lvl2", package="datamart")

The InternalData2 class is itself derived from Xdata. Xdata defines two generics: one method called query to access the data and one method queries for introspection (more on that later). The InternalData2 class just adds a simple wrapper for the data function. On instantiation, the dataset is loaded in a object's private environment. It can then accessed by querying the evs2008.lvl2 resource. So the usual call data(evs2008.lvl2) now becomes

query(dat, "evs2008.lvl2")

This divides the data loading process in two steps: The import from data subdirectory, and the querying of the dataset. If there were a second call query(xp, "evs2008.lvl2") the import process would not take place, instead the already imported dataset is handed out. If the import process takes some time, this might save some time.

Define custom queries

Now if we want to, for example, create a parameterized query for expenditure category in certain household types and/or income situation, we first create the function

evs.categories <- function(self, resource, income="(all)", hhtype="(all)", relative=TRUE, ...) {
    dat <- subset(query(self, "evs2008.lvl2"), coicop2 != "15" & coicop2 != "00")
    income_lvls <- unique(dat$income)
    if(!income %in% income_lvls) stop("invalid 'income' argument, expected one of '", paste(income_lvls, collapse="', '"), "'.")
    hhtype_lvls <- unique(dat$hhtype)
    if(!hhtype %in% hhtype_lvls) stop("invalid 'hhtype' argument, expected one of '", paste(hhtype_lvls, collapse="', '"), "'.")
    dat <- dat[dat$income==income & dat$hhtype==hhtype,]
    if(relative) dat <- transform(dat, value=value/sum(value))
    res <- dat$value
    names(res) <- dat$coicop2de
    return(res)
}

This function provides a simple interface to subset with some argument checking and an optional transformation. The second argument to query defines its name. Usually it is passed as a string.

Other functions may produce graphics, for example

evs.elasticity <- function(self, resource, categ="", xlab="", ylab="", main=NULL, ...) {
    dat <- subset(query(self, "evs2008.lvl2"), coicop2 != "15" & coicop2 != "00" & income != "(all)")
    cat_lvls <- unique(dat$coicop2)
    if (!categ %in% cat_lvls) stop("invalid 'categ' argument, expected one of '", paste(cat_lvls, collapse="', '"), "'.")
    income_lvls <- c("lt900", "900to1300", "1300to1500", "1500t2000", "2000t2600",
                    "2600t3600", "3600t5000", "5000t18000")
    dat$income <- factor(dat$income, levels=income_lvls)
    dat <- subset(dat, coicop2==categ)
    if(is.null(main)) main <- dat[1, "coicop2de"]
    par(mar=c(2,2,3,0)+0.1)
    boxplot(value ~ income, data=dat, ylab=ylab, main=main, ylim=c(0, 1000), ...)
    return(dat)
}

Now in order to wrap data and functions together, we use the mashup function, that is also contained in the datamart package:

expenditures <- function() mashup(
    evs2008.lvl2=internalData2(resource="evs2008.lvl2", package="datamart"),
    elasticities=evs.elasticities,
    categories=evs.categories,
    elasticity=evs.elasticity
)

The function expenditures thus defined now works as in the opening example.

Similarities and overlaps with other approaches

One “S3 way” to achieve something similar would be to create an S3 class evs for the data structure and then define new methods like categories.evs or elasticity.evs. The resources are thus defined by the method names. This works well as long there are no more than one arguments to dispatch on, and as long all that is done with the data object is to query it.

The biomaRt package of the bioconductor project is somewhat similar, for instance, with its useMart function. It is focused on certain bioinformatics databases.

The original inspiration for the datamart comes from the CountryData function in Mathematica. This and similar xxxData functions in Mathematica provide introspection facilities like CountryData[], CountryData["Tags"], CountryData["Properties"] and deliver the data in various formats, as numbers, as polygons, etc.

Future directions

In addition to the InternalData2 class, there is already a class for web APIs such as the MediaWiki API or SPARQL end points. More on those in future posts.

Another direction to go is to support writing and updating datasets. A wrapper for data works for read-only datasets that are part of a package, but it seems not a good idea to update data in installed packages. Another class is needed for that.

Also I like the idea of using the uniform query interface and introspection to build something on top of it. I think of a simple templating mechanism for creating markdown reports, powerpoint slides. Interactive SVG or Tk windows is another direction.

1 comment:

Tyreese said...

Good reading your ppost