The workflow for statistical analyses is discussed at several places. Often, it is recommended:
- never change the raw data, but transform it,
- keep your analysis reproducible,
- separate functions and data,
- use R package system as organizing structure.
In some recent projects I tried an S4 class approach for this workflow, which I want to present and discuss. It makes use of the package datamart
, which I recently submitted to CRAN. Here is a sample session:
library(datamart) # version 0.5 or later
# load one of my datasets
xp <- expenditures()
# introspection: what
# "resources" for this
# dataset did I once define?
queries(xp)
# get me a resource
head(query(xp, "evs2008.lvl2"))
Read on to see how a S4 dataset object is defined and accessed, and what I see in favour and against this approach.
<!– more –>
Here is an use case. The R user develops some R functions that depend on some data that do not change very often. The R user decides to put the functions and the dataset into an R package. The approach described below wraps both the functions and the data into one object.
As an example I use a CSV file derived from the German Income and Expenditure Survey. This file evs2008.lvl2.csv
is put in the data
subdirectory. I add a file evs2008.lvl2.R
that basically calls read.csv2
with the right parameters. The example is part of the datamart
package.
Wrapping the dataset into an S4 object
The new step, and the point of this blog post, is that I now define a S4 object of class datamart::InternalData2
for the dataset:
dat <- internalData2(resource="evs2008.lvl2", package="datamart")
The InternalData2
class is itself derived from Xdata
. Xdata
defines two generics: one method called query
to access the data and one method queries
for introspection (more on that later). The InternalData2
class just adds a simple wrapper for the data
function. On instantiation, the dataset is loaded in a object's private environment. It can then accessed by querying the evs2008.lvl2
resource. So the usual call data(evs2008.lvl2)
now becomes
query(dat, "evs2008.lvl2")
This divides the data loading process in two steps: The import from data
subdirectory, and the querying of the dataset. If there were a second call query(xp, "evs2008.lvl2")
the import process would not take place, instead the already imported dataset is handed out. If the import process takes some time, this might save some time.
Define custom queries
Now if we want to, for example, create a parameterized query for expenditure category in certain household types and/or income situation, we first create the function
evs.categories <- function(self, resource, income="(all)", hhtype="(all)", relative=TRUE, ...) {
dat <- subset(query(self, "evs2008.lvl2"), coicop2 != "15" & coicop2 != "00")
income_lvls <- unique(dat$income)
if(!income %in% income_lvls) stop("invalid 'income' argument, expected one of '", paste(income_lvls, collapse="', '"), "'.")
hhtype_lvls <- unique(dat$hhtype)
if(!hhtype %in% hhtype_lvls) stop("invalid 'hhtype' argument, expected one of '", paste(hhtype_lvls, collapse="', '"), "'.")
dat <- dat[dat$income==income & dat$hhtype==hhtype,]
if(relative) dat <- transform(dat, value=value/sum(value))
res <- dat$value
names(res) <- dat$coicop2de
return(res)
}
This function provides a simple interface to subset
with some argument checking and an optional transformation. The second argument to query
defines its name. Usually it is passed as a string.
Other functions may produce graphics, for example
evs.elasticity <- function(self, resource, categ="", xlab="", ylab="", main=NULL, ...) {
dat <- subset(query(self, "evs2008.lvl2"), coicop2 != "15" & coicop2 != "00" & income != "(all)")
cat_lvls <- unique(dat$coicop2)
if (!categ %in% cat_lvls) stop("invalid 'categ' argument, expected one of '", paste(cat_lvls, collapse="', '"), "'.")
income_lvls <- c("lt900", "900to1300", "1300to1500", "1500t2000", "2000t2600",
"2600t3600", "3600t5000", "5000t18000")
dat$income <- factor(dat$income, levels=income_lvls)
dat <- subset(dat, coicop2==categ)
if(is.null(main)) main <- dat[1, "coicop2de"]
par(mar=c(2,2,3,0)+0.1)
boxplot(value ~ income, data=dat, ylab=ylab, main=main, ylim=c(0, 1000), ...)
return(dat)
}
Now in order to wrap data and functions together, we use the mashup
function, that is also contained in the datamart
package:
expenditures <- function() mashup(
evs2008.lvl2=internalData2(resource="evs2008.lvl2", package="datamart"),
elasticities=evs.elasticities,
categories=evs.categories,
elasticity=evs.elasticity
)
The function expenditures
thus defined now works as in the opening example.
Similarities and overlaps with other approaches
One “S3 way” to achieve something similar would be to create an S3 class evs
for the data structure and then define new methods like categories.evs
or elasticity.evs
. The resources are thus defined by the method names. This works well as long there are no more than one arguments to dispatch on, and as long all that is done with the data object is to query it.
The biomaRt
package of the bioconductor project is somewhat similar, for instance, with its useMart
function. It is focused on certain bioinformatics databases.
The original inspiration for the datamart
comes from the CountryData
function in Mathematica. This and similar xxxData
functions in Mathematica provide introspection facilities like CountryData[], CountryData["Tags"], CountryData["Properties"]
and deliver the data in various formats, as numbers, as polygons, etc.
Future directions
In addition to the InternalData2
class, there is already a class for web APIs such as the MediaWiki API or SPARQL end points. More on those in future posts.
Another direction to go is to support writing and updating datasets. A wrapper for data
works for read-only datasets that are part of a package, but it seems not a good idea to update data in installed packages. Another class is needed for that.
Also I like the idea of using the uniform query
interface and introspection to build something on top of it. I think of a simple templating mechanism for creating markdown reports, powerpoint slides. Interactive SVG or Tk windows is another direction.
1 comment:
Good reading your ppost
Post a Comment