2012-06-24

Querying DBpedia from R

DBpedia is an extract of structured information from wikipedia. The structured data can be retrieved using an SQL-like query language for RDF called SPARQL. There is already an R package for this kind of queries named SPARQL.

There is an S4 class Dbpedia part of my datamart package that aims to support the creation of predefined parameterized queries. Here is an example that retrieves data on German Federal States:

> library(datamart) # version 0.5 or later
> dbp <- dbpedia()

# see a list of predefined queries
> queries(dbp)
[1] "Nuts1"  "PlzAgs"

# lists Federal States
> head(query(dbp, "Nuts1"))[, c("name", "nuts", "gdp")]
                    name nuts    gdp
1                Hamburg  DE6  94.43
2      Baden-Württemberg  DE1 376.28
3 Mecklenburg-Vorpommern  DE8  35.78
4        Rheinland-Pfalz  DEB 107.63
5              Thüringen  DEG  49.87
6                 Berlin  DE3  101.4

It is straightforward to extend the Dbpedia class for further queries. More challenging in my opinion is to figure out useful queries. Some examples can be found at Bob DuCharme's blog, in the article by Jos van den Oever at kde.org, in a discussion on a mailing list and a tutorial at the W3C, at Kingsley Idehen's blog and at DBpedia's wiki.

2012-06-19

A wrapper for R's data() function

The workflow for statistical analyses is discussed at several places. Often, it is recommended:

  • never change the raw data, but transform it,
  • keep your analysis reproducible,
  • separate functions and data,
  • use R package system as organizing structure.

In some recent projects I tried an S4 class approach for this workflow, which I want to present and discuss. It makes use of the package datamart, which I recently submitted to CRAN. Here is a sample session:

library(datamart) # version 0.5 or later

# load one of my datasets
xp <- expenditures()

# introspection: what
# "resources" for this
# dataset did I once define?
queries(xp)

# get me a resource
head(query(xp, "evs2008.lvl2"))

Read on to see how a S4 dataset object is defined and accessed, and what I see in favour and against this approach.

<!– more –>

Here is an use case. The R user develops some R functions that depend on some data that do not change very often. The R user decides to put the functions and the dataset into an R package. The approach described below wraps both the functions and the data into one object.

As an example I use a CSV file derived from the German Income and Expenditure Survey. This file evs2008.lvl2.csv is put in the data subdirectory. I add a file evs2008.lvl2.R that basically calls read.csv2 with the right parameters. The example is part of the datamart package.

Wrapping the dataset into an S4 object

The new step, and the point of this blog post, is that I now define a S4 object of class datamart::InternalData2 for the dataset:

dat <- internalData2(resource="evs2008.lvl2", package="datamart")

The InternalData2 class is itself derived from Xdata. Xdata defines two generics: one method called query to access the data and one method queries for introspection (more on that later). The InternalData2 class just adds a simple wrapper for the data function. On instantiation, the dataset is loaded in a object's private environment. It can then accessed by querying the evs2008.lvl2 resource. So the usual call data(evs2008.lvl2) now becomes

query(dat, "evs2008.lvl2")

This divides the data loading process in two steps: The import from data subdirectory, and the querying of the dataset. If there were a second call query(xp, "evs2008.lvl2") the import process would not take place, instead the already imported dataset is handed out. If the import process takes some time, this might save some time.

Define custom queries

Now if we want to, for example, create a parameterized query for expenditure category in certain household types and/or income situation, we first create the function

evs.categories <- function(self, resource, income="(all)", hhtype="(all)", relative=TRUE, ...) {
    dat <- subset(query(self, "evs2008.lvl2"), coicop2 != "15" & coicop2 != "00")
    income_lvls <- unique(dat$income)
    if(!income %in% income_lvls) stop("invalid 'income' argument, expected one of '", paste(income_lvls, collapse="', '"), "'.")
    hhtype_lvls <- unique(dat$hhtype)
    if(!hhtype %in% hhtype_lvls) stop("invalid 'hhtype' argument, expected one of '", paste(hhtype_lvls, collapse="', '"), "'.")
    dat <- dat[dat$income==income & dat$hhtype==hhtype,]
    if(relative) dat <- transform(dat, value=value/sum(value))
    res <- dat$value
    names(res) <- dat$coicop2de
    return(res)
}

This function provides a simple interface to subset with some argument checking and an optional transformation. The second argument to query defines its name. Usually it is passed as a string.

Other functions may produce graphics, for example

evs.elasticity <- function(self, resource, categ="", xlab="", ylab="", main=NULL, ...) {
    dat <- subset(query(self, "evs2008.lvl2"), coicop2 != "15" & coicop2 != "00" & income != "(all)")
    cat_lvls <- unique(dat$coicop2)
    if (!categ %in% cat_lvls) stop("invalid 'categ' argument, expected one of '", paste(cat_lvls, collapse="', '"), "'.")
    income_lvls <- c("lt900", "900to1300", "1300to1500", "1500t2000", "2000t2600",
                    "2600t3600", "3600t5000", "5000t18000")
    dat$income <- factor(dat$income, levels=income_lvls)
    dat <- subset(dat, coicop2==categ)
    if(is.null(main)) main <- dat[1, "coicop2de"]
    par(mar=c(2,2,3,0)+0.1)
    boxplot(value ~ income, data=dat, ylab=ylab, main=main, ylim=c(0, 1000), ...)
    return(dat)
}

Now in order to wrap data and functions together, we use the mashup function, that is also contained in the datamart package:

expenditures <- function() mashup(
    evs2008.lvl2=internalData2(resource="evs2008.lvl2", package="datamart"),
    elasticities=evs.elasticities,
    categories=evs.categories,
    elasticity=evs.elasticity
)

The function expenditures thus defined now works as in the opening example.

Similarities and overlaps with other approaches

One “S3 way” to achieve something similar would be to create an S3 class evs for the data structure and then define new methods like categories.evs or elasticity.evs. The resources are thus defined by the method names. This works well as long there are no more than one arguments to dispatch on, and as long all that is done with the data object is to query it.

The biomaRt package of the bioconductor project is somewhat similar, for instance, with its useMart function. It is focused on certain bioinformatics databases.

The original inspiration for the datamart comes from the CountryData function in Mathematica. This and similar xxxData functions in Mathematica provide introspection facilities like CountryData[], CountryData["Tags"], CountryData["Properties"] and deliver the data in various formats, as numbers, as polygons, etc.

Future directions

In addition to the InternalData2 class, there is already a class for web APIs such as the MediaWiki API or SPARQL end points. More on those in future posts.

Another direction to go is to support writing and updating datasets. A wrapper for data works for read-only datasets that are part of a package, but it seems not a good idea to update data in installed packages. Another class is needed for that.

Also I like the idea of using the uniform query interface and introspection to build something on top of it. I think of a simple templating mechanism for creating markdown reports, powerpoint slides. Interactive SVG or Tk windows is another direction.

2012-06-06

Is defrosting the freezer worth the trouble?

Short answer: No, not in my case.

Long answer: My refridgerator is about 5 years old, has a volume of 103 liters and a freezing compartment of 16 liters. Its energy class is “A”, the label states a power consumption of 219 kWh per year.

I borrowed a power meter, and measured consumption before and after defrosting. Here is a chart of the measured data. I measured only 5 values, one week before defrosting, 1 hour after turning the fridge on, and then 8, 23 and 47 hours after defrosting. The y-axis shows the average power consumption, i.e. the intensity the fridge sucks energy.


The data suggests that the power consumption is on average at 15.9 Watt. The difference before and after defrosting is 0.5 Watt. I do not want to attribute this to the defrosting. The outside temperature in the week before defrosting was higher, also I used the fridge more often because there were holidays and I was at home. But even if defrosting would have this effect, these 0.5 Watt add up to only 4.38 kWh or 1.50 EUR per year.

Another insight was that 15.9 Watt project to 139 kWh per year and this is quite below the value 219 kWh stated on the energy label. Even after 5 years! Maybe this is because I put the thermostat a rather low level (2 out of 5), or maybe the consumption in the summer will be much higher.

2012-06-05

AUM -- a Sensor to Track Your Computer Usage

A few days ago, I released a first beta version of the Application Usage Monitor AUM. It is a simple sensor for the windows operating system that logs the window names you are actively using. Thus the program enables you to analyse: how much time do you spend surfing (and where?), how much time do you work productively?

So the sensor is somewhat similar to RescueTime or TimeDoctor (I did not try these programs however). The difference is that AUM is really small program written in C that does nothing more than collect the data. It leaves the analysis of the data up to you. Another difference is that AUM is free (as in GPL).

Installing the program is simple. Just download and run the installation program from sourceforge. You may choose to automatically start the sensor at login, which is recommended. Once the program is started, the program logs the time, process name, process id, window title to a semicolon separated text file. You will also see a little blue butterfly in the system tray. Click on the icon to stop monitoring.

The default location for the log file is %APPDATA%\AUM\AUM.csv. You may provide a different file path at the command line. You can open the file with any text editor. There is also a Visual Basic Script that tries to import the data into Excel. This requires a working Excel installation at your PC. In future, I plan to support analysis of the data from within R.

Please give it a try! But be warned that the program is beta and not so well tested yet. Please leave feedback in the comment box below.