ScraperWiki in R

ScraperWiki describes itself as an online tool for gathering, cleaning and analysing data from the web. It is a programming oriented approach, users can implement ETL processes in Python, PHP or Ruby, share these processes among the community (or pay for privacy) and schedule automated runs. The software behind the service is open source, and there is also an offline version of the program library.

As far as I know, ScraperWiki has no R support yet. This is where the scrape method of my datamart package chimes in. It provides an S4 generic for storing scraped data into an offline database. It is not a wiki (however collaboration can take place using r-forge or CRAN), has no scheduling support (scheduling the start of R scripts should take place outside R) and is not intended as a web application.

Example use case — Berlin’s bathing water quality

Here is an example use case. It is (finally) summer in Berlin and the warm weather invites to go swimming. For the health-aware and/or data-affine people (or those with strong imagination) the public authorities provide the latest water quality assessments online for around 30 bathing areas as open data.

Only the last measurement is available. Now if we were interested in collecting the quality measurements, maybe to find a trend or to search for contributing factors, we would need a mechanism to regularly (in this case, weekly) scrape the data via the online API and save the data locally.

With the framework proposed in datamart, once the details are defined, the mechanism would be executed with just three lines

> ds <- datastore("path/to/local.db")
> bq <- be_bathing_water_quality()
> scrape(bq, ds) # get & save the data

The rest of this blog post is on defining the process, which boils down to defining the be_bathing_water_quality function.

ETL process — extract, transform, load

The task the be_bathing_water_quality function has to accomplish is to define an three-step process by passing the process details to the urldata function. In the database or data warehouse context this process is often refered to as ETL process. In our context, the steps are:

  • After mapping a resource name to an URL the data is extracted, i.e. downloaded. If the network is not available, an authentication failed, or similar, the process ends here.
  • The extracted data is then transformed into an object R can work with. Currently, only data.frame objects are supported, xts and igraph are envisioned. If the extracted data was not as expected and the tranformation failed, the process ends here.
  • The tidy data is then stored (or loaded) int a local sqlite database.

The urldata function provides parameters for each of these steps.

Example: extract and load

In the bathing water quality example, there is on URL for all bathing areas. We map the resource BathingWaterQuality to this URL. the data is returned as JSON, which is why we use fromJSON as extraction function. The data is then transformed into a data.frame. (We do not go into into every detail of that transformation step.) Hence, the call to urldata is:

> be_bathing_water_quality <- function() urldata(
+   template="",
+   map.lst=list(BathingWaterQuality=""),
+   extract.fct=fromJSON,
+   transform.fct=function(x) {
+     tbl <- x[["index"]]
+     nm <- names(tbl[[1]])
+     dat <-
+       matrix(NA, length(tbl), length(nm))
+     )
+     colnames(dat) <- nm
+     for(i in 1:length(tbl)) 
+        dat[i,nm] <- tbl[[i]][nm]
+     #some steps omitted...
+     return(dat)
+   }
+ )

Now, we can create an data object, inspect it using the queries method, and access the latest measurements form the web using query:

> bq <- be_bathing_water_quality()
> queries(bq)
[1] "BathingWaterQuality"
> nrow(query(bq, "BathingWaterQuality"))
[1] 38

The urldata function as part of the datamart package has been described in more detail in an earlier post on the Gapminder datasets.

Example: load

In order to save the scraped data, one more step is necessary. The urldata provides a parameter scrape.lst for specifying which resources to save locally, and how:

> be_bathing_water_quality <- function() urldata(
+   template="...", # see above
+   map.lst=list(BathingWaterQuality=""),
+   extract.fct=fromJSON,
+   #transform.fct= # see above
+   scrape.lst=list(
+     BathingWaterQuality=list(
+       uniq=c("id", "badname", "dat"), 
+       saveMode="u", 
+       dates="dat"
+     )
+   )
+ )

The name(s) of the scrape.lst argument must be a subset of the names of map.lst. The entries of the list are passed as arguments to dbWriteTable: - saveMode="u" indicates that existing observations should be replaced and new observations appended, uniq defines the columns that determine if an observation exists or not. In the example, an observation is identified by the name of the bathing area (id, badname) and the measurement date (dat). If a place/date combination already exists both in the database and in the newly scraped data, the row in the database gets overwritten. - There are other options to dbWriteTable such as specification of columns with dates or timestamps to enforce data type conversion.

Now with this extended function definition we can decide to use local or web data:

> bq <- be_bathing_water_quality()
> ds <- datastore(":memory:") # not persistent
> scrape(bq, ds)
> #
> #local:
> system.time(query(bq, "BathingWaterQuality", dbconn=ds))
user  system elapsed 
0       0       0 
> #
> #web:
> system.time(query(bq, "BathingWaterQuality"))
user  system elapsed 
0.04    0.00    0.23 

Usually, the local data is accessed faster. Additionally, it is possible to keep observations as long as we want to.


This blog post belongs to a series of articles on a S4 data concept I am currently prototyping. Currently, the framework allows to access and query “conventional” internal datasets, SPARQL endpoints and datasets on the web. Topic of the next blog post is combining data objects using the Mashup class.

On the other side, I am thinking of building on top of this infrastructure. Combining data objects with text templates similar to brew is one idea, interactive views using rpanelor similiar is another.

The datamart package is in its early stage, changes to the classes are likely to happen. I would appreciate feedback on the concept. Please leave a comment or send an email.


Convenient access to Gapminder's datasets from R

In April, Hans Rosling examined the influence of religion on fertility. I used R to replicate a graphic of his talk:

> library(datamart)
> gm <- gapminder()
> #queries(gm)
> #
> # babies per woman
> tmp <- query(gm, "TotalFertilityRate")
> babies <- as.vector(tmp["2008"])
> names(babies) <- names(tmp)
> babies <- babies[!]
> countries <- names(babies)
> #
> # income per capita, PPP adjusted
> tmp <- query(gm, "IncomePerCapita")
> income <- as.vector(tmp["2008"])
> names(income) <- names(tmp)
> income <- income[!]
> countries <- intersect(countries, names(income))
> #
> # religion
> tmp <- query(gm, "MainReligion")
> religion <- tmp[,"Group"]
> names(religion) <- tmp[,"Entity"]
> religion[religion==""] <- "unknown"
> colcodes <- c(
+   Christian="blue", 
+   "Eastern religions"="red", 
+   Muslim="green", "unknown"="grey"
+ )
> countries <- intersect(countries, names(religion))
> #
> # plot
> par(mar=c(4,4,0,0)+0.1)
> plot(
+   x=income[countries], 
+   y=babies[countries], 
+   col=colcodes[religion[countries]], 
+   log="x",
+   xlab="Income per Person, PPP-adjusted", 
+   ylab="Babies per Woman"
+ )
> legend(
+   "topright", 
+   legend=names(colcodes), 
+   fill=colcodes, 
+   border=colcodes
+ )

One of the points Rosling wanted to make is: Religion has no or very little influence on fertility, but economic welfare has. I wonder if demographs agree and take this economic effect into account.

If you want to know more about that gapminder function and that query method, read on.


Querying DBpedia from R

DBpedia is an extract of structured information from wikipedia. The structured data can be retrieved using an SQL-like query language for RDF called SPARQL. There is already an R package for this kind of queries named SPARQL.

There is an S4 class Dbpedia part of my datamart package that aims to support the creation of predefined parameterized queries. Here is an example that retrieves data on German Federal States:

> library(datamart) # version 0.5 or later
> dbp <- dbpedia()

# see a list of predefined queries
> queries(dbp)
[1] "Nuts1"  "PlzAgs"

# lists Federal States
> head(query(dbp, "Nuts1"))[, c("name", "nuts", "gdp")]
                    name nuts    gdp
1                Hamburg  DE6  94.43
2      Baden-W├╝rttemberg  DE1 376.28
3 Mecklenburg-Vorpommern  DE8  35.78
4        Rheinland-Pfalz  DEB 107.63
5              Th├╝ringen  DEG  49.87
6                 Berlin  DE3  101.4

It is straightforward to extend the Dbpedia class for further queries. More challenging in my opinion is to figure out useful queries. Some examples can be found at Bob DuCharme's blog, in the article by Jos van den Oever at, in a discussion on a mailing list and a tutorial at the W3C, at Kingsley Idehen's blog and at DBpedia's wiki.


A wrapper for R's data() function

The workflow for statistical analyses is discussed at several places. Often, it is recommended:

  • never change the raw data, but transform it,
  • keep your analysis reproducible,
  • separate functions and data,
  • use R package system as organizing structure.

In some recent projects I tried an S4 class approach for this workflow, which I want to present and discuss. It makes use of the package datamart, which I recently submitted to CRAN. Here is a sample session:

library(datamart) # version 0.5 or later

# load one of my datasets
xp <- expenditures()

# introspection: what
# "resources" for this
# dataset did I once define?

# get me a resource
head(query(xp, "evs2008.lvl2"))

Read on to see how a S4 dataset object is defined and accessed, and what I see in favour and against this approach.

<!– more –>

Here is an use case. The R user develops some R functions that depend on some data that do not change very often. The R user decides to put the functions and the dataset into an R package. The approach described below wraps both the functions and the data into one object.

As an example I use a CSV file derived from the German Income and Expenditure Survey. This file evs2008.lvl2.csv is put in the data subdirectory. I add a file evs2008.lvl2.R that basically calls read.csv2 with the right parameters. The example is part of the datamart package.

Wrapping the dataset into an S4 object

The new step, and the point of this blog post, is that I now define a S4 object of class datamart::InternalData2 for the dataset:

dat <- internalData2(resource="evs2008.lvl2", package="datamart")

The InternalData2 class is itself derived from Xdata. Xdata defines two generics: one method called query to access the data and one method queries for introspection (more on that later). The InternalData2 class just adds a simple wrapper for the data function. On instantiation, the dataset is loaded in a object's private environment. It can then accessed by querying the evs2008.lvl2 resource. So the usual call data(evs2008.lvl2) now becomes

query(dat, "evs2008.lvl2")

This divides the data loading process in two steps: The import from data subdirectory, and the querying of the dataset. If there were a second call query(xp, "evs2008.lvl2") the import process would not take place, instead the already imported dataset is handed out. If the import process takes some time, this might save some time.

Define custom queries

Now if we want to, for example, create a parameterized query for expenditure category in certain household types and/or income situation, we first create the function

evs.categories <- function(self, resource, income="(all)", hhtype="(all)", relative=TRUE, ...) {
    dat <- subset(query(self, "evs2008.lvl2"), coicop2 != "15" & coicop2 != "00")
    income_lvls <- unique(dat$income)
    if(!income %in% income_lvls) stop("invalid 'income' argument, expected one of '", paste(income_lvls, collapse="', '"), "'.")
    hhtype_lvls <- unique(dat$hhtype)
    if(!hhtype %in% hhtype_lvls) stop("invalid 'hhtype' argument, expected one of '", paste(hhtype_lvls, collapse="', '"), "'.")
    dat <- dat[dat$income==income & dat$hhtype==hhtype,]
    if(relative) dat <- transform(dat, value=value/sum(value))
    res <- dat$value
    names(res) <- dat$coicop2de

This function provides a simple interface to subset with some argument checking and an optional transformation. The second argument to query defines its name. Usually it is passed as a string.

Other functions may produce graphics, for example

evs.elasticity <- function(self, resource, categ="", xlab="", ylab="", main=NULL, ...) {
    dat <- subset(query(self, "evs2008.lvl2"), coicop2 != "15" & coicop2 != "00" & income != "(all)")
    cat_lvls <- unique(dat$coicop2)
    if (!categ %in% cat_lvls) stop("invalid 'categ' argument, expected one of '", paste(cat_lvls, collapse="', '"), "'.")
    income_lvls <- c("lt900", "900to1300", "1300to1500", "1500t2000", "2000t2600",
                    "2600t3600", "3600t5000", "5000t18000")
    dat$income <- factor(dat$income, levels=income_lvls)
    dat <- subset(dat, coicop2==categ)
    if(is.null(main)) main <- dat[1, "coicop2de"]
    boxplot(value ~ income, data=dat, ylab=ylab, main=main, ylim=c(0, 1000), ...)

Now in order to wrap data and functions together, we use the mashup function, that is also contained in the datamart package:

expenditures <- function() mashup(
    evs2008.lvl2=internalData2(resource="evs2008.lvl2", package="datamart"),

The function expenditures thus defined now works as in the opening example.

Similarities and overlaps with other approaches

One “S3 way” to achieve something similar would be to create an S3 class evs for the data structure and then define new methods like categories.evs or elasticity.evs. The resources are thus defined by the method names. This works well as long there are no more than one arguments to dispatch on, and as long all that is done with the data object is to query it.

The biomaRt package of the bioconductor project is somewhat similar, for instance, with its useMart function. It is focused on certain bioinformatics databases.

The original inspiration for the datamart comes from the CountryData function in Mathematica. This and similar xxxData functions in Mathematica provide introspection facilities like CountryData[], CountryData["Tags"], CountryData["Properties"] and deliver the data in various formats, as numbers, as polygons, etc.

Future directions

In addition to the InternalData2 class, there is already a class for web APIs such as the MediaWiki API or SPARQL end points. More on those in future posts.

Another direction to go is to support writing and updating datasets. A wrapper for data works for read-only datasets that are part of a package, but it seems not a good idea to update data in installed packages. Another class is needed for that.

Also I like the idea of using the uniform query interface and introspection to build something on top of it. I think of a simple templating mechanism for creating markdown reports, powerpoint slides. Interactive SVG or Tk windows is another direction.


Is defrosting the freezer worth the trouble?

Short answer: No, not in my case.

Long answer: My refridgerator is about 5 years old, has a volume of 103 liters and a freezing compartment of 16 liters. Its energy class is “A”, the label states a power consumption of 219 kWh per year.

I borrowed a power meter, and measured consumption before and after defrosting. Here is a chart of the measured data. I measured only 5 values, one week before defrosting, 1 hour after turning the fridge on, and then 8, 23 and 47 hours after defrosting. The y-axis shows the average power consumption, i.e. the intensity the fridge sucks energy.

The data suggests that the power consumption is on average at 15.9 Watt. The difference before and after defrosting is 0.5 Watt. I do not want to attribute this to the defrosting. The outside temperature in the week before defrosting was higher, also I used the fridge more often because there were holidays and I was at home. But even if defrosting would have this effect, these 0.5 Watt add up to only 4.38 kWh or 1.50 EUR per year.

Another insight was that 15.9 Watt project to 139 kWh per year and this is quite below the value 219 kWh stated on the energy label. Even after 5 years! Maybe this is because I put the thermostat a rather low level (2 out of 5), or maybe the consumption in the summer will be much higher.


AUM -- a Sensor to Track Your Computer Usage

A few days ago, I released a first beta version of the Application Usage Monitor AUM. It is a simple sensor for the windows operating system that logs the window names you are actively using. Thus the program enables you to analyse: how much time do you spend surfing (and where?), how much time do you work productively?

So the sensor is somewhat similar to RescueTime or TimeDoctor (I did not try these programs however). The difference is that AUM is really small program written in C that does nothing more than collect the data. It leaves the analysis of the data up to you. Another difference is that AUM is free (as in GPL).

Installing the program is simple. Just download and run the installation program from sourceforge. You may choose to automatically start the sensor at login, which is recommended. Once the program is started, the program logs the time, process name, process id, window title to a semicolon separated text file. You will also see a little blue butterfly in the system tray. Click on the icon to stop monitoring.

The default location for the log file is %APPDATA%\AUM\AUM.csv. You may provide a different file path at the command line. You can open the file with any text editor. There is also a Visual Basic Script that tries to import the data into Excel. This requires a working Excel installation at your PC. In future, I plan to support analysis of the data from within R.

Please give it a try! But be warned that the program is beta and not so well tested yet. Please leave feedback in the comment box below.


Working with strings

R has a lot of string functions, many of them can be found with ls("package:base", pattern="str"). Additionally, there are add-on packages such as stringr, gsubfn and brew that enhance R string processing capabilities. As a statistical language and environment, R has an edge compared to other programming languages when it comes to text mining algorithms or natural language processing. There is even a taskview for this on CRAN.

I am currently playing with markdown files in R, which eventually will result in a new version of mdtools, and collected or created some string functions I like to present in this blogpost. The source code of the functions is at the end of the post, first I show how to use these functions.

Head and tail for strings

The idea for the first two functions I had earlier, and I had to learn that providing a S3 method for head and tail is not an good idea. But strhead and strtail did prove as handy. Here are some usage examples:

> strhead("egghead", 3)
[1] "egg"
> strhead("beagle", -1) # negative index
[1] "beagl"
> strtail(c("bowl", "snowboard"), 3) # vector-able in the first argument
[1] "owl" "ard"

These functions are only syntactic sugar, hopefully easy to memorize because of their similarity to existing R functions. For packages, they are probably not worth introducing an extra dependency. I thought about defining an replacement function like substr does, but I did not try it because head and tail do not have replacement functions.

Bare minimum template

With sprintf, format and pretty, there are powerful functions for formatting strings. However, sometimes I miss the named template syntax as in Python or in Makefiles. So I implemented this in R. Here are some usage examples:

> strsubst(
+   "$(WHAT) is $(HEIGHT) meters high.", 
+   list(
+     WHAT="Berlin's teletower",
+     HEIGHT=348
+   )
+ )
[1] "Berlin's teletower is 348 meters high."
> d <- strptime("2012-03-18", "%Y-%m-%d")
> strsubst(c(
+   "Be careful with dates.",
+   "$(NO_CONV) shows a list.",
+   "$(CONV) is more helpful."),
+   list(
+     NO_CONV=d,
+     CONV= as.character(d)
+   )
+ )
[1] "Be careful with dates."                                                                                        
[2] "list(sec = 0, min = 0, hour = 0, mday = 18, mon = 2, year = 112, wday = 0, yday = 77, isdst = 0) shows a list."
[3] "2012-03-18 is more helpful."                                                                                   

The first argument can be string or a vector of strings such as the output of readLines. The second argument can be any indexable object (i.e. with working [ operator) such as lists. Environments are not indexable hence won’t work.

Parse raw text

Frequently, I need to extract parts from raw text data. For instance, few weeks ago I had to parse a SPSS script (some variable labels were hard-coded theree and not in the .sav file). The script contained lines VARIABLE LABELS some_var "<some_label>". I was interested in some_var and <some_label>. The examples from the R documentation on regexpr gave me the direction and led me to the strparse function that is applied as follows:

> lines <- c(
+     'VARIABLE LABELS weight "weight".',
+     'VARIABLE LABELS altq "Year of birth".',
+     'VARIABLE LABELS hhg "Household size".',
+     'missing values all (-1).',
+     'EXECUTE.'
+ )
> pat <- 'VARIABLE LABELS (?<name>[^\\s]+) \\"(?<lbl>.*)\\".$'
> matches <- grepl(pat, lines, perl=TRUE)
> strparse(pat, lines[matches])
name     lbl             
[1,] "weight" "weight"        
[2,] "altq"   "Year of birth" 
[3,] "hhg"    "Household size"

The function returns a vector if one line was parsed and a matrix otherwise. It supports named groups.

Recoding with regular expressions

Sometimes I need to recode a vector of strings in a way that I find all mathces for a particular regular expression and replace these matches with one string. The I match all remaining strings with a second regular expression and replace the hits with a second replacement. And so on. I wrote the strrecode function to support this operation. The function can be seen as an generalisation of the gsub function. It is the only function without test code. Here is a made-up example analysing process information from the task manager:

> dat <- data.frame(
+     wtitle=c(paste(c("Inbox", "Starred", "All"), "- Google Mail"), paste("file", 1:4, "- Notepad++")),
+     exe=c(rep("chrome.exe",3), rep("notepad++.exe", 4))
+ )
> dat <- transform(
+     dat,
+     usage=strrecode(c("Google Mail$|Microsoft Outlook$", " - Notepad\\+\\+$|Microsoft Word$"), c("Mail", "Text"), dat$wtitle)
+ )
> dat
wtitle  pid           exe usage
1   Inbox - Google Mail 6810    chrome.exe  Mail
2 Starred - Google Mail 2488    chrome.exe  Mail
3     All - Google Mail 4086    chrome.exe  Mail
4    file 1 - Notepad++ 2946 notepad++.exe  Text
5    file 2 - Notepad++  112 notepad++.exe  Text
6    file 3 - Notepad++ 1176 notepad++.exe  Text
7    file 4 - Notepad++ 8881 notepad++.exe  Text

Interested in the source code of these helper functions? Read on.


Comparing my expenses

Now that I have collected my expenditures via Twitter and bank transaction data, and categorized it according to COICOP, in this blog post I compare it with the typical expenditures of a household of my type with an income level like mine.

The German Federal Statistical Office conducts every five years (last time in 2008) a survey on Income and Consumption. On their website, you can find this nice visualization.

The dataset I used is provided here. It is also part of the pft package. The dataset is not identical with the official data, some information is lost by the processing.

Here is the chart comparing my expenses in 2011 with the typical expenses.

The main problem when comparing me with a typical consumer is that 15% of my expenses remained uncategorized. If I assume it goes in either Food, Recreation, Restaurants or Misc, and add those amounts, it turns that almost the half of my expenses goes in these four categories, while typical would be 35%. Especially I tend to spend more money on Recreation and Restaurants. On the other side, I spend less money on Housing (small apartement) and Transportation (no car).


Berlin's children

Few years ago, a newspaper claimed the block I live in — Prenzlauer Berg in Berlin — is the most fertile region in Europe. It was a hoax, as this (German) newspaper article points out. (The article has become quite famous because it coined the term Bionade Biedermeier to describe the life style in this area.)

However, there are more children in my district than in the other parts of Berlin. Have a look at this map:

(The base map and population data come from the State’s statistical office. Data at block level is not readily available, though.)

The place I live is marked by a hair cross. Indeed, in this district there is a “higher exposure to kids” than in the other districts, one children per 1000 inhabitants more than in Friedrichshain-Kreuzberg, and twelve children per 1000 more than in Charlottenburg-Wilmersdorf. Exposure is of course different from fertility, maybe that is what I learned from playing with the map.

If you want to know how to draw this thematic map with R, and add the point and the legends to it, or if you are just looking for a shapefile of Berlin, then read on.


Categorizing my expenses

In order to analyse my expenses, a classification scheme is necessary. I need to identify categories that are meaningful to me. I decided to go with the “Classification of Individual Consumption by Purpose” (COICOP), for three reasons:

  • It is made by people who have thought more about consumption classification than I ever will.
  • It is feasible to assign bank transactions and tracked cash spendings to one of the 12 top level categories.
  • It is widely used by statistics divisions, e.g. the Federal Statistical Office of Germany, Eurostat, and the UN. This means I can do social comparisons: In which categories do I spend more money than the average? Do the prices I pay rise faster than the price indices suggest?

So I classified my last year’s expense data according to COICOP. Here is a chart showing the portions of the categories for each month:

For me, the holidays, prepared in August and traveled in September (shown as unknown expenses), are much more dominant than I expected. Except for the new glasses in September I did not make any larger investments.

I like this kind of chart more than stacked bar charts because the history for each category is very visible. This chart is called inkblot chart. I stumbled on it on junk charts, asked how to implement it in R on StackOverflow, and included a revised version in the latest pft package. See below for more information.


Tracking my expenses

One new-year resolution I made last year was to understand where my money goes. From previous experiments I know that expense tracking has to be as simple as possible. My approach is to

  • Use my cash card as often as possible. This automatically tracks the date and some information on the vendor.
  • Use twitter to track my cash expenses. This supplements the bank account statement data.
  • Edit, enrich, merge and visualise the two data sources with R. Because it is fun playing with R!

Now after more than one year of expense tracking, I can now analyse the results. The first result however, was disappointing. My cash tracking with twitter was not as complete as I thought it is. Below is a figure that displays the sum tracked with twitter divided by the sum withdrawn from my bank account for each month of 2011.

If I had tracked my cash expenses completely, the ratio would be around 100, the gray dashed line. However, it is systematically below. For September, there is an explanation: I was on holidays and did intentionally not track the expenses. But even considering that, there remain 18 percent of my cash spendings unexplained!

More analysis results will follow. If you are interested in technical aspects of the expense tracking, such as importing the tweets and bank statements, read on. However, there is no R code today, since there is no example data.