Internet-Defense-League

2013-05-17

Unit conversion in R

Last weekend I submitted an update of my R package datamart to CRAN. It has been more than a half year since the last update, however there are only minor advances. The package is still in its early stages, and very experimental.

One new feature is the function uconv. Think iconv, but instead of converting character vectors between different encodings, this function converts numerical vectors between different units of measurements. Now if you want to know how many centimeters one horse length is, you can write in R:

> #install.packages("datamart")
> library(datamart)
> uconv(1, "horse length", "cm")

and you will get the answer 240. I had the idea for this function when I had to convert between various energy units, including natural units of energy fuels like cubic metres of natural gas. The uconv function supports this, using common constants for the conversion.

> uconv(1, "Mtoe", "PJ")
[1] 41.88
> uconv(1, "m³ NG", "kWh")
[1] 10.55556

These conversions may be ambigious. For instance, the last one combines a volume and an energy dimension. An optional parameter allows the specification of the context, or unitset:

> uconv(1, "Mtoe", "PJ", uset="Energy")

The currently available unit sets and units therein can be inspected with

> uconvlist()

The first argument can be a numerical vector:

> set.seed(13)
> uconv(37+2*rnorm(5), "°C", "°F", uset="Temperature")
[1] 100.59558  97.59102 104.99059  99.27435 102.71309

2013-05-11

Highlights of Re:publica 13

From May 6th to 8th, Berlin was the host for the re:publica 13. I did not have time to attend it, but many of the talks of this internet culture conference are online. Here are my highlights (mostly in German, though):

2013-02-16

Some of Excel's Finance Functions in R

Last year I took a free online class on finance by Gautam Kaul. I recommend it, although there are other classes I can not compare it to. The instructor took great efforts in motivating the concepts, structuring the material, and enable critical thinking / intuition. I believe this is an advantage of video lectures over books. Textbooks often cover a broader area and are more subtle when it comes to recommendations.

One fun excercise to me was porting the classic excel functions FV, PV, NPV, PMT and IRR to R. Partly I used the PHP class by Enrique Garcia M. You can find the R code at pastebin. By looking at the source code, you will understand how sensitive IRR to its start value is:

> source("http://pastebin.com/raw.php?i=q7tyiEmM")
> irr(c(-100, 230, -132), start=0.14)
[1] 0.09999995
> irr(c(-100, 230, -132), start=0.16)
[1] 0.1999999

I still do not understand the sign of the return values. This I have to figure out every time I use the function. If you have a memory hook for this, please leave a comment.

The class did of course not only cover the time value of money, it was also a non-rigorous introduction to bonds and perpetuities (which I found interesting, too), as well as to CAPM and portfolio theory.

2013-02-14

Reflections on a Free Online Course on Quantitative Methods in Political Sciences

Last year I watched some videos of Gary King's lectures on Advanced Quantitative Research Methodology (GOV 2001). The course teaches ongoing political scientists how to develop new approaches to research methods, data analysis, and statistical theory. The course material (videos and slides) seems to be still online, a subsequent course apparently has started end of January 2013.

I only watched some videos and did not work through the assignments. Nevertheless I learned a lot, and I am writing this post to reduce my pile of loose leafs (new year's resolution) and summarize the take-aways.

Theoretical concepts

In one of the first lessons, the goals of empirical research are stepwise partitioned until the concept of counterfactual inference appears, a new term for me. It denotes “using facts you know to learn facts you cannot know, because they do not (yet) exist” and can further differentiated into prediction, what-if analysis, and causal inference. I liked the stepwise approximation) to the concept: summarize vs. inference, descriptive inference vs. counterfactual inference.

In the course was presented a likelihood theory of inference. New to me was the likelihood axiom which states that a likelihood function L(t',y) must be proportional to the probability of the data given the hypothetical parameter and the “model world”. Proportional means here a constant that only depends on the data y, i.e. L(t',y) = k(y) P(y|t'). Likelihood is a relative measure of uncertainty, relative to the data set y. Comparisons of values of the likelihood function across data sets is meaningless. The data affects inferences only through the likelihood function.

In contrast to likelihood inference, Bayesian inference models a posterior distribution P(t'|y) which incorporates prior information P(t'), i.e. P(t'|y) = P(t') P(y|t')/P(y). To me, it seems the likelihood theory of inference is more straightforward as it is not necessary to treat prior information P(t'). I have heard that there discussions between “frequentists” and “Bayesians”, but it was new to me to hear from a third group “Likelihoodists”.

Modeling

At the beginning of the course, some model specifications with “systematic” and “stochastic” components were introduced. I like this notation, it makes very clear what goes on and where the uncertainty is.

An motivation was given of the negative binomial distribution as a compounding distribution of the Poisson and the Gamma distribution (aka Gamma mixture). The negative binomial distribution can be viewed as a Poisson distribution where the Poisson parameter is itself a random variable, distributed according to a Gamma distribution. With g(y|\lambda) as density of the Poisson distribution and h(\lambda|\phi, \sigma^2) as density of the Gamma distribution, the negative binomial distribution f arises after collapsing their joint distribution: f(y|\phi, \sigma^2) = \int_0^+\infty g(y|\lambda) h(\lambda|\phi, \sigma^2) d\lambda

There were many other modeling topics, including missing value imputation and matching as a technique for causal inference. I did not look into it, maybe later/someday.

The assignments did move very fast to simulation techniques. I did not work through them, but got interested in the subject and will work some chapters of Ripley's “Stochastic Simulation” book, when time permits.

Didactical notes

I was really impressed by the efforts Mr. King and his teaching assistants took to teach their material. Students taking the (non-free) full course prepare replication papers. The assignments involve programming. In the lectures quizzes are shown, the students vote using facebook and the result is presented two minutes later. The professor interrupted his talk once per lecture and said “Here is the question; discuss this with your neighbour for five minutes”. Very good idea.

2012-07-29

ScraperWiki in R

ScraperWiki describes itself as an online tool for gathering, cleaning and analysing data from the web. It is a programming oriented approach, users can implement ETL processes in Python, PHP or Ruby, share these processes among the community (or pay for privacy) and schedule automated runs. The software behind the service is open source, and there is also an offline version of the program library.

As far as I know, ScraperWiki has no R support yet. This is where the scrape method of my datamart package chimes in. It provides an S4 generic for storing scraped data into an offline database. It is not a wiki (however collaboration can take place using r-forge or CRAN), has no scheduling support (scheduling the start of R scripts should take place outside R) and is not intended as a web application.

Example use case — Berlin’s bathing water quality

Here is an example use case. It is (finally) summer in Berlin and the warm weather invites to go swimming. For the health-aware and/or data-affine people (or those with strong imagination) the public authorities provide the latest water quality assessments online for around 30 bathing areas as open data.

Only the last measurement is available. Now if we were interested in collecting the quality measurements, maybe to find a trend or to search for contributing factors, we would need a mechanism to regularly (in this case, weekly) scrape the data via the online API and save the data locally.

With the framework proposed in datamart, once the details are defined, the mechanism would be executed with just three lines

> ds <- datastore("path/to/local.db")
> bq <- be_bathing_water_quality()
> scrape(bq, ds) # get & save the data

The rest of this blog post is on defining the process, which boils down to defining the be_bathing_water_quality function.

ETL process — extract, transform, load

The task the be_bathing_water_quality function has to accomplish is to define an three-step process by passing the process details to the urldata function. In the database or data warehouse context this process is often refered to as ETL process. In our context, the steps are:

  • After mapping a resource name to an URL the data is extracted, i.e. downloaded. If the network is not available, an authentication failed, or similar, the process ends here.
  • The extracted data is then transformed into an object R can work with. Currently, only data.frame objects are supported, xts and igraph are envisioned. If the extracted data was not as expected and the tranformation failed, the process ends here.
  • The tidy data is then stored (or loaded) int a local sqlite database.

The urldata function provides parameters for each of these steps.

Example: extract and load

In the bathing water quality example, there is on URL for all bathing areas. We map the resource BathingWaterQuality to this URL. the data is returned as JSON, which is why we use fromJSON as extraction function. The data is then transformed into a data.frame. (We do not go into into every detail of that transformation step.) Hence, the call to urldata is:

> be_bathing_water_quality <- function() urldata(
+   template="http://www.berlin.de/badegewaesser/baden-details/index.php/index/all.json?q=%s",
+   map.lst=list(BathingWaterQuality=""),
+   extract.fct=fromJSON,
+   transform.fct=function(x) {
+     tbl <- x[["index"]]
+     nm <- names(tbl[[1]])
+     dat <- as.data.frame(
+       matrix(NA, length(tbl), length(nm))
+     )
+     colnames(dat) <- nm
+     for(i in 1:length(tbl)) 
+        dat[i,nm] <- tbl[[i]][nm]
+     #some steps omitted...
+     return(dat)
+   }
+ )

Now, we can create an data object, inspect it using the queries method, and access the latest measurements form the web using query:

> bq <- be_bathing_water_quality()
> queries(bq)
[1] "BathingWaterQuality"
> nrow(query(bq, "BathingWaterQuality"))
[1] 38

The urldata function as part of the datamart package has been described in more detail in an earlier post on the Gapminder datasets.

Example: load

In order to save the scraped data, one more step is necessary. The urldata provides a parameter scrape.lst for specifying which resources to save locally, and how:

> be_bathing_water_quality <- function() urldata(
+   template="...", # see above
+   map.lst=list(BathingWaterQuality=""),
+   extract.fct=fromJSON,
+   #transform.fct= # see above
+   scrape.lst=list(
+     BathingWaterQuality=list(
+       uniq=c("id", "badname", "dat"), 
+       saveMode="u", 
+       dates="dat"
+     )
+   )
+ )

The name(s) of the scrape.lst argument must be a subset of the names of map.lst. The entries of the list are passed as arguments to dbWriteTable: - saveMode="u" indicates that existing observations should be replaced and new observations appended, uniq defines the columns that determine if an observation exists or not. In the example, an observation is identified by the name of the bathing area (id, badname) and the measurement date (dat). If a place/date combination already exists both in the database and in the newly scraped data, the row in the database gets overwritten. - There are other options to dbWriteTable such as specification of columns with dates or timestamps to enforce data type conversion.

Now with this extended function definition we can decide to use local or web data:

> bq <- be_bathing_water_quality()
> ds <- datastore(":memory:") # not persistent
> scrape(bq, ds)
> #
> #local:
> system.time(query(bq, "BathingWaterQuality", dbconn=ds))
user  system elapsed 
0       0       0 
> #
> #web:
> system.time(query(bq, "BathingWaterQuality"))
user  system elapsed 
0.04    0.00    0.23 

Usually, the local data is accessed faster. Additionally, it is possible to keep observations as long as we want to.

Conclusion

This blog post belongs to a series of articles on a S4 data concept I am currently prototyping. Currently, the framework allows to access and query “conventional” internal datasets, SPARQL endpoints and datasets on the web. Topic of the next blog post is combining data objects using the Mashup class.

On the other side, I am thinking of building on top of this infrastructure. Combining data objects with text templates similar to brew is one idea, interactive views using rpanelor similiar is another.

The datamart package is in its early stage, changes to the classes are likely to happen. I would appreciate feedback on the concept. Please leave a comment or send an email.

2012-07-16

Convenient access to Gapminder's datasets from R

In April, Hans Rosling examined the influence of religion on fertility. I used R to replicate a graphic of his talk:

> library(datamart)
> gm <- gapminder()
> #queries(gm)
> #
> # babies per woman
> tmp <- query(gm, "TotalFertilityRate")
> babies <- as.vector(tmp["2008"])
> names(babies) <- names(tmp)
> babies <- babies[!is.na(babies)]
> countries <- names(babies)
> #
> # income per capita, PPP adjusted
> tmp <- query(gm, "IncomePerCapita")
> income <- as.vector(tmp["2008"])
> names(income) <- names(tmp)
> income <- income[!is.na(income)]
> countries <- intersect(countries, names(income))
> #
> # religion
> tmp <- query(gm, "MainReligion")
> religion <- tmp[,"Group"]
> names(religion) <- tmp[,"Entity"]
> religion[religion==""] <- "unknown"
> colcodes <- c(
+   Christian="blue", 
+   "Eastern religions"="red", 
+   Muslim="green", "unknown"="grey"
+ )
> countries <- intersect(countries, names(religion))
> #
> # plot
> par(mar=c(4,4,0,0)+0.1)
> plot(
+   x=income[countries], 
+   y=babies[countries], 
+   col=colcodes[religion[countries]], 
+   log="x",
+   xlab="Income per Person, PPP-adjusted", 
+   ylab="Babies per Woman"
+ )
> legend(
+   "topright", 
+   legend=names(colcodes), 
+   fill=colcodes, 
+   border=colcodes
+ )

One of the points Rosling wanted to make is: Religion has no or very little influence on fertility, but economic welfare has. I wonder if demographs agree and take this economic effect into account.

If you want to know more about that gapminder function and that query method, read on.

2012-06-24

Querying DBpedia from R

DBpedia is an extract of structured information from wikipedia. The structured data can be retrieved using an SQL-like query language for RDF called SPARQL. There is already an R package for this kind of queries named SPARQL.

There is an S4 class Dbpedia part of my datamart package that aims to support the creation of predefined parameterized queries. Here is an example that retrieves data on German Federal States:

> library(datamart)
> dbp <- dbpedia()
> # see a list of predefined queries
> queries(dbp)
Dbpedia#Nuts1 Xsparql#character 
"Nuts1"       "character" 
> # lists Federal States
> head(query(dbp, "Nuts1"))
name nuts    popDate      pop
1           Niedersachsen  DE9 2007-10-31  7977000
2                  Hessen  DE7 2007-09-30  6073000
3     Nordrhein-Westfalen  DEA 2009-01-31 17920000
4 Freie Hansestadt Bremen  DE5 2007-10-31   664000
5                  Berlin  DE3 2010-09-30  3450889
6             Brandenburg  DE4 2008-12-31  2522493
area   gdp popMetro
1 47624200000   188       NA
2 21100000000   225       NA
3 34084100000 54107       NA
4   408000000    24       NA
5   891850000    95  4429847
6 29478600000    48       NA

It is straightforward to extend the Dbpedia class for further queries. More challenging in my opinion is to figure out useful queries. Some examples can be found at Bob DuCharme’s blog, in the article by Jos van den Oever at kde.org, in a discussion on a mailing list and a tutorial at the W3C, at Kingsley Idehen’s blog and at DBpedia’s wiki.