2012-04-10

Working with strings

R has a lot of string functions, many of them can be found with ls("package:base", pattern="str"). Additionally, there are add-on packages such as stringr, gsubfn and brew that enhance R string processing capabilities. As a statistical language and environment, R has an edge compared to other programming languages when it comes to text mining algorithms or natural language processing. There is even a taskview for this on CRAN.

I am currently playing with markdown files in R, which eventually will result in a new version of mdtools, and collected or created some string functions I like to present in this blogpost. The source code of the functions is at the end of the post, first I show how to use these functions.

Head and tail for strings

The idea for the first two functions I had earlier, and I had to learn that providing a S3 method for head and tail is not an good idea. But strhead and strtail did prove as handy. Here are some usage examples:

> strhead("egghead", 3)
[1] "egg"
> strhead("beagle", -1) # negative index
[1] "beagl"
> strtail(c("bowl", "snowboard"), 3) # vector-able in the first argument
[1] "owl" "ard"

These functions are only syntactic sugar, hopefully easy to memorize because of their similarity to existing R functions. For packages, they are probably not worth introducing an extra dependency. I thought about defining an replacement function like substr does, but I did not try it because head and tail do not have replacement functions.

Bare minimum template

With sprintf, format and pretty, there are powerful functions for formatting strings. However, sometimes I miss the named template syntax as in Python or in Makefiles. So I implemented this in R. Here are some usage examples:

> strsubst(
+   "$(WHAT) is $(HEIGHT) meters high.", 
+   list(
+     WHAT="Berlin's teletower",
+     HEIGHT=348
+   )
+ )
[1] "Berlin's teletower is 348 meters high."
> d <- strptime("2012-03-18", "%Y-%m-%d")
> strsubst(c(
+   "Be careful with dates.",
+   "$(NO_CONV) shows a list.",
+   "$(CONV) is more helpful."),
+   list(
+     NO_CONV=d,
+     CONV= as.character(d)
+   )
+ )
[1] "Be careful with dates."                                                                                        
[2] "list(sec = 0, min = 0, hour = 0, mday = 18, mon = 2, year = 112, wday = 0, yday = 77, isdst = 0) shows a list."
[3] "2012-03-18 is more helpful."                                                                                   

The first argument can be string or a vector of strings such as the output of readLines. The second argument can be any indexable object (i.e. with working [ operator) such as lists. Environments are not indexable hence won’t work.

Parse raw text

Frequently, I need to extract parts from raw text data. For instance, few weeks ago I had to parse a SPSS script (some variable labels were hard-coded theree and not in the .sav file). The script contained lines VARIABLE LABELS some_var "<some_label>". I was interested in some_var and <some_label>. The examples from the R documentation on regexpr gave me the direction and led me to the strparse function that is applied as follows:

> lines <- c(
+     'VARIABLE LABELS weight "weight".',
+     'VARIABLE LABELS altq "Year of birth".',
+     'VARIABLE LABELS hhg "Household size".',
+     'missing values all (-1).',
+     'EXECUTE.'
+ )
> pat <- 'VARIABLE LABELS (?<name>[^\\s]+) \\"(?<lbl>.*)\\".$'
> matches <- grepl(pat, lines, perl=TRUE)
> strparse(pat, lines[matches])
name     lbl             
[1,] "weight" "weight"        
[2,] "altq"   "Year of birth" 
[3,] "hhg"    "Household size"

The function returns a vector if one line was parsed and a matrix otherwise. It supports named groups.

Recoding with regular expressions

Sometimes I need to recode a vector of strings in a way that I find all mathces for a particular regular expression and replace these matches with one string. The I match all remaining strings with a second regular expression and replace the hits with a second replacement. And so on. I wrote the strrecode function to support this operation. The function can be seen as an generalisation of the gsub function. It is the only function without test code. Here is a made-up example analysing process information from the task manager:

> dat <- data.frame(
+     wtitle=c(paste(c("Inbox", "Starred", "All"), "- Google Mail"), paste("file", 1:4, "- Notepad++")),
+     pid=sample.int(9999,7),
+     exe=c(rep("chrome.exe",3), rep("notepad++.exe", 4))
+ )
> dat <- transform(
+     dat,
+     usage=strrecode(c("Google Mail$|Microsoft Outlook$", " - Notepad\\+\\+$|Microsoft Word$"), c("Mail", "Text"), dat$wtitle)
+ )
> dat
wtitle  pid           exe usage
1   Inbox - Google Mail 6810    chrome.exe  Mail
2 Starred - Google Mail 2488    chrome.exe  Mail
3     All - Google Mail 4086    chrome.exe  Mail
4    file 1 - Notepad++ 2946 notepad++.exe  Text
5    file 2 - Notepad++  112 notepad++.exe  Text
6    file 3 - Notepad++ 1176 notepad++.exe  Text
7    file 4 - Notepad++ 8881 notepad++.exe  Text

Interested in the source code of these helper functions? Read on.

2012-02-25

Comparing my expenses

Now that I have collected my expenditures via Twitter and bank transaction data, and categorized it according to COICOP, in this blog post I compare it with the typical expenditures of a household of my type with an income level like mine.

The German Federal Statistical Office conducts every five years (last time in 2008) a survey on Income and Consumption. On their website, you can find this nice visualization.

The dataset I used is provided here. It is also part of the pft package. The dataset is not identical with the official data, some information is lost by the processing.

Here is the chart comparing my expenses in 2011 with the typical expenses.

The main problem when comparing me with a typical consumer is that 15% of my expenses remained uncategorized. If I assume it goes in either Food, Recreation, Restaurants or Misc, and add those amounts, it turns that almost the half of my expenses goes in these four categories, while typical would be 35%. Especially I tend to spend more money on Recreation and Restaurants. On the other side, I spend less money on Housing (small apartement) and Transportation (no car).

2012-02-04

Berlin's children

Few years ago, a newspaper claimed the block I live in — Prenzlauer Berg in Berlin — is the most fertile region in Europe. It was a hoax, as this (German) newspaper article points out. (The article has become quite famous because it coined the term Bionade Biedermeier to describe the life style in this area.)

However, there are more children in my district than in the other parts of Berlin. Have a look at this map:


(The base map and population data come from the State’s statistical office. Data at block level is not readily available, though.)

The place I live is marked by a hair cross. Indeed, in this district there is a “higher exposure to kids” than in the other districts, one children per 1000 inhabitants more than in Friedrichshain-Kreuzberg, and twelve children per 1000 more than in Charlottenburg-Wilmersdorf. Exposure is of course different from fertility, maybe that is what I learned from playing with the map.

If you want to know how to draw this thematic map with R, and add the point and the legends to it, or if you are just looking for a shapefile of Berlin, then read on.

2012-01-28

Categorizing my expenses

In order to analyse my expenses, a classification scheme is necessary. I need to identify categories that are meaningful to me. I decided to go with the “Classification of Individual Consumption by Purpose” (COICOP), for three reasons:

  • It is made by people who have thought more about consumption classification than I ever will.
  • It is feasible to assign bank transactions and tracked cash spendings to one of the 12 top level categories.
  • It is widely used by statistics divisions, e.g. the Federal Statistical Office of Germany, Eurostat, and the UN. This means I can do social comparisons: In which categories do I spend more money than the average? Do the prices I pay rise faster than the price indices suggest?

So I classified my last year’s expense data according to COICOP. Here is a chart showing the portions of the categories for each month:

For me, the holidays, prepared in August and traveled in September (shown as unknown expenses), are much more dominant than I expected. Except for the new glasses in September I did not make any larger investments.

I like this kind of chart more than stacked bar charts because the history for each category is very visible. This chart is called inkblot chart. I stumbled on it on junk charts, asked how to implement it in R on StackOverflow, and included a revised version in the latest pft package. See below for more information.

2012-01-08

Tracking my expenses

One new-year resolution I made last year was to understand where my money goes. From previous experiments I know that expense tracking has to be as simple as possible. My approach is to

  • Use my cash card as often as possible. This automatically tracks the date and some information on the vendor.
  • Use twitter to track my cash expenses. This supplements the bank account statement data.
  • Edit, enrich, merge and visualise the two data sources with R. Because it is fun playing with R!

Now after more than one year of expense tracking, I can now analyse the results. The first result however, was disappointing. My cash tracking with twitter was not as complete as I thought it is. Below is a figure that displays the sum tracked with twitter divided by the sum withdrawn from my bank account for each month of 2011.

If I had tracked my cash expenses completely, the ratio would be around 100, the gray dashed line. However, it is systematically below. For September, there is an explanation: I was on holidays and did intentionally not track the expenses. But even considering that, there remain 18 percent of my cash spendings unexplained!

More analysis results will follow. If you are interested in technical aspects of the expense tracking, such as importing the tweets and bank statements, read on. However, there is no R code today, since there is no example data.

2011-12-30

Electricity prices rose by 16% in two years

With about 490 kWh electricity consumption I am a rather small customer. Over the year this sums up to about 160 EUR. So I had a look at the costs.

I was suprised to learn that the prices rose quite clearly. My tariff is two-part, a base price and a kilowatt-hour rate. If I look at the total costs and divide them by my consumption, I get

yearEUR-Cent/kWh
200829.8
200933.1
201034.6

This means, in 2010 I had to pay 4% more than in 2009 and 16% more than in 2008. My energy use per day, on the other side, remained quite stable at 1.35 kWh/day.

What can I do about it? The first impulse was to collect more data, maybe by using a WiFi energy sensor. Would be really fun. However, I can not see many options to reduce my consumption. Maybe I can turn off the internet router while at work.

It seems more sensible to me to switch my energy provider, which I did today. My new energy provider will charge a bit more, but gives me 50 EUR new-customer bonus and claims to deliver certified renewable energy.

2011-12-29

How much is a shower?

After looking at my heating expenses, I turned to the costs for water heating. For some time, I looked at my water meter before and after taking a shower or a bath. Quite often, I forgot one or the other measurement, but I collected about 40 observations. Here is what they look like:

The data suggest that for a shower, it takes between 17 and 26.5 liters hot water and between 11 and 16.5 liters cold water. For a bath, it is 60 to 77 liters hot water and 24 to 32.5 liters cold water. (The numbers refer to the 25% and 75% percentiles, respectively.) The larger share of cold water for a shower makes sense, since I use cold water at the end of the shower for its “invigorating effect”.

Multiplied with the average costs, as charged by my landlord the last three years, a bath takes 0.94 to 1.22 EUR and a shower costs 0.29 to 0.45 EUR (again, first and third quartiles). So taking a shower for 0.50 EUR at the fitness club is not optimal, but also not very expensive.

There are water saving shower heads for 30 EUR. It is advertised that such a shower head uses 6.5 liter per minute instead the usual 15 to 16 liters per minute. I believe 15 liters per minute is too much. So let’s assume I save 3.5 liters per minute or (using the median water use of the data) 12 liters per shower. Is it cost-efficient?

Twelve liters less per shower at 10.5 EUR/cbm means a saving of 0.126 EUR per shower. I assume 20 showers a month. This is tentative, since with this assumption the costs sum up to 70 EUR a year for hot water while my bills amount in average to 110 EUR total costs for hot water. So the new shower head saves 20*0.126=2.52 EUR per month.

Let’s calculate the payback period. With an interest rate of 4% p.a. and 5 years expected serviceable life the monthly gain calculates to

2. 52 - 30 * 0. 04 / 12 - 30 / (5 * 12) = 1. 92

so after 30 / 1. 92 = 16 months, the investment is repaid. This seems acceptable. But just to be sure, let’s calculate the internal rate of return, too. In excel, there is a function IRR for this procedure, which is implemented in three lines in R. For convenience, the irr function is stored in the pft package. The result is 8.3% which seems decent.

Just for reference, here comes the R code for the plot and the irr function: