2012-04-10

Working with strings

R has a lot of string functions, many of them can be found with ls("package:base", pattern="str"). Additionally, there are add-on packages such as stringr, gsubfn and brew that enhance R string processing capabilities. As a statistical language and environment, R has an edge compared to other programming languages when it comes to text mining algorithms or natural language processing. There is even a taskview for this on CRAN.

I am currently playing with markdown files in R, which eventually will result in a new version of mdtools, and collected or created some string functions I like to present in this blogpost. The source code of the functions is at the end of the post, first I show how to use these functions.

Head and tail for strings

The idea for the first two functions I had earlier, and I had to learn that providing a S3 method for head and tail is not an good idea. But strhead and strtail did prove as handy. Here are some usage examples:

> strhead("egghead", 3)
[1] "egg"
> strhead("beagle", -1) # negative index
[1] "beagl"
> strtail(c("bowl", "snowboard"), 3) # vector-able in the first argument
[1] "owl" "ard"

These functions are only syntactic sugar, hopefully easy to memorize because of their similarity to existing R functions. For packages, they are probably not worth introducing an extra dependency. I thought about defining an replacement function like substr does, but I did not try it because head and tail do not have replacement functions.

Bare minimum template

With sprintf, format and pretty, there are powerful functions for formatting strings. However, sometimes I miss the named template syntax as in Python or in Makefiles. So I implemented this in R. Here are some usage examples:

> strsubst(
+   "$(WHAT) is $(HEIGHT) meters high.", 
+   list(
+     WHAT="Berlin's teletower",
+     HEIGHT=348
+   )
+ )
[1] "Berlin's teletower is 348 meters high."
> d <- strptime("2012-03-18", "%Y-%m-%d")
> strsubst(c(
+   "Be careful with dates.",
+   "$(NO_CONV) shows a list.",
+   "$(CONV) is more helpful."),
+   list(
+     NO_CONV=d,
+     CONV= as.character(d)
+   )
+ )
[1] "Be careful with dates."                                                                                        
[2] "list(sec = 0, min = 0, hour = 0, mday = 18, mon = 2, year = 112, wday = 0, yday = 77, isdst = 0) shows a list."
[3] "2012-03-18 is more helpful."                                                                                   

The first argument can be string or a vector of strings such as the output of readLines. The second argument can be any indexable object (i.e. with working [ operator) such as lists. Environments are not indexable hence won’t work.

Parse raw text

Frequently, I need to extract parts from raw text data. For instance, few weeks ago I had to parse a SPSS script (some variable labels were hard-coded theree and not in the .sav file). The script contained lines VARIABLE LABELS some_var "<some_label>". I was interested in some_var and <some_label>. The examples from the R documentation on regexpr gave me the direction and led me to the strparse function that is applied as follows:

> lines <- c(
+     'VARIABLE LABELS weight "weight".',
+     'VARIABLE LABELS altq "Year of birth".',
+     'VARIABLE LABELS hhg "Household size".',
+     'missing values all (-1).',
+     'EXECUTE.'
+ )
> pat <- 'VARIABLE LABELS (?<name>[^\\s]+) \\"(?<lbl>.*)\\".$'
> matches <- grepl(pat, lines, perl=TRUE)
> strparse(pat, lines[matches])
name     lbl             
[1,] "weight" "weight"        
[2,] "altq"   "Year of birth" 
[3,] "hhg"    "Household size"

The function returns a vector if one line was parsed and a matrix otherwise. It supports named groups.

Recoding with regular expressions

Sometimes I need to recode a vector of strings in a way that I find all mathces for a particular regular expression and replace these matches with one string. The I match all remaining strings with a second regular expression and replace the hits with a second replacement. And so on. I wrote the strrecode function to support this operation. The function can be seen as an generalisation of the gsub function. It is the only function without test code. Here is a made-up example analysing process information from the task manager:

> dat <- data.frame(
+     wtitle=c(paste(c("Inbox", "Starred", "All"), "- Google Mail"), paste("file", 1:4, "- Notepad++")),
+     pid=sample.int(9999,7),
+     exe=c(rep("chrome.exe",3), rep("notepad++.exe", 4))
+ )
> dat <- transform(
+     dat,
+     usage=strrecode(c("Google Mail$|Microsoft Outlook$", " - Notepad\\+\\+$|Microsoft Word$"), c("Mail", "Text"), dat$wtitle)
+ )
> dat
wtitle  pid           exe usage
1   Inbox - Google Mail 6810    chrome.exe  Mail
2 Starred - Google Mail 2488    chrome.exe  Mail
3     All - Google Mail 4086    chrome.exe  Mail
4    file 1 - Notepad++ 2946 notepad++.exe  Text
5    file 2 - Notepad++  112 notepad++.exe  Text
6    file 3 - Notepad++ 1176 notepad++.exe  Text
7    file 4 - Notepad++ 8881 notepad++.exe  Text

Interested in the source code of these helper functions? Read on.

The function definitions

This blog post is an intermediary step for a package I am working on. It is called mdtools and will provide a framework for working with markdown formatted text files. So in (the hopefully near) future, you can access the function via the mdtools package. Until then, you can copy and paste:

> strhead
function(s,n=1) {
if(n<0) 
substr(s,1,nchar(s)+n) 
else 
substr(s,1,n)
}
> strtail
function(s,n=1) {
if(n<0) 
substring(s,1-n) 
else 
substring(s,nchar(s)-n+1)
}
> strsubst
function(template, map, verbose=getOption("verbose")) {
pat <- "\\$\\([^\\)]+\\)"
res <- template
map[["$"]] <- "$"
m <- gregexpr(pat, template)
idx <- which(sapply(m, function(x) x[[1]]!=-1)) # faster than 1:length(template)?
for (i in idx) {
line <- template[[i]]
if(verbose) cat("input: |", template[[i]], "|\n")
starts <- m[[i]]
ml <- attr(m[[i]], "match.length")
sym <- substring(line, starts+2, starts+ml-2)
repl <- map[sym]
idx1 <- is.null(repl)
repl[idx1] <- sym[idx1]
norepl <- substring(line, c(1, starts+ml), c(starts-1, nchar(line)))
res[[i]] <- paste(norepl, c(repl, ""), sep="", collapse="") # more elegant?
if (verbose) cat("output: |", res[[i]], "|\n")
}
return(res)
}
> strparse
function(pat, x) {
parsed <- regexpr(pat, x, perl=TRUE)
if (length(x)==1) {
if(parsed[1]==-1) return(NULL)
st <- attr(parsed, "capture.start")[1,]
m <- substring(x, st, st + attr(parsed, "capture.length")[1,]-1)
names(m) <- attr(parsed, "capture.names")
} else {
m <- do.call(rbind, lapply(seq_along(parsed), function(i) {
if(parsed[i] == -1) return("")
st <- attr(parsed, "capture.start")[i, ]
substring(x[i], st, st + attr(parsed, "capture.length")[i, ] - 1)
}))
colnames(m) <- attr(parsed, "capture.names")
}
return(m)
}
> strrecode
function(pats, repls, x, ...) {
res <- rep(NA, length(x))
hits <- rep(FALSE, length(x))
for (i in seq_along(pats)) {
#browser()
new_hits <- grepl(pats[[i]],x[!hits],...)
res[!hits][new_hits] <- repls[[i]]
hits[!hits][new_hits] <- TRUE
if(all(hits)) break
}
return(res)
}

This post has been submitted to R bloggers, a RSS aggregator for R news and tutorials. If you are interested in R, check it out!

No comments: