ScraperWiki describes itself as an online tool for gathering, cleaning and analysing data from the web. It is a programming oriented approach, users can implement ETL processes in Python, PHP or Ruby, share these processes among the community (or pay for privacy) and schedule automated runs. The software behind the service is open source, and there is also an offline version of the program library.
As far as I know, ScraperWiki has no R support yet. This is where the scrape
method of my datamart
package chimes in. It provides an S4 generic for storing scraped data into an offline database. It is not a wiki (however collaboration can take place using r-forge or CRAN), has no scheduling support (scheduling the start of R scripts should take place outside R) and is not intended as a web application.
Example use case — Berlin’s bathing water quality
Here is an example use case. It is (finally) summer in Berlin and the warm weather invites to go swimming. For the health-aware and/or data-affine people (or those with strong imagination) the public authorities provide the latest water quality assessments online for around 30 bathing areas as open data.
Only the last measurement is available. Now if we were interested in collecting the quality measurements, maybe to find a trend or to search for contributing factors, we would need a mechanism to regularly (in this case, weekly) scrape the data via the online API and save the data locally.
With the framework proposed in datamart
, once the details are defined, the mechanism would be executed with just three lines
> ds <- datastore("path/to/local.db")
> bq <- be_bathing_water_quality()
> scrape(bq, ds) # get & save the data
The rest of this blog post is on defining the process, which boils down to defining the be_bathing_water_quality
function.
ETL process — extract, transform, load
The task the be_bathing_water_quality
function has to accomplish is to define an three-step process by passing the process details to the urldata
function. In the database or data warehouse context this process is often refered to as ETL process. In our context, the steps are:
- After mapping a resource name to an URL the data is extracted, i.e. downloaded. If the network is not available, an authentication failed, or similar, the process ends here.
- The extracted data is then transformed into an object R can work with. Currently, only
data.frame
objects are supported,xts
andigraph
are envisioned. If the extracted data was not as expected and the tranformation failed, the process ends here. - The tidy data is then stored (or loaded) int a local sqlite database.
The urldata
function provides parameters for each of these steps.
Example: extract and load
In the bathing water quality example, there is on URL for all bathing areas. We map the resource BathingWaterQuality
to this URL. the data is returned as JSON, which is why we use fromJSON
as extraction function. The data is then transformed into a data.frame
. (We do not go into into every detail of that transformation step.) Hence, the call to urldata
is:
> be_bathing_water_quality <- function() urldata(
+ template="http://www.berlin.de/badegewaesser/baden-details/index.php/index/all.json?q=%s",
+ map.lst=list(BathingWaterQuality=""),
+ extract.fct=fromJSON,
+ transform.fct=function(x) {
+ tbl <- x[["index"]]
+ nm <- names(tbl[[1]])
+ dat <- as.data.frame(
+ matrix(NA, length(tbl), length(nm))
+ )
+ colnames(dat) <- nm
+ for(i in 1:length(tbl))
+ dat[i,nm] <- tbl[[i]][nm]
+ #some steps omitted...
+ return(dat)
+ }
+ )
Now, we can create an data object, inspect it using the queries
method, and access the latest measurements form the web using query
:
> bq <- be_bathing_water_quality()
> queries(bq)
[1] "BathingWaterQuality"
> nrow(query(bq, "BathingWaterQuality"))
[1] 38
The urldata
function as part of the datamart
package has been described in more detail in an earlier post on the Gapminder datasets.
Example: load
In order to save the scraped data, one more step is necessary. The urldata
provides a parameter scrape.lst
for specifying which resources to save locally, and how:
> be_bathing_water_quality <- function() urldata(
+ template="...", # see above
+ map.lst=list(BathingWaterQuality=""),
+ extract.fct=fromJSON,
+ #transform.fct= # see above
+ scrape.lst=list(
+ BathingWaterQuality=list(
+ uniq=c("id", "badname", "dat"),
+ saveMode="u",
+ dates="dat"
+ )
+ )
+ )
The name(s) of the scrape.lst
argument must be a subset of the names of map.lst
. The entries of the list are passed as arguments to dbWriteTable
: - saveMode="u"
indicates that existing observations should be replaced and new observations appended, uniq
defines the columns that determine if an observation exists or not. In the example, an observation is identified by the name of the bathing area (id
, badname
) and the measurement date (dat
). If a place/date combination already exists both in the database and in the newly scraped data, the row in the database gets overwritten. - There are other options to dbWriteTable
such as specification of columns with dates or timestamps to enforce data type conversion.
Now with this extended function definition we can decide to use local or web data:
> bq <- be_bathing_water_quality()
> ds <- datastore(":memory:") # not persistent
> scrape(bq, ds)
> #
> #local:
> system.time(query(bq, "BathingWaterQuality", dbconn=ds))
user system elapsed
0 0 0
> #
> #web:
> system.time(query(bq, "BathingWaterQuality"))
user system elapsed
0.04 0.00 0.23
Usually, the local data is accessed faster. Additionally, it is possible to keep observations as long as we want to.
Conclusion
This blog post belongs to a series of articles on a S4 data concept I am currently prototyping. Currently, the framework allows to access and query “conventional” internal datasets, SPARQL endpoints and datasets on the web. Topic of the next blog post is combining data objects using the Mashup
class.
On the other side, I am thinking of building on top of this infrastructure. Combining data objects with text templates similar to brew
is one idea, interactive views using rpanel
or similiar is another.
The datamart
package is in its early stage, changes to the classes are likely to happen. I would appreciate feedback on the concept. Please leave a comment or send an email.