2023-02-18

Large Matrix Multiplication: DuckDB vs. SQLite

On my laptop with 16 GB RAM, I would like to perform a matrix-vector multiplication with a sparse matrix of around 10 million columns and 2500 rows. The matrix has approximately only 2% non-zero entries, but this are still 500 million numbers and the column/row information, too large to work comfortably in-memory.

A while ago, I tried using sqlite for this task. It kind of worked, but was too slow to be useful. This weekend, I revisited the problem and tried using duckdb.

TLDR: duckdb is amazingly faster than sqlite for this problem, around 20 times. The matrix multiplication above takes around 20-30seconds.

First, let’s build a small example and see if the approaches produce the right solution.

set.seed(13)
nvars <- 1000 # columns
ncons <- 500  # rows
n_nonzero <- round(0.2*nvars*ncons) # approximate, there may be actually less values
Amat <- data.frame(
    i=sample.int(ncons, n_nonzero, replace=TRUE),
    j=sample.int(nvars, n_nonzero, replace=TRUE),
    x=runif(n_nonzero)
)
Amat <- Amat[!duplicated(Amat[,c("i", "j")]),]
bvec <- runif(nvars)

# Matrix solution
library(Matrix)
AmatSparse <- Matrix::sparseMatrix(i=Amat[,"i"], j=Amat[,"j"], x=Amat[,"x"])
expected <- as.vector(AmatSparse %*% bvec)
str(expected)
# num [1:500] 45.7 41 43.5 38.1 40.8 ...

For sqlite and duckdb, the matrix multiplication is performed as SQL. First, we create connection objects. The function that performs the SQL is identical:

library(DBI)

# set up SQLite
library(RSQLite)
dbname <- "sparsemat.db"
con1 <- dbConnect(drv=RSQLite::SQLite(), dbname=dbname)

# set up duckdb
library(duckdb)
dbname <- "sparsemat.duckdb"
con2 <- dbConnect(duckdb(), dbname)
dbExecute(con2, "PRAGMA memory_limit='1GB';")

# save to SQLite / duckdb
dbWriteTable(con1, name="Amat", value=Amat, overwrite=TRUE)
dbWriteTable(con1, name="bvec", value=data.frame(j=seq(1:nvars), x=bvec), overwrite=TRUE)   
dbWriteTable(con2, name="Amat", value=Amat, overwrite=TRUE)
dbWriteTable(con2, name="bvec", value=data.frame(j=seq(1:nvars), x=bvec), overwrite=TRUE)

mat_mult_sql <- function(con) {
    ans <- dbGetQuery(con, paste(
      "select sum(Amat.x*bvec.x) from Amat",
      "join bvec on Amat.j=bvec.j",
      "group by Amat.i order by Amat.i"
    ))
    return(ans[,1, drop=TRUE])
}

Let’s benchmark the different approaches:

library(microbenchmark)
chk_fun <- function(...) {
    res <- list(...)[[1]]
    for (x in res) {
        if(!all.equal(expected, x)) return(FALSE)
    }
    return(TRUE)
}
microbenchmark(
    sqlite=mat_mult_sql(con1),
    duckdb=mat_mult_sql(con2),
    times=100, check=chk_fun
)
# Unit: milliseconds
#    expr     min       lq      mean   median       uq     max neval
#  sqlite 59.9924 61.77500 62.595894 62.49010 63.17130 75.5612   100
#  duckdb  1.6891  1.77065  1.942087  1.98605  2.04365  2.9142   100

Both approaches produce the correct solution. The duckdb approach is 30 times faster.

But do the approaches work with the big matrix, too? And how big are the files created? Let’s find out. We create 250 small-enough matrices in memory and write them to the disk.

library(data.table)
nvars <- 10000000
ncons <- 10
n_nonzero <- round(0.02*nvars*ncons) # approximate, there may be actually less values
set.seed(13)

# the first table
Amat <- data.frame(
    i=sample.int(ncons, n_nonzero, replace=TRUE),
    j=sample.int(nvars, n_nonzero, replace=TRUE),
    x=runif(n_nonzero)
)
setDT(Amat)
Amat <- unique(Amat, by=c("i", "j"))

# save to SQLite / duckdb
bvec <- runif(nvars)

# write to sqlite
dbWriteTable(con1, name="Amat", value=Amat, overwrite=TRUE)
dbWriteTable(con1, name="bvec", value=data.frame(j=seq(1:nvars), x=bvec), overwrite=TRUE)

# write to duckdb
dbWriteTable(con2, name="Amat", value=Amat, overwrite=TRUE)
dbWriteTable(con2, name="bvec", value=data.frame(j=seq(1:nvars), x=bvec), overwrite=TRUE)

# the 249 other tables
for (iter in 2:250) {
    Amat <- data.frame(
        i=sample.int(ncons, n_nonzero, replace=TRUE),
        j=sample.int(nvars, n_nonzero, replace=TRUE),
        x=runif(n_nonzero)
    )
    setDT(Amat)
    Amat <- unique(Amat, by=c("i", "j"))
    Amat[,i:=i+(iter-1)*500]
    dbWriteTable(con1, name="Amat", value=Amat, append=TRUE)
    dbWriteTable(con2, name="Amat", value=Amat, append=TRUE)
}

Now the matrix has 500 million non-zero elements. The sqlite database is 12GB, the duckdb is 7GB large.

(I also did run both database backends separately. Building the matrix with sqlite took 350s, with duckdb 230s.)

So, let’s benchmark the matrix multiplication:

microbenchmark(
    sqlite=mat_mult_sql(con1), # 500sec
    duckdb=mat_mult_sql(con2), # 23 sec
    times=1
)

I did not save the detailled timing results, but the sqlite approach took about 500s, while the duckdb approach took only 23s. Again, duckdb is much faster, around 20 times.

So I am deeply impressed by the speed of duckdb. There seem to be huge progressions in OLAP queries. One small part of it is that duckdb can use more than one thread:

dbExecute(con2, "PRAGMA threads=4;")
dbGetQuery(con2, "SELECT current_setting('threads');")
#   current_setting('threads')
# 1                          4

There is also the bigsparser which looks as if it can be even faster than duckdb, however, I could not yet figure out how to create a big matrix on disk with this package.

2023-02-07

Using chatGPT in the teaching of R

I am currently focalising an introductory course in programming with R within my organisation. We use datacamp.com as a resource and meet online once a week to share.

This week my input was an interaction with chatGPT, see below. My impression is that AI can be quite helpful in learning a programming language. Especially if you can ask good questions. In the future, as a course organiser, the focus will be less on the content and more on being together as a learning group and sharing.

Also worth reading on the topic of chatGPT is a discussion thread on datamethods.org.


Here is the AI interaction:

I want to extract the first five characters of a string using R. How would the code look like?

You can extract the first five characters of a string in R using the substring function. Here is an > example:

string <- "Hello World"
result <- substring(string, 1, 5)
print(result)

This code will return the following output:

[1] "Hello"

Explain how vectorisation works in R!

In R, vectorization is the process of performing operations on entire arrays of data, rather than iterating over the elements one by one. This is important because vectorized operations are typically much faster than non-vectorized operations, due to the optimized underlying code.

The key to vectorization in R is that all basic data structures in R are arrays, or vectors. This means that operations performed on them are automatically vectorized. For example, if you add two vectors of the same length, the result is a new vector where each element is the sum of the corresponding elements from the input vectors.

Here is an example to illustrate this:

a <- c(1, 2, 3, 4, 5)
b <- c(5, 4, 3, 2, 1)
c <- a + b
print(c)

This code will return the following output:

[1] 6 6 6 6 6

In this example, the + operator is applied to the entire vectors a and b, producing the new vector c that contains the sums of corresponding elements.

It’s worth noting that not all operations in R are vectorized, and some operations can only be vectorized under certain conditions. For example, some operations require that all elements in a vector have the same length and structure, and some operations may return unexpected results if applied to vectors with different lengths.

Give me another example!

Sure, here’s another example that demonstrates vectorization in R:

x <- c(1, 2, 3, 4, 5)
y <- x * 2
print(y)

In this code, we create a vector x with 5 elements, and then use the * operator to multiply each element in x by 2. The result is a new vector y that contains the doubled values of the elements in x.

[1]  2  4  6  8 10

This demonstrates how vectorization allows us to perform operations on entire arrays of data, without having to loop over the elements one by one. This is a more efficient and concise way to work with data in R.

What are the most important string functions in base R?

Here are some of the most commonly used string functions in base R:

  1. nchar: returns the number of characters in a string
  2. substring: extracts a substring from a string
  3. paste: concatenates strings into a single string
  4. toupper: converts all characters in a string to uppercase
  5. tolower: converts all characters in a string to lowercase
  6. gsub: replaces all occurrences of a pattern in a string with a replacement string
  7. grep: searches for a pattern in a string and returns the matching elements
  8. grepl: similar to grep, but returns a logical vector indicating which elements match the pattern
  9. strsplit: splits a string into substrings based on a specified separator
  10. trimws: removes leading and trailing whitespace from a string.

These functions form the basis for many common string manipulation tasks in R, and are commonly used in data cleaning and preparation. It’s worth noting that there are many more string functions available in R, and many specialized string manipulation packages available in the R ecosystem.

Translate excel string functions to their corresponding R functions!

Here are some common Excel string functions and their equivalent R functions:

Excel function R function
LEFT substring
RIGHT substring
MID substring
LEN nchar
UPPER toupper
LOWER tolower
CONCATENATE paste
REPLACE gsub
SEARCH grep or grepl

Note that the R functions are not always a direct match to their Excel counterparts, but they can be used to achieve the same results. For example, substring can be used to implement the functionality of the LEFT, RIGHT, and MID functions by specifying different arguments for the start and end positions of the desired substring.