On my laptop with 16 GB RAM, I would like to perform a matrix-vector multiplication with a sparse matrix of around 10 million columns and 2500 rows. The matrix has approximately only 2% non-zero entries, but this are still 500 million numbers and the column/row information, too large to work comfortably in-memory.
A while ago, I tried using sqlite
for this task. It kind
of worked, but was too slow to be useful. This weekend, I revisited the
problem and tried using duckdb
.
TLDR: duckdb is amazingly faster than sqlite for this problem, around 20 times. The matrix multiplication above takes around 20-30seconds.
First, let’s build a small example and see if the approaches produce the right solution.
set.seed(13)
nvars <- 1000 # columns
ncons <- 500 # rows
n_nonzero <- round(0.2*nvars*ncons) # approximate, there may be actually less values
Amat <- data.frame(
i=sample.int(ncons, n_nonzero, replace=TRUE),
j=sample.int(nvars, n_nonzero, replace=TRUE),
x=runif(n_nonzero)
)
Amat <- Amat[!duplicated(Amat[,c("i", "j")]),]
bvec <- runif(nvars)
# Matrix solution
library(Matrix)
AmatSparse <- Matrix::sparseMatrix(i=Amat[,"i"], j=Amat[,"j"], x=Amat[,"x"])
expected <- as.vector(AmatSparse %*% bvec)
str(expected)
# num [1:500] 45.7 41 43.5 38.1 40.8 ...
For sqlite and duckdb, the matrix multiplication is performed as SQL. First, we create connection objects. The function that performs the SQL is identical:
library(DBI)
# set up SQLite
library(RSQLite)
dbname <- "sparsemat.db"
con1 <- dbConnect(drv=RSQLite::SQLite(), dbname=dbname)
# set up duckdb
library(duckdb)
dbname <- "sparsemat.duckdb"
con2 <- dbConnect(duckdb(), dbname)
dbExecute(con2, "PRAGMA memory_limit='1GB';")
# save to SQLite / duckdb
dbWriteTable(con1, name="Amat", value=Amat, overwrite=TRUE)
dbWriteTable(con1, name="bvec", value=data.frame(j=seq(1:nvars), x=bvec), overwrite=TRUE)
dbWriteTable(con2, name="Amat", value=Amat, overwrite=TRUE)
dbWriteTable(con2, name="bvec", value=data.frame(j=seq(1:nvars), x=bvec), overwrite=TRUE)
mat_mult_sql <- function(con) {
ans <- dbGetQuery(con, paste(
"select sum(Amat.x*bvec.x) from Amat",
"join bvec on Amat.j=bvec.j",
"group by Amat.i order by Amat.i"
))
return(ans[,1, drop=TRUE])
}
Let’s benchmark the different approaches:
library(microbenchmark)
chk_fun <- function(...) {
res <- list(...)[[1]]
for (x in res) {
if(!all.equal(expected, x)) return(FALSE)
}
return(TRUE)
}
microbenchmark(
sqlite=mat_mult_sql(con1),
duckdb=mat_mult_sql(con2),
times=100, check=chk_fun
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# sqlite 59.9924 61.77500 62.595894 62.49010 63.17130 75.5612 100
# duckdb 1.6891 1.77065 1.942087 1.98605 2.04365 2.9142 100
Both approaches produce the correct solution. The duckdb
approach is 30 times faster.
But do the approaches work with the big matrix, too? And how big are the files created? Let’s find out. We create 250 small-enough matrices in memory and write them to the disk.
library(data.table)
nvars <- 10000000
ncons <- 10
n_nonzero <- round(0.02*nvars*ncons) # approximate, there may be actually less values
set.seed(13)
# the first table
Amat <- data.frame(
i=sample.int(ncons, n_nonzero, replace=TRUE),
j=sample.int(nvars, n_nonzero, replace=TRUE),
x=runif(n_nonzero)
)
setDT(Amat)
Amat <- unique(Amat, by=c("i", "j"))
# save to SQLite / duckdb
bvec <- runif(nvars)
# write to sqlite
dbWriteTable(con1, name="Amat", value=Amat, overwrite=TRUE)
dbWriteTable(con1, name="bvec", value=data.frame(j=seq(1:nvars), x=bvec), overwrite=TRUE)
# write to duckdb
dbWriteTable(con2, name="Amat", value=Amat, overwrite=TRUE)
dbWriteTable(con2, name="bvec", value=data.frame(j=seq(1:nvars), x=bvec), overwrite=TRUE)
# the 249 other tables
for (iter in 2:250) {
Amat <- data.frame(
i=sample.int(ncons, n_nonzero, replace=TRUE),
j=sample.int(nvars, n_nonzero, replace=TRUE),
x=runif(n_nonzero)
)
setDT(Amat)
Amat <- unique(Amat, by=c("i", "j"))
Amat[,i:=i+(iter-1)*500]
dbWriteTable(con1, name="Amat", value=Amat, append=TRUE)
dbWriteTable(con2, name="Amat", value=Amat, append=TRUE)
}
Now the matrix has 500 million non-zero elements. The sqlite database is 12GB, the duckdb is 7GB large.
(I also did run both database backends separately. Building the matrix with sqlite took 350s, with duckdb 230s.)
So, let’s benchmark the matrix multiplication:
microbenchmark(
sqlite=mat_mult_sql(con1), # 500sec
duckdb=mat_mult_sql(con2), # 23 sec
times=1
)
I did not save the detailled timing results, but the sqlite approach took about 500s, while the duckdb approach took only 23s. Again, duckdb is much faster, around 20 times.
So I am deeply impressed by the speed of duckdb. There seem to be huge progressions in OLAP queries. One small part of it is that duckdb can use more than one thread:
dbExecute(con2, "PRAGMA threads=4;")
dbGetQuery(con2, "SELECT current_setting('threads');")
# current_setting('threads')
# 1 4
There is also the bigsparser
which looks as if it can be
even faster than duckdb
, however, I could not yet figure
out how to create a big matrix on disk with this package.