You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The elapsed time to fetch, convert and materialize 100k rows of random strings, of various column sizes, are shown below. The route via Arrow::Table scales about linearly, while the direct route from nanoarrow_arrow_stream to data.frame scales much worse than linearly.
num_cols
elapsed_with_arrow
elapsed_without_arrow
10
2.4 secs
1.3 secs
20
3.2 secs
2.9 secs
40
6.1 secs
7.0 secs
80
12.7 secs
30.8 secs
160
26.9 secs
920.7 secs
get_con<-function() {
uri<-"secret"adbc_con<-adbcsnowflake::adbcsnowflake() |>adbcdrivermanager::adbc_database_init(uri=uri) |>adbcdrivermanager::adbc_connection_init()
adbc_con
}
sweep_cols<- c(10, 20, 40, 80, 160)
results_df<-purrr::map_dfr(sweep_cols, function(num_cols) {
print(sprintf("num_cols = %i", num_cols))
# write table to snowflake. use separate R process to avoid common string poolcallr::r(args=list(get_con=get_con, num_cols=num_cols), function(get_con, num_cols) {
adbc_con<- get_con()
num_rows<-1e5df<- lapply(seq(num_cols), function(jj) uuid::UUIDgenerate(n=num_rows)) |>rlang::set_names(paste0('x', seq(num_cols))) |>dplyr::bind_cols()
lobstr::obj_size(df)
adbcdrivermanager::execute_adbc(adbc_con, "DROP TABLE IF EXISTS LARGE1;")
adbcdrivermanager::write_adbc(df, adbc_con, "LARGE1", mode="default")
NULL
})
# read table with as_arrow_table, use separate R process to avoid common string poolelapsed_with_arrow<-callr::r(args=list(get_con=get_con), function(get_con) {
adbc_con<- get_con()
t1<- Sys.time()
df_read<-adbcdrivermanager::read_adbc(adbc_con, 'SELECT * from "LARGE1"') |>arrow::as_arrow_table() |>
as.data.frame()
.Internal(inspect(df_read[[1]])) # shows 16 STRSXP g0c0 [REF(65535)] arrow::array_string_vector<0x00000299299da350, string, 687 chunks, 0 nulls> len=100000tibble_read<-tibble::as_tibble(df_read) # required in order for print to trigger ALTREP materialization
.Internal(inspect(df_read[[1]])) # shows: 16 STRSXP g0c0 [REF(65535)] arrow::array_string_vector<0x00000299299da350, string, 687 chunks, 0 nulls> len=100000
print(tibble_read) # takes a bit of time. ALTREP materialization happens here
.Internal(inspect(df_read[[1]])) # shows: 16 STRSXP g1c0 [MARK,REF(65535)] materialized arrow::array_string_vector len=100000
Sys.time() -t1
})
# read table without as_arrow_table, use separate R process to avoid common string pool. SLOWelapsed_without_arrow<-callr::r(args=list(get_con=get_con), function(get_con) {
adbc_con<- get_con()
t1<- Sys.time()
df_read<-adbcdrivermanager::read_adbc(adbc_con, 'SELECT * from "LARGE1"') |>
as.data.frame() # slow: materialization happened here
.Internal(inspect(df_read[[1]])) # shows STRSXP g1c7 [MARK,REF(1)] (len=100000, tl=0) ie fully materialized standard R vectortibble_read<-tibble::as_tibble(df_read) # instant
print(tibble_read) # instant
Sys.time() -t1
})
data.frame(num_cols, elapsed_with_arrow, elapsed_without_arrow)
})
results_df# num_cols elapsed_with_arrow elapsed_without_arrow# 1 10 2.454586 secs 1.327180 secs# 2 20 3.211402 secs 2.942045 secs# 3 40 6.179819 secs 7.045094 secs# 4 80 12.713749 secs 30.818208 secs
> library(arrow)
> library(nanoarrow)
> library(adbcdrivermanager)
> library(adbcsnowflake)
> sessionInfo()
> sessionInfo()
R version 4.5.2 (2025-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)
Matrix products: default
LAPACK version 3.12.1
locale:
[1] LC_COLLATE=English_Australia.utf8 LC_CTYPE=English_Australia.utf8
[3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C
[5] LC_TIME=English_Australia.utf8
time zone: Australia/Sydney
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] arrow_22.0.0 adbcsnowflake_0.20.0 adbcdrivermanager_0.20.0
[4] nanoarrow_0.7.0-1
loaded via a namespace (and not attached):
[1] assertthat_0.2.1 R6_2.6.1 bit_4.6.0 tidyselect_1.2.1
[5] magrittr_2.0.4 glue_1.8.0 tibble_3.3.0 pkgconfig_2.0.3
[9] bit64_4.6.0-1 dplyr_1.1.4 generics_0.1.4 lifecycle_1.0.4
[13] ps_1.9.1 cli_3.6.5 processx_3.8.6 callr_3.7.6
[17] vctrs_0.6.5 compiler_4.5.2 purrr_1.1.0 pillar_1.11.1
[21] rlang_1.1.6
Suggestions from ChatGPT (I have absolutely no idea if it's sensible):
Replicate Arrow's faster materialization algorithm inside nanoarrow (C/C++), focusing on:
Chunked processing of record batches (avoid per‑row overhead across the whole table).
Minimized PROTECT/UNPROTECT frequency (allocate target STRSXP once per column; protect vectors, not each element).
Avoid repeated re‑encoding: call Rf_mkCharLenCE(..., CE_UTF8) using direct string buffer/offsets; bypass translateCharUTF8 in tight loops.
Efficient null handling (bitmap/offset reads with branch‑light loops).
Optional parallel chunk materialization for very wide tables (behind an option; respect R’s thread‑safety constraints).
.Internal(inspect(x)).The elapsed time to fetch, convert and materialize 100k rows of random strings, of various column sizes, are shown below. The route via Arrow::Table scales about linearly, while the direct route from nanoarrow_arrow_stream to data.frame scales much worse than linearly.
Suggestions from ChatGPT (I have absolutely no idea if it's sensible):