Skip to content

slow nanoarrow_array_stream to data.frame for strings; slow ALTREP materialization #822

@klin333

Description

@klin333
  • Creating a data.frame directly from a nanoarrow_array_stream is much slower than first converting to an arrow::Table and then to data.frame.
  • I have tried to force ALTREP materialization in both paths for a fair comparison, and confirmed via .Internal(inspect(x)).
  • The key observation is that the direct nanoarrow_array_stream → data.frame() path materializes much slower than the Arrow Table path.
  • This is very similar to R: Benchmarking ADBC (Snowflake) vs ODBC shows issue with larger datasets arrow-adbc#2508 . In fact, the workaround of going via arrow Table is inspired by that issue. However, even though that issue is closed, the reprex below for strings show significant performance difference persists.

The elapsed time to fetch, convert and materialize 100k rows of random strings, of various column sizes, are shown below. The route via Arrow::Table scales about linearly, while the direct route from nanoarrow_arrow_stream to data.frame scales much worse than linearly.

num_cols elapsed_with_arrow elapsed_without_arrow
10 2.4 secs 1.3 secs
20 3.2 secs 2.9 secs
40 6.1 secs 7.0 secs
80 12.7 secs 30.8 secs
160 26.9 secs 920.7 secs
get_con <- function() {
  uri <- "secret"
  adbc_con <- adbcsnowflake::adbcsnowflake() |>
    adbcdrivermanager::adbc_database_init(uri = uri) |>
    adbcdrivermanager::adbc_connection_init()
  adbc_con
}

sweep_cols <- c(10, 20, 40, 80, 160)
results_df <- purrr::map_dfr(sweep_cols, function(num_cols) {
  print(sprintf("num_cols = %i", num_cols))
  
  # write table to snowflake. use separate R process to avoid common string pool
  callr::r(args = list(get_con = get_con, num_cols = num_cols), function(get_con, num_cols) {
    adbc_con <- get_con()
    num_rows <- 1e5
    df <- lapply(seq(num_cols), function(jj) uuid::UUIDgenerate(n = num_rows)) |> 
      rlang::set_names(paste0('x', seq(num_cols))) |>  
      dplyr::bind_cols()
    lobstr::obj_size(df)
    
    adbcdrivermanager::execute_adbc(adbc_con, "DROP TABLE IF EXISTS LARGE1;")
    adbcdrivermanager::write_adbc(df, adbc_con, "LARGE1", mode = "default")
    NULL
  })
  
  # read table with as_arrow_table, use separate R process to avoid common string pool
  elapsed_with_arrow <- callr::r(args = list(get_con = get_con), function(get_con) {
    adbc_con <- get_con()
    t1 <- Sys.time()
    df_read <- adbcdrivermanager::read_adbc(adbc_con, 'SELECT * from "LARGE1"') |> 
      arrow::as_arrow_table() |>
      as.data.frame()
    .Internal(inspect(df_read[[1]]))          # shows 16 STRSXP g0c0 [REF(65535)] arrow::array_string_vector<0x00000299299da350, string, 687 chunks, 0 nulls> len=100000
    tibble_read <- tibble::as_tibble(df_read) # required in order for print to trigger ALTREP materialization
    .Internal(inspect(df_read[[1]]))          # shows: 16 STRSXP g0c0 [REF(65535)] arrow::array_string_vector<0x00000299299da350, string, 687 chunks, 0 nulls> len=100000
    print(tibble_read)                        # takes a bit of time. ALTREP materialization happens here
    .Internal(inspect(df_read[[1]]))          # shows: 16 STRSXP g1c0 [MARK,REF(65535)] materialized arrow::array_string_vector len=100000
    Sys.time() - t1
  })
  
  # read table without as_arrow_table, use separate R process to avoid common string pool. SLOW
  elapsed_without_arrow <- callr::r(args = list(get_con = get_con), function(get_con) {
    adbc_con <- get_con()
    t1 <- Sys.time()
    df_read <- adbcdrivermanager::read_adbc(adbc_con, 'SELECT * from "LARGE1"') |> 
      as.data.frame()                         # slow: materialization happened here
    .Internal(inspect(df_read[[1]]))          # shows STRSXP g1c7 [MARK,REF(1)] (len=100000, tl=0) ie fully materialized standard R vector
    tibble_read <- tibble::as_tibble(df_read) # instant
    print(tibble_read)                        # instant
    Sys.time() - t1
  })
  
  data.frame(num_cols, elapsed_with_arrow, elapsed_without_arrow)
})

results_df
# num_cols elapsed_with_arrow elapsed_without_arrow
# 1       10      2.454586 secs         1.327180 secs
# 2       20      3.211402 secs         2.942045 secs
# 3       40      6.179819 secs         7.045094 secs
# 4       80     12.713749 secs        30.818208 secs
> library(arrow)
> library(nanoarrow)
> library(adbcdrivermanager)
> library(adbcsnowflake)
> sessionInfo()
> sessionInfo()
R version 4.5.2 (2025-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=English_Australia.utf8  LC_CTYPE=English_Australia.utf8
[3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C
[5] LC_TIME=English_Australia.utf8

time zone: Australia/Sydney
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] arrow_22.0.0             adbcsnowflake_0.20.0     adbcdrivermanager_0.20.0
[4] nanoarrow_0.7.0-1

loaded via a namespace (and not attached):
 [1] assertthat_0.2.1 R6_2.6.1         bit_4.6.0        tidyselect_1.2.1
 [5] magrittr_2.0.4   glue_1.8.0       tibble_3.3.0     pkgconfig_2.0.3
 [9] bit64_4.6.0-1    dplyr_1.1.4      generics_0.1.4   lifecycle_1.0.4
[13] ps_1.9.1         cli_3.6.5        processx_3.8.6   callr_3.7.6
[17] vctrs_0.6.5      compiler_4.5.2   purrr_1.1.0      pillar_1.11.1
[21] rlang_1.1.6

Suggestions from ChatGPT (I have absolutely no idea if it's sensible):

  • Replicate Arrow's faster materialization algorithm inside nanoarrow (C/C++), focusing on:
  • Chunked processing of record batches (avoid per‑row overhead across the whole table).
  • Minimized PROTECT/UNPROTECT frequency (allocate target STRSXP once per column; protect vectors, not each element).
  • Avoid repeated re‑encoding: call Rf_mkCharLenCE(..., CE_UTF8) using direct string buffer/offsets; bypass translateCharUTF8 in tight loops.
  • Efficient null handling (bitmap/offset reads with branch‑light loops).
  • Optional parallel chunk materialization for very wide tables (behind an option; respect R’s thread‑safety constraints).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions