Skip to content

Data corruption when inserting OracleDataFrames that are views #574

@vladidobro

Description

@vladidobro

Hi,

  1. What versions are you using?
con.version: 23.26.1.1.0
platform.platform: macOS-26.2-arm64-arm-64bit
sys.maxsize > 2**32: True
platform.python_version: 3.12.12
oracledb.__version__: 3.4.2
  1. Is it an error or a hang or a crash?

Error

  1. What error(s) or behavior you are seeing?
import pandas as pd
import oracledb

with oracledb.connect(...) as con:
    cur = con.cursor()
    cur.execute('create private temporary table ora$ptt_mytest (a int) on commit drop definition')
    df = pd.DataFrame({'A': [0, 1]}).convert_dtypes(dtype_backend='pyarrow')
    df = df.iloc[1:2]
    print('Inserted:\n', df)
    cur.executemany('insert into ora$ptt_mytest (a) values (:a)', df)
    print('Selected:\n', pd.DataFrame.from_arrow(con.fetch_df_all('select * from ora$ptt_mytest')))

Both "inserted" and "selected" dataframes should be the same, but they are different

Inserted:
    A
1  1
Selected:
      A
0  0.0

We have isolated the problem to be only when the dataframe is some kind of view of another one.
Here we are creating a dataframe with 2 rows, and then calling df.iloc[1:2], which selects only the second row.
But instead, oracledb inserted the first row.

This does not happen if we (i believe)

  • don't call .convert_dtypes(), because then the Arrow PyCapsule interface creates new buffers
  • round trip the dataframe e.g. through a file, so that we are sure it owns its data

This leads me to believe that when I call df.iloc[1:2], the result will point to the same Arrow buffer with an offset, and this offset is ignored by oracledb (but length is not ignored, because only 1 row from the buffer is inserted).

We ran into this problem with manual batching, i.e.

big_df: pd.DataFrame
batches = [big_df.iloc[i:i+1000] for i in range(0, len(big_df), 1000)]
for b in batches:
    cur.executemany('INSERT ...', b)

which kept inserting the first batch again and again.

This may be a problem with Pandas not correctly exporting Arrow PyCapsule, but probably no, because polars.from_arrow(df) gives the correct result.

  1. Does your application call init_oracle_client()?

No, using thin mode.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions