Most BCP improvements from NCBI, as of May 2024#558
Conversation
e6facc4 to
7b71bdb
Compare
|
I've force-pushed a corrected version superseding #570. |
|
Meanwhile, I'm looking to extend this series with BCP-specific tuneups superseding #573. To that end, I wanted to compare behavior for three cases: normal BCP in, BCP into an alternate version of the table where column types are all On the DBLIB front, I found that all three cases fully succeeded when built against this branch, but Any idea what I might be missing? I took care to predefine UPDATE: Those errors turned out to stem from the use of |
|
Here are my summarized comparison results (comparing against version 16.0 of Sybase's client libraries), where CTLIB BCP truncation differences w/CS layer messages ignored (see below for corrected version):
DBLIB BCP truncation differences (also corrected below):
|
|
I'll turn back to other PRs for now, but do still plan to add a proper replacement for #573 at some point. |
|
Ah, |
CTLIB BCP truncation differences w/CS layer messages printed, and more corner cases examinedGeneral notes
Notably, all errors under Sybase officially come from the CT/BLK layer, whereas FreeTDS only ever issues CS-layer errors that offer less context and may be likelier to slip through the cracks. That should be possible to remedy by letting |
Corrected DBLIB truncation differences (modulo formatting details along the above lines and FreeTDS's support for additional client types):
|
7b71bdb to
468812f
Compare
468812f to
8739305
Compare
|
My latest version of this series adds a commit to address a corner case I recently encountered by (re)allowing BCP into computed columns in the context of a |
|
Picking up a commit to replace one work I was doing, specifically ae519f8. |
|
Another commit merged! I'm rebasing the initial merge on master to get an updated version. See https://github.com/freddy77/freetds/tree/pr/558/ncbi-2024-05-bcp-improvements. Some considerations
|
| if (s[n] == '\0') { | ||
| return s; /* already all uppercase */ | ||
| } else { | ||
| char * result = malloc(n + 1); |
There was a problem hiding this comment.
Are you sure about this ? Isn't this function converting ASCII to upper case but stopping at first lower case character ?
There was a problem hiding this comment.
Oops, yeah, I needed to reset n to the length before proceeding here.
There was a problem hiding this comment.
Alternatively, the calling code could use strcasestr, but that's not portable, so we'd need to add an implementation to replacements.
| #define TDS_ISSPACE(c) isspace((unsigned char ) (c)) | ||
|
|
||
| const char STD_DATETIME_FMT[] = "%b %e %Y %I:%M%p"; | ||
| const char STD_DATETIME_FMT[] = "%Y-%m-%d %H:%M:%S.%z"; |
There was a problem hiding this comment.
Sure useful for BCP but I suppose it breaks a lot of tests
There was a problem hiding this comment.
Don't worry, I checked for regressions and didn't find any.
There was a problem hiding this comment.
I found from profiling on VMS that a significant amount of time was spent in strftime for bcp out -c operations (actually more than 50% of the entire runtime), so I think there are noticeable improvements to be had by a hand-tuned date/time format for bcp operations .
The ASE bcp client appears to use locale for bcp out -c date formatting but I don't think that's a requirement we need to follow
There was a problem hiding this comment.
I suppose it depends on how many date/time fields you have in your table. tds_strftime allocates a temporary buffer on the heap, maybe that's the issue?
Or maybe it would be worth to code directly instead of calling strtfime at the end?
Surely not something I care in the close future, there are quite a long list I want to do before.
* Read defaults, send them back explicitly as needed, and clean up their storage afterwards. * tds5_bcp_add_variable_columns: ** Send spaces in lieu of fully empty SYB(VAR)CHAR data. ** Send correct textpos values in all cases. ** When writing the adjustment table, avoid a potential Sybase "row format error" by eliding the (effective) column count when it would have repeated the largest adjustment table entry. * tds_deinit_bcpinfo: Free and reset current_row for transfers in. Avoid potentially trying to interpret fragments of outgoing TDS data as BLOB pointers. Signed-off-by: Aaron M. Ucko <ucko@ncbi.nlm.nih.gov>
Under these protocol versions, MS SQL Server describes these columns as having NVARCHAR(n) types rather than traditional DATETIME types, due to DATETIME's range limitations. (Likewise for TIME, which however still needs to come in as a preconverted UTF-16 string to compensate for combined API and protocol limitations.) - ctlib/blk.c: Perform character conversion on strings converted from fixed-width types such as CS_DATETIME_TYPE when a comparison of column-size limits appears to indicate an encoding difference. - tds/config.c: Adjust STD_DATETIME_FMT to match MS's (ISO-ish) tastes; in particular, ensure that the entire date portion fits within the first 10 characters. Signed-off-by: Aaron M. Ucko <ucko@ncbi.nlm.nih.gov>
* Extend _cs_convert to take an additional handle argument (allowed to be NULL) to provide for the possibility of replacing the caller's original output buffer with the one tdsconvert allocated itself (as it always does for blobs), rather than copying the blob data. * blk.c (_blk_get_col_data): Provide a non-NULL handle, both for efficiency's sake and because tds_alloc_bcp_column_data imposes a 4 KiB cap with the expectation that callers will take this approach. (dblib already does.) Signed-off-by: Aaron M. Ucko <ucko@ncbi.nlm.nih.gov>
To that end, factor a static _blk_clean_desc function out of blk_done's CS_BLK_ALL case. Signed-off-by: Aaron M. Ucko <ucko@ncbi.nlm.nih.gov>
* tds.h (struct tds_bcpinfo): Add more fields to enable the necessary
bookkeeping.
* ctlib/blk.c: Implement blk_textfer; adjust other functions accordingly.
* tds/{bulk,data}.c: Allow for deferred, and possibly piecemeal, blob
submission; in particular, make tds_bcp_send_record resumable.
Signed-off-by: Aaron M. Ucko <ucko@ncbi.nlm.nih.gov>
Signed-off-by: Aaron M. Ucko <ucko@ncbi.nlm.nih.gov>
Adjust existing failure modes' boundaries and introduce the possibility of proceeding with truncated input with or without a message (with the latter occurring in some cases that require no conversion). As for messages, arrange to emit them via _ctclient_msg rather than _csclient_msg with the same (English) wording and context as Sybase yields, modulo formal punctuation differences and the usual discrepancies in layer and origin numbering. To that end: * Add internal BLK_CONV_* constants that serve both as (new) client message numbers and as possible return values for _cs_convert. * Cover those numbers in _ct_get_user_api_layer_error. * Have _cs_convert take the presence of a non-NULL handle (which only _blk_get_col_data supplies) to indicate that the caller wants bulk-insertion semantics, including in particular the possible use of those return values. * Keep a running count of rows sent for reporting purposes (determining the column number to cite by a linear scan of the bindinfo's columns array). Signed-off-by: Aaron M. Ucko <ucko@ncbi.nlm.nih.gov>
The destination view may well have an INSTEAD OF INSERT trigger that makes meaningful use of the supplied values. Signed-off-by: Aaron M. Ucko <ucko@ncbi.nlm.nih.gov>
Signed-off-by: Aaron M. Ucko <ucko@ncbi.nlm.nih.gov>
8739305 to
1c45955
Compare
Thanks for all that and for taking another look in general, and sorry for not getting back to you sooner -- I work for a US federal agency, and the new administration imposed a broad communications ban until recently.
Is there some other name you'd prefer?
Fair point. I was also using the presence of a non-
Good question. I was preserving the status quo here, but perhaps some simplification is possible.
Wouldn't that be a subtle API break?
By my reading of https://learn.microsoft.com/en-us/sql/tools/bcp-utility?view=sql-server-ver16, it defaults to a legacy 8-bit encoding (e.g. CP-1252), but the Windows version can be directed to use UTF-16 instead. I'm not sure we have any installations here I can use for empirical testing. At any rate, I'm specifically concerned with bulk insertion.
Perhaps, but that's also out of scope here. |
|
Back to this series, or at least c3f9ec7 commit. Yes, this type is currently broken with very old protocols. /* convert string if needed */
if (!bcp7 && curcol->char_conv && curcol->char_conv->flags != TDS_ENCODING_MEMCPY && colsize) {
size_t output_size;
...What /* convert string if needed */
if (USE_ICONV && curcol->char_conv && curcol->char_conv->flags != TDS_ENCODING_MEMCPY && colsize) {
size_t output_size;
...( /* convert string if needed */
if (curcol->char_conv && curcol->char_conv->flags != TDS_ENCODING_MEMCPY && colsize) {
size_t output_size;
...(always conversion coherence) and just some bulk tests is failing.
Those changes should make code more consistent and would allow to easily fix the initial conversion issue. |
|
@freddy77 I have found that all of the other commits from this PR can be applied in order , with a few trivial conflict resolutions, and the tests all pass against both MS (tdsver 7.4) and Sybase. See mm_pr558 for reference. I haven't done much testing outside of the FreeTDS test suite though (e.g. the trigger application and blob input). This was just before your cherry-pick of "BCP into computed columns again..." Are you planning to pick all of these (just so I can rebase my #685 changes to the right place) |
|
@mmcnabb-vms added a commit to replace entirely the "ctlib: Optimize blob conversion for bulk transfers in." commit in this series. The series is surely nice but it lacks entirely tests. Not clear if it's doing what is supposed to do or not. For instance the "Fix DATE and DATETIME2 bulk insertion via TDS 7.0 and 7.1." is not fixing the issue. |
|
Slowly getting this in. More than merging is a kind of rewriting. Now the behavior of char -> binary in ctlib is correct. Added tests for this. |
This PR roughly corresponds to the primarily BCP-related commits from #555, adding 02aa63c per dependency considerations and omitting a couple of commits that can receive individual PRs, which I'll open shortly.