Skip to content

[fix](paimon) infer manifest format from split file format in cpp reader#60795

Merged
morningman merged 6 commits intoapache:masterfrom
xylaaaaa:fix/paimoncpp-manifest-format-from-split
Mar 7, 2026
Merged

[fix](paimon) infer manifest format from split file format in cpp reader#60795
morningman merged 6 commits intoapache:masterfrom
xylaaaaa:fix/paimoncpp-manifest-format-from-split

Conversation

@xylaaaaa
Copy link
Copy Markdown
Contributor

@xylaaaaa xylaaaaa commented Feb 22, 2026

Problem

Followup #60676

When FE does not pass full table options in scan ranges, paimon-cpp may default manifest.format to avro.
For non-avro environments, this can fail in PaimonCppReader initialization with:
Could not find a FileFormatFactory implementation class for format avro.

Solution

In PaimonCppReader::_build_options, if split-level file_format exists and table options are missing/empty:

  • set file.format from split file_format
  • set manifest.format from split file_format

This keeps paimon-cpp format resolution consistent with the actual split format and avoids unintended avro fallback.

Verification

  • Incremental BE build succeeded for doris_be target.
  • Change scope is limited to be/src/vec/exec/format/table/paimon_cpp_reader.cpp.

Copilot AI review requested due to automatic review settings February 22, 2026 13:40
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts Doris BE’s Paimon C++ reader option construction to avoid incorrect/default manifest format selection when FE scan ranges omit table options, by inferring formats from split metadata.

Changes:

  • Infer paimon::Options::FILE_FORMAT from split-level paimon_params.file_format when the option is missing/empty.
  • Infer paimon::Options::MANIFEST_FORMAT from split-level paimon_params.file_format when the option is missing/empty.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +314 to +328
// FE currently may not pass paimon table options in scan ranges.
// Avoid paimon-cpp defaulting manifest.format to avro when split file format is known.
if (_range.__isset.table_format_params && _range.table_format_params.__isset.paimon_params &&
_range.table_format_params.paimon_params.__isset.file_format &&
!_range.table_format_params.paimon_params.file_format.empty()) {
const auto& split_file_format = _range.table_format_params.paimon_params.file_format;
auto file_format_it = options.find(paimon::Options::FILE_FORMAT);
if (file_format_it == options.end() || file_format_it->second.empty()) {
options[paimon::Options::FILE_FORMAT] = split_file_format;
}
auto manifest_format_it = options.find(paimon::Options::MANIFEST_FORMAT);
if (manifest_format_it == options.end() || manifest_format_it->second.empty()) {
options[paimon::Options::MANIFEST_FORMAT] = split_file_format;
}
}
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New option inference logic isn’t covered by existing PaimonCppReader unit tests. Consider adding a test that asserts when split-level file_format is set and options lacks/has empty paimon::Options::FILE_FORMAT / MANIFEST_FORMAT, _build_options() (or an observable init path) populates them, and does not override non-empty values.

Copilot uses AI. Check for mistakes.
Comment on lines +314 to +328
// FE currently may not pass paimon table options in scan ranges.
// Avoid paimon-cpp defaulting manifest.format to avro when split file format is known.
if (_range.__isset.table_format_params && _range.table_format_params.__isset.paimon_params &&
_range.table_format_params.paimon_params.__isset.file_format &&
!_range.table_format_params.paimon_params.file_format.empty()) {
const auto& split_file_format = _range.table_format_params.paimon_params.file_format;
auto file_format_it = options.find(paimon::Options::FILE_FORMAT);
if (file_format_it == options.end() || file_format_it->second.empty()) {
options[paimon::Options::FILE_FORMAT] = split_file_format;
}
auto manifest_format_it = options.find(paimon::Options::MANIFEST_FORMAT);
if (manifest_format_it == options.end() || manifest_format_it->second.empty()) {
options[paimon::Options::MANIFEST_FORMAT] = split_file_format;
}
}
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior is slightly broader than the PR description (“only when table options are missing/empty”): it will also set FILE_FORMAT/MANIFEST_FORMAT when other table options are present but just these keys are absent/empty. If that’s intended, the PR description should be updated; if not, consider tightening the condition to only apply when the table-level paimon options map is missing/empty.

Copilot uses AI. Check for mistakes.
@xylaaaaa
Copy link
Copy Markdown
Contributor Author

run buildall

@doris-robot
Copy link
Copy Markdown

TPC-H: Total hot run time: 28640 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 0d91196b69ceea46a7b74db43daca438dddf9880, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17628	4481	4309	4309
q2	q3	10648	780	524	524
q4	4680	349	251	251
q5	7554	1187	1023	1023
q6	177	177	149	149
q7	779	829	661	661
q8	9284	1431	1319	1319
q9	4896	4728	4738	4728
q10	6820	1872	1626	1626
q11	476	269	236	236
q12	707	565	463	463
q13	17791	4215	3427	3427
q14	225	234	227	227
q15	943	800	781	781
q16	747	712	660	660
q17	730	851	405	405
q18	6195	5406	5261	5261
q19	1119	964	600	600
q20	518	496	382	382
q21	4402	1815	1362	1362
q22	343	282	246	246
Total cold run time: 96662 ms
Total hot run time: 28640 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4396	4347	4349	4347
q2	q3	1770	2161	1723	1723
q4	847	1148	760	760
q5	3997	4288	4314	4288
q6	173	170	139	139
q7	1726	1606	1491	1491
q8	2433	2630	2515	2515
q9	7204	7737	7533	7533
q10	2812	2960	2443	2443
q11	527	431	418	418
q12	526	590	464	464
q13	4055	4490	3649	3649
q14	291	308	286	286
q15	871	830	823	823
q16	749	787	784	784
q17	1215	1659	1296	1296
q18	7357	6870	6731	6731
q19	943	910	882	882
q20	2105	2173	1994	1994
q21	4151	3484	3392	3392
q22	480	445	404	404
Total cold run time: 48628 ms
Total hot run time: 46362 ms

@doris-robot
Copy link
Copy Markdown

TPC-DS: Total hot run time: 183793 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 0d91196b69ceea46a7b74db43daca438dddf9880, data reload: false

query5	4763	630	508	508
query6	337	219	210	210
query7	4218	467	280	280
query8	344	269	228	228
query9	8717	2713	2702	2702
query10	534	369	319	319
query11	16981	16771	16490	16490
query12	184	128	130	128
query13	1250	448	341	341
query14	6126	3154	2934	2934
query14_1	2787	2790	2747	2747
query15	201	192	177	177
query16	1002	474	464	464
query17	1049	709	578	578
query18	2466	428	337	337
query19	200	195	169	169
query20	132	124	128	124
query21	221	151	114	114
query22	5018	6001	5728	5728
query23	17622	17129	16931	16931
query23_1	17188	16975	16980	16975
query24	7431	1609	1228	1228
query24_1	1229	1210	1220	1210
query25	560	488	424	424
query26	1230	266	157	157
query27	2780	483	300	300
query28	4532	1871	1847	1847
query29	820	579	491	491
query30	309	244	214	214
query31	865	720	655	655
query32	83	76	75	75
query33	529	348	293	293
query34	925	974	573	573
query35	636	671	593	593
query36	1098	1140	1004	1004
query37	140	93	84	84
query38	2934	2917	2872	2872
query39	918	865	849	849
query39_1	823	821	823	821
query40	238	159	136	136
query41	70	65	63	63
query42	105	101	100	100
query43	370	376	358	358
query44	
query45	201	189	183	183
query46	881	977	598	598
query47	2117	2161	2050	2050
query48	317	320	234	234
query49	640	468	381	381
query50	678	288	220	220
query51	4089	4081	4017	4017
query52	106	107	98	98
query53	292	335	292	292
query54	320	296	289	289
query55	93	81	82	81
query56	323	334	342	334
query57	1381	1345	1272	1272
query58	299	279	286	279
query59	2558	2711	2522	2522
query60	361	333	340	333
query61	180	196	151	151
query62	620	588	526	526
query63	311	276	265	265
query64	4872	1226	979	979
query65	
query66	1440	453	350	350
query67	16361	16308	16185	16185
query68	
query69	398	286	290	286
query70	1002	1005	955	955
query71	330	302	299	299
query72	2816	2710	2466	2466
query73	541	540	309	309
query74	10010	9880	9707	9707
query75	2847	2726	2592	2592
query76	2290	1013	670	670
query77	353	373	314	314
query78	11123	11296	10696	10696
query79	3121	799	597	597
query80	1781	613	537	537
query81	574	282	244	244
query82	1003	148	117	117
query83	333	260	244	244
query84	252	123	95	95
query85	908	504	432	432
query86	436	297	305	297
query87	3074	3080	2965	2965
query88	3524	2630	2626	2626
query89	425	366	335	335
query90	1940	176	166	166
query91	167	152	132	132
query92	76	74	72	72
query93	1268	828	488	488
query94	632	312	279	279
query95	600	396	322	322
query96	639	516	224	224
query97	2493	2550	2410	2410
query98	232	214	214	214
query99	964	1007	950	950
Total cold run time: 255979 ms
Total hot run time: 183793 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/13) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.52% (19553/37230)
Line Coverage 36.14% (182414/504767)
Region Coverage 32.48% (141563/435802)
Branch Coverage 33.42% (61323/183487)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 0.00% (0/13) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.20% (26711/36488)
Line Coverage 56.50% (284506/503527)
Region Coverage 53.91% (237302/440186)
Branch Coverage 55.59% (102393/184191)

Copy link
Copy Markdown
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Mar 2, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 2, 2026

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 2, 2026

PR approved by anyone and no changes requested.

@xylaaaaa
Copy link
Copy Markdown
Contributor Author

xylaaaaa commented Mar 2, 2026

run buildall

@doris-robot
Copy link
Copy Markdown

TPC-H: Total hot run time: 28877 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit e9ceb1cee24e61257385ec3830fa9d0fbd44f445, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17651	4484	4310	4310
q2	q3	10638	818	535	535
q4	4673	355	266	266
q5	7614	1212	1028	1028
q6	174	175	145	145
q7	802	861	665	665
q8	10101	1452	1330	1330
q9	5096	4758	4720	4720
q10	6844	1875	1629	1629
q11	448	247	229	229
q12	737	557	467	467
q13	17775	4256	3419	3419
q14	233	234	212	212
q15	968	800	790	790
q16	727	726	667	667
q17	705	885	443	443
q18	6064	5443	5373	5373
q19	1396	990	628	628
q20	501	491	383	383
q21	4526	1851	1389	1389
q22	341	280	249	249
Total cold run time: 98014 ms
Total hot run time: 28877 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4464	4315	4364	4315
q2	q3	1766	2166	1738	1738
q4	830	1156	756	756
q5	4026	4322	4337	4322
q6	177	172	140	140
q7	1751	1582	1469	1469
q8	2420	2663	2501	2501
q9	7668	7437	7373	7373
q10	2651	2945	2394	2394
q11	515	443	420	420
q12	482	592	473	473
q13	3955	4503	3706	3706
q14	287	310	274	274
q15	851	831	807	807
q16	709	764	706	706
q17	1176	1591	1324	1324
q18	7149	6897	6555	6555
q19	910	919	989	919
q20	2100	2147	1951	1951
q21	4141	3466	3463	3463
q22	516	434	376	376
Total cold run time: 48544 ms
Total hot run time: 45982 ms

@doris-robot
Copy link
Copy Markdown

TPC-DS: Total hot run time: 184287 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit e9ceb1cee24e61257385ec3830fa9d0fbd44f445, data reload: false

query5	4786	641	500	500
query6	332	222	206	206
query7	4243	474	278	278
query8	352	248	237	237
query9	8706	2789	2775	2775
query10	540	371	370	370
query11	17003	16932	16588	16588
query12	186	124	125	124
query13	1265	459	349	349
query14	6255	3220	2968	2968
query14_1	2817	2843	2817	2817
query15	202	195	183	183
query16	979	469	440	440
query17	1067	722	602	602
query18	2519	445	346	346
query19	209	207	186	186
query20	153	130	132	130
query21	232	140	117	117
query22	5026	5799	5439	5439
query23	17468	17139	16965	16965
query23_1	17296	17237	17195	17195
query24	7389	1638	1262	1262
query24_1	1248	1248	1233	1233
query25	560	479	477	477
query26	1233	259	149	149
query27	2787	471	291	291
query28	4488	1859	1868	1859
query29	786	560	464	464
query30	307	242	207	207
query31	879	725	637	637
query32	77	68	68	68
query33	522	319	285	285
query34	942	915	562	562
query35	632	672	578	578
query36	1098	1115	944	944
query37	134	94	77	77
query38	2959	2888	2939	2888
query39	910	865	840	840
query39_1	848	824	850	824
query40	227	151	132	132
query41	61	59	57	57
query42	106	105	102	102
query43	377	395	356	356
query44	
query45	194	190	187	187
query46	909	984	601	601
query47	2119	2199	2010	2010
query48	306	327	230	230
query49	625	469	376	376
query50	682	274	214	214
query51	4145	4090	4058	4058
query52	106	107	97	97
query53	291	344	280	280
query54	301	262	254	254
query55	89	88	81	81
query56	309	300	301	300
query57	1356	1337	1287	1287
query58	289	278	286	278
query59	2575	2693	2556	2556
query60	339	316	328	316
query61	146	145	148	145
query62	636	584	552	552
query63	317	285	272	272
query64	4845	1239	992	992
query65	
query66	1437	452	356	356
query67	16349	16422	16304	16304
query68	
query69	395	306	288	288
query70	993	942	882	882
query71	336	311	301	301
query72	2835	2621	2420	2420
query73	560	551	330	330
query74	10014	10003	9758	9758
query75	2845	2742	2443	2443
query76	2305	1022	676	676
query77	363	377	312	312
query78	11166	11408	10728	10728
query79	2490	758	632	632
query80	1802	601	543	543
query81	568	286	242	242
query82	1011	158	112	112
query83	340	280	239	239
query84	251	119	95	95
query85	902	467	441	441
query86	406	312	300	300
query87	3132	3088	2980	2980
query88	3600	2677	2655	2655
query89	423	360	337	337
query90	2056	177	175	175
query91	160	158	132	132
query92	73	79	70	70
query93	1038	845	518	518
query94	638	330	287	287
query95	585	395	309	309
query96	627	530	227	227
query97	2455	2460	2399	2399
query98	230	220	219	219
query99	998	981	909	909
Total cold run time: 256119 ms
Total hot run time: 184287 ms

@doris-robot
Copy link
Copy Markdown

BE UT Coverage Report

Increment line coverage 0.00% (0/13) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.58% (19631/37339)
Line Coverage 36.21% (183398/506463)
Region Coverage 32.49% (142232/437709)
Branch Coverage 33.44% (61672/184430)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 0.00% (0/13) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.52% (26149/36561)
Line Coverage 54.29% (274124/504910)
Region Coverage 51.51% (227600/441846)
Branch Coverage 52.90% (97853/184994)

@morningman morningman merged commit 16f2dda into apache:master Mar 7, 2026
28 of 30 checks passed
morningman pushed a commit that referenced this pull request Mar 10, 2026
…1-rc01 (#61125)

## Summary
Backport paimon-cpp reader integration chain into
tmp-branch-2.1.11-rc01-paimon-cpp as a single squashed change.

Included upstream PRs:
- #60296
- #60676
- #60711
- #60730
- #60795
- #60883
- #60946



## Notes
- This PR is intentionally squashed into one commit for release-branch
delivery.
- Local uncommitted changes in the original workspace were not touched;
work was done in an isolated worktree.
xylaaaaa added a commit to xylaaaaa/doris that referenced this pull request Mar 16, 2026
…der (apache#60795)

## Problem
Followup apache#60676

When FE does not pass full table options in scan ranges, paimon-cpp may
default manifest.format to avro.
For non-avro environments, this can fail in PaimonCppReader
initialization with:
Could not find a FileFormatFactory implementation class for format avro.

## Solution
In PaimonCppReader::_build_options, if split-level file_format exists
and table options are missing/empty:
- set file.format from split file_format
- set manifest.format from split file_format

This keeps paimon-cpp format resolution consistent with the actual split
format and avoids unintended avro fallback.

## Verification
- Incremental BE build succeeded for doris_be target.
- Change scope is limited to
be/src/vec/exec/format/table/paimon_cpp_reader.cpp.
xylaaaaa added a commit to xylaaaaa/doris that referenced this pull request Mar 19, 2026
…der (apache#60795)

## Problem
Followup apache#60676

When FE does not pass full table options in scan ranges, paimon-cpp may
default manifest.format to avro.
For non-avro environments, this can fail in PaimonCppReader
initialization with:
Could not find a FileFormatFactory implementation class for format avro.

## Solution
In PaimonCppReader::_build_options, if split-level file_format exists
and table options are missing/empty:
- set file.format from split file_format
- set manifest.format from split file_format

This keeps paimon-cpp format resolution consistent with the actual split
format and avoids unintended avro fallback.

## Verification
- Incremental BE build succeeded for doris_be target.
- Change scope is limited to
be/src/vec/exec/format/table/paimon_cpp_reader.cpp.
yiguolei pushed a commit that referenced this pull request Mar 19, 2026
…61379)

## Summary
- Cherry-pick 
- #60676: [feat](paimon) integrate paimon-cpp reader
- #60795: [fix](paimon) infer manifest format from split file format in
cpp reader
- #60883 
## Conflict Resolution
- `gensrc/thrift/PaloInternalService.thrift`: kept both new fields from
branch-4.1 and the PR (200: `enable_adjust_conjunct_order_by_cost`, 201:
`enable_paimon_cpp_reader`, 202: `single_backend_query`)

---------

Co-authored-by: morningman <yunyou@selectdb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.1.0-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants