Skip to content

mm/shmem: optimize read with reduced xarray lookups and folio batching#1437

Open
vfsci-bot[bot] wants to merge 5 commits into
vfs.base.cifrom
pw/1097925/vfs.base.ci
Open

mm/shmem: optimize read with reduced xarray lookups and folio batching#1437
vfsci-bot[bot] wants to merge 5 commits into
vfs.base.cifrom
pw/1097925/vfs.base.ci

Conversation

@vfsci-bot
Copy link
Copy Markdown

@vfsci-bot vfsci-bot Bot commented May 20, 2026

Series: https://patchwork.kernel.org/project/linux-fsdevel/list/?series=1097925
Submitter: Chi Zhiling
Version: 1
Patches: 5/5
Message-ID: <20260520101538.58745-1-chizhiling@163.com>
Base: vfs.base.ci
Lore: https://lore.kernel.org/linux-fsdevel/20260520101538.58745-1-chizhiling@163.com


Automated by ml2pr

Chi Zhiling added 5 commits May 20, 2026 10:50
When reading small amounts of data from the page cache, only a single
folio is typically returned from filemap_read_get_batch(). In this case,
calling xas_advance() or xas_next() after adding the folio to the batch
is unnecessary and only introduces extra branches.

The same issue exists for large reads, where one additional xarray walk
is always performed before termination.

Move the boundary check to after the folio is added to the batch so the
final redundant xarray advancement can be avoided. This significantly
reduces the branch count in the read path.

xas_next() does not update xa_index when xas->xa_node is set to
XAS_RESTART, so checking the boundary before updating xa_index is
sufficient to keep the folio within range. The warning should therefore
never trigger.

The branch count:
654.198 M/sec -> 646.444 M/sec

Performance counter stats for 'fio --ioengine=sync --rw=read --bs=4k --size=1G
--runtime=300 --time_based --group_reporting --name=seq_read_test --filename=file':

before:
READ: bw=2697MiB/s (2828MB/s), 2697MiB/s-2697MiB/s (2828MB/s-2828MB/s), io=790GiB (848GB), run=300001-300001msec
      245602051556      task-clock                       #    0.821 CPUs utilized
             78467      context-switches                 #  319.488 /sec
                40      cpu-migrations                   #    0.163 /sec
              3388      page-faults                      #   13.795 /sec
      758312319204      instructions                     #    0.74  insn per cycle
     1025881497502      cycles                           #    4.177 GHz
      160672383734      branches                         #  654.198 M/sec
         361904512      branch-misses                    #    0.23% of all branches

after:
READ: bw=2709MiB/s (2841MB/s), 2709MiB/s-2709MiB/s (2841MB/s-2841MB/s), io=794GiB (852GB), run=300000-300000msec
      243985503670      task-clock                       #    0.812 CPUs utilized
             79004      context-switches                 #  323.806 /sec
                30      cpu-migrations                   #    0.123 /sec
              3355      page-faults                      #   13.751 /sec
      747830935069      instructions                     #    0.73  insn per cycle
     1019609333322      cycles                           #    4.179 GHz
      157722976668      branches                         #  646.444 M/sec
         348984893      branch-misses                    #    0.22% of all branches

Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
Apply the same optimization used in filemap_get_read_batch() by moving
the boundary check from the loop condition to before xas_advance(),
avoiding an unnecessary xarray lookup and reducing branches in the fast
path.

Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
Change SGP_NOALLOC to return 0 with NULL folio on hole, matching
SGP_READ behavior. This simplifies the sgp_type handling by unifying
hole semantics across these types.

Previously, SGP_NOALLOC returned -ENOENT on hole, while SGP_READ
returned 0. This inconsistency required special handling in callers
like khugepaged and userfaultfd.

After this change:
- khugepaged: behavior unchanged (checks both error and NULL folio)
- userfaultfd: behavior unchanged (both -ENOENT and NULL are converted
  to -EFAULT before returning to userspace)

Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
This is a prep patch for shmem folio batching in the read path, where
non-uptodate folios need to be handled in the main iteration loop. A
large non-uptodate folio should be treated as a hole.

Currently, holes larger than PAGE_SIZE cannot be handled because
ZERO_PAGE is limited to a single page. Add copy_zero_to_iter() as a
wrapper to support copying larger zero ranges to the iterator.

Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
Optimize shmem file read by using filemap_get_folios_contig() to
batch fetch contiguous folios from the page cache, reducing the
overhead of repeated shmem_get_folio() calls.

When the folio batch is exhausted, attempt to refill it with
filemap_get_folios_contig(). If no folios are found (hole or swapped
out pages), fall back to shmem_get_folio() to handle these cases
individually.

Additionally:
- Defer folio_put() until the batch is exhausted or on exit
- Add folio_test_uptodate() check before copying to ensure data
  validity

Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants