Skip to content

mcviewer: add syntax highlighting via external command#5065

Open
noctuum wants to merge 6 commits intoMidnightCommander:masterfrom
noctuum:viewer_syntax_highlighting
Open

mcviewer: add syntax highlighting via external command#5065
noctuum wants to merge 6 commits intoMidnightCommander:masterfrom
noctuum:viewer_syntax_highlighting

Conversation

@noctuum
Copy link

@noctuum noctuum commented Mar 13, 2026

Checklist

👉 Our coding style can be found here: https://midnight-commander.org/coding-style/ 👈

  • I have referenced the issue(s) resolved by this PR (if any)
  • I have signed-off my contribution with git commit --amend -s
  • Lint and unit tests pass locally with my changes (make indent && make check)
  • I have added tests that prove my fix is effective or that my feature works
  • I have added the necessary documentation (if appropriate)

Proposed changes

Add optional syntax highlighting to the internal file viewer (mcview) via configurable external command. This resolves feature requests #28 (syntax highlighting in viewer, originally filed in 2005) and #1849 (ANSI color escape support in viewer, 2009).

The implementation has two layers:

  1. ANSI SGR parser — a state machine that renders \033[...m escape
    sequences with proper color mapping. Useful standalone for viewing colored
    logs, git diff --color output, CI logs, and any ANSI-colored content.

  2. Syntax highlighting — pipes the viewed file through a configurable
    external command, feeds the ANSI-colored output to the parser. Five
    backends are supported out of the box; users can also set an arbitrary
    command via the Shift-SCustom... dialog.

Supported backends

Backend Command template Notes
bat bat --color=always --style=plain --paging=never %s Built-in binary detection
chroma chroma --formatter terminal256 --fail %s --fail for fast exit on unknown files
highlight highlight -O xterm256 --force %s Lua-based definitions
pygmentize pygmentize -f terminal256 -O stripnl=False %s stripnl=False preserves leading blank lines
source-highlight source-highlight --failsafe --out-format=esc -i %s Default. GPL-3, available in 190+ package repos

Default is source-highlight because it's GPL-3 compatible, available in
190+ package repositories (every major Linux distro, all BSDs, macOS via
Homebrew/MacPorts), produces clean ANSI output, and is purpose-built for
exactly this use case.

User interaction

  • s — toggle syntax highlighting on/off
  • Shift-S — open backend chooser (only installed backends are shown,
    plus Custom... for arbitrary commands with %s placeholder)
  • Status bar shows short backend name: [bat], [chrm], [hi], [pyg],
    [src-hl], or [ext] with the detected file type
  • Setting persists within the MC session via mcview_global_flags
  • Syntax and nroff modes are mutually exclusive

Error handling and graceful degradation

Scenario Behavior
No highlighter installed Feature silently disabled, viewer works as before
Highlighter not found at toggle time Error dialog listing available backends
Highlighter can't process file (e.g. binary) Friendly "Syntax Highlighting" dialog, automatic fallback to plain view
Highlighter produces empty output Same friendly dialog + fallback
User disables at runtime Instant return to plain text, file position preserved

Architecture

┌─────────────┐     ┌──────────────────┐     ┌──────────────┐
│  file.py    │────>│ source-highlight │────>│ ANSI-colored │
│  (on disk)  │     │  (or any tool)   │     │   output     │
└─────────────┘     └──────────────────┘     └──────┬───────┘
                                                    │
                                              ┌─────▼───────┐
                                              │ mcview ANSI │
                                              │   parser    │
                                              └─────┬───────┘
                                                    │
                                              ┌─────▼───────┐
                                              │  terminal   │
                                              │  (colored)  │
                                              └─────────────┘

This is the same pattern used by less (via LESSOPEN), ranger, nnn, lf,
and vifm. Every modern file manager delegates syntax highlighting to an
external tool — none have built-in syntax engines for their viewer.

Why not port the editor's syntax engine?

src/editor/syntax.c (1639 lines) is tightly coupled to WEdit — text
access via edit_buffer_get_byte() (11 call sites), state stored in WEdit
fields, assumes full file in memory. The viewer handles files larger than
RAM. Estimated 2000–3000 lines of refactoring vs. ~740 lines for this
approach with more language coverage.


Testing

Unit tests (35 total, all pass)

  • ANSI parser — 23 tests: color parsing (basic, bright, 256-color,
    truecolor), attribute tracking (bold, underline, italic), reset sequences,
    malformed input, combined sequences, non-SGR CSI skip, plain text passthrough
  • Syntax command — 12 tests: %s substitution with shell escaping,
    binary name extraction, edge cases (NULL, empty, missing %s)

Manual testing — backend matrix

Tested on Arch Linux with all five backends:

Backend Version Normal file Binary file (5 MB ELF) Behavior on binary
bat 0.26.1 0.26.1 379/379 lines 226 B message Built-in "binary input" notice
chroma 2.23.1 379/379 lines 0 B, exit 1 --fail: fast exit → friendly dialog → plain fallback
highlight 4.19 379/379 lines 18 B (ELF magic) Outputs minimal header, exit 0
pygmentize 2.19.2 379/379 lines 0 B, stderr error Friendly dialog → plain fallback
source-highlight 3.1.9 379/379 lines 5.2 MB as-is --failsafe: outputs raw content

Manual testing — edge cases

  • Binary files — each backend handled gracefully (see matrix above)
  • Files without extension — backends that can't detect type fall back cleanly
  • Large files (5 MB ELF binary) — drain loop completes, percentage accurate
  • Pipe/stdin input — syntax mode correctly disabled (no filename for %s)
  • Toggle persistence — setting preserved across file opens within session
  • Nroff mutual exclusion — enabling syntax disables nroff and vice versa
  • Custom command dialog — validates %s placeholder, placeholder shows current backend template
  • Backend switching — Shift-S shows only installed backends + Custom
  • Leading blank lines — pygmentize -O stripnl=False preserves them
  • Empty output fallback — friendly non-error dialog, automatic plain view reload

Build verification

make indent  — clean (clang-format 20.1.3)
make check   — 66 tests, 0 failures, 0 errors (12 test suites)

Commit structure

1. mcviewer (ansi.c): add ANSI SGR escape sequence parser.
   8 files, +976 lines — standalone parser + 23 unit tests

2. mcviewer (ascii.c): wire ANSI parser into text rendering pipeline.
   4 files, +172 lines — integration into viewer rendering

3. mcviewer: add syntax highlighting via configurable external command.
   18 files, +1097 lines — full feature: 5 backends, Custom dialog,
   toggle, config, drain loop, error handling, 12 unit tests

4. doc: add syntax highlighting documentation.
   2 files, +38 lines — man page + hint

Each commit passes make indent && make check.


Configuration

[Viewer]
syntax_command=source-highlight --failsafe --out-format=esc -i %s

Alternative commands:

syntax_command=bat --color=always --style=plain --paging=never %s
syntax_command=chroma --formatter terminal256 --fail %s
syntax_command=highlight -O xterm256 --force %s
syntax_command=pygmentize -f terminal256 -O stripnl=False %s

Or any tool that outputs ANSI escape codes to stdout. %s is replaced with
the shell-escaped file path.


Stats

  • 35 unit tests (23 ANSI parser + 12 syntax command)
  • 0 new compile-time dependencies
  • 0 configure.ac flags needed (runtime detection only)

@github-actions github-actions bot added needs triage Needs triage by maintainers prio: medium Has the potential to affect progress labels Mar 13, 2026
@github-actions github-actions bot added this to the Future Releases milestone Mar 13, 2026
noctuum added 4 commits March 13, 2026 17:28
Add a state machine parser for ANSI SGR (Select Graphic Rendition)
escape sequences.  Supports standard colors (0-7), bright (8-15),
256-color (38;5;N / 48;5;N), and truecolor (38;2;R;G;B / 48;2;R;G;B)
which is approximated to the nearest xterm 256-color palette index.
Bold, underline, italic, and individual/global resets are handled.

Includes 23 unit tests covering color parsing, attribute tracking,
truecolor approximation, escape sequence edge cases, and combined
sequences.

Signed-off-by: noctuum <25441068+noctuum@users.noreply.github.com>
Integrate the ANSI SGR parser into mcview's text display pipeline.
When syntax mode is active, bytes are fed through the parser and
ANSI escape sequences are consumed while displayable characters
are rendered with the appropriate colors.

Key additions:
- mcview_ansi_get_color(): maps ANSI parser state to MC color pairs,
  inheriting the viewer skin's fg/bg when ANSI values are default
  (fixes background mismatch with source-highlight)
- mcview_get_next_maybe_ansi_char(): new rendering layer between raw
  byte reading and nroff processing
- mcview_fill_line_remaining(): fills remaining columns on each line
  with spaces using the last drawn color, giving full-width background
- syntax_fill_color tracking across lines for consistent empty-line
  coloring

Signed-off-by: noctuum <25441068+noctuum@users.noreply.github.com>
Pipe files through an external syntax highlighter and render the
ANSI-colored output in the internal viewer.  Five backends supported
out of the box: bat, chroma, highlight, pygmentize, source-highlight.
Default is source-highlight; user can also set an arbitrary command
via the Shift-S "Custom..." dialog.

Implementation details:
- syntax.c: backend registry, command building with g_shell_quote(),
  binary detection via g_find_program_in_path(), options dialog
- mcviewer.c: drain loop for accurate file-size percentage,
  empty-output detection with friendly fallback to plain view
- growbuf.c: suppress redundant error dialog in syntax mode
- Toggle with 's' key; Shift-S opens backend chooser
- Syntax and nroff modes are mutually exclusive
- chroma uses --fail to exit fast on unrecognized files
- pygmentize uses -O stripnl=False to preserve leading blank lines
- Status bar shows short backend name: [bat], [chrm], [hi], [pyg],
  [src-hl], or [ext] with the actual file type

Includes 12 unit tests for command parsing and binary extraction.

Signed-off-by: noctuum <25441068+noctuum@users.noreply.github.com>
Add user-facing documentation for the new syntax highlighting feature
in the internal file viewer:

- mc.1.in: document the feature in the Internal File Viewer section,
  including supported backends (bat, chroma, highlight, pygmentize,
  source-highlight), key bindings (s to toggle, S for backend chooser),
  Custom command option, status bar format, and persistence behavior
- mc.hint: add a random hint about the s/Shift-S keys

Signed-off-by: noctuum <25441068+noctuum@users.noreply.github.com>
@noctuum noctuum force-pushed the viewer_syntax_highlighting branch from b7ab89f to ba2a3b3 Compare March 13, 2026 10:31
@zyv zyv added area: mcview mcview, the built-in text editor and removed needs triage Needs triage by maintainers labels Mar 13, 2026
@zyv
Copy link
Member

zyv commented Mar 13, 2026

Overall, very impressive PR! Thank you. I appreciate a clean description of the intent, considered options, the notes on the architecture, and, of course, the tests.

I think that for the most part, it goes in the direction we want to go. However, I don't agree with all choices and would like to have some discussion on that. Whether the PR is a good place for it or we should open a GitHub discussion, I'm not sure. Let's see.

I think that the good part is the idea of implementing the ANSI SGR parser into the viewer. Maybe it's not the best solution for all cases, but in the case of mc, which features an own viewer with pluggable stream filters and an editor, it sounds like a good abstraction layer to draw a boundary between viewer/editor machinery and syntax highlighting.

The part that I don't like so much is the focus of this PR to support external highlighters to supply highlighted source. In my opinion, the optimal structure for mc would be along the lines of:

  1. Internal restricted highlighting system for use in editor and viewer
    a. Interface to the editor via low-level API
    b. Interface to the viewer via high-level API
  2. Editor with support for only internal highlighting system
  3. Viewer with support for internal highlighting system AND external systems via SGR
    a. Default would be source-highlight, possibly with pre-configured presets for others
    b. External highlighters would support more formats than we ever can and with higher quality as compared to the restricted engine

The syntax detection would be shared between the editor and the viewer, unless for viewer an external system is configured, in which case it would take over syntax detection for the viewer.

Why not port the editor's syntax engine?

src/editor/syntax.c (1639 lines) is tightly coupled to WEdit — text access via edit_buffer_get_byte() (11 call sites), state stored in WEdit fields, assumes full file in memory. The viewer handles files larger than RAM. Estimated 2000–3000 lines of refactoring vs. ~740 lines for this approach with more language coverage.

This I find a bit weak of an argument.

I'm neither an expert on highlighting systems, nor am I familiar with our editor & viewer codebase, but let me try to put down some thoughts on it from the maintainer perspective... think of it as a gardener, who has to somehow manage an estate that was entrusted to him, which he neither planned himself, nor is deeply familiar with all corners and individual plants.

The basic premise of mc is that it offers a portable baseline set of file management tools on supported systems in a mostly self-contained way while trying to stay tractable. So it ships editor and viewer, and (limited) syntax highlighting in the editor.

The editor's main selling point is being able to process the binary files, while the viewer can work with large files. The syntax engine needs to always terminate, be tolerant of invalid syntax, be reasonably fast and ideally work on buffers.

We do have an engine, which has served us well in the past, except for the problem with nested constructs (which might be fixable or not...). Unfortunately, as you mention, it's coded in a very brittle way, untested and indeed tightly coupled with WEdit.

However, thinking long term, I'd rather very much have it extracted, refactored and covered with tests. So these drawbacks are not an argument for me not to use it in the viewer, but rather to refactor it. In fact, I would be much more excited seeing a PR that does exactly that, even though it's an order of magnitude more effort, instead of just giving up and using it as an argument to focus on external highlighters.

Do you see any technical arguments on principle against this, other than it being a lot of work? To my mind it should be possible to have a slim restricted engine usable in both cases.

Besides, my understanding is that any external highlighted would first process the entire file anyways and pipe its SGRed version into the viewer, so I don't see how you solve the problem of our highlighting engine currently needing to load the whole file in memory with external highlighters.


Anyways, in terms of going forward with this, I think that if we all quickly agree that the SGR layer is a good thing, maybe we can start by getting this in via a separate PR. Then, hopefully, yet another one could happen for the internal engine (if we agree on that and you would be willing to do the work), and another one for external ones?

@egmontkob, since you did a lot of work in this area, I would be interested in your opinions.

@egmontkob
Copy link
Contributor

I probably won't have the capacity to review every single aspect of this feature, all the way from the global design down to the smallest bits, so here are a few things that occur to me (without having looked at the change):

(Foreword: I'll use the terminology "paragraph" for newline-separated sections of a file, in other words logical lines, which may end up being wrapped into multiple physical lines of the display.)

A huge selling point of the viewer is indeed to be able to "view" giant files (we're not talking about 5MB that you tested but rather 5GB and more, e.g. /dev/hda), and display it reasonably.

That involves a few tricks, e.g. when you press End then (if I remember correctly) we seek to the end of the file and then walk backwards until we find enough newline characters to be able to start rendering. And it's surely more complex if line wrapping is enabled: we seek back enough, then start to wrap the paragraph in memory, I guess we have to wrap all the way through to count the resulting number of lines, then start it again, this time already knowing when to start actually rendering them on the screen. I can't remember but it must be something like this.

As far as I vaguely remember, we also have a safety cap, after which (if walking backwards didn't result in enough newline characters soon enough) we display the file possibly beginning at a "wrong" position, rather than hanging for too long.

We also do this backwards walking when the user scrolls backwards (e.g. Up or PgUp key). Due to Unicode having tricks like double-wide characters, and also us supporting the *roff backspace-overstrike formatting syntax, we can't simply walk backwards by let's say 80 characters on an 80-column display. We walk backwards until the beginning of the paragraph and imagine to re-render that paragraph from its beginning (possibly some of its lines not making it into the screen).

(Much of this was done by me in a big viewer revamp in #3250.)

When it comes to syntax highlighting, there's two aspects of this story.

One is that we surely need to pipe the file through the highlighting engine, and then read its output to memory. This is similar to how we handle e.g. manual pages, F8-F9 changing whether or how to process the file. (The viewer definitely should have a mode where the SGR sequences are displayed rather than interpreted.)

The other is: if an SGR highlighting can span across paragraphs, our backwards walking strategy cannot work. Which leaves us with two possible approaches to take.

One possibility is to have two different backwards walking strategies implemented in the viewer. A new strategy that properly supports reasonably sized files, and always scans it from the beginning, or remembers for each newline the SGR state we carry there, or something like this; and the current strategy that can handle enormous files, although not interpret SGR sequences.

The second possibility is to do what less does, and since there's a prominent precedence for this, I think I'm leaning towards this. This is to slightly deviate from how a terminal would interpret the data if you cat'ed it, namely: every newline character of the file resets the SGR attributes. Every paragraph starts clean. This way you can keep our current backwards walking strategy.

This second possibility implies two things. One is that if you ever view a file containing raw SGR sequences (which you likely won't do too much) then you'll get this less-like interpretation. The other is that if you hook up to an external highlighter, it has to know to re-enable the active SGR modes at the beginning of every paragraph.

Which possibility should we go for?

You list 5 highlighters that you support. How many and which of them support this mode of reassuring any non-default SGR state after every newline character?

@egmontkob
Copy link
Contributor

Things to verify about the SGR parser (I haven't verified them yet):

mc (as for its skins) supports the following attributes, supported by both of our backends (ncurses and slang): bold, italic, underline, blink, reverse: SGR 1, 3, 4, 5, 7. Arguably the last two are rarely used, but given that we currently support them as much as the other three, I think they should be recognized and handled here too.

To turn them off individually: SGR 22 (yup, not 21!), 23, 24, 25, 27.

Escape sequences with 0 or an empty value anywhere inside, e.g. \e[m, \e[0m, \e[1;0;3m, \e[1;;3m all reset the attributes (the latter ones reset all (incl. bold) and then enabling italic).

Extended colors de jure use : as internal separator, in practice de facto ; is used instead. E.g. frequently used \e[38;5;255m should really be \e[38:5:255m.

RGB truecolors de jure have one more parameter. What is commonly incorrectly written as \e[38;2;10;20;30m should actually be \e[38:2::10:20:30m.

These so far should all be supported. Sometimes you can see mixed notation. The first separator is ;, the rest are :. Or RGB with : as separator but without the additional parameter. I don't think these need to be supported.

The very point of having : as the nested separator is future extendibility. \e[38;5;4;7m enables reverse but not blink or underline. In order to know this you have to recognize 38 as extended foreground color, then you have to recognize 5 as meaning 256-color mode, then you have to recognize 4 as the parameter to the former, then revert back to the default state and recognize 7 as reverse mode. However, seeing \e[38:5:4:7m you must know that it does not enable reverse (even if the foreground color looks incorrect, taking too many parameters). So, proper parsing is not just taking ; and : as synonyms, it's properly understanding the flat (incorrect, ; only) as well as the nested (correct, ; is the outer and : is the inner separator) notation, and know when to leave the nest.

You might want to recognize underline style extensions and (due to lack of support in both of our underlying libraries) replace with regular underline.

(See also https://askubuntu.com/a/985386.)

Recognize SGR 30..37, 90..97 as synonyms with 38:5:0..15, and similarly 40..47, 100..107 the same as 48:5:0..15. Strictly speaking they are not all fully synonyms. In many terminals (or settings thereof), SGR 30..37 combined with SGR 1 makes the color brigther, but SGR 38:5:0..7 combined with SGR 1 leaves the color the dark variant. Neither ncurses nor slang lets you keep this distinction, you have to merge them which will be sent out the terminal in whichever of these two formats.

@egmontkob
Copy link
Contributor

egmontkob commented Mar 13, 2026

Unfortunately we build upon two braindamaged libraries (ncurses and slang), both having an unreasonable palette-based approach to the color pairs (ncurses) or color pairs and attributes combos (slang), rather than being able to specify the colors directly for each cell.

You have to allocate a new color index for every new combo (which you surely do), and you'll never know when you'll run out of available slots.

Our palette handling component has an is_temp flag, not sure if it's used currently at all. You might want to look into this.

With modern terminal descriptions, terminfo asks the libraries to support 65536 color pairs. Ncurses supports this, not sure about Slang. This should be more than enough in practice, unless you begin to stress-test the system. Then sometimes (maybe before each screen update) carefully gargabe-collecting the no longer used indices could remedy the problem. Not sure if it's worth it dealing with this pain, though.

@egmontkob
Copy link
Contributor

You also have to somehow (how?) deal with the situation if the text file (or syntax highlighter) uses more advanced extended colors than the terminal supports them.

@zyv
Copy link
Member

zyv commented Mar 13, 2026

One is that we surely need to pipe the file through the highlighting engine, and then read its output to memory. This is similar to how we handle e.g. manual pages, F8-F9 changing whether or how to process the file. (The viewer definitely should have a mode where the SGR sequences are displayed rather than interpreted.)

Well, how about limiting the SGR parser (and highlighting) only to the parsed mode? Would this not make the rest much easier and leave whatever we have for big files intact?

@egmontkob
Copy link
Contributor

Well, how about limiting the SGR parser (and highlighting) only to the parsed mode? Would this not make the rest much easier and leave whatever we have for big files intact?

Running through the highlighter should be analogous to "F8 Parse", like running through the manpage formatter, gzip uncompressor etc.

If we go with less-like SGR resetting for every paragraph (subject to the backends supporting this behavior), SGR parsing-highlighting could remain orthogonal to this, depend on "F9 Format" just like with the *roff backspace-overstrike handler.

This is a well-established behavior that users didn't complain about. (Mind you, I never know what F8 vs. F9 do, I just randomly press them until I end up with the desired formatting :))

If we go with the less-like approach, I think syntax highlighting fits nicely into this picture with very little change hereabouts.

This means that the proposed s shortcut wouldn't be needed. Shift-s for choosing the backend could be replaced by the more convenient s, to which I'd also add Shift+F8 as an alternate shortcut since it extends F8's behavior.

@egmontkob
Copy link
Contributor

A big question (especially if we wait with 4.9.0 for this change to land):

If we can hook up to multiple external syntax highlighters, would we want to keep our own built-in one as well? Or could we simply ditch that?

@noctuum
Copy link
Author

noctuum commented Mar 13, 2026

Thank you both @zyv and @egmontkob for the incredibly thorough and thoughtful feedback — this is exactly the kind of discussion that makes the result better.

I'll address everything in one message to keep the thread manageable.


SGR across newlines

Tested all 5 supported highlighters with a Python file containing a multi-line docstring ("""...""" spanning 5 lines). Results:

Highlighter Re-emits SGR after \n Resets at EOL
bat yes \033[0m
chroma yes \033[0m
highlight yes \033[m
pygmentize yes \033[39m
source-highlight yes \033[m

All 5 produce less-compatible output — every line is SGR-independent. This is by design: they all target less -R as their primary consumer, which has the same paragraph-independence requirement. So we naturally align with option B (every \n resets SGR), and the current backwards walking strategy works unchanged.

F8/F9 paradigm and keybindings

I really like the idea of fitting syntax highlighting into the existing F8/F9 paradigm. Making it analogous to "F8 Parse" (like manpage processing or gzip) is a clean mental model. Happy to remap: drop the standalone s toggle, use s for backend selection, and add Shift+F8 as an alternate shortcut extending F8's behavior. This feels much more natural for existing mc users.

SGR parser audit

@egmontkob, thank you for the extremely detailed SGR requirements list. I audited the parser against every point. Here's the honest picture:

Already working correctly:

  • Bold (1), Underline (4) — with individual off codes (22, 24)
  • \e[m, \e[0m — full reset
  • FG 30–37, BG 40–47, Bright FG 90–97, Bright BG 100–107
  • 256-color: 38;5;N and 48;5;N (semicolon notation)
  • Truecolor: 38;2;R;G;B — approximated to nearest 256-color via 6x6x6 cube mapping
  • Default FG (39), Default BG (49)
  • Flat \e[38;5;4;7m correctly consumes 2 sub-params for the color, then processes 7 separately
  • Non-SGR CSI sequences consumed without affecting color state
  • Overflow protection (params clamped at 65535, max 16 params)

Gaps found (will fix in follow-up):

  • Italic (3), Blink (5), Reverse (7) and their off codes (23, 25, 27) — not parsed. MC's tty layer supports all of these via skins, so adding them to the parser and attribute builder is straightforward.
  • Empty parameter bug: \e[1;;3m currently drops the empty param instead of treating it as 0 (reset). The has_current_param guard in mcview_ansi_csi_finalize_param needs to be removed so empty slots push the default value of 0. This means \e[1;;3m would correctly behave as \e[1;0;3m (bold on → reset all → italic on).
  • Colon : as sub-separator: \e[38:5:255m doesn't parse — : (0x3A) falls into the intermediate byte range and is consumed silently. All 5 supported highlighters use ;, so there's no practical impact today, but for correctness and future-proofing this should be handled. This also gives us proper nested semantics: \e[38:5:4:7m would correctly NOT interpret 7 as reverse.
  • Underline style extensions (SGR 21 / 4:N variants like curly, dotted) — should be recognized and mapped to regular underline, given ncurses/slang limitations.

None of these gaps affect the output of any of the 5 supported highlighters (they all use basic SGR with ; separators), but I agree they should be fixed for robustness.

Color pair allocation

Already using tty_try_alloc_color_pair(&color, TRUE) — the is_temp flag marks our pairs as garbage-collectible via tty_color_free_temp(). With terminfo supporting 65536 pairs on ncurses, exhaustion is unlikely in practice, but the temp mechanism provides a safety net.

Truecolor → terminal downgrade

Already handled: mcview_ansi_rgb_to_256() maps 24-bit RGB to the nearest xterm 256-color index using the 6x6x6 cube. If the terminal supports fewer colors, ncurses/slang handle the further downgrade at the rendering layer — the same way they handle skin colors today.

Large file protection

Great point about 5GB+ files. I benchmarked the highlighters on generated Python files:

Highlighter 10 MB time 10 MB RSS
source-highlight 7.8s 9 MB
highlight 9.5s 5.5 MB
bat 19.5s 12.5 MB
chroma 26.8s 93 MB
pygmentize 27.2s 56 MB

This scales linearly — 5GB would take hours. Plan: add a file size threshold (a few MB) above which syntax highlighting is silently skipped, falling back to plain view. Same approach as less. The viewer's large-file capability remains fully intact because syntax mode operates through the pipe/growbuf path, completely separate from direct file access.

On the internal engine

@zyv, I appreciate the long-term vision for a unified highlighting engine, and I understand the gardener's perspective.

An interesting counterpoint was raised — whether we even need the built-in engine if external highlighters are available. I think the answer depends on the use case:

  • Editor (mcedit): users who need serious syntax highlighting in their editor typically already use vim, emacs, or vscode — and mc lets them do exactly that via EDITOR/VISUAL. The internal engine serves those who prefer the built-in editor for quick edits, and it works within its limitations for that purpose.
  • Viewer (mcview): external highlighters provide dramatically better coverage (500+ languages vs ~104 hand-maintained syntax files) with proper grammar-based parsing, and they already exist as mature, well-tested tools. The SGR layer in this PR is engine-agnostic — if the internal engine is ever refactored to emit SGR sequences, it plugs right in with zero viewer-side changes.

The current src/editor/syntax.c is 1639 lines of untested, tightly-coupled code. Refactoring it into a reusable, buffer-agnostic, testable library is realistically a multi-month project — and the result would still be a basic regex engine that can't handle nested constructs. That's a worthy long-term goal, but I don't think it should block a working solution that's available today.

There's also a pragmatic question of audience. The viewer is mc's most universally used tool — everyone opens files to look at them. The editor, on the other hand, competes with vim, emacs, vscode, and a dozen others that people already have muscle memory for. The intersection of "needs syntax highlighting" and "chooses mcedit as their primary editor" is modest. Investing months of refactoring into the internal engine to serve that niche, when external tools already cover 500+ languages with better quality, is a hard trade-off to justify.

This PR doesn't close any doors. It opens a clean SGR abstraction layer that works with any source of highlighted content — external tools today, potentially a refactored internal engine tomorrow.

Splitting the PR

Shipping the SGR parser without anything feeding it highlighted content would produce dead code with no user-visible effect. The parser and the external highlighter integration are two halves of a single feature — one interprets SGR, the other produces it. I'd prefer to keep them together as a cohesive, testable unit (35 tests covering both halves).

That said, I'm happy to restructure the commits if it helps review — the current 4-commit structure already separates the layers cleanly:

  1. ANSI SGR parser (pure, self-contained, 23 tests)
  2. Wire parser into ascii.c rendering pipeline
  3. Syntax command infrastructure + backend registry + dialog (12 tests)
  4. Documentation

Next steps

I'll prepare a follow-up addressing the gaps found in the audit:

  • Add italic, blink, reverse support (and their off codes)
  • Fix the empty parameter bug
  • Add colon : sub-separator support with proper nested semantics
  • Map underline style extensions to regular underline
  • Add file size threshold for large file protection
  • Remap keybindings to fit F8/Shift+F8 paradigm

Happy to discuss any of these points further.

@zyv
Copy link
Member

zyv commented Mar 13, 2026

A big question (especially if we wait with 4.9.0 for this change to land):

If we can hook up to multiple external syntax highlighters, would we want to keep our own built-in one as well? Or could we simply ditch that?

Well, I don't think that there is a chance to ditch the internal one (here I mean the one used in the editor), opting for the external one, because the requirements are different from the one used in the viewer, alone for performance reasons.

The highlighter used in the editor should be (ideally) streaming, error-recovering, and performant. We can't just pipe the edited file through an external highlighter upon every change. I'm not saying that ours is great and ticks all these boxes, but ripping it out for Pygments or so just isn't an option. We could replace it with a specialized library, but then we have a rather heavy external dependency, and it's not like there are lots of awesome contenders in the C-space.

My preference (or dream?) would be to extract, clean up, and cover with tests our internal one and make it our primary choice for editor and viewer with support for external ones as an option. See my musings above.

noctuum added 2 commits March 13, 2026 21:20
Address reviewer feedback on the ANSI SGR parser:

Parser improvements:
- Add italic (3), blink (5), reverse (7) attribute support
- Add individual off codes: 23, 25, 27
- Map SGR 21 (double underline) to regular underline
- Fix empty parameter bug: \e[1;;3m now correctly treats
  empty param as 0 (reset)
- Add colon ':' sub-separator with proper nested semantics:
  \e[38:5:4:7m does NOT set reverse (all sub-params belong
  to the color group), while \e[38;5;4;7m correctly does
- Support de jure truecolor: \e[38:2:CS:R:G:Bm

Large file protection:
- Skip syntax highlighting for files > 2 MB to avoid long
  processing times from external highlighters

12 new tests (35 total for ANSI parser).

Signed-off-by: noctuum <25441068+noctuum@users.noreply.github.com>
Per reviewer feedback, align syntax highlighting shortcuts with
the existing F8 (Parse/Raw) paradigm:

- Shift+F8 toggles syntax highlighting (was 's')
- 's' opens backend chooser dialog (was 'Shift-S')

Updated in all three keymaps (default, emacs, vim),
hint file, and man page.

Signed-off-by: noctuum <25441068+noctuum@users.noreply.github.com>
@noctuum
Copy link
Author

noctuum commented Mar 13, 2026

Follow-up: the parser fixes and keybinding remap mentioned above are now pushed.

SGR parser hardening

All gaps identified in the audit are addressed:

  • Italic (3), Blink (5), Reverse (7) — now parsed and passed to the renderer via dynamic attribute builder
  • Individual off codes (23, 25, 27) — each attribute can be disabled independently
  • SGR 21 (double underline) — mapped to regular underline (ncurses/slang limitation)
  • Empty parameter bug\e[1;;3m now correctly treats the empty param as 0, producing bold→reset→italic
  • Colon : sub-separator — fully supported with proper nested semantics:
    • \e[38:5:4:7m → foreground color 4, reverse is not set (7 belongs to the color group)
    • \e[38;5;4;7m → foreground color 4, reverse is set (7 is an independent SGR param)
  • De jure truecolor\e[38:2:CS:R:G:Bm recognized, color space parameter skipped

Large file protection

Files larger than 2 MB now silently skip syntax highlighting, falling back to plain view. The viewer's large-file capability remains fully intact — syntax mode operates through the pipe/growbuf path, completely separate from direct file access.

The 2 MB threshold is based on the benchmark data shared above: even the fastest highlighter (source-highlight) takes ~1.5s at 2 MB, scaling linearly from there.

Keybinding remap

Remapped per @egmontkob's suggestion to fit the F8/F9 paradigm:

  • Shift+F8 — toggle syntax highlighting (was s), extending F8's "parse" behavior
  • s — open backend chooser dialog (was Shift-S)

Updated in all three keymaps (default, emacs, vim), hint, and man page.

Tests

12 new tests added (35 total for the ANSI parser), covering every fix listed above including the colon-vs-semicolon semantic distinction.

@ossilator
Copy link
Contributor

the ticket for using an external library is #2931, and the merits and caveats of going that route should be discussed there.

it seems somewhat obvious that syntax hl in the editor requires an interactive provider. but that doesn't imply an in-process library; the language server protocol is all the rage currently.
going from there, it would seem kind of silly to have separate support for external engines in the viewer.

i don't buy the argument that the PR cannot be split. text files with embedded ansi colors are absolutely a thing, which is why i created #31.

@noctuum
Copy link
Author

noctuum commented Mar 14, 2026

@ossilator — I can see the argument for splitting. If you'd like me to restructure this into separate PRs, let me know — the commit history already separates the layers cleanly. Or feel free to cherry-pick the SGR parser commits directly if that's easier on your end.

A note on scope, since the discussion is drifting toward LSP and internal engine refactoring:

Issue #28 ("syntax highlighting in viewer wanted") has been open since July 2005. Issue #1849 ("mcview: color output / ansi escapes support") since 2009. Issue #2931 ("replace mcedit syntax highlighting engine with an external library") since 2012. In all that time, no PR and no concrete implementation proposal has materialized for any of them.

This PR exists precisely because none of those discussions led to code. It solves one specific, narrowly-scoped problem: let the viewer display syntax-highlighted output from tools that already exist. It's ~740 lines of new code, 35 tests, zero architectural changes, zero new dependencies. It doesn't touch the editor, doesn't modify the internal syntax engine, doesn't alter the viewer's large-file or backwards-walking strategies.

LSP is a fundamentally different project — JSON-RPC transport, capabilities negotiation, server lifecycle, per-language runtime dependencies. An internal engine refactoring of the 1639-line untested syntax.c is another. Both are multi-month efforts that would need dedicated contributors. Neither competes with what this PR does — the SGR rendering layer is source-agnostic and would serve as the display backend for either approach if they ever materialize.

I'm happy to split into:

But I want to be transparent: my contribution scope is limited to what's here. I built this because it's a simple, self-contained solution to a long-standing request — not as a stepping stone toward a larger architecture rework. If the external highlighter path doesn't align with the project's direction, that's fine — the SGR parser stands on its own.

@ossilator
Copy link
Contributor

i have a strong backwards compatibility mindset, so it irks me to add features from which are an architectural dead end. but then, most users wouldn't be seriously bothered by the highlighting slightly changing due to the backend being replaced. so ... fair enough?

the inconsistent highlighting between viewer and editor will raise some eyebrows, though. so this "architectural limitation" should be clearly documented as such.

your solution also has a runtime dependency. i don't think it's a big deal, but let's keep the comparisons honest.

so overall the approach is good enough given available manpower.
splitting PRs is a PITA on GH, so don't bother unless yuri disagrees with my assessment above.
we like a nicely polished git history, though.

@egmontkob
Copy link
Contributor

egmontkob commented Mar 16, 2026

I didn't follow the evolution of this discussion thoroughly. From a huge distance, everything makes sense to me, including this:

But I want to be transparent: my contribution scope is limited to what's here


I'm happy to split into:

Yup I think that extending the viewer's capabilities, vs. the wiring with other tools, should preferably be separate commits.


Another "tool" (not syntax highlighting of text files but rather something completely different) is an "image viewer" here: https://gitlab.gnome.org/GNOME/vte/-/issues/2941

I'm not saying mc should hook up to it by default – in fact, I really think it shouldn't – but it could be a cool tool for showcasing and debugging the new feature, as well stress-testing how it deals with running out of color pairs.

@ossilator
Copy link
Contributor

i don't think anyone even considered that it shouldn't be separate commits (anymore). the question was whether the PR should be split.
at this point, the newly added #5067 provides a tangible reason why separating the second part may make sense.

@egmontkob
Copy link
Contributor

the question was whether the PR should be split.

Ahh, okay :) I couldn't care less about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: mcview mcview, the built-in text editor prio: medium Has the potential to affect progress

Development

Successfully merging this pull request may close these issues.

mcview: color output (ansi escapes) support savannah: syntax highlighting in viewer wanted

4 participants