mcviewer: add syntax highlighting via external command#5065
mcviewer: add syntax highlighting via external command#5065noctuum wants to merge 6 commits intoMidnightCommander:masterfrom
Conversation
Add a state machine parser for ANSI SGR (Select Graphic Rendition) escape sequences. Supports standard colors (0-7), bright (8-15), 256-color (38;5;N / 48;5;N), and truecolor (38;2;R;G;B / 48;2;R;G;B) which is approximated to the nearest xterm 256-color palette index. Bold, underline, italic, and individual/global resets are handled. Includes 23 unit tests covering color parsing, attribute tracking, truecolor approximation, escape sequence edge cases, and combined sequences. Signed-off-by: noctuum <25441068+noctuum@users.noreply.github.com>
Integrate the ANSI SGR parser into mcview's text display pipeline. When syntax mode is active, bytes are fed through the parser and ANSI escape sequences are consumed while displayable characters are rendered with the appropriate colors. Key additions: - mcview_ansi_get_color(): maps ANSI parser state to MC color pairs, inheriting the viewer skin's fg/bg when ANSI values are default (fixes background mismatch with source-highlight) - mcview_get_next_maybe_ansi_char(): new rendering layer between raw byte reading and nroff processing - mcview_fill_line_remaining(): fills remaining columns on each line with spaces using the last drawn color, giving full-width background - syntax_fill_color tracking across lines for consistent empty-line coloring Signed-off-by: noctuum <25441068+noctuum@users.noreply.github.com>
Pipe files through an external syntax highlighter and render the ANSI-colored output in the internal viewer. Five backends supported out of the box: bat, chroma, highlight, pygmentize, source-highlight. Default is source-highlight; user can also set an arbitrary command via the Shift-S "Custom..." dialog. Implementation details: - syntax.c: backend registry, command building with g_shell_quote(), binary detection via g_find_program_in_path(), options dialog - mcviewer.c: drain loop for accurate file-size percentage, empty-output detection with friendly fallback to plain view - growbuf.c: suppress redundant error dialog in syntax mode - Toggle with 's' key; Shift-S opens backend chooser - Syntax and nroff modes are mutually exclusive - chroma uses --fail to exit fast on unrecognized files - pygmentize uses -O stripnl=False to preserve leading blank lines - Status bar shows short backend name: [bat], [chrm], [hi], [pyg], [src-hl], or [ext] with the actual file type Includes 12 unit tests for command parsing and binary extraction. Signed-off-by: noctuum <25441068+noctuum@users.noreply.github.com>
Add user-facing documentation for the new syntax highlighting feature in the internal file viewer: - mc.1.in: document the feature in the Internal File Viewer section, including supported backends (bat, chroma, highlight, pygmentize, source-highlight), key bindings (s to toggle, S for backend chooser), Custom command option, status bar format, and persistence behavior - mc.hint: add a random hint about the s/Shift-S keys Signed-off-by: noctuum <25441068+noctuum@users.noreply.github.com>
b7ab89f to
ba2a3b3
Compare
|
Overall, very impressive PR! Thank you. I appreciate a clean description of the intent, considered options, the notes on the architecture, and, of course, the tests. I think that for the most part, it goes in the direction we want to go. However, I don't agree with all choices and would like to have some discussion on that. Whether the PR is a good place for it or we should open a GitHub discussion, I'm not sure. Let's see. I think that the good part is the idea of implementing the ANSI SGR parser into the viewer. Maybe it's not the best solution for all cases, but in the case of mc, which features an own viewer with pluggable stream filters and an editor, it sounds like a good abstraction layer to draw a boundary between viewer/editor machinery and syntax highlighting. The part that I don't like so much is the focus of this PR to support external highlighters to supply highlighted source. In my opinion, the optimal structure for mc would be along the lines of:
The syntax detection would be shared between the editor and the viewer, unless for viewer an external system is configured, in which case it would take over syntax detection for the viewer.
This I find a bit weak of an argument. I'm neither an expert on highlighting systems, nor am I familiar with our editor & viewer codebase, but let me try to put down some thoughts on it from the maintainer perspective... think of it as a gardener, who has to somehow manage an estate that was entrusted to him, which he neither planned himself, nor is deeply familiar with all corners and individual plants. The basic premise of mc is that it offers a portable baseline set of file management tools on supported systems in a mostly self-contained way while trying to stay tractable. So it ships editor and viewer, and (limited) syntax highlighting in the editor. The editor's main selling point is being able to process the binary files, while the viewer can work with large files. The syntax engine needs to always terminate, be tolerant of invalid syntax, be reasonably fast and ideally work on buffers. We do have an engine, which has served us well in the past, except for the problem with nested constructs (which might be fixable or not...). Unfortunately, as you mention, it's coded in a very brittle way, untested and indeed tightly coupled with However, thinking long term, I'd rather very much have it extracted, refactored and covered with tests. So these drawbacks are not an argument for me not to use it in the viewer, but rather to refactor it. In fact, I would be much more excited seeing a PR that does exactly that, even though it's an order of magnitude more effort, instead of just giving up and using it as an argument to focus on external highlighters. Do you see any technical arguments on principle against this, other than it being a lot of work? To my mind it should be possible to have a slim restricted engine usable in both cases. Besides, my understanding is that any external highlighted would first process the entire file anyways and pipe its SGRed version into the viewer, so I don't see how you solve the problem of our highlighting engine currently needing to load the whole file in memory with external highlighters. Anyways, in terms of going forward with this, I think that if we all quickly agree that the SGR layer is a good thing, maybe we can start by getting this in via a separate PR. Then, hopefully, yet another one could happen for the internal engine (if we agree on that and you would be willing to do the work), and another one for external ones? @egmontkob, since you did a lot of work in this area, I would be interested in your opinions. |
|
I probably won't have the capacity to review every single aspect of this feature, all the way from the global design down to the smallest bits, so here are a few things that occur to me (without having looked at the change): (Foreword: I'll use the terminology "paragraph" for newline-separated sections of a file, in other words logical lines, which may end up being wrapped into multiple physical lines of the display.) A huge selling point of the viewer is indeed to be able to "view" giant files (we're not talking about 5MB that you tested but rather 5GB and more, e.g. That involves a few tricks, e.g. when you press End then (if I remember correctly) we seek to the end of the file and then walk backwards until we find enough newline characters to be able to start rendering. And it's surely more complex if line wrapping is enabled: we seek back enough, then start to wrap the paragraph in memory, I guess we have to wrap all the way through to count the resulting number of lines, then start it again, this time already knowing when to start actually rendering them on the screen. I can't remember but it must be something like this. As far as I vaguely remember, we also have a safety cap, after which (if walking backwards didn't result in enough newline characters soon enough) we display the file possibly beginning at a "wrong" position, rather than hanging for too long. We also do this backwards walking when the user scrolls backwards (e.g. Up or PgUp key). Due to Unicode having tricks like double-wide characters, and also us supporting the *roff backspace-overstrike formatting syntax, we can't simply walk backwards by let's say 80 characters on an 80-column display. We walk backwards until the beginning of the paragraph and imagine to re-render that paragraph from its beginning (possibly some of its lines not making it into the screen). (Much of this was done by me in a big viewer revamp in #3250.) When it comes to syntax highlighting, there's two aspects of this story. One is that we surely need to pipe the file through the highlighting engine, and then read its output to memory. This is similar to how we handle e.g. manual pages, F8-F9 changing whether or how to process the file. (The viewer definitely should have a mode where the SGR sequences are displayed rather than interpreted.) The other is: if an SGR highlighting can span across paragraphs, our backwards walking strategy cannot work. Which leaves us with two possible approaches to take. One possibility is to have two different backwards walking strategies implemented in the viewer. A new strategy that properly supports reasonably sized files, and always scans it from the beginning, or remembers for each newline the SGR state we carry there, or something like this; and the current strategy that can handle enormous files, although not interpret SGR sequences. The second possibility is to do what This second possibility implies two things. One is that if you ever view a file containing raw SGR sequences (which you likely won't do too much) then you'll get this Which possibility should we go for? You list 5 highlighters that you support. How many and which of them support this mode of reassuring any non-default SGR state after every newline character? |
|
Things to verify about the SGR parser (I haven't verified them yet): mc (as for its skins) supports the following attributes, supported by both of our backends (ncurses and slang): bold, italic, underline, blink, reverse: SGR 1, 3, 4, 5, 7. Arguably the last two are rarely used, but given that we currently support them as much as the other three, I think they should be recognized and handled here too. To turn them off individually: SGR 22 (yup, not 21!), 23, 24, 25, 27. Escape sequences with 0 or an empty value anywhere inside, e.g. Extended colors de jure use RGB truecolors de jure have one more parameter. What is commonly incorrectly written as These so far should all be supported. Sometimes you can see mixed notation. The first separator is The very point of having You might want to recognize underline style extensions and (due to lack of support in both of our underlying libraries) replace with regular underline. (See also https://askubuntu.com/a/985386.) Recognize SGR 30..37, 90..97 as synonyms with 38:5:0..15, and similarly 40..47, 100..107 the same as 48:5:0..15. Strictly speaking they are not all fully synonyms. In many terminals (or settings thereof), SGR 30..37 combined with SGR 1 makes the color brigther, but SGR 38:5:0..7 combined with SGR 1 leaves the color the dark variant. Neither ncurses nor slang lets you keep this distinction, you have to merge them which will be sent out the terminal in whichever of these two formats. |
|
Unfortunately we build upon two braindamaged libraries (ncurses and slang), both having an unreasonable palette-based approach to the color pairs (ncurses) or color pairs and attributes combos (slang), rather than being able to specify the colors directly for each cell. You have to allocate a new color index for every new combo (which you surely do), and you'll never know when you'll run out of available slots. Our palette handling component has an With modern terminal descriptions, terminfo asks the libraries to support 65536 color pairs. Ncurses supports this, not sure about Slang. This should be more than enough in practice, unless you begin to stress-test the system. Then sometimes (maybe before each screen update) carefully gargabe-collecting the no longer used indices could remedy the problem. Not sure if it's worth it dealing with this pain, though. |
|
You also have to somehow (how?) deal with the situation if the text file (or syntax highlighter) uses more advanced extended colors than the terminal supports them. |
Well, how about limiting the SGR parser (and highlighting) only to the parsed mode? Would this not make the rest much easier and leave whatever we have for big files intact? |
Running through the highlighter should be analogous to "F8 Parse", like running through the manpage formatter, gzip uncompressor etc. If we go with This is a well-established behavior that users didn't complain about. (Mind you, I never know what F8 vs. F9 do, I just randomly press them until I end up with the desired formatting :)) If we go with the This means that the proposed |
|
A big question (especially if we wait with 4.9.0 for this change to land): If we can hook up to multiple external syntax highlighters, would we want to keep our own built-in one as well? Or could we simply ditch that? |
|
Thank you both @zyv and @egmontkob for the incredibly thorough and thoughtful feedback — this is exactly the kind of discussion that makes the result better. I'll address everything in one message to keep the thread manageable. SGR across newlinesTested all 5 supported highlighters with a Python file containing a multi-line docstring (
All 5 produce F8/F9 paradigm and keybindingsI really like the idea of fitting syntax highlighting into the existing F8/F9 paradigm. Making it analogous to "F8 Parse" (like manpage processing or gzip) is a clean mental model. Happy to remap: drop the standalone SGR parser audit@egmontkob, thank you for the extremely detailed SGR requirements list. I audited the parser against every point. Here's the honest picture: Already working correctly:
Gaps found (will fix in follow-up):
None of these gaps affect the output of any of the 5 supported highlighters (they all use basic SGR with Color pair allocationAlready using Truecolor → terminal downgradeAlready handled: Large file protectionGreat point about 5GB+ files. I benchmarked the highlighters on generated Python files:
This scales linearly — 5GB would take hours. Plan: add a file size threshold (a few MB) above which syntax highlighting is silently skipped, falling back to plain view. Same approach as On the internal engine@zyv, I appreciate the long-term vision for a unified highlighting engine, and I understand the gardener's perspective. An interesting counterpoint was raised — whether we even need the built-in engine if external highlighters are available. I think the answer depends on the use case:
The current There's also a pragmatic question of audience. The viewer is mc's most universally used tool — everyone opens files to look at them. The editor, on the other hand, competes with vim, emacs, vscode, and a dozen others that people already have muscle memory for. The intersection of "needs syntax highlighting" and "chooses mcedit as their primary editor" is modest. Investing months of refactoring into the internal engine to serve that niche, when external tools already cover 500+ languages with better quality, is a hard trade-off to justify. This PR doesn't close any doors. It opens a clean SGR abstraction layer that works with any source of highlighted content — external tools today, potentially a refactored internal engine tomorrow. Splitting the PRShipping the SGR parser without anything feeding it highlighted content would produce dead code with no user-visible effect. The parser and the external highlighter integration are two halves of a single feature — one interprets SGR, the other produces it. I'd prefer to keep them together as a cohesive, testable unit (35 tests covering both halves). That said, I'm happy to restructure the commits if it helps review — the current 4-commit structure already separates the layers cleanly:
Next stepsI'll prepare a follow-up addressing the gaps found in the audit:
Happy to discuss any of these points further. |
Well, I don't think that there is a chance to ditch the internal one (here I mean the one used in the editor), opting for the external one, because the requirements are different from the one used in the viewer, alone for performance reasons. The highlighter used in the editor should be (ideally) streaming, error-recovering, and performant. We can't just pipe the edited file through an external highlighter upon every change. I'm not saying that ours is great and ticks all these boxes, but ripping it out for Pygments or so just isn't an option. We could replace it with a specialized library, but then we have a rather heavy external dependency, and it's not like there are lots of awesome contenders in the C-space. My preference (or dream?) would be to extract, clean up, and cover with tests our internal one and make it our primary choice for editor and viewer with support for external ones as an option. See my musings above. |
Address reviewer feedback on the ANSI SGR parser: Parser improvements: - Add italic (3), blink (5), reverse (7) attribute support - Add individual off codes: 23, 25, 27 - Map SGR 21 (double underline) to regular underline - Fix empty parameter bug: \e[1;;3m now correctly treats empty param as 0 (reset) - Add colon ':' sub-separator with proper nested semantics: \e[38:5:4:7m does NOT set reverse (all sub-params belong to the color group), while \e[38;5;4;7m correctly does - Support de jure truecolor: \e[38:2:CS:R:G:Bm Large file protection: - Skip syntax highlighting for files > 2 MB to avoid long processing times from external highlighters 12 new tests (35 total for ANSI parser). Signed-off-by: noctuum <25441068+noctuum@users.noreply.github.com>
Per reviewer feedback, align syntax highlighting shortcuts with the existing F8 (Parse/Raw) paradigm: - Shift+F8 toggles syntax highlighting (was 's') - 's' opens backend chooser dialog (was 'Shift-S') Updated in all three keymaps (default, emacs, vim), hint file, and man page. Signed-off-by: noctuum <25441068+noctuum@users.noreply.github.com>
|
Follow-up: the parser fixes and keybinding remap mentioned above are now pushed. SGR parser hardeningAll gaps identified in the audit are addressed:
Large file protectionFiles larger than 2 MB now silently skip syntax highlighting, falling back to plain view. The viewer's large-file capability remains fully intact — syntax mode operates through the pipe/growbuf path, completely separate from direct file access. The 2 MB threshold is based on the benchmark data shared above: even the fastest highlighter (source-highlight) takes ~1.5s at 2 MB, scaling linearly from there. Keybinding remapRemapped per @egmontkob's suggestion to fit the F8/F9 paradigm:
Updated in all three keymaps (default, emacs, vim), hint, and man page. Tests12 new tests added (35 total for the ANSI parser), covering every fix listed above including the colon-vs-semicolon semantic distinction. |
|
the ticket for using an external library is #2931, and the merits and caveats of going that route should be discussed there. it seems somewhat obvious that syntax hl in the editor requires an interactive provider. but that doesn't imply an in-process library; the language server protocol is all the rage currently. i don't buy the argument that the PR cannot be split. text files with embedded ansi colors are absolutely a thing, which is why i created #31. |
|
@ossilator — I can see the argument for splitting. If you'd like me to restructure this into separate PRs, let me know — the commit history already separates the layers cleanly. Or feel free to cherry-pick the SGR parser commits directly if that's easier on your end. A note on scope, since the discussion is drifting toward LSP and internal engine refactoring: Issue #28 ("syntax highlighting in viewer wanted") has been open since July 2005. Issue #1849 ("mcview: color output / ansi escapes support") since 2009. Issue #2931 ("replace mcedit syntax highlighting engine with an external library") since 2012. In all that time, no PR and no concrete implementation proposal has materialized for any of them. This PR exists precisely because none of those discussions led to code. It solves one specific, narrowly-scoped problem: let the viewer display syntax-highlighted output from tools that already exist. It's ~740 lines of new code, 35 tests, zero architectural changes, zero new dependencies. It doesn't touch the editor, doesn't modify the internal syntax engine, doesn't alter the viewer's large-file or backwards-walking strategies. LSP is a fundamentally different project — JSON-RPC transport, capabilities negotiation, server lifecycle, per-language runtime dependencies. An internal engine refactoring of the 1639-line untested syntax.c is another. Both are multi-month efforts that would need dedicated contributors. Neither competes with what this PR does — the SGR rendering layer is source-agnostic and would serve as the display backend for either approach if they ever materialize. I'm happy to split into:
But I want to be transparent: my contribution scope is limited to what's here. I built this because it's a simple, self-contained solution to a long-standing request — not as a stepping stone toward a larger architecture rework. If the external highlighter path doesn't align with the project's direction, that's fine — the SGR parser stands on its own. |
|
i have a strong backwards compatibility mindset, so it irks me to add features from which are an architectural dead end. but then, most users wouldn't be seriously bothered by the highlighting slightly changing due to the backend being replaced. so ... fair enough? the inconsistent highlighting between viewer and editor will raise some eyebrows, though. so this "architectural limitation" should be clearly documented as such. your solution also has a runtime dependency. i don't think it's a big deal, but let's keep the comparisons honest. so overall the approach is good enough given available manpower. |
|
I didn't follow the evolution of this discussion thoroughly. From a huge distance, everything makes sense to me, including this:
Yup I think that extending the viewer's capabilities, vs. the wiring with other tools, should preferably be separate commits. Another "tool" (not syntax highlighting of text files but rather something completely different) is an "image viewer" here: https://gitlab.gnome.org/GNOME/vte/-/issues/2941 I'm not saying mc should hook up to it by default – in fact, I really think it shouldn't – but it could be a cool tool for showcasing and debugging the new feature, as well stress-testing how it deals with running out of color pairs. |
|
i don't think anyone even considered that it shouldn't be separate commits (anymore). the question was whether the PR should be split. |
Ahh, okay :) I couldn't care less about that. |
Checklist
👉 Our coding style can be found here: https://midnight-commander.org/coding-style/ 👈
git commit --amend -smake indent && make check)Proposed changes
Add optional syntax highlighting to the internal file viewer (
mcview) via configurable external command. This resolves feature requests #28 (syntax highlighting in viewer, originally filed in 2005) and #1849 (ANSI color escape support in viewer, 2009).The implementation has two layers:
ANSI SGR parser — a state machine that renders
\033[...mescapesequences with proper color mapping. Useful standalone for viewing colored
logs,
git diff --coloroutput, CI logs, and any ANSI-colored content.Syntax highlighting — pipes the viewed file through a configurable
external command, feeds the ANSI-colored output to the parser. Five
backends are supported out of the box; users can also set an arbitrary
command via the
Shift-S→Custom...dialog.Supported backends
bat --color=always --style=plain --paging=never %schroma --formatter terminal256 --fail %s--failfor fast exit on unknown fileshighlight -O xterm256 --force %spygmentize -f terminal256 -O stripnl=False %sstripnl=Falsepreserves leading blank linessource-highlight --failsafe --out-format=esc -i %sDefault is
source-highlightbecause it's GPL-3 compatible, available in190+ package repositories (every major Linux distro, all BSDs, macOS via
Homebrew/MacPorts), produces clean ANSI output, and is purpose-built for
exactly this use case.
User interaction
s— toggle syntax highlighting on/offShift-S— open backend chooser (only installed backends are shown,plus
Custom...for arbitrary commands with%splaceholder)[bat],[chrm],[hi],[pyg],[src-hl], or[ext]with the detected file typemcview_global_flagsError handling and graceful degradation
Architecture
This is the same pattern used by
less(viaLESSOPEN), ranger, nnn, lf,and vifm. Every modern file manager delegates syntax highlighting to an
external tool — none have built-in syntax engines for their viewer.
Why not port the editor's syntax engine?
src/editor/syntax.c(1639 lines) is tightly coupled toWEdit— textaccess via
edit_buffer_get_byte()(11 call sites), state stored inWEditfields, assumes full file in memory. The viewer handles files larger than
RAM. Estimated 2000–3000 lines of refactoring vs. ~740 lines for this
approach with more language coverage.
Testing
Unit tests (35 total, all pass)
truecolor), attribute tracking (bold, underline, italic), reset sequences,
malformed input, combined sequences, non-SGR CSI skip, plain text passthrough
%ssubstitution with shell escaping,binary name extraction, edge cases (NULL, empty, missing
%s)Manual testing — backend matrix
Tested on Arch Linux with all five backends:
0.26.12.23.1--fail: fast exit → friendly dialog → plain fallback4.192.19.23.1.9--failsafe: outputs raw contentManual testing — edge cases
%s)%splaceholder, placeholder shows current backend template-O stripnl=Falsepreserves themBuild verification
Commit structure
Each commit passes
make indent && make check.Configuration
Alternative commands:
Or any tool that outputs ANSI escape codes to stdout.
%sis replaced withthe shell-escaped file path.
Stats