Tree-sitter integration for better syntax highlighting#5067
Tree-sitter integration for better syntax highlighting#5067jtyr wants to merge 8 commits intoMidnightCommander:masterfrom
Conversation
Signed-off-by: Jiri Tyr <jiri.tyr@gmail.com>
Signed-off-by: Jiri Tyr <jiri.tyr@gmail.com>
src/editor/edit.c
Outdated
| || (status_before_ok && status_after_ok && status_after.st_size != 0 | ||
| && (status_after.st_size != status_before.st_size | ||
| || status_after.st_mtime != status_before.st_mtime)); | ||
| #ifdef HAVE_STRUCT_STAT_ST_MTIM |
There was a problem hiding this comment.
Please move this to the separate commit.
There was a problem hiding this comment.
I will create a separate PR for this.
src/editor/editcmd.c
Outdated
| { | ||
| char *fname; | ||
| char *macros_fname = NULL; | ||
| off_t start_mark, end_mark; |
There was a problem hiding this comment.
Please move this to the separate commit.
There was a problem hiding this comment.
I will create a separate PR for this (together with the change in src/editor/edit.c.
| unsigned int skip_detach_prompt : 1; // Do not prompt whether to detach a file anymore | ||
|
|
||
| // syntax highlighting | ||
| // syntax highlighting (tree-sitter) |
There was a problem hiding this comment.
Can it be moved to the separate structure?
There was a problem hiding this comment.
I have moved it to separate structure.
src/editor/syntax.c
Outdated
| edit_syntax_rule_t rule; | ||
| } syntax_marker_t; | ||
|
|
||
| #ifdef HAVE_TREE_SITTER |
There was a problem hiding this comment.
Can it be moved to the separate file?
There was a problem hiding this comment.
I have separated the TS syntax handling.
src/subshell/common.c
Outdated
| if (use_persistent_buffer) | ||
| clear_cwd_pipe (); | ||
| else | ||
| else if (mc_global.mc_run_mode == MC_RUN_FULL) |
There was a problem hiding this comment.
Please move this to the separate commit.
There was a problem hiding this comment.
I will create a separate PR for this.
| AC_CHECK_TOOLS([AR], [ar gar]) | ||
|
|
||
| AC_PROG_CC | ||
| AC_PROG_CXX |
There was a problem hiding this comment.
Do we really need C++ compiler?
There was a problem hiding this comment.
Yes, this is needed for the SQL grammar which has a C++ scanner.
| AC_SUBST(LDFLAGS) | ||
| AC_SUBST(LIBS) | ||
|
|
||
| dnl ############################################################################ |
There was a problem hiding this comment.
Please move this to the separate m4-file.
There was a problem hiding this comment.
I have moved that to a separate m4 file.
Signed-off-by: Jiri Tyr <jiri.tyr@gmail.com>
Signed-off-by: Jiri Tyr <jiri.tyr@gmail.com>
Signed-off-by: Jiri Tyr <jiri.tyr@gmail.com>
Signed-off-by: Jiri Tyr <jiri.tyr@gmail.com>
Signed-off-by: Jiri Tyr <jiri.tyr@gmail.com>
Signed-off-by: Jiri Tyr <jiri.tyr@gmail.com>
|
The last commit demonstrates how simple it is to add an extra language. I have added QML language that's requested in the PR #5022. All known parsers are listed at this page. |
|
Thanks for the PR! Unfortunately, I don't think it is seriously reviewable as a blob, but it's nice to have a prototype on the basis of which we can have specific discussions and decide how we can move forward, instead of just hand-waving and doing some theoretical exercises. Before we get into the technical details, I would like to discuss some general design and maintenance issues. First off, I think it's certainly desirable to replace our internal regex-based highlighting engine if at all possible with something a little bit more maintainable and powerful (more grammar-based at least to the extent that nested constructs are supported). The priorities to my mind should be:
Non-goals would be:
I would absolutely like to avoid to maintain several systems in parallel, or connections to several systems. This PR comes directly after #5065 - not sure if that's a coincidence. Anyways, I think we have the following options:
I think (1) will happen on its own if our discussions end up nowhere, (2) is not something I can accept, and (3) maybe is what we should aim for and discuss how we get there. I've had a cursory look at Tree Sitter (I haven't heard about it before). I think it looks pretty much like highlighting done the right way, although it still seems to not be without difficulties in the context of mc's requirements. My first question is, are there any comparable alternatives at all, that is highlighting libraries that have a slim C core runtime and can generate parsers in C from grammars, or is it pretty much the only contender in this space? If Tree Sitter is the only alternative, then I would like to clarify the integration aspects. You mention that adding parsers for languages inflates the binary. How bad it really is and what do the numbers mean? I'm surprised about ~5M baseline for So with Now, adding dynamically loaded parser libraries is not something I would accept. We've been moving in the opposite direction for years, and this would be a huge setback. If this is non-negotiable, this would kind of kill Tree Sitter for me. The next big issue for me is the grammar fetching magic at configure time. This is absolutely unacceptable. We should be able to bootstrap self-contained tarballs, which have everything in them that is needed for the build. My understanding is that it's not impossible to vendor Tree Sitter and parsers. Is there any reason why you didn't consider that? If it is possible, then we should think of what vendoring strategy is best. How often does Tree Sitter runtime change and how big is that in terms of source code? Can we just get away with committing it to a subdirectory? Same question regarding the parsers, can we just commit them or is it dozens of megabytes of generated source code changing often that will completely blow up our repository? Are there any other meaningful approaches to that, as in create a separate repository for vendored Tree Sitter stuff and include it as a submodule and/or fetch a ZIP of a hash at the I understand your ideas about pinning to tags being reasonably secure, but we have other issues to consider. Our main git repository is on GitHub and we have mirrors. I want us to have reasonable control over what we depend upon. Yes, we already lost control of many things. But that's no reason to exacerbate the situation by several orders of magnitude. What happens if Gitlab-hosted projects shut down, like it happened to BitBucket? What happens if one of the grammar maintainers stops taking his pills and nukes his repository? Someone takes over and changes the tag, because it was an unsigned one? When relying on so many moving parts, I'm afraid, we'll be dealing with this stuff all the time. Also, if we come up with a good vendoring strategy (separate repository?), I would like us to have something in the CI generating a report of what the state of vendored grammars are, and ideally semi-automated updates, which we could e.g. do once before release, or immediately after releases, or sometime in the middle of the cycle without it being a major project like tar updates. Any ideas how this can be done cheaply and reliably? Finally, regarding grammar support, I would actually like to have a comparison table between what we have now and Tree Sitter. We have about ~100 definitions of variable quality. I understand that Tree Sitter brings in ~60, including community stuff. How much do we lose? How bad is that? Can you make a tabular overview? Is it realistic to lose as little as possible, or are there just no grammars for what we have, which is why you picked built-in, high-quality community grammars and stopped at that? And one final point, requiring a C++ compiler is absolutely a no-go. |
|
Thank you for the thorough and thoughtful review, @zyv. These are exactly the right questions to ask, and I'm glad we're having this discussion with a working prototype rather than in the abstract. I'll try to address each point. The full documentation is in Why tree-sitter?
No. Tree-sitter is uniquely positioned for MC's needs:
The alternatives don't fit: TextMate grammars are regex-based (same fundamental limitations as CoolEdit), Pygments is Python-only, Lezer is JavaScript-only. Tree-sitter is the only C-native option with grammar-based parsing. Option 3: Replace CoolEdit with tree-sitterI agree this is the right direction. The tree-sitter integration already falls back to legacy highlighting when a grammar is unavailable, so the transition can be gradual:
Viewer supportTree-sitter can absolutely be extended to the viewer. The infrastructure is the same - Language coverageHere's a comparison of what MC has today vs what tree-sitter covers:
The ~51 MC-only languages are mostly niche or domain-specific formats:
The notable gaps are LaTeX (no tree-sitter grammar with pre-built parser.c exists - only a scanner), DOS Batch, M4 macros, and Puppet. These could be added over time as tree-sitter grammars become available, or the community could contribute them. All mainstream programming languages are covered with significantly better highlighting quality than CoolEdit - especially languages with complex syntax like Bash (heredocs, nested quoting), Perl, Ruby, and HTML (embedded JS/CSS). The full list of available tree-sitter parsers is at https://github.com/tree-sitter/tree-sitter/wiki/List-of-parsers - there are 200+ grammars, many of which could fill the gaps. Binary sizeHere are the real numbers (Linux x86_64,
The size comes from the generated
A curated set of ~22 common languages (C, C++, Python, Bash, Go, Rust, Java, JS, TS, JSON, YAML, HTML, CSS, XML, Lua, Ruby, PHP, Perl, Markdown, Make, Dockerfile) comes to 18 MB stripped. Dropping the largest grammars (Verilog, Fortran, COBOL) helps, but doesn't eliminate the issue - modern languages like C# and Kotlin also have large parsers because the generated code is inherently proportional to grammar complexity. The Dynamic loading
I understand, and I respect the project's direction. Could you help me understand the reasoning? The grammar module use case seems quite different from general plugin/VFS module loading:
That said, if static is the firm decision, then Grammar fetching and vendoring
Agreed. The configure-time download was added purely for developer convenience and is not essential. Grammar sources can be provided from any local directory via For vendoring, I'd propose a separate repository (
This way:
Tree-sitter runtime vendoring
The tree-sitter runtime is small (~15 C source files, ~200KB compiled) and very stable. The C API ( Vendoring the entire tree-sitter runtime into MC's source tree is feasible and would eliminate the C++ compiler requirement
Only one grammar (SQL) has a C++ scanner ( Next stepsBased on your feedback, I think the path forward is:
I'm happy to discuss any of these points further or adjust the approach based on your preferences. |
|
this is all just meta; i didn't look at the code. the idea to vendor the runtime lib is nonsense. afaict, it would be the first external lib that mc vendors. but i don't understand why the grammars have to be compiled by every user, rather than just being installed as binaries (plug-ins for a loader included in the runtime). this is totally bonkers from a packaging perspective, esp. considering how huge these things are. yet, debian ships only with -src packages (for a select set of grammars). the faq totally fails address this fundamental question.
the hand-waving about the yet-unsupported grammars is a tad unconvincing. porting mc's (generally very simple) syntax defs sounds like a relatively minor (though probably boring) task, and a commitment to doing so (and therefore enabling the elimination of the internal engine) would be quite a boost in my eyes. similarly, a commitment to actually port the viewer would also be a major boost. there is a flip side to eliminating the internal engine: given how huge the TS grammars are, this step would imply the removal of syntax hl for resource constrained systems. well, actually, there is a middle way: syntaxes that are actually relevant for such systems (we're talking home routers and such) are very simple (config files of various types) and could actually be included. so non-issue, i guess? i don't understand yuri's opposition to c++, esp. as an optional build dependency for 3rd-party components. the only situation where this would have a practical impact is if a new-ish c++ version is required, as some niche/legacy systems fail to deliver in that regard. and of course, this is relevant only if the distribution question above does not get a good answer. jiri, your answer has a whiff of chatgpt. i don't mind if you use it, but it has a bad taste when it's recognizable as such. we like a nice clean git history. you are likely to end up in merge hell if you don't start cleaning up early on during the major refactoring of this PR. |
|
Thanks for the feedback, @ossilator. On the shared libraries point - yes, that's exactly what the prototype already implements. Each grammar compiles to a standalone Regarding why grammars need to be compiled rather than distributed as binaries - they don't, in principle. The On vendoring the tree-sitter runtime - I agree, that was not a good suggestion. Keeping it as a system dependency ( On the missing grammars - I can commit to expanding coverage from the existing tree-sitter grammar ecosystem (200+ grammars available). Most of the ~51 MC-only syntaxes are simple keyword/string/comment definitions that translate straightforwardly to tree-sitter queries, provided a grammar exists. For the handful of truly niche formats where no grammar exists, keeping CoolEdit as a fallback seems reasonable. CoolEdit is also lighter on CPU/memory than tree-sitter parsing, which could matter on embedded systems. Having On the viewer - I can commit to extending tree-sitter highlighting to the viewer. The On C++ - only one grammar (SQL) needs it. Happy to drop it or keep it, depending on what the project prefers. On AI usage - yes, I used AI assistance for both the implementation and drafting the previous response. The design decisions are mine but the tooling helped enormously with the volume of work. I find the structured format contributes to better readability even if it has that recognizable feel. On git history - I can squash and rebase the commits if that's how the project prefers PRs to be structured. Just let me know when the design discussion settles and I'll clean it up before final review. |
|
yeah, i know that this is what you implemented, and i appreciate it (i also don't understand why yuri doesn't like the idea). my question was basically: why isn't this upstream? they could even provide multiple loaders (in separate libraries) if they don't want to commit to a particular toolkit, provided the resulting plugins are binary compatible. if they fail to do that, they work against their own adoption, given that each user needs to a) obtain, b) build, and c) install an own copy. i hate that as a user (both developer and end-user), and it goes against (some) distros' policy to unbundle dependencies. mind researching this further and potentially raising it with upstream? |
|
Good question @ossilator. I did some more digging. Tree-sitter upstream provides no loader or standard distribution mechanism. Their documentation only covers compile-time integration. Each editor rolls its own: Neovim compiles grammars at runtime via nvim-treesitter, Helix compiles at build time into the binary, and Emacs has a third-party project (tree-sitter-langs) that publishes pre-built The Emacs bundle is interesting - it has 120 grammars as The tree-sitter CLI can also build grammars ( I think the real gap isn't the loader (50 lines of For MC specifically, I think our best path is the separate I could raise the general distribution question with tree-sitter upstream if that's useful, though I suspect the answer will be that they consider it an editor concern rather than a core library concern. |
|
the queries are application-specific, but syntax hl is part of the ts core, so installing the matching .scm files along with the .so files seems rather obvious at first sight. that emacs project's mission statement is basically exactly what we need, though in practice they intentionally deviate from the upstream queries. so go figure. 🤔 you are quite correct that this is a ts ecosystem problem. from the practical side, i'd probably fork the emacs project into the ts namespace and "de-emacsify" it. the emacs project in turn would then need to pin itself to specific versions of the central project, so they could ship matching vendored .scm files in an emacs-specific directory. plan b for mc would be to simply use the emacs project verbatim, but this doesn't feel quite right. having our own build integration is at best plan c as far as i'm concerned. |
Summary
.somodule (~5 MB mc binary)--with-tree-sitter-static).syntaxhighlighting when a tree-sitter grammar is not availableMotivation
MC's regex-based syntax highlighting is fundamentally limited by its line-oriented pattern matching. Languages with complex nesting, heredocs, multi-line strings, or context-dependent syntax (Bash, Perl, Ruby, Python, and many others) are notoriously difficult to highlight correctly with regexes. Users regularly encounter broken highlighting that propagates through the rest of the file from a single unmatched delimiter.
Tree-sitter solves this by parsing the actual AST of each language. Highlighting is always structurally correct because it operates on the parse tree, not on regex patterns. Incremental re-parsing makes it fast enough for interactive editing where only the changed portion of the tree is rebuilt.
Beyond correctness, tree-sitter enables features that are impossible with regex-based highlighting. Language injection allows one grammar to delegate to another: HTML files get proper JavaScript highlighting inside
<script>tags and CSS inside<style>tags. Markdown fenced code blocks are highlighted according to their language tag. For example a```pythonblock gets full Python highlighting,```bashgets Bash, and so on for any installed grammar. This kind of multi-language highlighting is a natural fit for tree-sitter's AST approach and opens the door for similar injection support in other templating and polyglot file formats.Build options
Grammar sources are automatically downloaded from upstream repositories during configure (pinned to specific commit SHAs for reproducibility).
Packaging
In shared mode, each grammar produces two files:
mc-ts-<name>.so-- loadable grammar module (installed to$(libdir)/mc/ts-grammars/)<name>-highlights.scm-- highlight query file (installed to$(datadir)/mc/syntax-ts/queries/)Distros can package these as separate optional packages (e.g.,
mc-ts-grammar-python), letting users install only the languages they need. When a grammar module is not installed, MC transparently falls back to legacy regex highlighting for that language.Users can also build and install custom grammar modules in
~/.local/lib/mc/ts-grammars/(with query files in~/.local/share/mc/syntax-ts/queries/) without root access.Note on grammar sources
This feature requires building tree-sitter grammar sources (downloaded C files from upstream projects) into shared libraries. This adds a build-time dependency on
libtree-sitterandgmodule-2.0, and the responsibility of maintaining version pins for 61 grammar repositories. I believe the benefit of reliable, AST-based syntax highlighting, eliminating the class of bugs that regex highlighting cannot solve, justifies this added maintenance scope. The grammar sources are not vendored in the repository; they are downloaded at configure time and pinned to specific commits, making updates straightforward.Test plan
./configure --with-tree-sitter && make && make check(shared mode, all grammars)./configure --with-tree-sitter --with-tree-sitter-static && make && make check(static mode)./configure && make(without tree-sitter, unchanged behavior)<script>and<style>tags - JS and CSS should be highlightedmc -e fileand execute a macro script - should not hang anymoreSee
doc/TREE-SITTERfor full documentation.