Skip to content

Tree-sitter integration for better syntax highlighting#5067

Open
jtyr wants to merge 8 commits intoMidnightCommander:masterfrom
jtyr:jtyr-ts
Open

Tree-sitter integration for better syntax highlighting#5067
jtyr wants to merge 8 commits intoMidnightCommander:masterfrom
jtyr:jtyr-ts

Conversation

@jtyr
Copy link

@jtyr jtyr commented Mar 14, 2026

Summary

  • Adds tree-sitter-based syntax highlighting as an alternative to MC's regex-based system
  • 61 languages supported with AST-accurate parsing (no more broken highlighting from unmatched quotes or nested constructs)
  • Language injection for HTML (JavaScript + CSS) and Markdown (inline elements + fenced code blocks with per-language highlighting)
  • Shared library build (default): each grammar is a separate .so module (~5 MB mc binary)
  • Static build option: all grammars linked into the binary (--with-tree-sitter-static)
  • Falls back to legacy .syntax highlighting when a tree-sitter grammar is not available

Motivation

MC's regex-based syntax highlighting is fundamentally limited by its line-oriented pattern matching. Languages with complex nesting, heredocs, multi-line strings, or context-dependent syntax (Bash, Perl, Ruby, Python, and many others) are notoriously difficult to highlight correctly with regexes. Users regularly encounter broken highlighting that propagates through the rest of the file from a single unmatched delimiter.

Tree-sitter solves this by parsing the actual AST of each language. Highlighting is always structurally correct because it operates on the parse tree, not on regex patterns. Incremental re-parsing makes it fast enough for interactive editing where only the changed portion of the tree is rebuilt.

Beyond correctness, tree-sitter enables features that are impossible with regex-based highlighting. Language injection allows one grammar to delegate to another: HTML files get proper JavaScript highlighting inside <script> tags and CSS inside <style> tags. Markdown fenced code blocks are highlighted according to their language tag. For example a ```python block gets full Python highlighting, ```bash gets Bash, and so on for any installed grammar. This kind of multi-language highlighting is a natural fit for tree-sitter's AST approach and opens the door for similar injection support in other templating and polyglot file formats.

Build options

# Shared mode (default) - grammars as .so modules, mc binary stays ~5 MB
./configure --with-tree-sitter

# Static mode - grammars linked into binary
./configure --with-tree-sitter --with-tree-sitter-static

# Build only specific grammars
./configure --with-tree-sitter --with-tree-sitter-grammars=c,python,bash

# Without tree-sitter (default, unchanged behavior)
./configure

Grammar sources are automatically downloaded from upstream repositories during configure (pinned to specific commit SHAs for reproducibility).

Packaging

In shared mode, each grammar produces two files:

  • mc-ts-<name>.so -- loadable grammar module (installed to $(libdir)/mc/ts-grammars/)
  • <name>-highlights.scm -- highlight query file (installed to $(datadir)/mc/syntax-ts/queries/)

Distros can package these as separate optional packages (e.g., mc-ts-grammar-python), letting users install only the languages they need. When a grammar module is not installed, MC transparently falls back to legacy regex highlighting for that language.

Users can also build and install custom grammar modules in ~/.local/lib/mc/ts-grammars/ (with query files in ~/.local/share/mc/syntax-ts/queries/) without root access.

Note on grammar sources

This feature requires building tree-sitter grammar sources (downloaded C files from upstream projects) into shared libraries. This adds a build-time dependency on libtree-sitter and gmodule-2.0, and the responsibility of maintaining version pins for 61 grammar repositories. I believe the benefit of reliable, AST-based syntax highlighting, eliminating the class of bugs that regex highlighting cannot solve, justifies this added maintenance scope. The grammar sources are not vendored in the repository; they are downloaded at configure time and pinned to specific commits, making updates straightforward.

Test plan

  • ./configure --with-tree-sitter && make && make check (shared mode, all grammars)
  • ./configure --with-tree-sitter --with-tree-sitter-static && make && make check (static mode)
  • ./configure && make (without tree-sitter, unchanged behavior)
  • Open files in various languages and verify highlighting matches MC's legacy colors
  • Open an HTML file with <script> and <style> tags - JS and CSS should be highlighted
  • Open a Markdown file with fenced code blocks - code should be highlighted per language
  • mc -e file and execute a macro script - should not hang anymore

See doc/TREE-SITTER for full documentation.

Signed-off-by: Jiri Tyr <jiri.tyr@gmail.com>
@github-actions github-actions bot added needs triage Needs triage by maintainers prio: medium Has the potential to affect progress labels Mar 14, 2026
@github-actions github-actions bot added this to the Future Releases milestone Mar 14, 2026
@zyv zyv added area: mcedit mcedit, the built-in text editor and removed needs triage Needs triage by maintainers labels Mar 14, 2026
Signed-off-by: Jiri Tyr <jiri.tyr@gmail.com>
Copy link
Contributor

@mc-worker mc-worker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick view. Not related to the tree-sitter itself, but to the your PR content.

Please remove *.po and Makefile.in from your PR,

|| (status_before_ok && status_after_ok && status_after.st_size != 0
&& (status_after.st_size != status_before.st_size
|| status_after.st_mtime != status_before.st_mtime));
#ifdef HAVE_STRUCT_STAT_ST_MTIM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this to the separate commit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will create a separate PR for this.

{
char *fname;
char *macros_fname = NULL;
off_t start_mark, end_mark;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this to the separate commit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will create a separate PR for this (together with the change in src/editor/edit.c.

unsigned int skip_detach_prompt : 1; // Do not prompt whether to detach a file anymore

// syntax highlighting
// syntax highlighting (tree-sitter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be moved to the separate structure?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved it to separate structure.

edit_syntax_rule_t rule;
} syntax_marker_t;

#ifdef HAVE_TREE_SITTER
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be moved to the separate file?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have separated the TS syntax handling.

if (use_persistent_buffer)
clear_cwd_pipe ();
else
else if (mc_global.mc_run_mode == MC_RUN_FULL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this to the separate commit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will create a separate PR for this.

AC_CHECK_TOOLS([AR], [ar gar])

AC_PROG_CC
AC_PROG_CXX
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need C++ compiler?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is needed for the SQL grammar which has a C++ scanner.

AC_SUBST(LDFLAGS)
AC_SUBST(LIBS)

dnl ############################################################################
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this to the separate m4-file.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved that to a separate m4 file.

jtyr added 6 commits March 15, 2026 12:39
Signed-off-by: Jiri Tyr <jiri.tyr@gmail.com>
Signed-off-by: Jiri Tyr <jiri.tyr@gmail.com>
Signed-off-by: Jiri Tyr <jiri.tyr@gmail.com>
Signed-off-by: Jiri Tyr <jiri.tyr@gmail.com>
Signed-off-by: Jiri Tyr <jiri.tyr@gmail.com>
Signed-off-by: Jiri Tyr <jiri.tyr@gmail.com>
@jtyr
Copy link
Author

jtyr commented Mar 15, 2026

The last commit demonstrates how simple it is to add an extra language. I have added QML language that's requested in the PR #5022. All known parsers are listed at this page.

@zyv
Copy link
Member

zyv commented Mar 15, 2026

Thanks for the PR!

Unfortunately, I don't think it is seriously reviewable as a blob, but it's nice to have a prototype on the basis of which we can have specific discussions and decide how we can move forward, instead of just hand-waving and doing some theoretical exercises.

Before we get into the technical details, I would like to discuss some general design and maintenance issues.

First off, I think it's certainly desirable to replace our internal regex-based highlighting engine if at all possible with something a little bit more maintainable and powerful (more grammar-based at least to the extent that nested constructs are supported). The priorities to my mind should be:

  1. Usable both in editor and viewer
  2. Written in C, slim, and if externally maintained, then vendorable
  3. Available on all supported configurations including embedded systems
  4. Supporting most common languages, ideally uncomplicated to add new languages

Non-goals would be:

  1. Ultimate highlighting quality

I would absolutely like to avoid to maintain several systems in parallel, or connections to several systems. This PR comes directly after #5065 - not sure if that's a coincidence. Anyways, I think we have the following options:

  1. Reject both, insist on improving CoolEdit engine to make it usable for both viewer and editor, at least go that far in the grammar direction as to support nested constructs, make code clean, maintainable, and testable
  • Realistically, I guess this is never going to happen, although probably the most appealing option
  1. Accept both, ending with 3 different highlighting systems: CoolEdit and Tree Sitter for editor, a bunch of external ones for the viewer bolted on via SGR1 parser
  • I don't want to have to maintain this
  • Bad user experience leading to maximally inconsistent highlighting between competing parts of the program
  1. Accept Tree-Sitter, extending it to the viewer
  • In this case, I would like to remove the CoolEdit engine even with some loss of functionality

I think (1) will happen on its own if our discussions end up nowhere, (2) is not something I can accept, and (3) maybe is what we should aim for and discuss how we get there.

I've had a cursory look at Tree Sitter (I haven't heard about it before). I think it looks pretty much like highlighting done the right way, although it still seems to not be without difficulties in the context of mc's requirements.

My first question is, are there any comparable alternatives at all, that is highlighting libraries that have a slim C core runtime and can generate parsers in C from grammars, or is it pretty much the only contender in this space?

If Tree Sitter is the only alternative, then I would like to clarify the integration aspects.


You mention that adding parsers for languages inflates the binary. How bad it really is and what do the numbers mean? I'm surprised about ~5M baseline for mc. On macOS, without optimization, with debugging information and unstripped the executable is about ~1M. I think we should consider only something equivalent to -O3 and without debugging information. The distributions split it anyways, and ones looking for savings do more to the binaries.

So with -O3 and and stripped, what would be the real difference between static Tree Sitter and normal build? If the difference is "big", why is that, can it be somehow optimized? The parsers should have a lot in common. Can the parts of the code not be re-used?

Now, adding dynamically loaded parser libraries is not something I would accept. We've been moving in the opposite direction for years, and this would be a huge setback. If this is non-negotiable, this would kind of kill Tree Sitter for me.


The next big issue for me is the grammar fetching magic at configure time. This is absolutely unacceptable. We should be able to bootstrap self-contained tarballs, which have everything in them that is needed for the build.

My understanding is that it's not impossible to vendor Tree Sitter and parsers. Is there any reason why you didn't consider that? If it is possible, then we should think of what vendoring strategy is best.

How often does Tree Sitter runtime change and how big is that in terms of source code? Can we just get away with committing it to a subdirectory?

Same question regarding the parsers, can we just commit them or is it dozens of megabytes of generated source code changing often that will completely blow up our repository?

Are there any other meaningful approaches to that, as in create a separate repository for vendored Tree Sitter stuff and include it as a submodule and/or fetch a ZIP of a hash at the ./autogen.sh stage?

I understand your ideas about pinning to tags being reasonably secure, but we have other issues to consider. Our main git repository is on GitHub and we have mirrors. I want us to have reasonable control over what we depend upon. Yes, we already lost control of many things. But that's no reason to exacerbate the situation by several orders of magnitude.

What happens if Gitlab-hosted projects shut down, like it happened to BitBucket? What happens if one of the grammar maintainers stops taking his pills and nukes his repository? Someone takes over and changes the tag, because it was an unsigned one? When relying on so many moving parts, I'm afraid, we'll be dealing with this stuff all the time.

Also, if we come up with a good vendoring strategy (separate repository?), I would like us to have something in the CI generating a report of what the state of vendored grammars are, and ideally semi-automated updates, which we could e.g. do once before release, or immediately after releases, or sometime in the middle of the cycle without it being a major project like tar updates.

Any ideas how this can be done cheaply and reliably?


Finally, regarding grammar support, I would actually like to have a comparison table between what we have now and Tree Sitter. We have about ~100 definitions of variable quality. I understand that Tree Sitter brings in ~60, including community stuff.

How much do we lose? How bad is that? Can you make a tabular overview? Is it realistic to lose as little as possible, or are there just no grammars for what we have, which is why you picked built-in, high-quality community grammars and stopped at that?


And one final point, requiring a C++ compiler is absolutely a no-go.

@jtyr
Copy link
Author

jtyr commented Mar 15, 2026

Thank you for the thorough and thoughtful review, @zyv. These are exactly the right questions to ask, and I'm glad we're having this discussion with a working prototype rather than in the abstract.

I'll try to address each point. The full documentation is in doc/TREE-SITTER in the branch if you'd like more detail on any specific aspect.

Why tree-sitter?

My first question is, are there any comparable alternatives at all?

No. Tree-sitter is uniquely positioned for MC's needs:

  • C runtime (~200KB shared library, ~15 source files) - no foreign language runtimes
  • Incremental parsing - re-parses only the changed portion of the tree, fast enough for real-time editing
  • Generated C parsers - each grammar compiles to plain C (one parser.c + optional scanner.c)
  • Large ecosystem - 200+ grammars maintained by the community, covering all mainstream languages
  • Battle-tested - used by Neovim, Helix, Zed, GitHub's code search, and many others

The alternatives don't fit: TextMate grammars are regex-based (same fundamental limitations as CoolEdit), Pygments is Python-only, Lezer is JavaScript-only. Tree-sitter is the only C-native option with grammar-based parsing.

Option 3: Replace CoolEdit with tree-sitter

I agree this is the right direction. The tree-sitter integration already falls back to legacy highlighting when a grammar is unavailable, so the transition can be gradual:

  1. Ship tree-sitter alongside CoolEdit (current state)
  2. As grammar coverage improves, mark CoolEdit as deprecated
  3. Eventually remove CoolEdit once the remaining gaps are filled

Viewer support

Tree-sitter can absolutely be extended to the viewer. The infrastructure is the same - ts_grammar_registry_lookup() finds the grammar, ts_query_new() compiles the highlight query, and ts_parser_parse_string() produces the tree. The viewer would parse the file once (or incrementally as the user scrolls), run the query cursor over the visible range, and produce color entries. The viewer's rendering would consume these the same way the editor does. The core tree-sitter code in syntax_ts.c is already separated from the editor-specific parts, which makes this extension straightforward.

Language coverage

Here's a comparison of what MC has today vs what tree-sitter covers:

Count
MC CoolEdit languages ~104
Tree-sitter languages in this PR 63 (62 + markdown_inline)
MC-only (would lose highlighting) ~51
Tree-sitter only (new) 3 (HCL/Terraform, QML, Scala)

The ~51 MC-only languages are mostly niche or domain-specific formats:

  • Packaging/distro: Debian control/changelog, RPM spec, Ebuild, YUM repo
  • Obscure/legacy: Eiffel, Nemerle, Jal, Yabasic, B language, J language
  • Domain-specific: POV-Ray, Spice circuits, D-Link switch configs, PIC linker scripts
  • Internal: MC's own syntax/file highlighting definition formats

The notable gaps are LaTeX (no tree-sitter grammar with pre-built parser.c exists - only a scanner), DOS Batch, M4 macros, and Puppet. These could be added over time as tree-sitter grammars become available, or the community could contribute them.

All mainstream programming languages are covered with significantly better highlighting quality than CoolEdit - especially languages with complex syntax like Bash (heredocs, nested quoting), Perl, Ruby, and HTML (embedded JS/CSS).

The full list of available tree-sitter parsers is at https://github.com/tree-sitter/tree-sitter/wiki/List-of-parsers - there are 200+ grammars, many of which could fill the gaps.

Binary size

Here are the real numbers (Linux x86_64, -O2, stripped):

Build Size
MC without tree-sitter 1.1 MB
MC + 22 common grammars (static) 18 MB
MC + all 63 grammars (static) 100 MB

The size comes from the generated parser.c files. Some grammars are enormous:

Grammar parser.c size
Verilog 46 MB
Fortran 36 MB
OCaml 36 MB
SQL 36 MB
C# 35 MB
COBOL 31 MB

A curated set of ~22 common languages (C, C++, Python, Bash, Go, Rust, Java, JS, TS, JSON, YAML, HTML, CSS, XML, Lua, Ruby, PHP, Perl, Markdown, Make, Dockerfile) comes to 18 MB stripped. Dropping the largest grammars (Verilog, Fortran, COBOL) helps, but doesn't eliminate the issue - modern languages like C# and Kotlin also have large parsers because the generated code is inherently proportional to grammar complexity.

The --with-tree-sitter-grammars=LIST option already allows selecting exactly which grammars to include, so distributions can choose their own subset.

Dynamic loading

Adding dynamically loaded parser libraries is not something I would accept.

I understand, and I respect the project's direction. Could you help me understand the reasoning? The grammar module use case seems quite different from general plugin/VFS module loading:

  • Grammar .so files are pure data (parser tables + a single function), not code that hooks into MC internals
  • Missing modules are silently ignored (fallback to legacy highlighting)
  • No ABI stability concern - grammars only use the tree-sitter API, not MC's
  • The size difference is dramatic: 1.1 MB (shared) vs 18-100 MB (static)
  • It enables per-language packaging, which is how both Neovim and distros handle this

That said, if static is the firm decision, then --with-tree-sitter-grammars=LIST makes it workable. The distribution maintainer picks a reasonable default set, and users who want more can rebuild.

Grammar fetching and vendoring

The grammar fetching magic at configure time is absolutely unacceptable. We should be able to bootstrap self-contained tarballs.

Agreed. The configure-time download was added purely for developer convenience and is not essential. Grammar sources can be provided from any local directory via TREE_SITTER_GRAMMARS_DIR=/path/to/grammars at configure time - this is already implemented and documented.

For vendoring, I'd propose a separate repository (mc-tree-sitter-grammars) that:

  1. Contains all grammar sources (the parser.c + scanner.c files, ~478 MB total)
  2. Has a CI workflow that periodically checks upstream grammar repos for new tags, downloads updates, compiles, and runs the query validation tests
  3. Produces versioned release tarballs (e.g., mc-tree-sitter-grammars-2026.1.tar.gz)
  4. MC's ./autogen.sh (not ./configure) fetches a pinned tarball from this repo, or it's included in make dist output

This way:

  • make dist produces a self-contained tarball - no network access needed to build from it
  • MC controls exactly which grammar versions are included via the pinned tarball hash
  • Grammar updates are vetted by CI before being tagged
  • If any upstream grammar repo disappears, the vendored copy in mc-tree-sitter-grammars is unaffected
  • No git submodule complexity - just a tarball URL and a hash in autogen.sh

Tree-sitter runtime vendoring

How often does Tree Sitter runtime change and how big is that in terms of source code?

The tree-sitter runtime is small (~15 C source files, ~200KB compiled) and very stable. The C API (tree_sitter/api.h) has been backward-compatible for years. The current ABI version is 14-15. We already vendor the internal headers (parser.h, alloc.h, array.h) that grammars need but the system package doesn't install.

Vendoring the entire tree-sitter runtime into MC's source tree is feasible and would eliminate the libtree-sitter system dependency. It would add roughly 15 C files to the build. For reference, Neovim uses tree-sitter as a system shared library dependency, but for MC's conservative approach, vendoring may be more appropriate.

C++ compiler requirement

Requiring a C++ compiler is absolutely a no-go.

Only one grammar (SQL) has a C++ scanner (scanner.cc). All other 62 grammars are pure C. We can drop SQL to eliminate the C++ requirement entirely. If SQL support is desired later, the scanner could potentially be rewritten in C - it's typically a small file (~200 lines).

Next steps

Based on your feedback, I think the path forward is:

  1. Create mc-tree-sitter-grammars repository with CI-automated grammar updates and release tarballs
  2. Remove configure-time downloading, replace with autogen.sh tarball fetch + TREE_SITTER_GRAMMARS_DIR for builds from tarballs
  3. Drop SQL (C++ scanner) unless the scanner can be rewritten in C
  4. Consider vendoring the tree-sitter runtime to eliminate the system library dependency
  5. Keep the --with-tree-sitter-grammars=LIST for static builds to manage binary size
  6. Plan viewer integration as a follow-up

I'm happy to discuss any of these points further or adjust the approach based on your preferences.

@ossilator
Copy link
Contributor

this is all just meta; i didn't look at the code.

the idea to vendor the runtime lib is nonsense. afaict, it would be the first external lib that mc vendors.

but i don't understand why the grammars have to be compiled by every user, rather than just being installed as binaries (plug-ins for a loader included in the runtime). this is totally bonkers from a packaging perspective, esp. considering how huge these things are. yet, debian ships only with -src packages (for a select set of grammars). the faq totally fails address this fundamental question.
i can think of two reasons:

  • plug-in systems are platform dependent, which results in a lot of ugly code. as TS wants no dependencies, it can't just use e.g. glib's module stuff. but this would be addressable by a separate library that offers this "service". isn't there one?
  • the grammars might have compile-time options that cannot be implemented as runtime options without major performance impact.

the hand-waving about the yet-unsupported grammars is a tad unconvincing. porting mc's (generally very simple) syntax defs sounds like a relatively minor (though probably boring) task, and a commitment to doing so (and therefore enabling the elimination of the internal engine) would be quite a boost in my eyes.

similarly, a commitment to actually port the viewer would also be a major boost.

there is a flip side to eliminating the internal engine: given how huge the TS grammars are, this step would imply the removal of syntax hl for resource constrained systems. well, actually, there is a middle way: syntaxes that are actually relevant for such systems (we're talking home routers and such) are very simple (config files of various types) and could actually be included. so non-issue, i guess?

i don't understand yuri's opposition to c++, esp. as an optional build dependency for 3rd-party components. the only situation where this would have a practical impact is if a new-ish c++ version is required, as some niche/legacy systems fail to deliver in that regard. and of course, this is relevant only if the distribution question above does not get a good answer.

jiri, your answer has a whiff of chatgpt. i don't mind if you use it, but it has a bad taste when it's recognizable as such.

we like a nice clean git history. you are likely to end up in merge hell if you don't start cleaning up early on during the major refactoring of this PR.

@jtyr
Copy link
Author

jtyr commented Mar 15, 2026

Thanks for the feedback, @ossilator.

On the shared libraries point - yes, that's exactly what the prototype already implements. Each grammar compiles to a standalone .so exporting a single tree_sitter_<name>() function. MC loads them on demand via g_module_open() (GLib's GModule, which MC already depends on). The loader is about 100 lines of code. Missing grammars are silently skipped with fallback to CoolEdit. So distros could package each grammar as a separate optional package - just an .so file and an .scm query file per language. Users install only what they need.

Regarding why grammars need to be compiled rather than distributed as binaries - they don't, in principle. The .so files are the binary form. The question is who compiles them. Neovim delegates this to nvim-treesitter which downloads and compiles grammars at runtime. We could theoretically reuse that, but it's tightly coupled to Neovim and depends on Node.js tooling. I think MC is better off owning its grammar build pipeline - a dedicated repo with CI that produces tested release tarballs of .so files. That way make dist (or a distro build) can include prebuilt binaries without end users needing to compile anything.

On vendoring the tree-sitter runtime - I agree, that was not a good suggestion. Keeping it as a system dependency (libtree-sitter) is the right approach, consistent with how MC handles other libraries.

On the missing grammars - I can commit to expanding coverage from the existing tree-sitter grammar ecosystem (200+ grammars available). Most of the ~51 MC-only syntaxes are simple keyword/string/comment definitions that translate straightforwardly to tree-sitter queries, provided a grammar exists. For the handful of truly niche formats where no grammar exists, keeping CoolEdit as a fallback seems reasonable. CoolEdit is also lighter on CPU/memory than tree-sitter parsing, which could matter on embedded systems. Having --with-tree-sitter as an opt-in feature means resource-constrained builds can stick with CoolEdit entirely. I'd leave this decision to the maintainers.

On the viewer - I can commit to extending tree-sitter highlighting to the viewer. The syntax_ts.c separation was done with this in mind. The infrastructure (grammar lookup, query compilation, highlight cache) is reusable; it just needs a different integration point in the viewer's rendering path.

On C++ - only one grammar (SQL) needs it. Happy to drop it or keep it, depending on what the project prefers.

On AI usage - yes, I used AI assistance for both the implementation and drafting the previous response. The design decisions are mine but the tooling helped enormously with the volume of work. I find the structured format contributes to better readability even if it has that recognizable feel.

On git history - I can squash and rebase the commits if that's how the project prefers PRs to be structured. Just let me know when the design discussion settles and I'll clean it up before final review.

@ossilator
Copy link
Contributor

yeah, i know that this is what you implemented, and i appreciate it (i also don't understand why yuri doesn't like the idea). my question was basically: why isn't this upstream? they could even provide multiple loaders (in separate libraries) if they don't want to commit to a particular toolkit, provided the resulting plugins are binary compatible. if they fail to do that, they work against their own adoption, given that each user needs to a) obtain, b) build, and c) install an own copy. i hate that as a user (both developer and end-user), and it goes against (some) distros' policy to unbundle dependencies. mind researching this further and potentially raising it with upstream?

@jtyr
Copy link
Author

jtyr commented Mar 15, 2026

Good question @ossilator. I did some more digging.

Tree-sitter upstream provides no loader or standard distribution mechanism. Their documentation only covers compile-time integration. Each editor rolls its own: Neovim compiles grammars at runtime via nvim-treesitter, Helix compiles at build time into the binary, and Emacs has a third-party project (tree-sitter-langs) that publishes pre-built .so bundles.

The Emacs bundle is interesting - it has 120 grammars as .so files, built via CI for Linux/macOS/Windows (x86_64 + aarch64), ~13 MB per platform. The symbols are identical to ours (tree_sitter_python, etc.), so they're technically loadable. But there's a catch: query files (the .scm highlight rules) are tightly coupled to specific grammar versions because node names might change between releases. If the grammar version in the Emacs bundle doesn't match what our query file expects, highlighting silently breaks. So reusing someone else's pre-built binaries only works if you also pin the query files to the same grammar versions.

The tree-sitter CLI can also build grammars (tree-sitter build . in a grammar repo produces a .so), but that still requires obtaining the source and having a C toolchain.

I think the real gap isn't the loader (50 lines of g_module_open as you say) but the lack of a standard convention for grammar location and version pinning. Each editor needs its own query files matched to specific grammar versions, which is why everyone ends up building their own pipeline. A shared grammar registry with versioned query bundles would be the proper solution, but that's a tree-sitter ecosystem problem beyond what we can solve for MC.

For MC specifically, I think our best path is the separate mc-tree-sitter-grammars repo with CI that builds both the .so files and ships the matching .scm query files together as a versioned bundle. That gives us version consistency and takes the build burden off end users.

I could raise the general distribution question with tree-sitter upstream if that's useful, though I suspect the answer will be that they consider it an editor concern rather than a core library concern.

@ossilator
Copy link
Contributor

the queries are application-specific, but syntax hl is part of the ts core, so installing the matching .scm files along with the .so files seems rather obvious at first sight.

that emacs project's mission statement is basically exactly what we need, though in practice they intentionally deviate from the upstream queries. so go figure. 🤔

you are quite correct that this is a ts ecosystem problem.
but it's somewhat crucial to address it to increase ts' acceptance by projects to which ts' added value doesn't justify the price that needs to be paid given the status quo. my role in mc is only advisory, but i strongly feel that mc falls into that category.
and it's clearly a problem that doesn't solve itself (i'll note here that ts is 13 years old and has already "survived" one lead developer change). so if you want to push things forward in mc, it would be advisable to take a leading role on the ts side as well, and maybe even become a maintainer of that sub-project.

from the practical side, i'd probably fork the emacs project into the ts namespace and "de-emacsify" it. the emacs project in turn would then need to pin itself to specific versions of the central project, so they could ship matching vendored .scm files in an emacs-specific directory.

plan b for mc would be to simply use the emacs project verbatim, but this doesn't feel quite right.

having our own build integration is at best plan c as far as i'm concerned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: mcedit mcedit, the built-in text editor prio: medium Has the potential to affect progress

Development

Successfully merging this pull request may close these issues.

4 participants