TextParser

TextParser is a high-performance, extensible text parsing library written in C. It uses regular expressions to define language grammars and generates a hierarchical Abstract Syntax Tree (AST) for parsed documents.

The project currently provides robust support for CFML (ColdFusion Markup Language) and JSON, with a flexible architecture allows for easy addition of new language definitions.

Features

High Performance: Written in optimized C for fast parsing of large codebases.
Small Footprint: The library is designed to be small and easy to integrate into other projects.
Minimal Dependencies: The library has minimal dependencies (only crpe2 library for regex matching).
Regex-Based Grammars: Define language syntax using flexible regular expressions.
Hierarchical AST: Generates a structured tree of tokens (textparser_token_item) representing the code structure.
Syntax Highlighting Support: Tokens track metadata like color, background, and flags, making it suitable for building syntax highlighters and editors.
Extensibility: Language definitions are decoupled from the core parsing logic, constructed with JSON, and can be loaded at compile time (by generated header file) or at runtime (by loading JSON file).
Python Tooling: Includes Python scripts for: prototyping and validation of the core algorithm, generation of C header files, and other parser verification tools.

Project Structure

src/: Core C library implementation (textparser.c, adv_regex.c).
include/: Public header files (textparser.h).
cli/: Command-line tool mainly for testing and demonstrating the library.
definitions/: Language definitions (e.g., CFML, JSON).
python/: Python bindings, prototypes, and validation tools (validate_cfml.py).
tests/: Unit and integration tests.
ccat/: Utilities for text processing (e.g., color cat).

Build Instructions

Prerequisites

CMake (version 3.15 or higher)
Ninja build system
A C compiler (GCC/Clang)

Building

You can use the provided build script for a quick start:

./build.sh

Alternatively, you can build using standard CMake commands:

cmake -B build -G Ninja
cmake --build build

Artifacts (libraries and executables) will be output to the bin/ directory.

Installation

Arch Linux

textparser is available on the Arch User Repository (AUR). You can install it using an AUR helper like yay:

yay -S textparser

Or view the package details at https://aur.archlinux.org/packages/textparser.

Usage

CLI Tool

The textparser CLI tool can be used to parse files and visualize the resulting token tree.

bin/textparser path/to/file.cfm

C Library Integration

To use TextParser in your C project, include textparser.h and link against libtextparser.

Basic Example:

#include <textparser.h>
#include <stdio.h>

// Assume 'my_lang_definition' is defined elsewhere
extern const textparser_language_definition my_lang_definition;

int main() {
    textparser_defer(handle); // Auto-cleanup

    // Open a file
    int err = textparser_openfile("example.txt", TEXTPARSER_ENCODING_LATIN1, &handle);
    if (err) {
        fprintf(stderr, "Failed to open file\n");
        return 1;
    }

    // Parse using the language definition
    err = textparser_parse(handle, &my_lang_definition);
    if (err) {
        fprintf(stderr, "Parse error\n");
        return 1;
    }

    // Iterate through tokens
    for (textparser_token_item *item = textparser_get_first_token(handle); item != NULL; item = item->next) {
        // ... process item ...
    }
    
    return 0;
}

Language Definition Example

TextParser uses a JSON-based format to define language grammars. This allows for defining complex syntax rules using regular expressions and hierarchical token structures.

Here is a simplified example of what a JSON definition might look like (based on definitions/json_definition.json):

{
  "name": "json",
  "version": 1.0,
  "startTokens": ["Object", "Array"],
  "tokens": {
    "Object": {
      "type": "StartStop",
      "startRegex": "{",
      "endRegex": "}",
      "textColor": "0xffd700",
      "nestedTokens": ["Key", "String", "Number", "ValueSeparator"]
    },
    "String": {
      "type": "StartStop",
      "startRegex": "\"",
      "endRegex": "\"",
      "textColor": "0xce9178",
      "nestedTokens": ["StringEscape"]
    },
    "Number": {
      "type": "SimpleToken",
      "startRegex": "-?\\d+(?:\\.\\d+)?",
      "textColor": "0xb5cea8"
    }
  }
}

Development and Verification

The python/ directory contains tools for verifying the parser's correctness, particularly for CFML.

validate_cfml.py: A robust validation script that compares the AST generated by this project against reference parsers (e.g., a Java-based CFML parser) to ensuring high fidelity and correctness.

python3 python/validate_cfml.py /path/to/cfml/files

License

See LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TextParser

Features

Project Structure

Build Instructions

Prerequisites

Building

Installation

Arch Linux

Usage

CLI Tool

C Library Integration

Language Definition Example

Development and Verification

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

TextParser

Features

Project Structure

Build Instructions

Prerequisites

Building

Installation

Arch Linux

Usage

CLI Tool

C Library Integration

Language Definition Example

Development and Verification

License