Skip to content

Latest commit

 

History

History
161 lines (117 loc) · 5.12 KB

File metadata and controls

161 lines (117 loc) · 5.12 KB

TextParser Ask DeepWiki

TextParser is a high-performance, extensible text parsing library written in C. It uses regular expressions to define language grammars and generates a hierarchical Abstract Syntax Tree (AST) for parsed documents.

The project currently provides robust support for CFML (ColdFusion Markup Language) and JSON, with a flexible architecture allows for easy addition of new language definitions.

Features

  • High Performance: Written in optimized C for fast parsing of large codebases.
  • Small Footprint: The library is designed to be small and easy to integrate into other projects.
  • Minimal Dependencies: The library has minimal dependencies (only crpe2 library for regex matching).
  • Regex-Based Grammars: Define language syntax using flexible regular expressions.
  • Hierarchical AST: Generates a structured tree of tokens (textparser_token_item) representing the code structure.
  • Syntax Highlighting Support: Tokens track metadata like color, background, and flags, making it suitable for building syntax highlighters and editors.
  • Extensibility: Language definitions are decoupled from the core parsing logic, constructed with JSON, and can be loaded at compile time (by generated header file) or at runtime (by loading JSON file).
  • Python Tooling: Includes Python scripts for: prototyping and validation of the core algorithm, generation of C header files, and other parser verification tools.

Project Structure

  • src/: Core C library implementation (textparser.c, adv_regex.c).
  • include/: Public header files (textparser.h).
  • cli/: Command-line tool mainly for testing and demonstrating the library.
  • definitions/: Language definitions (e.g., CFML, JSON).
  • python/: Python bindings, prototypes, and validation tools (validate_cfml.py).
  • tests/: Unit and integration tests.
  • ccat/: Utilities for text processing (e.g., color cat).

Build Instructions

Prerequisites

  • CMake (version 3.15 or higher)
  • Ninja build system
  • A C compiler (GCC/Clang)

Building

You can use the provided build script for a quick start:

./build.sh

Alternatively, you can build using standard CMake commands:

cmake -B build -G Ninja
cmake --build build

Artifacts (libraries and executables) will be output to the bin/ directory.

Installation

Arch Linux

textparser is available on the Arch User Repository (AUR). You can install it using an AUR helper like yay:

yay -S textparser

Or view the package details at https://aur.archlinux.org/packages/textparser.

Usage

CLI Tool

The textparser CLI tool can be used to parse files and visualize the resulting token tree.

bin/textparser path/to/file.cfm

C Library Integration

To use TextParser in your C project, include textparser.h and link against libtextparser.

Basic Example:

#include <textparser.h>
#include <stdio.h>

// Assume 'my_lang_definition' is defined elsewhere
extern const textparser_language_definition my_lang_definition;

int main() {
    textparser_defer(handle); // Auto-cleanup

    // Open a file
    int err = textparser_openfile("example.txt", TEXTPARSER_ENCODING_LATIN1, &handle);
    if (err) {
        fprintf(stderr, "Failed to open file\n");
        return 1;
    }

    // Parse using the language definition
    err = textparser_parse(handle, &my_lang_definition);
    if (err) {
        fprintf(stderr, "Parse error\n");
        return 1;
    }

    // Iterate through tokens
    for (textparser_token_item *item = textparser_get_first_token(handle); item != NULL; item = item->next) {
        // ... process item ...
    }
    
    return 0;
}

Language Definition Example

TextParser uses a JSON-based format to define language grammars. This allows for defining complex syntax rules using regular expressions and hierarchical token structures.

Here is a simplified example of what a JSON definition might look like (based on definitions/json_definition.json):

{
  "name": "json",
  "version": 1.0,
  "startTokens": ["Object", "Array"],
  "tokens": {
    "Object": {
      "type": "StartStop",
      "startRegex": "{",
      "endRegex": "}",
      "textColor": "0xffd700",
      "nestedTokens": ["Key", "String", "Number", "ValueSeparator"]
    },
    "String": {
      "type": "StartStop",
      "startRegex": "\"",
      "endRegex": "\"",
      "textColor": "0xce9178",
      "nestedTokens": ["StringEscape"]
    },
    "Number": {
      "type": "SimpleToken",
      "startRegex": "-?\\d+(?:\\.\\d+)?",
      "textColor": "0xb5cea8"
    }
  }
}

Development and Verification

The python/ directory contains tools for verifying the parser's correctness, particularly for CFML.

  • validate_cfml.py: A robust validation script that compares the AST generated by this project against reference parsers (e.g., a Java-based CFML parser) to ensuring high fidelity and correctness.
python3 python/validate_cfml.py /path/to/cfml/files

License

See LICENSE file for details.