Skip to content

Add unicode encoding and decoding functionality #256

@vkusrorof

Description

@vkusrorof

Recommendations on requirements for {{unicode(encode,decode)}} #6177

① This verification only uses the "Chinese" Unicode for testing and simply decodes Chinese using the Go language.
② The complete Unicode dictionary was found on the Internet: https://www.unicode.org/versions/Unicode16.0.0/#Components
③ The ultimate goal is to achieve the function that when Unicode is matched, it can be decoded (in addition, it is hoped that nuclei can specify the text for encoding during verification).

I wrote the exploration script, but some urls appear unicode encoding characters. For example, baidu.com shows "\u767e\u5ea6\u4e00\u4e0b\uff0c\u4f60\u5c31\u77e5\u9053".
20250416104843
I usually use unicode after extracting "Chinese" validate (conversion website: http://www.jsons.cn/unicode)
20250416105000

yaml:
id: alive-check-20250328

info:
name: alive-check
author: alive-check
severity: info
description: status test

http:

  • raw:

    • |
      GET / HTTP/1.1
      Host: {{Hostname}}

    matchers-condition: and
    matchers:

    • type: status
      status:
      • 200
        extractors:
    • type: regex
      name: title
      group: 1
      regex:
      • "<title>(.*?)</title>"
    • type: regex
      group: 1
      regex:
      • 'top.location.replace("([^"]+)")'

After consulting, I found that the go language supports unicode decoding. This code tests reading the local "unicodelist.txt" file and performing decoding.

unicodelist.txt sample:

\u767e\u5ea6\u4e00\u4e0b\u006f\u006b\uff0c\u4f60\u5c31\u77e5\u9053\uff0c\ua\u662f\u0020\u0061\u0061
\u767e\u5ea6\u4e00\u4e0b\u006f\u006b
\u767e\u5ea6\u4e00\u4e0b\u006f\u006b
\u006f\u006b
20250508093059

main.go run result:
20250508093303

go test code:

package main

import (
	"bufio"
	"bytes"
	"fmt"
	"os"
	"regexp"
	"strconv"
	"strings"
)

// Fix the broken escape of \u (such as \ua → line break + \u)
func fixBrokenUnicode(data string) string {
	var result strings.Builder
	i := 0
	for i < len(data) {
		if strings.HasPrefix(data[i:], `\u`) {
			// Attempt to take four hexadecimal characters
			end := i + 6
			if end <= len(data) {
				hex := data[i+2 : end]
				if matched, _ := regexp.MatchString(`^[0-9a-fA-F]{4}$`, hex); matched {
					result.WriteString(data[i:end])
					i = end
					continue
				}
			}
			// Illegal \u escape, skipping the current \u and up to 4 characters following it
			j := i + 2
			for j < len(data) && j-i < 6 {
				if !((data[j] >= '0' && data[j] <= '9') ||
					(data[j] >= 'a' && data[j] <= 'f') ||
					(data[j] >= 'A' && data[j] <= 'F')) {
					break
				}
				j++
			}
			result.WriteString("\n")
			i = j // Skip the part of illegal escape
		} else {
			result.WriteByte(data[i])
			i++
		}
	}
	return result.String()
}

// Decode \uXXXX or \UXXXXXXXX
func EscapeUnicode(data []byte) []byte {
	re := regexp.MustCompile(`(\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8})+`)
	for _, match := range re.FindAll(data, -1) {
		str, err := strconv.Unquote(`"` + string(match) + `"`)
		if err == nil {
			data = bytes.ReplaceAll(data, match, []byte(str))
		}
	}
	return data
}

func main() {
	file, err := os.Open("unicodelist.txt")
	if err != nil {
		fmt.Println("Failed to open the file:", err)
		return
	}
	defer file.Close()

	scanner := bufio.NewScanner(file)
	lineNum := 1
	for scanner.Scan() {
		line := scanner.Text()

		// Fix illegal Unicode escape (line breaks)
		fixedLine := fixBrokenUnicode(line)

		// Decode as UTF-8
		decoded := EscapeUnicode([]byte(fixedLine))

		fmt.Printf("line %d : %s\n", lineNum, string(decoded))
		lineNum++
	}

	if err := scanner.Err(); err != nil {
		fmt.Println("Error in reading the file:", err)
	}
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type: EnhancementMost issues will probably ask for additions or changes.
    No fields configured for Enhancement.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions