Add unicode encoding and decoding functionality

[Recommendations on requirements for {{unicode(encode,decode)}} #6177](https://github.com/orgs/projectdiscovery/discussions/6177)

① This verification only uses the "Chinese" Unicode for testing and simply decodes Chinese using the Go language.
② The complete Unicode dictionary was found on the Internet: https://www.unicode.org/versions/Unicode16.0.0/#Components
③ The ultimate goal is to achieve the function that when Unicode is matched, it can be decoded (in addition, it is hoped that nuclei can specify the text for encoding during verification).


I wrote the exploration script, but some urls appear unicode encoding characters. For example, baidu.com shows "\u767e\u5ea6\u4e00\u4e0b\uff0c\u4f60\u5c31\u77e5\u9053".
![20250416104843](https://github.com/user-attachments/assets/487feb80-280a-4c24-a41e-cd1db44351a0)
I usually use unicode after extracting "Chinese" validate (conversion website: http://www.jsons.cn/unicode)
![20250416105000](https://github.com/user-attachments/assets/13e50125-de98-43cb-820a-debdba205014)

yaml:
id: alive-check-20250328

info:
  name: alive-check
  author: alive-check
  severity: info
  description: status test

http:
  - raw:
      - |
        GET / HTTP/1.1
        Host: {{Hostname}}

    matchers-condition: and
    matchers:
      - type: status
        status:
          - 200
    extractors:
      - type: regex
        name: title
        group: 1
        regex:
          - "<title>(.*?)</title>"
      - type: regex
        group: 1
        regex:
          - 'top.location.replace$"([^"]+)"$'

After consulting, I found that the go language supports unicode decoding. This code tests reading the local "unicodelist.txt" file and performing decoding.

unicodelist.txt  sample:

\u767e\u5ea6\u4e00\u4e0b\u006f\u006b\uff0c\u4f60\u5c31\u77e5\u9053\uff0c\ua\u662f\u0020\u0061\u0061
\u767e\u5ea6\u4e00\u4e0b\u006f\u006b
\u767e\u5ea6\u4e00\u4e0b\u006f\u006b
\u006f\u006b
![20250508093059](https://github.com/user-attachments/assets/16cad639-1a11-423c-be1e-b4e71b0de3f3)

main.go run result:
![20250508093303](https://github.com/user-attachments/assets/b80490c7-08a2-4a2c-bc8e-ca042105ccfb)

go test code:
```
package main

import (
	"bufio"
	"bytes"
	"fmt"
	"os"
	"regexp"
	"strconv"
	"strings"
)

// Fix the broken escape of \u (such as \ua → line break + \u)
func fixBrokenUnicode(data string) string {
	var result strings.Builder
	i := 0
	for i < len(data) {
		if strings.HasPrefix(data[i:], `\u`) {
			// Attempt to take four hexadecimal characters
			end := i + 6
			if end <= len(data) {
				hex := data[i+2 : end]
				if matched, _ := regexp.MatchString(`^[0-9a-fA-F]{4}$`, hex); matched {
					result.WriteString(data[i:end])
					i = end
					continue
				}
			}
			// Illegal \u escape, skipping the current \u and up to 4 characters following it
			j := i + 2
			for j < len(data) && j-i < 6 {
				if !((data[j] >= '0' && data[j] <= '9') ||
					(data[j] >= 'a' && data[j] <= 'f') ||
					(data[j] >= 'A' && data[j] <= 'F')) {
					break
				}
				j++
			}
			result.WriteString("\n")
			i = j // Skip the part of illegal escape
		} else {
			result.WriteByte(data[i])
			i++
		}
	}
	return result.String()
}

// Decode \uXXXX or \UXXXXXXXX
func EscapeUnicode(data []byte) []byte {
	re := regexp.MustCompile(`(\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8})+`)
	for _, match := range re.FindAll(data, -1) {
		str, err := strconv.Unquote(`"` + string(match) + `"`)
		if err == nil {
			data = bytes.ReplaceAll(data, match, []byte(str))
		}
	}
	return data
}

func main() {
	file, err := os.Open("unicodelist.txt")
	if err != nil {
		fmt.Println("Failed to open the file:", err)
		return
	}
	defer file.Close()

	scanner := bufio.NewScanner(file)
	lineNum := 1
	for scanner.Scan() {
		line := scanner.Text()

		// Fix illegal Unicode escape (line breaks)
		fixedLine := fixBrokenUnicode(line)

		// Decode as UTF-8
		decoded := EscapeUnicode([]byte(fixedLine))

		fmt.Printf("line %d : %s\n", lineNum, string(decoded))
		lineNum++
	}

	if err := scanner.Err(); err != nil {
		fmt.Println("Error in reading the file:", err)
	}
}

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add unicode encoding and decoding functionality #256

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add unicode encoding and decoding functionality #256

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions