diff --git a/src/SUMMARY.md b/src/SUMMARY.md index 0692b3f433..2fa3622266 100644 --- a/src/SUMMARY.md +++ b/src/SUMMARY.md @@ -7,6 +7,7 @@ - [Lexical structure](lexical-structure.md) - [Input format](input-format.md) - [Shebang](shebang.md) + - [Frontmatter](frontmatter.md) - [Keywords](keywords.md) - [Identifiers](identifiers.md) - [Comments](comments.md) diff --git a/src/frontmatter.md b/src/frontmatter.md new file mode 100644 index 0000000000..3fe03294d8 --- /dev/null +++ b/src/frontmatter.md @@ -0,0 +1,67 @@ +r[frontmatter] +# Frontmatter + +r[frontmatter.intro] +Frontmatter is an optional section of metadata whose syntax allows external tools to read it without parsing Rust. + +> [!EXAMPLE] +> +> ```rust,ignore +> #!/bin/env cargo +> --- cargo +> package.edition = "2024" +> --- +> +> fn main() {} +> ``` + +r[frontmatter.syntax] +```grammar,lexer +@root FRONTMATTER -> + WHITESPACE_ONLY_LINE* + !FRONTMATTER_INVALID + FRONTMATTER_MAIN + +WHITESPACE_ONLY_LINE -> (!LF WHITESPACE)* LF + +FRONTMATTER_INVALID -> (!LF WHITESPACE)+ `---` ^ ⊥ + +FRONTMATTER_MAIN -> + `-`{n:3..=255} ^ FRONTMATTER_REST + +FRONTMATTER_REST -> + FRONTMATTER_FENCE_START + FRONTMATTER_LINE* + FRONTMATTER_FENCE_END + +FRONTMATTER_FENCE_START -> + MAYBE_INFOSTRING_OR_WS LF + +FRONTMATTER_FENCE_END -> + `-`{n} HORIZONTAL_WHITESPACE* ( LF | EOF ) + +FRONTMATTER_LINE -> !`-`{n} ~[LF CR]* LF + +MAYBE_INFOSTRING_OR_WS -> + HORIZONTAL_WHITESPACE* INFOSTRING? HORIZONTAL_WHITESPACE* + +INFOSTRING -> (XID_Start | `_`) ( XID_Continue | `-` | `.` )* +``` + +r[frontmatter.position] +Frontmatter may appear at the start of the file (after the optional [byte order mark]) or after a [shebang]. In either case, it may be preceded by [whitespace]. + +r[frontmatter.fence] +Frontmatter must start and end with a *fence*. Each fence must start at the beginning of a line. The opening fence must consist of at least 3 and no more than 255 hyphens (`-`). The closing fence must have exactly the same number of hyphens as the opening fence. The hyphens of either fence may be followed by [horizontal whitespace]. + +r[frontmatter.infostring] +The opening fence, after optional [horizontal whitespace], may be followed by an infostring that identifies the format or purpose of the body. An infostring may be followed by horizontal whitespace. + +r[frontmatter.body] +No line in the body may start with a sequence of hyphens (`-`) equal to or longer than the opening fence. The body may not contain any carriage returns (that survive [CRLF normalization]). + +[byte order mark]: https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8 +[CRLF normalization]: input.crlf +[horizontal whitespace]: grammar-HORIZONTAL_WHITESPACE +[shebang]: input-format.md#shebang-removal +[whitespace]: whitespace.md diff --git a/src/input-format.md b/src/input-format.md index 88ab3658ba..1e5432feff 100644 --- a/src/input-format.md +++ b/src/input-format.md @@ -44,6 +44,25 @@ r[input.shebang] r[input.shebang.removal] If a [shebang] is present, it is removed from the input sequence (and is therefore ignored). +r[input.frontmatter] +## Frontmatter removal + +r[input.frontmatter.removal] +If the remaining input begins with a [frontmatter] fence, optionally preceded by lines containing only [whitespace], the [frontmatter] and any preceding whitespace are removed. + +For example, given the following file: + + +```rust,ignore +--- cargo +package.edition = "2024" +--- + +fn main() {} +``` + +The first three lines (the opening fence, body, and closing fence) would be removed, leaving an empty line followed by `fn main() {}`. + r[input.tokenization] ## Tokenization @@ -54,11 +73,12 @@ The resulting sequence of characters is then converted into tokens as described > > - Byte order mark removal. > - CRLF normalization. -> - Shebang removal when invoked in an item context (as opposed to expression or statement contexts). +> - Shebang and frontmatter removal when invoked in an item context (as opposed to expression or statement contexts). > > The [`include_str!`] and [`include_bytes!`] macros do not apply these transformations. [BYTE ORDER MARK]: https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8 [Crates and source files]: crates-and-source-files.md +[frontmatter]: frontmatter.md [shebang]: shebang.md [whitespace]: whitespace.md diff --git a/src/items/modules.md b/src/items/modules.md index 3cc015025b..2164051f84 100644 --- a/src/items/modules.md +++ b/src/items/modules.md @@ -123,7 +123,7 @@ r[items.mod.attributes] ## Attributes on modules r[items.mod.attributes.intro] -Modules, like all items, accept outer attributes. They also accept inner attributes: either after `{` for a module with a body, or at the beginning of the source file, after the optional BOM and shebang. +Modules, like all items, accept outer attributes. They also accept inner attributes: either after `{` for a module with a body, or at the beginning of the source file, after the optional BOM, shebang, and frontmatter. r[items.mod.attributes.supported] The built-in attributes that have meaning on a module are [`cfg`], [`deprecated`], [`doc`], [the lint check attributes], [`path`], and [`no_implicit_prelude`]. Modules also accept macro attributes. diff --git a/src/notation.md b/src/notation.md index 7537c67ddc..ed21e6a386 100644 --- a/src/notation.md +++ b/src/notation.md @@ -45,6 +45,18 @@ Mizushima et al. introduced [cut operators][cut operator paper] to parsing expre The hard cut operator is necessary because some tokens in Rust begin with a prefix that is itself a valid token. For example, `c"` begins a C string literal, but `c` alone is a valid identifier. Without the cut, if `c"\0"` failed to lex as a C string literal (because null bytes are not allowed in C strings), the parser could backtrack and lex it as two tokens: the identifier `c` and the string literal `"\0"`. The [cut after `c"`] prevents this --- once the opening delimiter is recognized, the parser cannot go back. The same reasoning applies to [byte literals], [byte string literals], [raw string literals], and other literals with prefixes that are themselves valid tokens. +r[notation.grammar.bottom] +### The bottom rule + +In logic, ⊥ (*bottom*) represents *absurdity* --- a proposition that is always false. In type theory, it is the *empty type* --- a type with no inhabitants. The grammar borrows both senses: the rule ⊥ matches nothing --- not any character, not even the end of input. + +```grammar,notation +// The bottom rule does not match anything. +⊥ -> !(CHAR | EOF) +``` + +Placed after a [hard cut operator], ⊥ makes a rule fail unconditionally once the parser has committed past the cut. This gives the grammar a way to express recognition without acceptance. The parser identifies the input, commits so that no other alternative can be tried, and then rejects it. In the frontmatter grammar, for example, [FRONTMATTER_INVALID] uses `^ ⊥` to recognize an opening fence preceded by whitespace on the same line --- input that is close enough to frontmatter to rule out other interpretations but is not valid. + r[notation.grammar.string-tables] ### String table productions diff --git a/src/whitespace.md b/src/whitespace.md index 7e16c51d41..da0d8502b5 100644 --- a/src/whitespace.md +++ b/src/whitespace.md @@ -16,6 +16,10 @@ WHITESPACE -> | U+2028 // Line separator | U+2029 // Paragraph separator +HORIZONTAL_WHITESPACE -> + U+0009 // Horizontal tab, `'\t'` + | U+0020 // Space, `' '` + TAB -> U+0009 // Horizontal tab, `'\t'` LF -> U+000A // Line feed, `'\n'` @@ -26,6 +30,9 @@ CR -> U+000D // Carriage return, `'\r'` r[lex.whitespace.intro] Whitespace is any non-empty string containing only characters that have the [`Pattern_White_Space`] Unicode property. +r[lex.whitespace.horizontal] +[HORIZONTAL_WHITESPACE] is the horizontal space subset of [`Pattern_White_Space`] as categorized by [UAX #31, Section 4.1][uax31-4.1]. + r[lex.whitespace.token-sep] Rust is a "free-form" language, meaning that all forms of whitespace serve only to separate _tokens_ in the grammar, and have no semantic significance. @@ -33,3 +40,4 @@ r[lex.whitespace.replacement] A Rust program has identical meaning if each whitespace element is replaced with any other legal whitespace element, such as a single space character. [`Pattern_White_Space`]: https://www.unicode.org/reports/tr31/ +[uax31-4.1]: https://www.unicode.org/reports/tr31/#Whitespace_and_Syntax