Skip to content
Open
1 change: 1 addition & 0 deletions src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
- [Lexical structure](lexical-structure.md)
- [Input format](input-format.md)
- [Shebang](shebang.md)
- [Frontmatter](frontmatter.md)
- [Keywords](keywords.md)
- [Identifiers](identifiers.md)
- [Comments](comments.md)
Expand Down
67 changes: 67 additions & 0 deletions src/frontmatter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
r[frontmatter]
# Frontmatter

r[frontmatter.intro]
Frontmatter is an optional section of metadata whose syntax allows external tools to read it without parsing Rust.

> [!EXAMPLE]
> <!-- ignore: test runner doesn't support frontmatter -->
> ```rust,ignore
> #!/bin/env cargo
> --- cargo
> package.edition = "2024"
> ---
>
> fn main() {}
> ```
r[frontmatter.syntax]
```grammar,lexer
@root FRONTMATTER ->
WHITESPACE_ONLY_LINE*
!FRONTMATTER_INVALID
FRONTMATTER_MAIN
WHITESPACE_ONLY_LINE -> (!LF WHITESPACE)* LF
FRONTMATTER_INVALID -> (!LF WHITESPACE)+ `---` ^ ⊥
FRONTMATTER_MAIN ->
`-`{n:3..=255} ^ FRONTMATTER_REST
FRONTMATTER_REST ->
FRONTMATTER_FENCE_START
FRONTMATTER_LINE*
FRONTMATTER_FENCE_END
FRONTMATTER_FENCE_START ->
MAYBE_INFOSTRING_OR_WS LF
FRONTMATTER_FENCE_END ->
`-`{n} HORIZONTAL_WHITESPACE* ( LF | EOF )
FRONTMATTER_LINE -> !`-`{n} ~[LF CR]* LF
MAYBE_INFOSTRING_OR_WS ->
HORIZONTAL_WHITESPACE* INFOSTRING? HORIZONTAL_WHITESPACE*
INFOSTRING -> (XID_Start | `_`) ( XID_Continue | `-` | `.` )*
```
r[frontmatter.position]
Frontmatter may appear at the start of the file (after the optional [byte order mark]) or after a [shebang]. In either case, it may be preceded by [whitespace].

r[frontmatter.fence]
Frontmatter must start and end with a *fence*. Each fence must start at the beginning of a line. The opening fence must consist of at least 3 and no more than 255 hyphens (`-`). The closing fence must have exactly the same number of hyphens as the opening fence. The hyphens of either fence may be followed by [horizontal whitespace].

r[frontmatter.infostring]
The opening fence, after optional [horizontal whitespace], may be followed by an infostring that identifies the format or purpose of the body. An infostring may be followed by horizontal whitespace.

r[frontmatter.body]
No line in the body may start with a sequence of hyphens (`-`) equal to or longer than the opening fence. The body may not contain any carriage returns (that survive [CRLF normalization]).

[byte order mark]: https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
[CRLF normalization]: input.crlf
[horizontal whitespace]: grammar-HORIZONTAL_WHITESPACE
[shebang]: input-format.md#shebang-removal
[whitespace]: whitespace.md
22 changes: 21 additions & 1 deletion src/input-format.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,25 @@ r[input.shebang]
r[input.shebang.removal]
If a [shebang] is present, it is removed from the input sequence (and is therefore ignored).

r[input.frontmatter]
## Frontmatter removal

r[input.frontmatter.removal]
If the remaining input begins with a [frontmatter] fence, optionally preceded by lines containing only [whitespace], the [frontmatter] and any preceding whitespace are removed.

For example, given the following file:

<!-- ignore: test runner doesn't support frontmatter -->
```rust,ignore
--- cargo
package.edition = "2024"
---
fn main() {}
```

The first three lines (the opening fence, body, and closing fence) would be removed, leaving an empty line followed by `fn main() {}`.

r[input.tokenization]
## Tokenization

Expand All @@ -54,11 +73,12 @@ The resulting sequence of characters is then converted into tokens as described
>
> - Byte order mark removal.
> - CRLF normalization.
> - Shebang removal when invoked in an item context (as opposed to expression or statement contexts).
> - Shebang and frontmatter removal when invoked in an item context (as opposed to expression or statement contexts).
>
> The [`include_str!`] and [`include_bytes!`] macros do not apply these transformations.
[BYTE ORDER MARK]: https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
[Crates and source files]: crates-and-source-files.md
[frontmatter]: frontmatter.md
[shebang]: shebang.md
[whitespace]: whitespace.md
2 changes: 1 addition & 1 deletion src/items/modules.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ r[items.mod.attributes]
## Attributes on modules

r[items.mod.attributes.intro]
Modules, like all items, accept outer attributes. They also accept inner attributes: either after `{` for a module with a body, or at the beginning of the source file, after the optional BOM and shebang.
Modules, like all items, accept outer attributes. They also accept inner attributes: either after `{` for a module with a body, or at the beginning of the source file, after the optional BOM, shebang, and frontmatter.

r[items.mod.attributes.supported]
The built-in attributes that have meaning on a module are [`cfg`], [`deprecated`], [`doc`], [the lint check attributes], [`path`], and [`no_implicit_prelude`]. Modules also accept macro attributes.
Expand Down
12 changes: 12 additions & 0 deletions src/notation.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,18 @@ Mizushima et al. introduced [cut operators][cut operator paper] to parsing expre

The hard cut operator is necessary because some tokens in Rust begin with a prefix that is itself a valid token. For example, `c"` begins a C string literal, but `c` alone is a valid identifier. Without the cut, if `c"\0"` failed to lex as a C string literal (because null bytes are not allowed in C strings), the parser could backtrack and lex it as two tokens: the identifier `c` and the string literal `"\0"`. The [cut after `c"`] prevents this --- once the opening delimiter is recognized, the parser cannot go back. The same reasoning applies to [byte literals], [byte string literals], [raw string literals], and other literals with prefixes that are themselves valid tokens.

r[notation.grammar.bottom]
### The bottom rule

In logic, ⊥ (*bottom*) represents *absurdity* --- a proposition that is always false. In type theory, it is the *empty type* --- a type with no inhabitants. The grammar borrows both senses: the rule ⊥ matches nothing --- not any character, not even the end of input.

```grammar,notation
// The bottom rule does not match anything.
⊥ -> !(CHAR | EOF)
```

Placed after a [hard cut operator], ⊥ makes a rule fail unconditionally once the parser has committed past the cut. This gives the grammar a way to express recognition without acceptance. The parser identifies the input, commits so that no other alternative can be tried, and then rejects it. In the frontmatter grammar, for example, [FRONTMATTER_INVALID] uses `^ ⊥` to recognize an opening fence preceded by whitespace on the same line --- input that is close enough to frontmatter to rule out other interpretations but is not valid.

r[notation.grammar.string-tables]
### String table productions

Expand Down
8 changes: 8 additions & 0 deletions src/whitespace.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,10 @@ WHITESPACE ->
| U+2028 // Line separator
| U+2029 // Paragraph separator
HORIZONTAL_WHITESPACE ->
U+0009 // Horizontal tab, `'\t'`
| U+0020 // Space, `' '`
TAB -> U+0009 // Horizontal tab, `'\t'`
LF -> U+000A // Line feed, `'\n'`
Expand All @@ -26,10 +30,14 @@ CR -> U+000D // Carriage return, `'\r'`
r[lex.whitespace.intro]
Whitespace is any non-empty string containing only characters that have the [`Pattern_White_Space`] Unicode property.

r[lex.whitespace.horizontal]
[HORIZONTAL_WHITESPACE] is the horizontal space subset of [`Pattern_White_Space`] as categorized by [UAX #31, Section 4.1][uax31-4.1].

r[lex.whitespace.token-sep]
Rust is a "free-form" language, meaning that all forms of whitespace serve only to separate _tokens_ in the grammar, and have no semantic significance.

r[lex.whitespace.replacement]
A Rust program has identical meaning if each whitespace element is replaced with any other legal whitespace element, such as a single space character.

[`Pattern_White_Space`]: https://www.unicode.org/reports/tr31/
[uax31-4.1]: https://www.unicode.org/reports/tr31/#Whitespace_and_Syntax