Fix the bug, update the doc changes.

bitdancer · bitdancer · commit 73c7b6b60036 · 2026-04-26T14:49:51.000-04:00
This is a more complete fix, covering any syntax part where encoded
words are not permitted, and the doc changes are adjusted accordingly.
There is also no need for a new exception, since HeaderWriteError
already exists.

The fix itself is to use a separate code loop to fold parts that
may not have encoded words, guaranteeing that we do not do incorrect
encoding.  This opens a door to simplifying the main folding loop,
but that is a much bigger refactoring job better left for another time.
diff --git a/Doc/library/email.errors.rst b/Doc/library/email.errors.rst
@@ -59,15 +59,6 @@ The following exception classes are defined in the :mod:`!email.errors` module:
    headers.
 
 
-.. exception:: InvalidMailboxError()
-
-   Raised when serializing a message with an address header that contains
-   a mailbox incompatible with the policy in use.
-   (See :attr:`email.policy.EmailPolicy.utf8`.)
-
-   .. versionadded:: 3.15
-
-
 .. exception:: MessageDefect()
 
    This is the base class for all defects found when parsing email messages.
diff --git a/Doc/library/email.policy.rst b/Doc/library/email.policy.rst
@@ -409,16 +409,19 @@ added matters.  To illustrate::
       the ``SMTPUTF8`` extension (:rfc:`6531`).
 
       When ``False``, the generator will raise an
-      :exc:`~email.errors.InvalidMailboxError` if any address header includes
-      a mailbox ("addr-spec") with non-ASCII characters. To use a mailbox with
-      an internationalized domain name, first encode the domain using the
-      third-party :pypi:`idna` or :pypi:`uts46` module or with
-      :mod:`encodings.idna`. It is not possible to use a non-ASCII username
-      ("local-part") in a mailbox when ``utf8=False``.
+      :exc:`~email.errors.HeaderWriteErrr` if any header includes non-ASCII
+      characters in a context where :rfc:`2047` does not permit encoded words.
+      This particularly applies to mailboxes ("addr-spec") with non-ASCII
+      characters, which can be created via :mod:~email.headerregistry.Address`.
+      To use a mailbox with non-ASCII domain name with ``utf8=False``, first
+      encode the domain using the third-party :pypi:`idna` or :pypi:`uts46`
+      module or with :mod:`encodings.idna`. It is not possible to use a
+      non-ASCII username ("local-part") in a mailbox when ``utf8=False``.
 
       .. versionchanged:: 3.14
-         Raises :exc:`~email.errors.InvalidMailboxError`. (Earlier versions
-         incorrectly applied :rfc:`2047` to non-ASCII addr-specs.)
+         Can trigger the raising of :exc:`~email.errors.HeaderWriteError`.
+         (Earlier versions incorrectly applied :rfc:`2047` in certain contexts,
+         mostly notably in addr-specs.)
 
    .. attribute:: refold_source
 
diff --git a/Lib/email/_header_value_parser.py b/Lib/email/_header_value_parser.py
@@ -157,10 +157,7 @@ def all_defects(self):
     def startswith_fws(self):
         return self[0].startswith_fws()
 
-    @property
-    def as_ew_allowed(self):
-        """True if all top level tokens of this part may be RFC2047 encoded."""
-        return all(part.as_ew_allowed for part in self)
+    as_ew_allowed = True
 
     @property
     def comments(self):
@@ -429,6 +426,7 @@ def addr_spec(self):
 class AngleAddr(TokenList):
 
     token_type = 'angle-addr'
+    as_ew_allowed = False
 
     @property
     def local_part(self):
@@ -847,26 +845,22 @@ def params(self):
 
 class ContentType(ParameterizedHeaderValue):
     token_type = 'content-type'
-    as_ew_allowed = False
     maintype = 'text'
     subtype = 'plain'
 
 
 class ContentDisposition(ParameterizedHeaderValue):
     token_type = 'content-disposition'
-    as_ew_allowed = False
     content_disposition = None
 
 
 class ContentTransferEncoding(TokenList):
     token_type = 'content-transfer-encoding'
-    as_ew_allowed = False
     cte = '7bit'
 
 
 class HeaderLabel(TokenList):
     token_type = 'header-label'
-    as_ew_allowed = False
 
 
 class MsgID(TokenList):
@@ -2838,13 +2832,68 @@ def _steal_trailing_WSP_if_exists(lines):
 
 
 def _refold_parse_tree(parse_tree, *, policy):
-    """Return string of contents of parse_tree folded according to RFC rules.
-
-    """
     # max_line_length 0/None means no limit, ie: infinitely long.
     maxlen = policy.max_line_length or sys.maxsize
     encoding = 'utf-8' if policy.utf8 else 'us-ascii'
     lines = ['']  # Folded lines to be output
+    if parse_tree.as_ew_allowed:
+        _refold_with_ew(parse_tree, lines, maxlen, encoding, policy=policy)
+    else:
+        _refold_without_ew(parse_tree, lines, maxlen, encoding, policy=policy)
+    return policy.linesep.join(lines) + policy.linesep
+
+def _refold_without_ew(parse_tree, lines, maxlen, encoding, *, policy):
+    parts = list(parse_tree)
+    while parts:
+        part = parts.pop(0)
+        tstr = str(part)
+        try:
+            tstr.encode(encoding)
+        except UnicodeEncodeError:
+            if any(isinstance(x, errors.UndecodableBytesDefect)
+                   for x in part.all_defects):
+                # There is garbage data from parsing a message in binary mode,
+                # just pass it through.  Not good, but the best we can do.
+                pass
+            elif policy.utf8:
+                # If this happens, it's a programmer error.
+                raise
+            else:
+                raise errors.HeaderWriteError(
+                    f"Non-ASCII {part.token_type} '{part}' is invalid"
+                    " under current policy setting (utf8=False)"
+                )
+        if len(tstr) <= maxlen - len(lines[-1]):
+            lines[-1] += tstr
+            continue
+        # This part is too long to fit.  The RFC wants us to break at
+        # "major syntactic breaks", so unless we don't consider this
+        # to be one, check if it will fit on the next line by itself.
+        if (part.syntactic_break and
+                len(tstr) + 1 <= maxlen):
+            newline = _steal_trailing_WSP_if_exists(lines)
+            if newline or part.startswith_fws():
+                lines.append(newline + tstr)
+                continue
+        if not hasattr(part, 'encode'):
+            # It's not a terminal, try folding the subparts.
+            newparts = list(part)
+            parts = newparts + parts
+            continue
+        # We can't figure out how to wrap, it, so give up.
+        newline = _steal_trailing_WSP_if_exists(lines)
+        if newline or part.startswith_fws():
+            lines.append(newline + tstr)
+        else:
+            # We can't fold it onto the next line either...
+            lines[-1] += tstr
+    return
+
+
+def _refold_with_ew(parse_tree, lines, maxlen, encoding, *, policy):
+    """Return string of contents of parse_tree folded according to RFC rules.
+
+    """
     last_word_is_ew = False
     last_ew = None  # if there is an encoded word in the last line of lines,
                     # points to the encoded word's first character
@@ -2885,7 +2934,10 @@ def _refold_parse_tree(parse_tree, *, policy):
             want_encoding = True
 
         if want_encoding and not wrap_as_ew_blocked:
-            if not part.as_ew_allowed:
+            if any(
+                    not x.as_ew_allowed for x in part
+                    if hasattr(x, 'as_ew_allowed')
+                ):
                 want_encoding = False
                 last_ew = None
                 if part.syntactic_break:
@@ -2966,6 +3018,8 @@ def _refold_parse_tree(parse_tree, *, policy):
                     [ValueTerminal(make_quoted_pairs(p), 'ptext')
                      for p in newparts] +
                     [ValueTerminal('"', 'ptext')])
+                _refold_without_ew(newparts, lines, maxlen, encoding, policy=policy)
+                continue
             if part.token_type == 'comment':
                 newparts = (
                     [ValueTerminal('(', 'ptext')] +
@@ -2993,7 +3047,7 @@ def _refold_parse_tree(parse_tree, *, policy):
             lines[-1] += tstr
         last_word_is_ew = last_word_is_ew and not bool(tstr.strip(_WSP))
 
-    return policy.linesep.join(lines) + policy.linesep
+    return
 
 def _fold_as_ew(to_encode, lines, maxlen, last_ew, ew_combine_allowed, charset, last_word_is_ew):
     """Fold string to_encode into lines as encoded word, combining if allowed.
diff --git a/Lib/email/errors.py b/Lib/email/errors.py
@@ -33,10 +33,6 @@ class HeaderWriteError(MessageError):
     """Error while writing headers."""
 
 
-class InvalidMailboxError(MessageError, ValueError):
-    """A mailbox was not compatible with the policy in use."""
-
-
 # These are parsing defects which the parser was able to work around.
 class MessageDefect(ValueError):
     """Base class for a message defect."""
diff --git a/Lib/test/test_email/test__header_value_parser.py b/Lib/test/test_email/test__header_value_parser.py
@@ -3364,10 +3364,12 @@ def test_fold_unfoldable_element_stealing_whitespace(self):
         self._test(token, expected, policy=policy)
 
     def test_encoded_word_with_undecodable_bytes(self):
-        self._test(parser.get_address_list(
-            ' =?utf-8?Q?=E5=AE=A2=E6=88=B6=E6=AD=A3=E8=A6=8F=E4=BA=A4=E7?='
+        self._test(
+            parser.get_address_list(
+                ' =?utf-8?Q?=E5=AE=A2=E6=88=B6=E6=AD=A3=E8=A6=8F=E4=BA=A4=E7?='
+                ' <xyz@abc.com>'
                 )[0],
-            ' =?unknown-8bit?b?5a6i5oi25q2j6KaP5Lqk5w==?=\n',
+            ' =?unknown-8bit?b?5a6i5oi25q2j6KaP5Lqk5w==?= <xyz@abc.com>\n',
             )
 
 
diff --git a/Lib/test/test_email/test_generator.py b/Lib/test/test_email/test_generator.py
@@ -296,30 +296,43 @@ def test_keep_long_encoded_newlines(self):
         g.flatten(msg)
         self.assertEqual(s.getvalue(), self.typ(expected))
 
-    # XXX renable after fix.
-    def xest_non_ascii_addr_spec_raises(self):
-        # RFC2047 encoded-word is not permitted in any part of an addr-spec.
-        # (See also test_non_ascii_addr_spec_preserved below.)
+    def test_non_ascii_addr_spec_raises(self):
+        # non-ascii is not permitted in any part of an addr-spec.  If the
+        # programmer generated it, it's an error.  (See also
+        # test_non_ascii_addr_spec_preserved below.)
         g = self.genclass(self.ioclass(), policy=self.policy.clone(utf8=False))
+        # XXX The particular part detected here isn't part of a behavioral
+        # spec and may change in the future.
         cases = [
-            'wők@example.com',
-            'wok@exàmple.com',
-            'wők@exàmple.com',
-            '"Name, for display" <wők@example.com>',
-            'Näyttönimi <wők@example.com>',
+            ('wők@example.com', 'wők', 'local-part'),
+            ('wok@exàmple.com', 'exàmple.com', 'domain'),
+            ('wők@exàmple.com', 'wők', 'local-part'),
+            (
+                '"Name, for display" <wők@example.com>',
+                'wők@example.com',
+                'addr-spec',
+                ),
+            (
+                'Näyttönimi <wők@example.com>',
+                'wők@example.com',
+                'addr-spec',
+                ),
         ]
-        for address in cases:
+        for address, badtoken, partname in cases:
             with self.subTest(address=address):
                 msg = EmailMessage()
                 msg['To'] = address
-                addr_spec = msg['To'].addresses[0].addr_spec
                 expected_error = (
-                    fr"(?i)(?=.*non-ascii)(?=.*{re.escape(addr_spec)})(?=.*policy.*utf8)"
+                    fr"(?i)(?=.*non-ascii)"
+                    fr"(?=.*{re.escape(badtoken)})"
+                    fr"(?=.*{partname})"
+                    fr"(?=.*policy.*utf8)"
                 )
                 with self.assertRaisesRegex(
-                    email.errors.InvalidMailboxError, expected_error
+                    email.errors.HeaderWriteError, expected_error
                 ):
                     g.flatten(msg)
+
     def _test_boundary_detection(self, linesep):
         # Generate a boundary token in the same way as _make_boundary
         token = random.randrange(sys.maxsize)
@@ -580,8 +593,7 @@ def test_smtp_policy(self):
         g.flatten(msg)
         self.assertEqual(s.getvalue(), expected)
 
-    # XXX renable after fix.
-    def xest_non_ascii_addr_spec_preserved(self):
+    def test_non_ascii_addr_spec_preserved(self):
         # A defective non-ASCII addr-spec parsed from the original
         # message is left unchanged when flattening.
         # (See also test_non_ascii_addr_spec_raises above.)
diff --git a/Misc/NEWS.d/next/Library/2024-07-31-17-22-10.gh-issue-83938.TtUa-c.rst b/Misc/NEWS.d/next/Library/2024-07-31-17-22-10.gh-issue-83938.TtUa-c.rst
@@ -1,7 +1,7 @@
 The :mod:`email` module no longer incorrectly uses :rfc:`2047` encoding for
 a mailbox with non-ASCII characters in its domain. Under a policy with
 :attr:`~email.policy.EmailPolicy.utf8` set ``False``, attempting to serialize
-such a message will now raise an :exc:`~email.errors.InvalidMailboxError`.
+such a message will now raise an :exc:`~email.errors.HeaderWriteError`.
 Either apply an appropriate IDNA encoding to convert the domain to ASCII before
 serialization, or use :data:`email.policy.SMTPUTF8` (or another policy with
 ``utf8=True``) to correctly pass through the internationalized domain name
diff --git a/Misc/NEWS.d/next/Library/2024-07-31-17-23-06.gh-issue-122476.TtUa-c.rst b/Misc/NEWS.d/next/Library/2024-07-31-17-23-06.gh-issue-122476.TtUa-c.rst
@@ -1,7 +1,7 @@
 The :mod:`email` module no longer incorrectly uses :rfc:`2047` encoding for
 a mailbox with non-ASCII characters in its local-part. Under a policy with
 :attr:`~email.policy.EmailPolicy.utf8` set ``False``, attempting to serialize
-such a message will now raise an :exc:`~email.errors.InvalidMailboxError`.
+such a message will now raise an :exc:`~email.errors.HeaderWriteError`.
 There is no valid 7-bit encoding for an internationalized local-part. Use
 :data:`email.policy.SMTPUTF8` (or another policy with ``utf8=True``) to
 correctly pass through the local-part as Unicode characters.