Skip to content

Commit 73c7b6b

Browse files
committed
Fix the bug, update the doc changes.
This is a more complete fix, covering any syntax part where encoded words are not permitted, and the doc changes are adjusted accordingly. There is also no need for a new exception, since HeaderWriteError already exists. The fix itself is to use a separate code loop to fold parts that may not have encoded words, guaranteeing that we do not do incorrect encoding. This opens a door to simplifying the main folding loop, but that is a much bigger refactoring job better left for another time.
1 parent f47e029 commit 73c7b6b

8 files changed

Lines changed: 112 additions & 54 deletions

File tree

Doc/library/email.errors.rst

Lines changed: 0 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -59,15 +59,6 @@ The following exception classes are defined in the :mod:`!email.errors` module:
5959
headers.
6060

6161

62-
.. exception:: InvalidMailboxError()
63-
64-
Raised when serializing a message with an address header that contains
65-
a mailbox incompatible with the policy in use.
66-
(See :attr:`email.policy.EmailPolicy.utf8`.)
67-
68-
.. versionadded:: 3.15
69-
70-
7162
.. exception:: MessageDefect()
7263

7364
This is the base class for all defects found when parsing email messages.

Doc/library/email.policy.rst

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -409,16 +409,19 @@ added matters. To illustrate::
409409
the ``SMTPUTF8`` extension (:rfc:`6531`).
410410

411411
When ``False``, the generator will raise an
412-
:exc:`~email.errors.InvalidMailboxError` if any address header includes
413-
a mailbox ("addr-spec") with non-ASCII characters. To use a mailbox with
414-
an internationalized domain name, first encode the domain using the
415-
third-party :pypi:`idna` or :pypi:`uts46` module or with
416-
:mod:`encodings.idna`. It is not possible to use a non-ASCII username
417-
("local-part") in a mailbox when ``utf8=False``.
412+
:exc:`~email.errors.HeaderWriteErrr` if any header includes non-ASCII
413+
characters in a context where :rfc:`2047` does not permit encoded words.
414+
This particularly applies to mailboxes ("addr-spec") with non-ASCII
415+
characters, which can be created via :mod:~email.headerregistry.Address`.
416+
To use a mailbox with non-ASCII domain name with ``utf8=False``, first
417+
encode the domain using the third-party :pypi:`idna` or :pypi:`uts46`
418+
module or with :mod:`encodings.idna`. It is not possible to use a
419+
non-ASCII username ("local-part") in a mailbox when ``utf8=False``.
418420

419421
.. versionchanged:: 3.14
420-
Raises :exc:`~email.errors.InvalidMailboxError`. (Earlier versions
421-
incorrectly applied :rfc:`2047` to non-ASCII addr-specs.)
422+
Can trigger the raising of :exc:`~email.errors.HeaderWriteError`.
423+
(Earlier versions incorrectly applied :rfc:`2047` in certain contexts,
424+
mostly notably in addr-specs.)
422425

423426
.. attribute:: refold_source
424427

Lib/email/_header_value_parser.py

Lines changed: 67 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -157,10 +157,7 @@ def all_defects(self):
157157
def startswith_fws(self):
158158
return self[0].startswith_fws()
159159

160-
@property
161-
def as_ew_allowed(self):
162-
"""True if all top level tokens of this part may be RFC2047 encoded."""
163-
return all(part.as_ew_allowed for part in self)
160+
as_ew_allowed = True
164161

165162
@property
166163
def comments(self):
@@ -429,6 +426,7 @@ def addr_spec(self):
429426
class AngleAddr(TokenList):
430427

431428
token_type = 'angle-addr'
429+
as_ew_allowed = False
432430

433431
@property
434432
def local_part(self):
@@ -847,26 +845,22 @@ def params(self):
847845

848846
class ContentType(ParameterizedHeaderValue):
849847
token_type = 'content-type'
850-
as_ew_allowed = False
851848
maintype = 'text'
852849
subtype = 'plain'
853850

854851

855852
class ContentDisposition(ParameterizedHeaderValue):
856853
token_type = 'content-disposition'
857-
as_ew_allowed = False
858854
content_disposition = None
859855

860856

861857
class ContentTransferEncoding(TokenList):
862858
token_type = 'content-transfer-encoding'
863-
as_ew_allowed = False
864859
cte = '7bit'
865860

866861

867862
class HeaderLabel(TokenList):
868863
token_type = 'header-label'
869-
as_ew_allowed = False
870864

871865

872866
class MsgID(TokenList):
@@ -2838,13 +2832,68 @@ def _steal_trailing_WSP_if_exists(lines):
28382832

28392833

28402834
def _refold_parse_tree(parse_tree, *, policy):
2841-
"""Return string of contents of parse_tree folded according to RFC rules.
2842-
2843-
"""
28442835
# max_line_length 0/None means no limit, ie: infinitely long.
28452836
maxlen = policy.max_line_length or sys.maxsize
28462837
encoding = 'utf-8' if policy.utf8 else 'us-ascii'
28472838
lines = [''] # Folded lines to be output
2839+
if parse_tree.as_ew_allowed:
2840+
_refold_with_ew(parse_tree, lines, maxlen, encoding, policy=policy)
2841+
else:
2842+
_refold_without_ew(parse_tree, lines, maxlen, encoding, policy=policy)
2843+
return policy.linesep.join(lines) + policy.linesep
2844+
2845+
def _refold_without_ew(parse_tree, lines, maxlen, encoding, *, policy):
2846+
parts = list(parse_tree)
2847+
while parts:
2848+
part = parts.pop(0)
2849+
tstr = str(part)
2850+
try:
2851+
tstr.encode(encoding)
2852+
except UnicodeEncodeError:
2853+
if any(isinstance(x, errors.UndecodableBytesDefect)
2854+
for x in part.all_defects):
2855+
# There is garbage data from parsing a message in binary mode,
2856+
# just pass it through. Not good, but the best we can do.
2857+
pass
2858+
elif policy.utf8:
2859+
# If this happens, it's a programmer error.
2860+
raise
2861+
else:
2862+
raise errors.HeaderWriteError(
2863+
f"Non-ASCII {part.token_type} '{part}' is invalid"
2864+
" under current policy setting (utf8=False)"
2865+
)
2866+
if len(tstr) <= maxlen - len(lines[-1]):
2867+
lines[-1] += tstr
2868+
continue
2869+
# This part is too long to fit. The RFC wants us to break at
2870+
# "major syntactic breaks", so unless we don't consider this
2871+
# to be one, check if it will fit on the next line by itself.
2872+
if (part.syntactic_break and
2873+
len(tstr) + 1 <= maxlen):
2874+
newline = _steal_trailing_WSP_if_exists(lines)
2875+
if newline or part.startswith_fws():
2876+
lines.append(newline + tstr)
2877+
continue
2878+
if not hasattr(part, 'encode'):
2879+
# It's not a terminal, try folding the subparts.
2880+
newparts = list(part)
2881+
parts = newparts + parts
2882+
continue
2883+
# We can't figure out how to wrap, it, so give up.
2884+
newline = _steal_trailing_WSP_if_exists(lines)
2885+
if newline or part.startswith_fws():
2886+
lines.append(newline + tstr)
2887+
else:
2888+
# We can't fold it onto the next line either...
2889+
lines[-1] += tstr
2890+
return
2891+
2892+
2893+
def _refold_with_ew(parse_tree, lines, maxlen, encoding, *, policy):
2894+
"""Return string of contents of parse_tree folded according to RFC rules.
2895+
2896+
"""
28482897
last_word_is_ew = False
28492898
last_ew = None # if there is an encoded word in the last line of lines,
28502899
# points to the encoded word's first character
@@ -2885,7 +2934,10 @@ def _refold_parse_tree(parse_tree, *, policy):
28852934
want_encoding = True
28862935

28872936
if want_encoding and not wrap_as_ew_blocked:
2888-
if not part.as_ew_allowed:
2937+
if any(
2938+
not x.as_ew_allowed for x in part
2939+
if hasattr(x, 'as_ew_allowed')
2940+
):
28892941
want_encoding = False
28902942
last_ew = None
28912943
if part.syntactic_break:
@@ -2966,6 +3018,8 @@ def _refold_parse_tree(parse_tree, *, policy):
29663018
[ValueTerminal(make_quoted_pairs(p), 'ptext')
29673019
for p in newparts] +
29683020
[ValueTerminal('"', 'ptext')])
3021+
_refold_without_ew(newparts, lines, maxlen, encoding, policy=policy)
3022+
continue
29693023
if part.token_type == 'comment':
29703024
newparts = (
29713025
[ValueTerminal('(', 'ptext')] +
@@ -2993,7 +3047,7 @@ def _refold_parse_tree(parse_tree, *, policy):
29933047
lines[-1] += tstr
29943048
last_word_is_ew = last_word_is_ew and not bool(tstr.strip(_WSP))
29953049

2996-
return policy.linesep.join(lines) + policy.linesep
3050+
return
29973051

29983052
def _fold_as_ew(to_encode, lines, maxlen, last_ew, ew_combine_allowed, charset, last_word_is_ew):
29993053
"""Fold string to_encode into lines as encoded word, combining if allowed.

Lib/email/errors.py

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -33,10 +33,6 @@ class HeaderWriteError(MessageError):
3333
"""Error while writing headers."""
3434

3535

36-
class InvalidMailboxError(MessageError, ValueError):
37-
"""A mailbox was not compatible with the policy in use."""
38-
39-
4036
# These are parsing defects which the parser was able to work around.
4137
class MessageDefect(ValueError):
4238
"""Base class for a message defect."""

Lib/test/test_email/test__header_value_parser.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3364,10 +3364,12 @@ def test_fold_unfoldable_element_stealing_whitespace(self):
33643364
self._test(token, expected, policy=policy)
33653365

33663366
def test_encoded_word_with_undecodable_bytes(self):
3367-
self._test(parser.get_address_list(
3368-
' =?utf-8?Q?=E5=AE=A2=E6=88=B6=E6=AD=A3=E8=A6=8F=E4=BA=A4=E7?='
3367+
self._test(
3368+
parser.get_address_list(
3369+
' =?utf-8?Q?=E5=AE=A2=E6=88=B6=E6=AD=A3=E8=A6=8F=E4=BA=A4=E7?='
3370+
' <xyz@abc.com>'
33693371
)[0],
3370-
' =?unknown-8bit?b?5a6i5oi25q2j6KaP5Lqk5w==?=\n',
3372+
' =?unknown-8bit?b?5a6i5oi25q2j6KaP5Lqk5w==?= <xyz@abc.com>\n',
33713373
)
33723374

33733375

Lib/test/test_email/test_generator.py

Lines changed: 27 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -296,30 +296,43 @@ def test_keep_long_encoded_newlines(self):
296296
g.flatten(msg)
297297
self.assertEqual(s.getvalue(), self.typ(expected))
298298

299-
# XXX renable after fix.
300-
def xest_non_ascii_addr_spec_raises(self):
301-
# RFC2047 encoded-word is not permitted in any part of an addr-spec.
302-
# (See also test_non_ascii_addr_spec_preserved below.)
299+
def test_non_ascii_addr_spec_raises(self):
300+
# non-ascii is not permitted in any part of an addr-spec. If the
301+
# programmer generated it, it's an error. (See also
302+
# test_non_ascii_addr_spec_preserved below.)
303303
g = self.genclass(self.ioclass(), policy=self.policy.clone(utf8=False))
304+
# XXX The particular part detected here isn't part of a behavioral
305+
# spec and may change in the future.
304306
cases = [
305-
'wők@example.com',
306-
'wok@exàmple.com',
307-
'wők@exàmple.com',
308-
'"Name, for display" <wők@example.com>',
309-
'Näyttönimi <wők@example.com>',
307+
('wők@example.com', 'wők', 'local-part'),
308+
('wok@exàmple.com', 'exàmple.com', 'domain'),
309+
('wők@exàmple.com', 'wők', 'local-part'),
310+
(
311+
'"Name, for display" <wők@example.com>',
312+
'wők@example.com',
313+
'addr-spec',
314+
),
315+
(
316+
'Näyttönimi <wők@example.com>',
317+
'wők@example.com',
318+
'addr-spec',
319+
),
310320
]
311-
for address in cases:
321+
for address, badtoken, partname in cases:
312322
with self.subTest(address=address):
313323
msg = EmailMessage()
314324
msg['To'] = address
315-
addr_spec = msg['To'].addresses[0].addr_spec
316325
expected_error = (
317-
fr"(?i)(?=.*non-ascii)(?=.*{re.escape(addr_spec)})(?=.*policy.*utf8)"
326+
fr"(?i)(?=.*non-ascii)"
327+
fr"(?=.*{re.escape(badtoken)})"
328+
fr"(?=.*{partname})"
329+
fr"(?=.*policy.*utf8)"
318330
)
319331
with self.assertRaisesRegex(
320-
email.errors.InvalidMailboxError, expected_error
332+
email.errors.HeaderWriteError, expected_error
321333
):
322334
g.flatten(msg)
335+
323336
def _test_boundary_detection(self, linesep):
324337
# Generate a boundary token in the same way as _make_boundary
325338
token = random.randrange(sys.maxsize)
@@ -580,8 +593,7 @@ def test_smtp_policy(self):
580593
g.flatten(msg)
581594
self.assertEqual(s.getvalue(), expected)
582595

583-
# XXX renable after fix.
584-
def xest_non_ascii_addr_spec_preserved(self):
596+
def test_non_ascii_addr_spec_preserved(self):
585597
# A defective non-ASCII addr-spec parsed from the original
586598
# message is left unchanged when flattening.
587599
# (See also test_non_ascii_addr_spec_raises above.)

Misc/NEWS.d/next/Library/2024-07-31-17-22-10.gh-issue-83938.TtUa-c.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
The :mod:`email` module no longer incorrectly uses :rfc:`2047` encoding for
22
a mailbox with non-ASCII characters in its domain. Under a policy with
33
:attr:`~email.policy.EmailPolicy.utf8` set ``False``, attempting to serialize
4-
such a message will now raise an :exc:`~email.errors.InvalidMailboxError`.
4+
such a message will now raise an :exc:`~email.errors.HeaderWriteError`.
55
Either apply an appropriate IDNA encoding to convert the domain to ASCII before
66
serialization, or use :data:`email.policy.SMTPUTF8` (or another policy with
77
``utf8=True``) to correctly pass through the internationalized domain name
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
The :mod:`email` module no longer incorrectly uses :rfc:`2047` encoding for
22
a mailbox with non-ASCII characters in its local-part. Under a policy with
33
:attr:`~email.policy.EmailPolicy.utf8` set ``False``, attempting to serialize
4-
such a message will now raise an :exc:`~email.errors.InvalidMailboxError`.
4+
such a message will now raise an :exc:`~email.errors.HeaderWriteError`.
55
There is no valid 7-bit encoding for an internationalized local-part. Use
66
:data:`email.policy.SMTPUTF8` (or another policy with ``utf8=True``) to
77
correctly pass through the local-part as Unicode characters.

0 commit comments

Comments
 (0)