Skip to content

Comments

[mypyc] Add str.isalnum() primitive#20852

Open
VaggelisD wants to merge 1 commit intopython:masterfrom
VaggelisD:str_isalnum
Open

[mypyc] Add str.isalnum() primitive#20852
VaggelisD wants to merge 1 commit intopython:masterfrom
VaggelisD:str_isalnum

Conversation

@VaggelisD
Copy link
Contributor

Added str.isalnum() similar to str.isspace().

One interesting thing to point out here is that the benchmarks decline in speed relative to the string's length:

All-alphanumeric mypyc (s) Python (s) Speedup
length 1 ('a') 0.645 2.036 3.16x
length 10 ('abcde12345') 1.026 2.607 2.54x
length 100 ('a' * 100) 3.599 7.848 2.18x
length 1 (UCS-2: U+00E9 é) 0.816 1.976 2.42x
length 10 (UCS-2: U+00E9 * 10) 2.091 2.587 1.24x
length 100 (UCS-2: U+00E9 * 100) 14.298 7.814 0.55x

Non-alphanumeric (early exit) mypyc (s) Python (s) Speedup
length 1 (' ') 0.622 2.006 3.22x
length 100 ('!' * 100) 0.617 2.024 3.28x
length 100 ('a' * 99 + '!') 3.453 10.246 2.97x

Not entirely sure how to interpret this but could it be because the Py_UNICODE_ISALNUM calls 4 functions internally which is more optimized in CPython due to PGO & LTO (?)

* ``s1.find(s2: str)``
* ``s1.find(s2: str, start: int)``
* ``s1.find(s2: str, start: int, end: int)``
* ``s.isspace()``
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was not documented in the str.isspace() PR, added it now


int kind = PyUnicode_KIND(str);
const void *data = PyUnicode_DATA(str);
for (Py_ssize_t i = 0; i < len; i++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance might increase if there was a separate loop for 2 byte and 4 byte kinds. This way the read operation wouldn't need to branch based on kind, which might result in better code. Can you try this out?

Copy link
Contributor Author

@VaggelisD VaggelisD Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it locally, it only slightly reduced the tail end (still 13+ seconds for the 2 byte 100 length one) so we'd still spot a significant regression for the larger strings.

I also tried calling PyObject_CallMethodNoArgs for larger strings in case we can fallback to the interpreter function but it doesn't make a difference; If it's the LTO/PGO inlining doing its magic we can't seem to reuse it at this point.

What is the preferred action here, do we still keep the primitive in the hopes that most strings are small or should mypyc always be at least on par or better than CPython?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still looks better than CPython on average, as ASCII strings and short strings are common. To match CPython performance we might need to have a custom implementation of Py_UNICODE_ISALNUM, which doesn't seem worth it. I'll experiment with this a little, but this might be close to as good as we can easily achieve.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I also wondered what it'd take to mirror PY_UNICODE_ISALNUM, Claude suggested against it as CPython is using gettyperecord() at each internal function call which operates on its internal unicode database (supposedly, hard to replicate)

Py_ssize_t len = PyUnicode_GET_LENGTH(str);
if (len == 0) return false;

if (PyUnicode_IS_ASCII(str)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be PyUnicode_KIND(obj) == PyUnicode_1BYTE_KIND instead? This would be needed if the loop below was split into dedicated 2/4 byte loops.

Copy link
Contributor Author

@VaggelisD VaggelisD Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching from ASCII path to the 1 byte puts a very big dent on performance, I assume because Py_ISALNUM operates off a lookup table whereas Py_UNICODE_ISALNUM has 4 separate function calls in it:


All-alphanumeric ASCII fast path 1 Byte kind Speedup
length 1 ('a') 0.623 0.873 1.40x
length 10 ('abcde12345') 1.003 2.708 2.70x
length 100 ('a' * 100) 3.139 14.147 4.51x
Non-alphanumeric (early exit) ASCII fast path 1 Byte kind Speedup
length 1 (' ') 0.609 1.118 1.84x
length 100 ('!' * 100) 0.617 1.126 1.82x
length 100 ('a' * 99 + '!') 3.322 14.802 4.46x

However, I can combine all 4 cases (ASCII plus 3 byte kinds) and hide their for loops behind a macro

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Judging from these numbers, it seems like checking any non-trivial string with Py_UNICODE_ISALNUM is already on par or worse than CPython

@VaggelisD
Copy link
Contributor Author

This might be of interest: tobymao/sqlglot#7120

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants