[mypyc] Add str.isalnum() primitive#20852
Conversation
| * ``s1.find(s2: str)`` | ||
| * ``s1.find(s2: str, start: int)`` | ||
| * ``s1.find(s2: str, start: int, end: int)`` | ||
| * ``s.isspace()`` |
There was a problem hiding this comment.
This was not documented in the str.isspace() PR, added it now
|
|
||
| int kind = PyUnicode_KIND(str); | ||
| const void *data = PyUnicode_DATA(str); | ||
| for (Py_ssize_t i = 0; i < len; i++) { |
There was a problem hiding this comment.
Performance might increase if there was a separate loop for 2 byte and 4 byte kinds. This way the read operation wouldn't need to branch based on kind, which might result in better code. Can you try this out?
There was a problem hiding this comment.
I tried it locally, it only slightly reduced the tail end (still 13+ seconds for the 2 byte 100 length one) so we'd still spot a significant regression for the larger strings.
I also tried calling PyObject_CallMethodNoArgs for larger strings in case we can fallback to the interpreter function but it doesn't make a difference; If it's the LTO/PGO inlining doing its magic we can't seem to reuse it at this point.
What is the preferred action here, do we still keep the primitive in the hopes that most strings are small or should mypyc always be at least on par or better than CPython?
There was a problem hiding this comment.
This still looks better than CPython on average, as ASCII strings and short strings are common. To match CPython performance we might need to have a custom implementation of Py_UNICODE_ISALNUM, which doesn't seem worth it. I'll experiment with this a little, but this might be close to as good as we can easily achieve.
There was a problem hiding this comment.
Sounds good! I also wondered what it'd take to mirror PY_UNICODE_ISALNUM, Claude suggested against it as CPython is using gettyperecord() at each internal function call which operates on its internal unicode database (supposedly, hard to replicate)
| Py_ssize_t len = PyUnicode_GET_LENGTH(str); | ||
| if (len == 0) return false; | ||
|
|
||
| if (PyUnicode_IS_ASCII(str)) { |
There was a problem hiding this comment.
Could this be PyUnicode_KIND(obj) == PyUnicode_1BYTE_KIND instead? This would be needed if the loop below was split into dedicated 2/4 byte loops.
There was a problem hiding this comment.
Switching from ASCII path to the 1 byte puts a very big dent on performance, I assume because Py_ISALNUM operates off a lookup table whereas Py_UNICODE_ISALNUM has 4 separate function calls in it:
| All-alphanumeric | ASCII fast path | 1 Byte kind | Speedup |
|---|---|---|---|
length 1 ('a') |
0.623 | 0.873 | 1.40x |
length 10 ('abcde12345') |
1.003 | 2.708 | 2.70x |
length 100 ('a' * 100) |
3.139 | 14.147 | 4.51x |
| Non-alphanumeric (early exit) | ASCII fast path | 1 Byte kind | Speedup |
|---|---|---|---|
length 1 (' ') |
0.609 | 1.118 | 1.84x |
length 100 ('!' * 100) |
0.617 | 1.126 | 1.82x |
length 100 ('a' * 99 + '!') |
3.322 | 14.802 | 4.46x |
However, I can combine all 4 cases (ASCII plus 3 byte kinds) and hide their for loops behind a macro
There was a problem hiding this comment.
Judging from these numbers, it seems like checking any non-trivial string with Py_UNICODE_ISALNUM is already on par or worse than CPython
|
This might be of interest: tobymao/sqlglot#7120 |
Added
str.isalnum()similar tostr.isspace().One interesting thing to point out here is that the benchmarks decline in speed relative to the string's length:
'a')'abcde12345')'a' * 100)é)' ')'!' * 100)'a' * 99 + '!')Not entirely sure how to interpret this but could it be because the Py_UNICODE_ISALNUM calls 4 functions internally which is more optimized in CPython due to PGO & LTO (?)