Fix ExifTool test failures: regex, eval, and bytecode interpreter bugs#250
Open
Fix ExifTool test failures: regex, eval, and bytecode interpreter bugs#250
Conversation
In Perl, \1-\7 inside character classes are always octal escapes, not backreferences. Java regex requires a leading zero for octal sequences. The preprocessor was passing single/double digit octals through as-is, causing Java Pattern.compile to reject them. Fixes 11 ExifTool test failures caused by MakerNoteHP2 condition regex [\0-\4]. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <noreply@cognition.ai>
Two bugs fixed: 1. \G regex anchor was never active: RuntimeRegex had a hardcoded `private final boolean useGAssertion = false` field that shadowed the actual flag from RegexFlags. Now useGAssertion is set from regexFlags.useGAssertion() after compilation and in all copy paths. 2. grep in scalar context always returned 1 in bytecode interpreter: ListOperators.grep() already wraps the count in scalar context, but BytecodeInterpreter was re-wrapping with elements.size() which is always 1. Removed the double-wrapping to match the MAP handler. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <noreply@cognition.ai>
ScalarSpecialVariable (regex captures like $1, $&) computes its value dynamically via getValueAsScalar(), but its type/value fields stay as UNDEF/null. The RuntimeScalar copy constructor and set() method directly accessed these fields, resulting in UNDEF when $1 was used in array ref construction ([$1] -> empty), hash construction, or any copy context. Added instanceof checks to resolve ScalarSpecialVariable before copying. Fixes RTF Unicode parsing (Recompose([$2])) and MIFF metadata. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <noreply@cognition.ai>
In Perl, pos() only affects /g matches. Non-/g matches always start from position 0 regardless of pos(). PerlOnJava was incorrectly using pos() for all matches, causing regex failures when a prior /g match had set pos() on a string. This fixed ExifTool Text_5 test where IsUTF8() used /g matches on a string, leaving pos() set, and a subsequent non-/g match on the same string via $$dataPt !~ /[\x80-\x9f]/ would incorrectly start from the pos() position instead of 0. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <noreply@cognition.ai>
- Fix not() operator for ScalarSpecialVariable ($1, $2, etc.) which always returned true because it checked the type field (always UNDEF) instead of delegating to getBoolean(). This caused DICOM parsing to incorrectly treat explicit VR as implicit VR. - Fix Digest::MD5 and Digest::SHA add() methods to use ISO_8859_1 instead of UTF_8 for byte string conversion, matching the addfile() methods. Fixes wrong IPTC digest computation for data with bytes >= 0x80. - Fix lengthBytes (use bytes; length) to return character count for BYTE_STRING type instead of UTF-8 byte count. Bytes >= 0x80 were being double-counted as 2 UTF-8 bytes. - Fix string concatenation to preserve BYTE_STRING type when either operand is BYTE_STRING and all characters are <= 255. Previously, concatenating a BYTE_STRING with a STRING would always upgrade to STRING, causing downstream binary data corruption. - Fix unpack float/double handlers to use code point path for UTF-8 flagged data, matching the existing short/long handler pattern. Bytes >= 0x80 were being UTF-8 expanded before float interpretation, corrupting values like 1.0 into -1.78e-38. Fixes ExifTool tests: DICOM_2, PDF_2,3, PhotoMechanic_2 Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <noreply@cognition.ai>
The is_utf8() function was checking type == BYTE_STRING when it should check type != BYTE_STRING. In Perl 5, is_utf8() returns true when the string has the internal UTF-8 flag (character string), and false for byte strings. PerlOnJava BYTE_STRING type represents byte strings (no UTF-8 flag), so is_utf8 should return false for them. This caused ExifTool to incorrectly UTF-8 encode binary image data when reading from scalar references, because it uses is_utf8() to decide whether to re-encode the data. Fixes ExifTool tests: PPM_3, GIF_2, IPTC_4, IPTC_5 Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <noreply@cognition.ai>
…re, substr lvalue - joinInternal: use OR logic for BYTE_STRING (any byte element makes result bytes) - stringConcat/stringConcatWarnUninitialized: preserve BYTE_STRING in fallthrough when one operand is UNDEF/INTEGER - ParseMapGrepSort: detect sort SUBNAME LIST pattern before trying block parse - EmitForeach: save/restore pre-existing lexical variables used as loop variables - RuntimeSubstrLvalue.set(): preserve BYTE_STRING type after substr assignment Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <noreply@cognition.ai>
tryWholeBlockRefactoring wraps large blocks in anonymous subs, breaking return semantics (return exits the anon sub, not the enclosing function). The interpreter fallback handles "Method too large" safely. Also adds return detection to ControlFlowDetectorVisitor as defense-in-depth, implements goto LABEL in bytecode compiler, and caches overload method resolution. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <noreply@cognition.ai>
HASH_SET was casting value register to RuntimeScalar without checking type, causing ClassCastException when a RuntimeList was in the register. NOT opcode similarly hardened. Also improved bytecode context display to show full 16-bit opcode values. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <noreply@cognition.ai>
…opcode Three interpreter bugs fixed plus a runtime fix: 1. undef $scalar in interpreter: was creating a new undef value and ignoring the operand. Now emits UNDEFINE_SCALAR opcode that calls .undefine() on the variable register in-place. 2. HASH_SET opcode stored values directly in hash, allowing RuntimeScalarReadOnly cached integers to be stored. Later += modifications failed with read-only error. Now wraps values in new RuntimeScalar() to ensure mutability. 3. print/say/printf return value: was returning byte count instead of 1 on success (Perl spec). Fixed in CustomFileChannel and PipeOutputChannel write() methods. 4. MY_SCALAR superinstruction (opcode 352): combines LOAD_UNDEF + SET_SCALAR for lexical scalar assignment. Prevents aliasing to read-only cached values. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <noreply@cognition.ai>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes multiple bugs discovered while running the ExifTool test suite, improving pass rate from ~78% to 93% (294/316 individual tests, 102/113 test files fully passing).
Relates to #93
Bugs Fixed
Fix
\Gregex anchor —RuntimeRegex.useGAssertionwas hardcoded tofalse(private final boolean), so the\Gassertion was never enforced. Fixed by making it mutable and setting fromregexFlags.useGAssertion(). This fixed XISF/XMP parsing and RTF title extraction.Fix
grepscalar context in bytecode interpreter —ListOperators.grep()already wraps the match count in scalar context, butBytecodeInterpreterwas re-counting the wrapper list (always returning 1). Removed the double-wrapping.Fix regex octal escapes inside character classes —
[\0-\4]was passing single-digit octals as-is, but Java regex requires\04. Prepended leading zero for short octals in character classes.Fix eval STRING leaking captured variable aliases — eval STRING pre-compilation created global aliases for captured
myvariables but never cleaned them up, causing variables to not re-initialize in loops.Dynamic scoping for regex match variables —
$1,$&,@-,@+etc. now properly save/restore acrossevaland subroutine boundaries.Fix
sysread/readtype corruption on tied scalars — byte-string type was being lost.Fix
stat _/lstat _— cached stat buffer now works correctly with proper context handling.Fix scalar
localtime/gmtime— use Perl ctime format; fix for large epoch values.Fix list slice with range indices —
(expr)[0..2]now works correctly.Fix bytecode compiler my-extraction — short-circuit ops and AUTOLOAD with forward declarations.
Other fixes: BYTE_STRING type preservation,
(@)prototype parsing, signature error messages, Time::Local update to 1.35.Test plan
makepasses (all unit tests)Generated with Devin