Skip to content

Implement WIPI String API#155

Open
mirusu400 wants to merge 3 commits into
dlunch:mainfrom
mirusu400:feat/wipi-string-api
Open

Implement WIPI String API#155
mirusu400 wants to merge 3 commits into
dlunch:mainfrom
mirusu400:feat/wipi-string-api

Conversation

@mirusu400
Copy link
Copy Markdown

요약

  • WIPI 1.2.1 스펙에 정의된 java.lang.String API 중 RustJava에 누락된 메서드들을 추가했습니다. (레퍼런스 참고)
  • 추가된 public 메서드는 모두 java_runtime/tests/classes/java/lang/test_string.rs에 단위 테스트 작성했습니다

생성자

  • String()
  • String(byte[] bytes, String charsetName)
  • String(byte[] bytes, int offset, int length, String charsetName)

인스턴스 메서드

  • endsWith(String)
  • equalsIgnoreCase(String)
  • getBytes(String charsetName)
  • lastIndexOf(int ch, int fromIndex)
  • regionMatches(boolean ignoreCase, int toffset, String other, int ooffset, int len)
  • replace(char oldChar, char newChar)
  • toLowerCase()

정적 메서드

  • valueOf(boolean), valueOf(long), valueOf(float), valueOf(double)
  • valueOf(char[]), valueOf(char[], int offset, int count)

비고

  • decode_str, encode_str 같은 함수들은 EUC-KR EUC_KR EUCKR 등 여러 방식으로 넘겨줘서 최대한 호환있게 바꿨습니다 (게임 예: KTF 전설의 마법학교2 - EUC_KR` 사용)

mirusu400 added 2 commits May 20, 2026 13:29
…ase ops

Add missing java.lang.String API as documented in the WIPI 1.2.1 spec
(see https://mirusu400.github.io/wipi-wiki/java-api/java/lang/String.md).
Korean J2ME apps that target this profile frequently fail with
NoSuchMethodError on these signatures, especially the charset-aware
byte[] constructors used to decode EUC-KR network payloads.

Constructors added:
- String()
- String(byte[], String charsetName)
- String(byte[], int, int, String charsetName)

Instance methods added:
- endsWith(String)
- equalsIgnoreCase(String)
- getBytes(String charsetName)
- lastIndexOf(int, int)
- regionMatches(boolean, int, String, int, int)
- replace(char, char)
- toLowerCase()

Static methods added:
- valueOf(boolean), valueOf(long), valueOf(float), valueOf(double)
- valueOf(char[]), valueOf(char[], int, int)
dlunch
dlunch previously approved these changes May 21, 2026
Copy link
Copy Markdown
Owner

@dlunch dlunch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

감사합니다!

@dlunch dlunch requested a review from Copilot May 21, 2026 08:59
@dlunch dlunch enabled auto-merge (squash) May 21, 2026 08:59
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

WIPI 1.2.1 스펙에 맞춰 RustJava의 java.lang.String에서 누락된 생성자/메서드 오버로드를 추가하고, 이에 대한 단위 테스트를 보강한 PR입니다.

Changes:

  • String() 및 charset을 받는 String(byte[], ...) 생성자 오버로드 추가
  • endsWith, equalsIgnoreCase, getBytes(String), regionMatches, replace, toLowerCase, lastIndexOf(int,int), valueOf(...) 오버로드 등 API 추가
  • charset 이름 정규화(대소문자/_/-) 및 일부 별칭 처리 로직 추가, 관련 테스트 일부 추가

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
java_runtime/src/classes/java/lang/string.rs WIPI 스펙 누락 String API(생성자/인스턴스/정적 메서드) 구현 및 charset 처리 확장
java_runtime/tests/classes/java/lang/test_string.rs 신규/추가된 String API에 대한 단위 테스트 추가
Comments suppressed due to low confidence (10)

java_runtime/src/classes/java/lang/string.rs:756

  • decode_str uses str::from_utf8(bytes).unwrap(), which will panic on invalid UTF-8 input. Since this constructor is reachable from user-provided byte arrays/charsets, it should return a Java-level decoding error (or perform replacement decoding) instead of crashing the VM.
    fn decode_str(charset: &str, bytes: &[u8]) -> RustString {
        match charset.to_ascii_uppercase().replace('_', "-").as_str() {
            "UTF-8" | "UTF8" => str::from_utf8(bytes).unwrap().to_string(),
            "EUC-KR" | "EUCKR" | "KS-C-5601-1987" | "MS949" | "CP949" => encoding_rs::EUC_KR.decode(bytes).0.to_string(),

java_runtime/src/classes/java/lang/string.rs:758

  • decode_str treats US-ASCII/ASCII the same as Latin-1 by directly mapping every byte to the same code point. In Java, bytes >= 0x80 are unmappable in US-ASCII and should be replaced (or trigger an error depending on policy), so this currently produces incorrect strings for non-ASCII bytes.
        match charset.to_ascii_uppercase().replace('_', "-").as_str() {
            "UTF-8" | "UTF8" => str::from_utf8(bytes).unwrap().to_string(),
            "EUC-KR" | "EUCKR" | "KS-C-5601-1987" | "MS949" | "CP949" => encoding_rs::EUC_KR.decode(bytes).0.to_string(),
            "ISO-8859-1" | "LATIN1" | "US-ASCII" | "ASCII" => bytes.iter().map(|&b| b as char).collect(),
            _ => unimplemented!("unsupported charset: {}", charset),

java_runtime/src/classes/java/lang/string.rs:767

  • encode_str for ISO-8859-1/US-ASCII uses c as u8, which truncates code points > 0xFF (and > 0x7F for ASCII) instead of replacing/handling unmappable characters. This can silently corrupt output; the encoder should implement proper mapping (e.g., replace with '?' or error) per charset.
    fn encode_str(charset: &str, string: &str) -> Vec<u8> {
        match charset.to_ascii_uppercase().replace('_', "-").as_str() {
            "UTF-8" | "UTF8" => string.as_bytes().to_vec(),
            "EUC-KR" | "EUCKR" | "KS-C-5601-1987" | "MS949" | "CP949" => encoding_rs::EUC_KR.encode(string).0.to_vec(),
            "ISO-8859-1" | "LATIN1" | "US-ASCII" | "ASCII" => string.chars().map(|c| c as u8).collect(),
            _ => unimplemented!("unsupported charset: {}", charset),

java_runtime/src/classes/java/lang/string.rs:768

  • Both decode_str and encode_str use unimplemented!("unsupported charset"), which will panic on unknown/typoed charset names now that charset is passed via public APIs. Please convert this into a Java exception (e.g., java/io/UnsupportedEncodingException) or a Result error instead of aborting execution.
    fn decode_str(charset: &str, bytes: &[u8]) -> RustString {
        match charset.to_ascii_uppercase().replace('_', "-").as_str() {
            "UTF-8" | "UTF8" => str::from_utf8(bytes).unwrap().to_string(),
            "EUC-KR" | "EUCKR" | "KS-C-5601-1987" | "MS949" | "CP949" => encoding_rs::EUC_KR.decode(bytes).0.to_string(),
            "ISO-8859-1" | "LATIN1" | "US-ASCII" | "ASCII" => bytes.iter().map(|&b| b as char).collect(),
            _ => unimplemented!("unsupported charset: {}", charset),
        }
    }

    fn encode_str(charset: &str, string: &str) -> Vec<u8> {
        match charset.to_ascii_uppercase().replace('_', "-").as_str() {
            "UTF-8" | "UTF8" => string.as_bytes().to_vec(),
            "EUC-KR" | "EUCKR" | "KS-C-5601-1987" | "MS949" | "CP949" => encoding_rs::EUC_KR.encode(string).0.to_vec(),
            "ISO-8859-1" | "LATIN1" | "US-ASCII" | "ASCII" => string.chars().map(|c| c as u8).collect(),
            _ => unimplemented!("unsupported charset: {}", charset),
        }

java_runtime/src/classes/java/lang/string.rs:589

  • get_bytes_charset dereferences charset_name without checking for null; passing null will currently panic via JavaLangString::to_rust_string. Java expects a NullPointerException here, so this should be handled explicitly.
    async fn get_bytes_charset(
        jvm: &Jvm,
        _: &mut RuntimeContext,
        this: ClassInstanceRef<Self>,
        charset_name: ClassInstanceRef<Self>,
    ) -> Result<ClassInstanceRef<Array<i8>>> {
        tracing::debug!("java.lang.String::getBytes({:?}, {:?})", &this, &charset_name);

        let string = JavaLangString::to_rust_string(jvm, &this).await?;
        let charset = JavaLangString::to_rust_string(jvm, &charset_name).await?;

        let bytes = cast_vec(Self::encode_str(&charset, &string));

java_runtime/src/classes/java/lang/string.rs:656

  • region_matches dereferences other without checking for null (to_rust_string unwraps). In Java, regionMatches(..., null, ...) throws NullPointerException; the current implementation will panic instead.
        let this_string = JavaLangString::to_rust_string(jvm, &this).await?;
        let other_string = JavaLangString::to_rust_string(jvm, &other).await?;

java_runtime/src/classes/java/lang/string.rs:703

  • ends_with does not handle suffix == null. JavaLangString::to_rust_string will unwrap and panic, but Java’s String.endsWith(null) should throw NullPointerException. Add an explicit null check and raise the correct Java exception.
    async fn ends_with(jvm: &Jvm, _: &mut RuntimeContext, this: ClassInstanceRef<Self>, suffix: ClassInstanceRef<Self>) -> Result<bool> {
        tracing::debug!("java.lang.String::endsWith({:?}, {:?})", &this, &suffix);

        let this_string = JavaLangString::to_rust_string(jvm, &this).await?;
        let suffix_string = JavaLangString::to_rust_string(jvm, &suffix).await?;

        Ok(this_string.ends_with(&suffix_string))

java_runtime/src/classes/java/lang/string.rs:694

  • last_index_of_from computes indices over Rust chars() (Unicode scalar values), but Java String.lastIndexOf(int, int) is specified in terms of UTF-16 code units (and supports searching for code points > 0xFFFF via surrogate pairs). This will return different indices for strings containing supplementary characters.
        let this_string = JavaLangString::to_rust_string(jvm, &this).await?;
        let chars: Vec<char> = this_string.chars().collect();
        let end = (from_index as usize + 1).min(chars.len());

        let index = chars[..end].iter().rposition(|&c| c as u32 == ch as u32).map(|x| x as i32);

        Ok(index.unwrap_or(-1))

java_runtime/tests/classes/java/lang/test_string.rs:327

  • Missing test coverage for the new charset-related APIs: constructors String(byte[], int, int, String) and String(byte[], String) are only partially covered, and there is no test for String.getBytes(String charsetName) (including charset alias normalization like EUC_KR). Add tests to validate behavior and prevent regressions.
#[tokio::test]
async fn test_init_byte_array_charset() -> Result<()> {
    let jvm = test_jvm().await?;

    let bytes = vec![b'H' as i8, b'i' as i8, b'!' as i8];
    let mut array = jvm.instantiate_array("B", 3).await?;
    jvm.store_array(&mut array, 0, bytes).await?;

    let charset = JavaLangString::from_rust_string(&jvm, "UTF-8").await?;
    let string = jvm.new_class("java/lang/String", "([BLjava/lang/String;)V", (array, charset)).await?;
    assert_eq!(JavaLangString::to_rust_string(&jvm, &string).await?, "Hi!");

java_runtime/src/classes/java/lang/string.rs:729

  • valueOf(float) / valueOf(double) currently delegate to Rust’s to_string(), whose formatting does not match Java’s Float.toString/Double.toString (e.g., Java prints 1.0 while Rust prints 1). This will break string representations expected by Java/WIPI code; please implement Java-compatible float formatting.
    async fn value_of_float(jvm: &Jvm, _: &mut RuntimeContext, value: f32) -> Result<ClassInstanceRef<Self>> {
        tracing::debug!("java.lang.String::valueOf({})", value);

        Ok(JavaLangString::from_rust_string(jvm, &value.to_string()).await?.into())
    }

    async fn value_of_double(jvm: &Jvm, _: &mut RuntimeContext, value: f64) -> Result<ClassInstanceRef<Self>> {
        tracing::debug!("java.lang.String::valueOf({})", value);

        Ok(JavaLangString::from_rust_string(jvm, &value.to_string()).await?.into())
    }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread java_runtime/src/classes/java/lang/string.rs
Comment thread java_runtime/tests/classes/java/lang/test_string.rs
auto-merge was automatically disabled May 22, 2026 02:14

Head branch was pushed to by a user without write access

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants