Skip to content

Add default set env MC_STORE_MEMCPY is 0 when protocol is tcp#104

Merged
yubofredwang merged 2 commits into
lightseekorg:mainfrom
lengrongfu:fix/moonkec-tcp
May 22, 2026
Merged

Add default set env MC_STORE_MEMCPY is 0 when protocol is tcp#104
yubofredwang merged 2 commits into
lightseekorg:mainfrom
lengrongfu:fix/moonkec-tcp

Conversation

@lengrongfu
Copy link
Copy Markdown
Contributor

Current Mooncake version has a bug with TCP-only hosts causing a SEGFAULT error. Set MC_STORE_MEMCPY=0 until the kvcache-ai/Mooncake#1986 is fixed.

Although this documentation provides a hint, we can enhance usability by setting this default environment variable directly within the code.

If a user overlooks the guidance provided in this documentation, it would be extremely difficult to pinpoint the root cause of the issue based solely on the error message.

(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. Lease ID: 0b00000043c351dfa676154eb6e2d2d83b12cc84d4b0dd6f7c95895706e201b2 Worker ID: e80b5521d830d80614c25bcf5227a307abb4372b9112056245654785 Node ID: 881e316ecdb052ae3a7b7405e2cab0cc775044e5492e1473e235f653 Worker IP address: 10.233.74.188 Worker port: 35501 Worker PID: 887276 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. Some common causes include: (1) the process was killed by the OOM killer due to high memory usage, (2) ray stop --force was called, or (3) the worker crashed unexpectedly due to SIGSEGV or another unexpected error.
[2026-05-18 03:12:01,209] loop.py:130 INFO Shutting down 1 inference engine(s)...
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/jovyan/code/TorchSpec/torchspec/train_entry.py", line 416, in <module>
    train_async_no_generation(args)
  File "/home/jovyan/code/TorchSpec/torchspec/train_entry.py", line 403, in train_async_no_generation
    run_training_loop(
  File "/home/jovyan/code/TorchSpec/torchspec/controller/loop.py", line 470, in run_training_loop
    return training_loop(
           ^^^^^^^^^^^^^^
  File "/home/jovyan/code/TorchSpec/torchspec/controller/loop.py", line 314, in training_loop
    train_results = ray.get(train_futures)
                    ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/code/TorchSpec/.venv/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/code/TorchSpec/.venv/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 107, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/code/TorchSpec/.venv/lib/python3.11/site-packages/ray/_private/worker.py", line 2980, in get
    values, debugger_breakpoint = worker.get_objects(
                                  ^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/code/TorchSpec/.venv/lib/python3.11/site-packages/ray/_private/worker.py", line 1025, in get_objects
    raise value
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
        class_name: TrainerActor
        actor_id: 3b496743efbcb5868b136aba01000000
        pid: 887276
        namespace: 4b085ef6-2e8d-467e-b0b2-edf27331584d
        ip: 10.233.74.188
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. Some common causes include: (1) the process was killed by the OOM killer due to high memory usage, (2) ray stop --force was called, or (3) the worker crashed unexpectedly due to SIGSEGV or another unexpected error.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c81cdf697d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread torchspec/config/mooncake_config.py Outdated
lengrongfu and others added 2 commits May 22, 2026 13:31
Signed-off-by: rongfu.leng <lenronfu@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
@yubofredwang yubofredwang merged commit 244b638 into lightseekorg:main May 22, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants