Skip to content

memory_manager::alloc should return error on OOM #1475

@wks

Description

@wks

TODO:

  • Make memory_manager::alloc and memory_manager::alloc_with_options return Result<Address, AllocError>.
    • Make them return error when OOM happens, even when at a safe point.
  • Write in the documentation that Collection::out_of_memory will return.
  • Refactor the allocation code (specifically Allocator::alloc_once_inline) for the changed APIs mentioned above.

The JikesRVM legacy

Our memory allocation architecture is ported from JikesRVM. In JikesRVM, MemoryManager.allocateSpace never returns null, even in the case of OOM. When OOM happens, MMTk calls Collection.outOfMemory(). The VM implements this method by throwing an exception.

public class Collection extends org.mmtk.vm.Collection {
  @Override
  @UninterruptibleNoWarn
  public void outOfMemory() {
    throw RVMThread.getOutOfMemoryError();
  }
}

This works because JikesRVM's (JIT and AoT) compilers compile both VM and MMTk code using the same ABI. The uniform stack layout and the uniform stack unwinding mechanism allow an exception to be thrown from Collection.outOfMemory() into MMTk frames, and then into VM frames, and then all the way to application frames, where the exception can be caught and handled.

The JikesRVM has always assumed that Collection.outOfMemory() never returns.

        if (failWithOOM) {
          VM.collection.outOfMemory();
          VM.assertions.fail("Not Reached"); // THIS IS UNREACHABLE!
          return Address.zero();
        }

There is no way to express in Java that a function never returns (like Rust's fn () -> !), so JikesRVM uses assertions. The return statement is there to make compilation successful.

It doesn't work in the Rust MMTk

In Rust MMTk, the VM binding cannot unwind the stack in Collection::out_of_memory, at least not in a portable way. When using the Rust MMTk, the VM, the MMTk core and the application code may have different ABIs. Take the OpenJDK binding for example.

  • The VM and the C++ part of the VM binding are implemented in C++ and are AoT compiled.
  • MMTk and the Rust part of the binding are implemented in Rust and are AoT compiled.
  • The application code is provided as bytecode, and is either interpreted or JIT compiled.

If Collection::out_of_memory were to throw an exception directly, it will throw from C++ code (VM and VM binding) to Rust (VM binding and MMTk core) to JIT-compiled machine code. This crosses three languages. Rust doesn't have the concept of exceptions (while panic!() is implemented with some form of stack unwinding), and C++ exceptions and Java exceptions are implemented in different ways.

The only safe way to transfer control back to the application is returning frame by frame out of memory_manager::alloc.

Proposed API changes

Allocation

First of all, memory_manager::alloc and memory_manager::alloc_with_options shall be able to return error values.

pub fn memory_manager::alloc(...) -> Result<Address, AllocationFailure> { ... }
pub fn memory_manager::alloc_with_options(...) -> Result<Address, AllocationFailure> { ... }

pub enum AllocationFailure {
    /// The memory has exhausted.  The VM binding should raise out-of-memory error to the application.
    OutOfMemory,
    /// The allocation is not at a safepiont, but the allocation could not be satisfied without a GC.
    WouldBlock,
}

Currently, these are the two possible errors the VM binding could get.

The caller of alloc should match against the Result<Address, AllocationFailure> and handle errors accordingly. Specifically, if it is Err(AllocationFailure::OutOfMemory), it should raise OOM exception.

The application code can be either interpreted or compiled. Handling errors in the interpreter is straightforward.

JIT-compiled code needs some tricks. First of all, the JIT-compiled code should use bump-pointer fast paths when possible. For the slow path, the VM binding is advised to wrap the raw memory_manager::alloc(...) -> Result<..., ...> into a function void* mmtk_alloc(...). It shall follow the C calling convention so that it is easy to emit code to call from JIT-compiled code to the runtime. When successful, it will simply return the pointer. When failed, there are two strategies.

  1. Returning 0 to the JIT-compiled code, and generate a check instruction after each allocation slow path and branch to a code stub that throws OutOfMemoryError.
  2. Modifying the return address before returning from void* mmtk_alloc(...) and use a return barrier to raise the exception.

Using return barrier can eliminate a check on the code path where the allocation is successful. But it is probably not that important because it is the slow path.

About the existing AllocationError

We currently have the AllocationError type which is currently used by Collection::out_of_memory

pub enum AllocationError {
    /// The specified heap size is too small for the given program to continue.
    HeapOutOfMemory,
    /// The OS is unable to mmap or acquire more memory. Critical error. MMTk expects the VM to
    /// abort if such an error is thrown.
    MmapOutOfMemory,
}

AllocationError::HeapOutOfMemory is equivalent to the AllocationFailure::OutOfMemory I proposed.

AllocationError::MmapOutOfMemory, as the doc says, is a critical error and should result in immediate VM termination.

There is no equivalent to AllocationFailure::WouldBlock.

We probably should let AllocationError and AllocationFailure to coexist because they are used by two different API functions and have different sets of values.

Collection::out_of_memory

We need to explicitly document that this function is expected to return.

Even though the VM binding cannot unwind the stack from within Collection::out_of_memory in a portable way, it still allows the VM binding to set thread-local states so that after returning from memory_manager::alloc, it can check the state and handle OOM errors. I (Kunshan) am not sure how useful this is, given that alloc is able to return Err(AllocationFailure::OutOfMemory), but we'd better keep it for a while just in case any VM actually needs that. For example, it can still panic fast when AllocationError::MmapOutOfMemory happens.

I am not sure if we should allow the VM binding to override Collection::out_of_memory and translate AllocationError::MmapOutOfMemory into a AllocationFailure::OutOfMemory to be returned from memory_manager::alloc. When mmap cannot allocate more memory, it doesn't mean the VM cannot continue. If the VM has pre-allocated OutOfMemoryError object instances, it can still unwind the stack (without stack trace or with limited stack trace) and let the application "limp" for a while and shut down gracefully.

Proposed refactoring

We need an InternalAllocationFailure type.

pub(crate) enum InternalAllocationFailure {
    BadRequest,
    Retry,
    WouldBlock,    
}

Space::acquire and its sub-functions Space::get_new_pages_and_initialize and Space::not_acquiring need to distinguish between two cases:

  1. If it is at safepoint and it blocked for GC, it shall return Err(Retry).
  2. If it is not at safepoint but GC is needed, it shall return Err(NeedGC).

All functions in the call chain to Space::acquire, such as BumpAllocator::acquire_block and ImmixSpace::get_clean_block, should forward that error to their callers such as ImmixAllocator::acquire_clean_block all the way up to Allocator::alloc_slow_inline. During this path, some functions may check for obvious allocation errors (Space::handle_obvious_oom_request). If that fails, it should return Err(BadRequest).

Allocator::alloc_slow_inline should match against the errors.

  • When InternalAllocationFailure::BadRequest, it should immediately fail with OOM. (This fixes the problem Fix infinite loop if we return from Collection::out_of_memory #1473 is trying to solve.)
  • When InternalAllocationFailure::WouldBlock, it should immediately return AllocationFailure::WouldBlock. (We don't need to check if we are at safepoint now because Space::acquire checked it for us.)
  • When InternalAllocationFailure::Retry, it should loop and try allocating again. But if after an emergency collection it still returns Retry, it shall fail with OOM.

When OOM, it shall call Collection::out_of_memory (if AllocationOptions::allow_oom_call is true) and then return AllocationFailure::OutOfMemory.

Performance concerns

The VM should use bump pointer fast paths whenever possible, and avoid calling memory_manager::alloc for most of the allocations. This means all of the API changes and refactoring happen on the slow paths. We shouldn't see obvious performance change with plans that support bump-pointer allocation, which should be everything except MarkSweep and PageProtect.

Related issues

#1223 proposed introducing NonZeroAddress because successful allocations should never return Address::ZERO. It mentioned returning error states using None, while this PR proposes using Err(...).

But regardless whether we introduce NonZeroAddress, once we started using Result<Address, AllocationFailure> or Result<Address, InternalAllocationFailure>, we should stop checking against Address::ZERO and start using Err(...) to report errors.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions