Skip to content

[RFC]: Add support for string arrays in stdlib #218

@Amansingh0807

Description

@Amansingh0807

Full name

Aman Singh

University status

Yes

University name

Guru Gobind Singh Indraprastha University

University program

Bachelors in Technology (Major IT)

Expected graduation

May, 2027

Short biography

I am a pre-final year CS/IT undergraduate at GGSIPU, New Delhi, with a deep-rooted passion for low-level architecture, algorithms, and building high-performance systems. My programming journey started with competitive problem-solving in C++, which naturally evolved into engineering scalable web and backend systems using JavaScript, TypeScript, and Next.js.

Currently, I am working as a PwC Launchpad Trainee, gaining hands-on experience with enterprise-grade software solutions. At the same time I am serving as Campus Crew at HackerRank. Previously, I spent time exploring technical problem spaces alongside the team at Atlas Research. Beyond corporate roles, I am heavily invested in the open-source ecosystem. I lead the Technical division at our college club, where I regularly organize hackathons (like 'Xen-O-Thon') and mentor peers in algorithmic problem-solving.

I am fascinated by the intersection of JavaScript and C, and the challenge of managing complex memory architectures is exactly what drew me to stdlib. When I am away from my keyboard, you can usually find me on a badminton court or talking about cricket, a sport I previously played professionally for the U-16 Delhi state team.

Timezone

Indian Standard Time (GMT+5:30)

Contact details

Email: amansingh080704@gmail.com
GitHub: Amansingh0807
LinkedIn: amansingh08

Platform

Windows

Editor

My exclusive and preferred code editor is Visual Studio Code (VSCode). I love it because of its lightweight nature and incredibly powerful extension ecosystem, which I have heavily customized for open-source development. To align with stdlib's rigorous codebase standards, my workspace is strictly configured with ESLint for real-time linting and style enforcement. Additionally, I rely heavily on VSCode's built-in TypeScript language server to ensure that any complex type definitions and signatures (like the ones I worked on in the ndarray packages) are perfectly accurate before I even run a local build.

Programming experience

My programming journey began in 2020 during the global lockdown. What started as a sheer fascination with how software operates under the hood quickly escalated into a deep passion for software engineering and open-source development. Over the past few years, I have transitioned from writing basic scripts to architecting scalable, real-world applications.

Some of the key projects that define my experience include:

GenForm
An open-source project where I serve as the core maintainer and Project Admin under the Social Winter of Code (SWOC). It currently supports over 600+ users. Managing this project taught me how to handle community contributions, enforce code quality, and maintain production-grade repositories.
GitHub Repository | Live Demo

Nextric Hire
A SaaS AI platform that enables users to intelligently interact with job descriptions and auto-generate tailored, ATS-friendly resumes. Built with Next.js 15, Convex, and Clerk, this project heavily refined my skills in integrating Generative AI (Gemini), managing complex real-time backend states, and building scalable full-stack architectures.
GitHub Repository | Live Demo

AI Road Segmentation
An AI/ML project focused on road segmentation, which required processing complex datasets. This exposed me to the performance bottlenecks of heavy data manipulation and taught me the critical need for highly optimized, low-level computations when dealing with multidimensional arrays.
GitHub Repository

MemG Vision
A computer vision-oriented project where I handled dynamic data processing and system integration. Building this further strengthened my backend, data streaming, and overall system architecture skills.
GitHub Repository

JavaScript experience

I initially learned JavaScript to build full-stack web applications using the React and Next.js ecosystems. However, my true appreciation for the language blossomed when I started exploring its lower-level capabilities, particularly during my contributions to stdlib. Moving away from standard web development to manipulating flat memory structures completely changed my perspective on the language.

My favorite feature: TypedArrays and ArrayBuffer. I am fascinated by how JavaScript allows us to allocate contiguous blocks of memory and manipulate raw bytes using views like Uint8Array or Float64Array. It bridges the gap between high-level scripting and low-level system performance, which is exactly why I am so drawn to the StringArray interop challenge.

My least favorite feature: Implicit Type Coercion. While it makes JavaScript flexible for beginners, it often leads to silent, catastrophic bugs in complex computational libraries where strict type integrity is required. This is precisely why I heavily prefer writing strict TypeScript and enforcing rigorous ESLint rules to catch these issues at compile-time rather than runtime.

Node.js experience

My experience with Node.js goes far beyond just spinning up REST APIs with Express.js. Through my work on GenForm and my backend projects, I have developed a solid grasp of the Node.js event loop, asynchronous file system operations (using the fs module), and stream processing.

Most importantly for this proposal, I have spent time understanding Node.js Buffer objects. Understanding that Node.js Buffers are essentially subclasses of JavaScript's native Uint8Array is crucial for the architecture I am proposing for StringArray, as it dictates how we will handle UTF-8 string encoding and memory allocation before passing data down to the C-level macros.

C/Fortran experience

C/C++ Experience: C and C++ form the absolute core of my computer science foundation. Because of my heavy involvement in competitive programming, I am highly comfortable with manual memory management, pointer arithmetic, and optimizing contiguous memory arrays. I understand the strict requirements C demands, such as handling null-terminated strings, avoiding memory leaks, and writing cache-friendly loops. This background gives me the exact low-level intuition required to build the C-structs and iteration macros needed for the StringArray JS/C interop.

Fortran Experience: I want to be completely transparent, I do not have hands-on experience writing Fortran code. Currently, when I encounter Fortran logic or legacy numerical libraries, I leverage AI tools to help me parse the syntax and understand the underlying mathematical models. However, I am a fast and eager learner. If the project requires translating or interacting with Fortran routines, I am fully prepared to adapt and learn it on the fly.

Interest in stdlib

When I first started my journey with competitive programming in C++, I treated standard libraries as magic "black boxes" that just worked. As I transitioned into the JavaScript and Node.js ecosystem for building full-stack applications, I frequently felt the absence of that raw, low-level numerical computing power. Discovering stdlib was a lightbulb moment for me. It wasn't just another npm package; it was a massive, ambitious bridge connecting the accessibility of the web with the bare-metal performance of C.

On a personal level, my journey here has been deeply transformative. I vividly remember one of my early PRs for the BLAS layer (dapx) receiving an extensive review with over 40 meticulous comments. Instead of feeling overwhelmed, I felt a profound sense of respect. The maintainers weren't just looking for a quick bug fix; they took the time to teach me strict architectural discipline, Tuple typing in TypeScript, and robust memory mutation documentation. That level of uncompromising mentorship is incredibly rare, and it fundamentally shifted my mindset from just being a "coder" to striving to be a "system architect."

If I have to pick my absolute favorite aspects of stdlib, it would be the ndarray iteration machinery and the rigorous benchmarking standards. I love the sheer engineering beauty of how flat memory buffers are manipulated through strides and offsets to achieve C-like speeds in JavaScript. Writing mathematical functions (like roundnf) and proving their efficiency through parameterized benchmarks gives a textbook-to-reality thrill that I haven't found anywhere else. stdlib has become my ultimate training ground, and I am deeply invested in helping it grow.

Version control

Yes

Contributions to stdlib

I started my journey with stdlib by picking up 'Good First Issues' to understand the repository's architecture and strict CI/CD pipelines, primarily refactoring benchmark files to use string interpolation. As I grew more comfortable with the codebase, I moved on to implementing numerical constants for the newly introduced float16 data type.

From there, I transitioned to core mathematical functions in the math/base/special namespace (such as roundnf and complex number utilities). Most recently, I have been deeply involved in adding and refining BLAS ndarray interfaces (like dapx, sfill, and drev). Working on these BLAS packages has been my biggest learning curve, teaching me the intricacies of strict TypeScript tuple types, 1D memory manipulation, and C-level array iteration.

Merged/Closed PRs (55+ Pull Requests)
My merged work primarily consists of float16 mathematical constants, base special math functions, and extensive benchmark refactoring.

Key Merges: math/base/special/roundnf (#9389), constants/float16/e (#8996), constants/float16/eulergamma (#9002), and structured package data for complex math like cround and csignumf.

View all my Merged/Closed PRs on GitHub

Open PRs (15 Pull Requests)
My currently open PRs are mostly heavy BLAS operations and ndarray implementations that are undergoing rigorous review or awaiting maintainer bandwidth.

Key Open PRs: blas/ext/base/ndarray/dapx (#9220 — Under extensive review), sfill (#9094), drev (#9056), and math/base/special/roundbf (#9679).

View all my Open PRs on GitHub

stdlib showcase

To truly demonstrate my ability to integrate stdlib's high-performance numerical utilities into modern, complex web environments, I built The StdLib Landscape, a visually rich, interactive 3D terrain generator built with Next.js and React Three Fiber.

Rather than relying on generic JavaScript math objects, the core rendering loop strictly utilizes focused stdlib modules to compute real-time geometry updates across a 50×50 terrain grid (2,500 vertices).

@stdlib/math-base-special-sin: Computes smooth, overlapping wave patterns for the base landscape elevation.
@stdlib/random-base-normal: Injects seeded Gaussian noise into each vertex for natural, deterministic variation.
@stdlib/stats-base-nanmean: Rapidly calculates the mean terrain height to re-center the mesh dynamically upon parameter changes.
This project showcases how stdlib's modular architecture can act as the mathematical engine behind a modern React/Three.js render loop without performance bottlenecks.

GitHub Repository | Live Demo

Goals

The goal of this project is to introduce a dedicated variable-length string typed array (StringArray) to stdlib, enabling efficient representation and manipulation of string data in both JavaScript and C. This is tracked in Issue #44.

Main Goals

  1. Design and implement @stdlib/array/string : A new StringArray constructor backed by raw byte buffers (Uint8Array) that stores variable-length UTF-8 encoded strings using an Offset Table architecture (data buffer + offset buffer).
  2. Implement all standard TypedArray prototype methods : Following the exact same API surface as Complex64Array and BooleanArray, including: get, set, at, map, filter, slice, fill, find, findIndex, findLast, findLastIndex, forEach, every, some, reduce, reduceRight, includes, indexOf, lastIndexOf, join, keys, values, entries, copyWithin, reverse, sort, subarray, toReversed, toSorted, toString, toLocaleString, with, and static methods from and of.
  3. Add supporting assert packages : Create @stdlib/array/base/assert/is-stringarray, @stdlib/array/base/assert/is-string-data-type, and @stdlib/assert/is-stringarray.
  4. Integrate StringArray throughout @stdlib/array/* : Register the "string" dtype in dtypes.json, add the constructor to ctors.js, update dtype resolution, accessor-getter/setter, and array creation utilities (empty, zeros, filled, from-iterator, convert).

Supporting Goals

  1. Design a C struct : for StringArray that enables future ndarray integration, following NumPy's NpyString_load/NpyString_pack pattern for safe string access from C.
  2. Research and document an SSO (Small String Optimization) strategy : Where strings ≤14 bytes are stored directly in fixed 16-byte slots, eliminating arena lookups. This is a future optimization to be proposed after the base API is merged.
  3. Improve test coverage : Ensure every prototype method has comprehensive tests, including edge cases for empty strings, Unicode (multi-byte UTF-8), very long strings, and boundary conditions.
  4. Add benchmarks : Following the patterns in @stdlib/array/complex64/benchmark/ and @stdlib/array/bool/benchmark/, benchmark construction, get/set performance, iteration, and memory usage.

The main and supporting goals can be worked on independently, with main goals taking priority. By the end of the program, any unfinished tasks will be properly documented as new issues for future contributors or for me to continue working on.


Approach

The Core Problem

Numbers have fixed sizes (Float64 = 8 bytes, Uint8 = 1 byte). Booleans are 1 byte. Complex numbers are 8 bytes (2 × Float32). But strings are variable-length, "Hi" is 2 bytes, "JavaScript" is 10 bytes. The fundamental challenge is: how do you store variable-length data in a fixed, contiguous memory layout that C can iterate over?

Prior Art Analysis

Before proposing a design, I studied three major approaches:

1. Apache Arrow : Variable-Size Binary Layout

Arrow uses a data buffer + offset buffer architecture:

offsets: [0, 2, 6, 11]                    ← Int32Array (N+1 entries)
data:    [H][i][A][m][a][n][H][e][l][l][o] ← Uint8Array (UTF-8 bytes)
  • Pros: Memory-efficient, O(1) indexed access, industry standard, great for immutable/read-heavy workloads.
  • Cons: Mutation (set with larger string) requires rebuilding or appending.
  • Used by: PyArrow, Pandas (via Arrow backend), DuckDB.

2. NumPy NEP 55 : Three-Tier Storage (SSO + Arena + Heap)

NumPy's new StringDType (merged in NumPy 2.0) uses a sophisticated union-based layout:

// Each element = 16-byte union:
typedef union {
    struct { size_t offset; size_t size_and_flags; } vstring;    // arena/heap
    struct { char buf[15]; unsigned char size_and_flags; } direct_buffer; // SSO
} packed_string;

Three tiers:

  • Short strings (≤15 bytes): Stored directly inline in the 16-byte slot, zero heap access.
  • Medium strings (16–255 bytes): Stored in a contiguous arena buffer with 1-byte size prefix.
  • Long strings (>255 bytes): Stored via direct heap allocation (malloc).
Image

Mutation strategy : "Reuse-or-Abandon":

  • If new string fits in old slot → reuse the space.
  • If new string is larger → old space is abandoned (never shifted/compacted), new space allocated.
  • Arena grows with a 1.25× expansion factor.

Key insight: Why arena becomes inefficient after 255 bytes:
Below 255 bytes, the size prefix in the arena is just 1 byte (low overhead). Above 255 bytes, the size prefix jumps to size_t (8 bytes), the overhead grows significantly. Additionally, mutation of large strings forces a fallback to direct heap allocation anyway, making the arena pointless for large entries.

  • Pros: SSO eliminates heap access for short strings (most real-world strings are short), excellent cache locality, constant BYTES_PER_ELEMENT = 16.
  • Cons: Complex implementation, union-based layout less natural in JS, three code paths to maintain.
  • Used by: NumPy 2.0+.

3. Java : Heap + String Constant Pool

Java stores strings on the heap with an internal byte[] array and uses a String Constant Pool for deduplication. Out of scope for stdlib's use case.

Proposed Design: Offset Table with Reuse-or-Abandon Mutation

After studying all three approaches, I propose an Offset Table architecture (inspired by Arrow) combined with NumPy's "Reuse-or-Abandon" mutation strategy. This balances simplicity with efficiency and follows stdlib's established patterns.

Internal Layout

function StringArray() {
    // ...constructor logic (length, array, ArrayBuffer, iterable)...

    // Follow stdlib's _buffer + _length pattern:
    setReadOnly( this, '_buffer', dataBuffer );    // Uint8Array : concatenated UTF-8 bytes
    setReadOnly( this, '_offsets', offsetBuffer );  // Int32Array : byte boundaries (N+1 entries)
    setReadOnly( this, '_length', numStrings );     // Number of string elements
}

Visual example:

Strings: ["Hello", "stdlib", "Hi"]

_offsets (Int32Array):  [0, 5, 11, 13]      ← 4 entries for 3 strings
                         ↑  ↑   ↑   ↑
                         |  |   |   └─ end of "Hi"
                         |  |   └─ start of "Hi" (length = 13-11 = 2)
                         |  └─ start of "stdlib" (length = 11-5 = 6)
                         └─ start of "Hello" (length = 5-0 = 5)

_buffer (Uint8Array):   [72,101,108,108,111,115,116,100,108,105,98,72,105]
                         H  e   l   l   o   s   t   d   l   i   b  H  i
Image

Why this design:

Feature Offset Table (Proposed) NumPy SSO+Arena
Follows stdlib pattern _buffer + _length Would need _slotBuffer + _dataBuffer
Memory per ASCII char 1 byte 1 byte
Encoding UTF-8 UTF-8
O(1) indexed access Yes (via offsets) Yes (via slots)
BYTES_PER_ELEMENT Variable (needs design decision) Fixed 16
Implementation complexity Medium High
C interop Two pointers (data + offsets) Two pointers (slots + arena)
Explainability for RFC Simple to diagram Complex union

The get() Implementation

    // Module-level cached decoder for performance:
    var DECODER = new TextDecoder( 'utf-8' );
    var ENCODER = new TextEncoder();

    setReadOnly( StringArray.prototype, 'get', function get( idx ) {
        var start;
        var end;

        if ( !isStringArray( this ) ) {
            throw new TypeError( 'invalid invocation. `this` is not a string array.' );
        }
        if ( !isNonNegativeInteger( idx ) ) {
            throw new TypeError( format(
                'invalid argument. Must provide a nonnegative integer. Value: `%s`.', idx
            ));
        }
        if ( idx >= this._length ) {
            return;
        }
        start = this._offsets[ idx ];
        end = this._offsets[ idx + 1 ];
        if ( start === end ) {
            return ''; // empty string
        }
        return DECODER.decode( this._buffer.subarray( start, end ) );
    });

The set() Implementation : Reuse-or-Abandon Strategy

This is the most critical method. When setting a value that's larger than the existing string, we use NumPy's "Reuse-or-Abandon" approach:

 setReadOnly( StringArray.prototype, 'set', function set( value ) {
        var oldStart;
        var oldEnd;
        var oldSize;
        var newSize;
        var encoded;
        var sbuf;
        var idx;
        var buf;
        var off;
        var N;
        var i;

        if ( !isStringArray( this ) ) {
            throw new TypeError( 'invalid invocation. `this` is not a string array.' );
        }
        buf = this._buffer;
        off = this._offsets;

        if ( arguments.length > 1 ) {
            idx = arguments[ 1 ];
            if ( !isNonNegativeInteger( idx ) ) {
                throw new TypeError( format(
                    'invalid argument. Index argument must be a nonnegative integer. Value: `%s`.', idx
                ));
            }
        } else {
            idx = 0;
        }

        // Case 1: Setting a single string value
        if ( isString( value ) ) {
            if ( idx >= this._length ) {
                throw new RangeError( format(
                    'invalid argument. Index argument is out-of-bounds. Value: `%u`.', idx
                ));
            }
            encoded = ENCODER.encode( value );

            oldStart = off[ idx ];
            oldEnd = off[ idx + 1 ];
            oldSize = oldEnd - oldStart;
            newSize = encoded.length;

            if ( newSize <= oldSize ) {
                // REUSE: New string fits in old slot, overwrite in place
                buf.set( encoded, oldStart );
                if ( newSize < oldSize ) {
                    this._rebuildOffsets( idx, newSize - oldSize );
                }
            } else {
                // ABANDON old space, APPEND to end of buffer
                this._appendAndUpdate( idx, encoded );
            }
            return;
        }

        // Case 2: Setting from a collection (array of strings)
        if ( isCollection( value ) ) {
            N = value.length;
            if ( idx + N > this._length ) {
                throw new RangeError(
                    'invalid arguments. Target array lacks sufficient storage to accommodate source values.'
                );
            }
            for ( i = 0; i < N; i++ ) {
                this.set( value[ i ], idx + i );
            }
            return;
        }

        throw new TypeError( format(
            'invalid argument. First argument must be either a string, an array-like object, or a string array. Value: `%s`.', value
        ));
    });
Image

Arena Growth Strategy

Following NumPy's 1.25× growth factor:

 function growBuffer( currentBuffer, neededCapacity ) {
        var newBuffer;
        var newSize;

        newSize = currentBuffer.length;
        while ( newSize < neededCapacity ) {
            newSize = Math.ceil( newSize * 1.25 );
        }
        // Minimum 64 bytes to avoid tiny allocations:
        newSize = Math.max( newSize, 64 );
        newBuffer = new Uint8Array( newSize );
        newBuffer.set( currentBuffer );
        return newBuffer;
    }
}

Why 1.25× and not 2×?

  • 2× wastes too much memory for large arrays (a 100MB buffer would jump to 200MB)
  • 1.1× causes too many reallocations (expensive Uint8Array copy each time)
  • 1.25× is NumPy's empirically chosen sweet spot (good balance of memory and reallocation cost)

The Constructor : All Input Forms

Following Complex64Array and BooleanArray exactly:

    function StringArray() {
        var byteOffset;
        var result;
        var nargs;
        var iter;
        var tmp;
        var buf;
        var off;
        var len;
        var arg;

        nargs = arguments.length;

        // Allow calling without new:
        if ( !(this instanceof StringArray) ) {
            if ( nargs === 0 ) return new StringArray();
            if ( nargs === 1 ) return new StringArray( arguments[0] );
            if ( nargs === 2 ) return new StringArray( arguments[0], arguments[1] );
            return new StringArray( arguments[0], arguments[1], arguments[2] );
        }

        if ( nargs === 0 ) {
            // Empty array:
            buf = new Uint8Array( 0 );
            off = new Int32Array( [ 0 ] );
            len = 0;
        } else if ( nargs === 1 ) {
            arg = arguments[ 0 ];
            if ( isNonNegativeInteger( arg ) ) {
                // new StringArray( 5 ) → 5 empty strings
                buf = new Uint8Array( 0 );
                off = new Int32Array( arg + 1 ); // all zeros = all empty strings
                len = arg;
            } else if ( isCollection( arg ) ) {
                // new StringArray( ['hello', 'world'] )
                result = fromStringCollection( arg );
                buf = result.buffer;
                off = result.offsets;
                len = result.length;
            } else if ( isObject( arg ) ) {
                // Iterable support
                if ( HAS_ITERATOR_SYMBOL === false ) {
                    throw new TypeError( '...' );
                }
                if ( !isFunction( arg[ ITERATOR_SYMBOL ] ) ) {
                    throw new TypeError( '...' );
                }
                iter = arg[ ITERATOR_SYMBOL ]();
                tmp = fromIterator( iter );
                result = fromStringCollection( tmp );
                buf = result.buffer;
                off = result.offsets;
                len = result.length;
            } else {
                throw new TypeError( '...' );
            }
        }

        setReadOnly( this, '_buffer', buf );
        setReadOnly( this, '_offsets', off );
        setReadOnly( this, '_length', len );

        return this;
    }

C Struct for ndarray Interop

// Proposed C representation for StringArray data:
typedef struct {
    uint8_t  *data;        // UTF-8 byte buffer (the _buffer)
    int32_t  *offsets;     // Offset table (the _offsets, length = n+1)
    int64_t  length;       // Number of strings
    int64_t  data_len;     // Total bytes used in data buffer
    int64_t  data_cap;     // Allocated capacity of data buffer
} stdlib_strarray_t;

// Safe access API (inspired by NumPy's NpyString_load / NpyString_pack):
int stdlib_strarray_load(
    const stdlib_strarray_t *arr,
    int64_t idx,
    const char **out_buf,   // Pointer to string data (read-only)
    size_t *out_size         // Length in bytes
);

int stdlib_strarray_pack(
    stdlib_strarray_t *arr,
    int64_t idx,
    const char *buf,
    size_t size
);

Why load/pack and not direct access?
Following NumPy's design philosophy: by abstracting string access behind functions, we can change the internal memory layout (e.g., add SSO) without breaking C consumers. This is the same reason NumPy uses npy_packed_static_string as an opaque type.

Future Optimization: Small String Optimization (SSO)

While this initial RFC proposes the Offset Table approach for architectural simplicity, I have also researched Small String Optimization (SSO) : storing strings ≤14 bytes directly in fixed 16-byte slots, eliminating arena lookups for short strings.

How SSO would work:

Each element = 16-byte slot in a Uint8Array:

SHORT STRING (≤14 bytes):
┌──────┬──────────────────────────────────────────────┬──────┐
│ Flags│  Inline UTF-8 data (up to 14 bytes)          │ Len  │
│ 1B   │  14 bytes                                     │ 1B   │
└──────┴──────────────────────────────────────────────┴──────┘

ARENA STRING (>14 bytes):
┌──────┬──────────────┬──────────────┬────────────────────────┐
│ Flags│  Arena Offset │  Byte Length │  (unused padding)      │
│ 1B   │  4 bytes      │  4 bytes     │  7 bytes               │
└──────┴──────────────┴──────────────┴────────────────────────┘

Benefits of SSO:

  • Most real-world strings are short (variable names, labels, categories, country codes) they'd all be inline.
  • Eliminates a pointer dereference for short strings → better cache performance.
  • Makes BYTES_PER_ELEMENT a constant 16.

Why defer SSO:

  • Increases implementation complexity significantly (two code paths for every method).
  • The Offset Table design is correct, explainable, and performant enough for initial adoption.
  • SSO can be introduced as a backward-compatible optimization once the base API is stable.
  • Better to discuss SSO with mentors during the community bonding period.

Once the base API is merged, SSO can be introduced to further eliminate arena lookups for short strings without changing the public API.

Image

Why this project?

I've always been fascinated by the gap between how we use data structures at a high level and how they're actually represented in memory. When I saw Issue Issue #44., I didn't just see "add string arrays", I saw a deep systems design problem:
how do you represent variable-length data in contiguous memory that both JavaScript and C can efficiently traverse?

What excites me most is that this problem has been tackled by some of the best engineers in the world, the NumPy team with NEP 55, Apache Arrow with their columnar format, Julia with their UTF-8 strings and each made different tradeoffs. The opportunity to study these approaches and design a solution specifically tailored to stdlib's architecture is exactly the kind of challenge I want to take on.

I also believe this project has outsized impact. StringArray isn't just one package, it touches the entire stdlib ecosystem. Every array utility, every ndarray operation, every dtype resolver needs to learn about strings. Successfully completing this means I'll have touched nearly every corner of the codebase, and that depth of understanding is incredibly valuable, both for me as a developer and for stdlib as a project.

Finally, there's something deeply satisfying about working on infrastructure that other developers will build on. When someone writes new StringArray(['hello', 'world']) and it just works fast, memory-efficient, C-interoperable that's a legacy worth contributing to.

Qualifications

With 55+ merged PRs and 15 open PRs across stdlib, I have deep familiarity with the codebase's architecture, coding conventions, testing patterns, and review process. My contributions span benchmark refactoring, float16 constants (gamma-lanczos-g, eulergamma), base special math functions (roundnf, roundbf), complex number utilities (cround, csignumf), and BLAS ndarray interfaces (dapx, sfill, drev).

Through these contributions, I've developed a working understanding of how custom typed arrays (Complex64Array, BooleanArray) are structured internally, how the dtype registry works, and how accessor-based array patterns are used throughout the library. The BLAS work in particular taught me strict TypeScript tuple types, 1D memory manipulation, and C-level array iteration, skills directly applicable to StringArray.

I have taken courses in Data Structures, Algorithms, Operating Systems, and Computer Architecture, which give me a strong foundation for understanding memory layouts, encoding schemes, and performance tradeoffs. My experience with C (including string manipulation and memory management) prepares me for the ndarray C integration portion of this project.

I have also studied NumPy's NEP 55 in depth, understanding the three-tier storage model (SSO/Arena/Heap), the arena allocator with 1.25× growth, the "Reuse-or-Abandon" mutation strategy, and why the arena becomes inefficient after 255 bytes (size metadata jumps from 1 byte to size_t). This research directly informs my design decisions for stdlib's StringArray.

Prior art

This area has been extensively explored in major libraries and standards:

Library/Standard Approach Key Insight for stdlib
NumPy NEP 55 Three-tier (SSO + Arena + Heap), UTF-8, packed unions The gold standard for variable-length string arrays. Reuse-or-Abandon mutation, 1.25× arena growth, load/pack C API abstraction.
Apache Arrow Offset table (data + offsets), UTF-8, immutable Simple and proven. The basis for our proposed architecture. Used by Pandas, DuckDB, Spark.
stdlib Complex64Array Float32Array backing, 2 floats per element, accessor pattern The template for our constructor, get/set, and all prototype methods.
stdlib BooleanArray Uint8Array backing, 1 byte per element, accessor pattern Shows how a non-numeric dtype was recently integrated (2024). Closest precedent for StringArray integration.
Julia UTF-8 encoded byte buffers, array of pointers Simpler approach, but no special optimization for string arrays.
Java Heap allocation + String Constant Pool Out of scope, GC-managed, not applicable to typed array context.

Of particular relevance is the recently added BooleanArray (@stdlib/array/bool), which demonstrates the full integration path for a new non-numeric dtype: constructor, 30+ prototype methods, assert packages, dtype registration, accessor support, and test/benchmark suites. I will follow this precedent exactly.

Commitment

I am fully committed to this project as a full-time, large project (350-hour commitment) and am prepared to go beyond if needed. I will dedicate 35-40 hours per week during my summer break and 25 hours per week during my exam period (last week of May through first week of June), focusing on steady progress, well-structured pull requests, and thorough testing.

Exam Period Note: My university exams fall in the last week of May through the first week of June. During this period, I have intentionally scheduled lighter tasks (constructor implementation + core get/set methods) that were already prototyped during the bonding period, allowing me to maintain momentum at a reduced 25 hrs/week pace without blocking progress.

Before GSoC officially begins, I will:

  1. Build a working prototype of the core StringArray (constructor + get/set) to validate my design.
  2. Post an RFC comment on Issue [Idea]: add support for string arrays in stdlib #44 presenting my Offset Table architecture and asking for mentor feedback on key design decisions.
  3. Continue making contributions to stdlib to deepen my familiarity with the codebase.

After GSoC, I plan to stay involved addressing any remaining integration work, implementing SSO as a follow-up optimization, and contributing to ndarray C integration.

Schedule

Implementation Blueprint

The project is divided into 5 phases with clear deliverables. Each phase builds on the previous one, and phases are designed so that midterm evaluation has a substantial, working deliverable.

Community Bonding Period (Weeks C1-C3)

Week C1: Design Validation & Environment Setup

  • Post RFC comment on Issue [Idea]: add support for string arrays in stdlib #44 with my Offset Table design, including diagrams and code sketches.
  • Discuss key design decisions with mentors:
    • Should BYTES_PER_ELEMENT be fixed (16, slot-based) or omitted?
    • Should uninitialized elements default to '' (empty string) or null?
    • Is Reuse-or-Abandon acceptable, or should we implement compaction?
  • Set up local development environment, run existing test suites.

Week C2: Prototype & Validate

  • Build a standalone prototype of StringArray core (constructor, get, set, _offsets) outside the main repo.
  • Test with various string types: ASCII, multi-byte Unicode (emoji, CJK), empty strings, very long strings.
  • Benchmark get/set performance against plain Array of strings.

Week C3: Study Integration Points

  • Map every file that needs updating by grepping for BooleanArray, bool, and complex64 across the codebase.
  • Create a tracking issue listing all ~100+ packages that need StringArray support.
  • Begin implementing based on mentor's go-ahead.

Phase 1: Core StringArray Constructor (Weeks 1–2)

Deliverables:

  • @stdlib/array/string/lib/main.js, Full constructor supporting:
    • new StringArray() , empty array
    • new StringArray( 5 ) , 5 empty strings
    • new StringArray( ['hello', 'world'] ), from array
    • new StringArray( iterable ), from iterable
  • @stdlib/array/string/lib/from_array.js, Helper for collection input
  • @stdlib/array/string/lib/from_iterator.js, Helper for iterable input
  • @stdlib/array/string/lib/from_iterator_map.js, Helper with callback
  • Static properties: StringArray.name = 'StringArray'
  • Prototype accessors: buffer, byteLength, byteOffset, length
  • Core methods: get( idx ), set( value, idx )

Files:

lib/node_modules/@stdlib/array/string/
├── lib/
│   ├── main.js              [NEW] Constructor + get/set + accessors
│   ├── from_array.js         [NEW] Collection → StringArray
│   ├── from_iterator.js      [NEW] Iterator → StringArray
│   ├── from_iterator_map.js  [NEW] Iterator with map → StringArray
│   └── index.js              [NEW] Module entry point
├── package.json              [NEW]
├── README.md                 [NEW]
├── test/
│   └── test.js               [NEW] Constructor + get/set tests
├── benchmark/
│   └── benchmark.js          [NEW] Construction + access benchmarks
└── examples/
    └── index.js              [NEW] Usage examples

Phase 2: Standard TypedArray Prototype Methods (Weeks 3–5)

Week 3 : Iteration & Search:

  • at( idx ), entries(), keys(), values()
  • forEach( fcn, thisArg ), every( predicate ), some( predicate )
  • find(), findIndex(), findLast(), findLastIndex()
  • includes( searchElement, fromIndex ), indexOf(), lastIndexOf()

Week 4 : Transformation:

  • map( fcn, thisArg ), filter( predicate, thisArg )
  • reduce( reducer, initialValue ), reduceRight()
  • fill( value, start, end )
  • join( separator )

Week 5 : Copy & Reorder:

  • slice( begin, end ), subarray( begin, end )
  • copyWithin( target, start, end )
  • reverse(), sort( compareFn )
  • toReversed(), toSorted( compareFn ), with( idx, value )
  • toString(), toLocaleString()
  • Static: StringArray.from( src, clbk, thisArg ), StringArray.of( ...elements )

Tests: Each method gets dedicated test cases following @stdlib/array/bool/test/ patterns.


Phase 3: Assert Packages & Dtype Registration (Week 6 : Midterm)

Midterm deliverable: A fully working StringArray with all 30+ prototype methods, comprehensive tests, and dtype registration.

New Packages:

@stdlib/array/base/assert/is-stringarray/         [NEW]
@stdlib/array/base/assert/is-string-data-type/     [NEW]
@stdlib/assert/is-stringarray/                     [NEW]

Modified Files:

File Change
@stdlib/array/dtypes/lib/dtypes.json Add "string" to all and typed categories
@stdlib/array/ctors/lib/ctors.js Add 'string': StringArray mapping
@stdlib/array/dtype/ Add StringArray → 'string' dtype resolution

Phase 4: Ecosystem Integration (Weeks 7–9)

This is the largest phase updating ~50+ packages to recognize StringArray. Prioritized by dependency order:

Week 7 : Core Accessors:

Package Change
@stdlib/array/base/getter Add accessor for StringArray
@stdlib/array/base/setter Add accessor for StringArray
@stdlib/array/base/accessor-getter Add 'string' accessor
@stdlib/array/base/accessor-setter Add 'string' accessor

Week 8 : Array Creation Utilities:

Package Change
@stdlib/array/empty Support dtype='string'
@stdlib/array/zeros Support dtype='string' (array of empty strings)
@stdlib/array/filled Support dtype='string'
@stdlib/array/from-iterator Support dtype='string'
@stdlib/array/from-scalar Support dtype='string'
@stdlib/array/convert Support conversion to/from 'string'

Week 9 : Additional Integration:

Package Change
@stdlib/array/convert-same StringArray support
@stdlib/array/slice StringArray support
@stdlib/array/take StringArray support
@stdlib/array/put StringArray support
@stdlib/array/place StringArray support
@stdlib/array/mskfilter StringArray support
@stdlib/array/mskreject StringArray support
@stdlib/array/mskput StringArray support
@stdlib/array/to-fancy StringArray support

Note: For the sake of brevity and focus, the tables above highlight the 19 most critical dependency bottlenecks. The remaining 30+ packages in this phase include high-level utilities that simply need dtype resolution updates or minor accessor integrations, such as: @stdlib/array/any, @stdlib/array/every, @stdlib/array/some, @stdlib/array/none, @stdlib/array/count, @stdlib/array/max, @stdlib/array/min, @stdlib/array/reverse, @stdlib/array/sort, @stdlib/array/shuffle, @stdlib/array/sample, @stdlib/array/unique, @stdlib/array/map, @stdlib/array/filter, @stdlib/array/to-iterator, @stdlib/array/to-json, @stdlib/array/pool, @stdlib/array/complex, @stdlib/array/int8, @stdlib/array/uint8, @stdlib/array/base/stride2offset, @stdlib/array/base/broadcast-array, and various multidimensional array utilities under @stdlib/ndarray/*.


Phase 5: C Design & Documentation (Weeks 10–12)

Week 10 : C Struct & Header:

  • Define stdlib_strarray_t struct in C header.
  • Implement stdlib_strarray_load() and stdlib_strarray_pack() functions.
  • Write basic napi native addon wrapping these functions.

Week 11 : Documentation & Final Testing:

  • Comprehensive README.md with full API documentation and examples.
  • Ensure all tests pass across Node.js versions.
  • Run full benchmark suite, compare with plain arrays.
  • Code freeze.

Week 12 : Polish & Submission:

  • Address any remaining review feedback.
  • Create tracking issues for remaining integration work (ndarray support, SSO optimization).
  • Write a summary of completed work and future directions.
  • Final submission.

Detailed Day-wise Schedule Blueprint

For a granular, day-by-day breakdown of all 15 weeks (including exact hours, tasks, and file-level deliverables per day), see the full schedule document:

StringArray Implementation Blueprint, Day-wise Schedule

Stretch Goals (If Ahead of Schedule)

  1. Implement SSO (Small String Optimization) for strings ≤14 bytes.
  2. Begin ndarray string dtype support in @stdlib/ndarray/.
  3. Add SIMD-friendly batch operations for string comparison/search in C.

Open Questions for Mentors

  1. BYTES_PER_ELEMENT: Should we define BYTES_PER_ELEMENT for StringArray? Since strings are variable-length, it doesn't have a fixed meaningful value.
    Options: (a) omit it, (b) set to 1 (byte-level granularity), (c) set to 16
    if we adopt SSO slots.

  2. Default value for uninitialized elements: Should new StringArray(5) produce 5 empty strings ('') or 5 null entries? NumPy defaults to empty strings.

  3. Mutation strategy: Is "Reuse-or-Abandon" (NumPy's approach) acceptable, or should we implement compaction? I recommend Reuse-or-Abandon for simplicity
    and O(1) mutation.

  4. C API priority: Should the C struct and load/pack API be part of the initial implementation, or deferred to a follow-up after the JS API is stable?

Related issues

  • #44 [Idea]: Add support for string arrays in stdlib

Checklist

  • I have read and understood the Code of Conduct.
  • I have read and understood the application materials found in this repository.
  • I understand that plagiarism will not be tolerated, and I have authored this application in my own words.
  • I have read and understood the patch requirement which is necessary for my application to be considered for acceptance.
  • I have read and understood the stdlib showcase requirement which is necessary for my application to be considered for acceptance.
  • The issue name begins with [RFC]: and succinctly describes your proposal.
  • I understand that, in order to apply to be a GSoC contributor, I must submit my final application to https://summerofcode.withgoogle.com/ before the submission deadline.

Metadata

Metadata

Assignees

No one assigned

    Labels

    20262026 GSoC proposal.rfcProject proposal.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions