[RFC]: Add support for string arrays in stdlib

### Full name

Aman Singh

### University status

Yes

### University name

Guru Gobind Singh Indraprastha University

### University program

Bachelors in Technology (Major IT)

### Expected graduation

May, 2027

### Short biography

I am a pre-final year CS/IT undergraduate at GGSIPU, New Delhi, with a deep-rooted passion for low-level architecture, algorithms, and building high-performance systems. My programming journey started with competitive problem-solving in `C++`, which naturally evolved into engineering scalable web and backend systems using `JavaScript`, `TypeScript`, and `Next.js`.

Currently, I am working as a `PwC` Launchpad Trainee, gaining hands-on experience with enterprise-grade software solutions. At the same time I am serving as `Campus Crew` at `HackerRank`. Previously, I spent time exploring technical problem spaces alongside the team at `Atlas Research`. Beyond corporate roles, I am heavily invested in the `open-source ecosystem`. I lead the Technical division at our college club, where I regularly organize hackathons (like 'Xen-O-Thon') and mentor peers in algorithmic problem-solving.

I am fascinated by the intersection of `JavaScript` and `C`, and the challenge of managing complex memory architectures is exactly what drew me to stdlib. When I am away from my keyboard, you can usually find me on a badminton court or talking about cricket, a sport I previously played professionally for the U-16 Delhi state team.


### Timezone

Indian Standard Time (GMT+5:30)

### Contact details

Email:  amansingh080704@gmail.com 
GitHub: Amansingh0807 
LinkedIn: amansingh08

### Platform

Windows

### Editor

My exclusive and preferred code editor is `Visual Studio Code (VSCode)`. I love it because of its lightweight nature and incredibly powerful extension ecosystem, which I have heavily customized for open-source development. To align with `stdlib's` rigorous codebase standards, my workspace is strictly configured with `ESLint` for real-time linting and style enforcement. Additionally, I rely heavily on VSCode's built-in `TypeScript language` server to ensure that any `complex type` definitions and signatures (like the ones I worked on in the ndarray packages) are perfectly accurate before I even run a local build.

### Programming experience

My programming journey began in **2020** during the **global lockdown**. What started as a sheer fascination with how software operates under the hood quickly escalated into a deep passion for software engineering and open-source development. Over the past few years, I have transitioned from writing basic scripts to **architecting scalable**, **real-world applications**.

Some of the key projects that define my experience include:

**GenForm** 
An open-source project where I serve as the core maintainer and Project Admin under the Social Winter of Code (SWOC). It currently supports _over 600+ users_. Managing this project taught me how to handle **community contributions**, **enforce code quality**, and **maintain production-grade** repositories.
[GitHub Repository](https://github.com/Amansingh0807/GenForm) | [Live Demo](https://genforma.vercel.app/)

**Nextric Hire**
A SaaS AI platform that enables users to intelligently interact with job descriptions and auto-generate tailored, **ATS-friendly resumes**. Built with Next.js 15, Convex, and Clerk, this project heavily refined my skills in integrating Generative AI (Gemini), managing complex real-time backend states, and building scalable full-stack architectures.
[GitHub Repository](https://github.com/Amansingh0807/Nextric-hire) | [Live Demo](https://nextric-hire.vercel.app/)

**AI Road Segmentation** 
An AI/ML project focused on road segmentation, which required processing complex datasets. This exposed me to the performance bottlenecks of heavy data manipulation and taught me the critical need for highly optimized, low-level computations when dealing with multidimensional arrays.
[GitHub Repository](https://github.com/Ayushrai987/Creative-codex--xenothon)

**MemG Vision** 
A _computer vision-oriented_ project where I handled _dynamic data processing_ and _system integration_. Building this further strengthened my backend, data streaming, and overall system architecture skills.
[GitHub Repository](https://github.com/Amansingh0807/MemG-Vision)


### JavaScript experience

I initially learned `JavaScript` to build full-stack web applications using the `React` and `Next.js` ecosystems. However, my true appreciation for the language blossomed when I started exploring its lower-level capabilities, particularly during my contributions to stdlib. Moving away from standard web development to manipulating flat memory structures completely changed my perspective on the language.

My favorite feature: `TypedArrays` and `ArrayBuffer`. I am fascinated by how JavaScript allows us to allocate contiguous blocks of memory and manipulate raw bytes using views like `Uint8Array` or `Float64Array`. It bridges the gap between high-level scripting and low-level system performance, which is exactly why I am so drawn to the StringArray interop challenge.

My least favorite feature: `Implicit Type Coercion`. While it makes JavaScript flexible for beginners, it often leads to silent, `catastrophic bugs` in complex computational libraries where `strict type integrity` is required. This is precisely why I heavily prefer writing `strict TypeScript` and enforcing rigorous `ESLint rules` to catch these issues at compile-time rather than runtime.

### Node.js experience

My experience with `Node.js` goes far beyond just spinning up `REST APIs` with `Express.js`. Through my work on GenForm and my backend projects, I have developed a solid grasp of the `Node.js event loop`, `asynchronous file system operations` (using the fs module), and `stream processing`.

Most importantly for this proposal, I have spent time understanding `Node.js Buffer objects`. Understanding that `Node.js Buffers` are essentially subclasses of `JavaScript's native Uint8Array` is crucial for the architecture I am proposing for **StringArray**, as it dictates how we will handle `UTF-8 string encoding` and `memory allocation` before passing data down to the `C-level macros`.

### C/Fortran experience

C/C++ Experience: `C` and `C++` form the absolute core of my computer science foundation. Because of my heavy involvement in competitive programming, I am highly comfortable with manual memory management, `pointer arithmetic`, and `optimizing contiguous memory arrays`. I understand the strict requirements C demands, such as handling null-terminated strings, avoiding memory leaks, and writing cache-friendly loops. This background gives me the exact `low-level intuition` required to build the `C-structs` and iteration macros needed for the `StringArray JS/C interop`.

Fortran Experience: I want to be completely transparent, I do not have hands-on experience writing Fortran code. Currently, when I encounter Fortran logic or legacy numerical libraries, I leverage AI tools to help me parse the syntax and understand the underlying mathematical models. However, I am a fast and eager learner. If the project requires translating or interacting with Fortran routines, I am fully prepared to adapt and learn it on the fly.

### Interest in stdlib

When I first started my journey with competitive programming in `C++`, I treated standard libraries as magic "**black boxes**" that just worked. As I transitioned into the `JavaScript` and `Node.js` ecosystem for building full-stack applications, I frequently felt the absence of that **raw**, **low-level numerical computing power**. Discovering stdlib was a lightbulb moment for me. It wasn't just another `npm package`; it was a massive, ambitious bridge connecting the accessibility of the web with the bare-metal performance of C.

On a personal level, my journey here has been deeply transformative. I vividly remember one of my early PRs for the **BLAS layer (dapx)** receiving an extensive review with over **40 meticulous comments**. Instead of feeling overwhelmed, I felt a profound sense of respect. The maintainers weren't just looking for a quick bug fix; they took the time to teach me strict architectural discipline, Tuple typing in TypeScript, and robust memory mutation documentation. That level of uncompromising mentorship is incredibly rare, and it fundamentally shifted my mindset from just being a "coder" to striving to be a "system architect."

If I have to pick my absolute favorite aspects of stdlib, it would be the `ndarray` iteration machinery and the rigorous benchmarking standards. I love the sheer engineering beauty of how flat memory buffers are manipulated through strides and offsets to achieve `C-like speeds` in `JavaScript`. Writing mathematical functions (like `roundnf`) and proving their efficiency through parameterized benchmarks gives a textbook-to-reality thrill that I haven't found anywhere else. stdlib has become my ultimate training ground, and I am deeply invested in helping it grow.

### Version control

Yes

### Contributions to stdlib

I started my journey with stdlib by picking up 'Good First Issues' to understand the repository's architecture and strict CI/CD pipelines, primarily refactoring benchmark files to use string interpolation. As I grew more comfortable with the codebase, I moved on to implementing numerical constants for the newly introduced `float16` data type.

From there, I transitioned to core mathematical functions in the `math/base/special` namespace (such as `roundnf` and complex number utilities). Most recently, I have been deeply involved in adding and refining BLAS ndarray interfaces (like `dapx`, `sfill`, and `drev`). Working on these BLAS packages has been my biggest learning curve, teaching me the intricacies of strict TypeScript tuple types, 1D memory manipulation, and C-level array iteration.

Merged/Closed PRs (55+ Pull Requests)
My merged work primarily consists of float16 mathematical constants, base special math functions, and extensive benchmark refactoring.

Key Merges: `math/base/special/roundnf` ([#9389](https://github.com/stdlib-js/stdlib/pull/9389)), `constants/float16/e` ([#8996](https://github.com/stdlib-js/stdlib/pull/8996)), `constants/float16/eulergamma` ([#9002](https://github.com/stdlib-js/stdlib/pull/9002)), and structured package data for complex math like `cround` and `csignumf`.

[View all my Merged/Closed PRs on GitHub](https://github.com/stdlib-js/stdlib/pulls?q=is%3Apr+author%3AAmansingh0807+is%3Aclosed)

Open PRs (15 Pull Requests)
My currently open PRs are mostly heavy BLAS operations and ndarray implementations that are undergoing rigorous review or awaiting maintainer bandwidth.

Key Open PRs: `blas/ext/base/ndarray/dapx` ([#9220](https://github.com/stdlib-js/stdlib/pull/9220) — Under extensive review), `sfill` ([#9094](https://github.com/stdlib-js/stdlib/pull/9094)), `drev` ([#9056](https://github.com/stdlib-js/stdlib/pull/9056)), and `math/base/special/roundbf` ([#9679](https://github.com/stdlib-js/stdlib/pull/9679)).

[View all my Open PRs on GitHub](https://github.com/stdlib-js/stdlib/pulls?q=is%3Apr+author%3AAmansingh0807+is%3Aopen)

### stdlib showcase

To truly demonstrate my ability to integrate `stdlib's` high-performance numerical utilities into modern, complex web environments, I built The StdLib Landscape, a visually rich, `interactive 3D terrain generator` built with `Next.js` and `React Three Fiber`.

Rather than relying on `generic JavaScript math objects`, the core rendering loop strictly utilizes focused stdlib modules to compute real-time geometry updates across a `50×50 terrain grid (2,500 vertices)`.

`@stdlib/math-base-special-sin`: Computes smooth, overlapping wave patterns for the base landscape elevation.
`@stdlib/random-base-normal`: Injects seeded Gaussian noise into each vertex for natural, deterministic variation.
`@stdlib/stats-base-nanmean`: Rapidly calculates the mean terrain height to re-center the mesh dynamically upon parameter changes.
This project showcases how stdlib's modular architecture can act as the mathematical engine behind a modern `React/Three.js` render loop without performance bottlenecks.

[GitHub Repository](https://github.com/Amansingh0807/springer_assignment) | [Live Demo](https://springer-maths.vercel.app/)

### Goals

The goal of this project is to introduce a dedicated variable-length string typed array (`StringArray`) to stdlib, enabling efficient representation and manipulation of string data in both JavaScript and C. This is tracked in **Issue [#44](https://github.com/stdlib-js/google-summer-of-code/issues/44)**.

### Main Goals

1. **Design and implement `@stdlib/array/string`**  : A new `StringArray` constructor backed by raw byte buffers (`Uint8Array`) that stores variable-length UTF-8 encoded strings using an **Offset Table architecture** (data buffer + offset buffer).
2. **Implement all standard TypedArray prototype methods**  : Following the exact same API surface as `Complex64Array` and `BooleanArray`, including: `get`, `set`, `at`, `map`, `filter`, `slice`, `fill`, `find`, `findIndex`, `findLast`, `findLastIndex`, `forEach`, `every`, `some`, `reduce`, `reduceRight`, `includes`, `indexOf`, `lastIndexOf`, `join`, `keys`, `values`, `entries`, `copyWithin`, `reverse`, `sort`, `subarray`, `toReversed`, `toSorted`, `toString`, `toLocaleString`, `with`, and static methods `from` and `of`.
3. **Add supporting assert packages** :  Create `@stdlib/array/base/assert/is-stringarray`, `@stdlib/array/base/assert/is-string-data-type`, and `@stdlib/assert/is-stringarray`.
4. **Integrate `StringArray` throughout `@stdlib/array/*`**  : Register the `"string"` dtype in `dtypes.json`, add the constructor to `ctors.js`, update `dtype` resolution, accessor-getter/setter, and array creation utilities (`empty`, `zeros`, `filled`, `from-iterator`, `convert`).

### Supporting Goals

5. **Design a C struct** :  for StringArray that enables future ndarray integration, following NumPy's `NpyString_load`/`NpyString_pack` pattern for safe string access from C.
6. **Research and document an SSO (Small String Optimization) strategy**  : Where strings ≤14 bytes are stored directly in fixed 16-byte slots, eliminating arena lookups. This is a **future optimization** to be proposed after the base API is merged.
7. **Improve test coverage** : Ensure every prototype method has comprehensive tests, including edge cases for empty strings, Unicode (multi-byte UTF-8), very long strings, and boundary conditions.
8. **Add benchmarks** : Following the patterns in `@stdlib/array/complex64/benchmark/` and `@stdlib/array/bool/benchmark/`, benchmark construction, `get`/`set` performance, iteration, and memory usage.

The main and supporting goals can be worked on independently, with main goals taking priority. By the end of the program, any unfinished tasks will be properly documented as new issues for future contributors or for me to continue working on.

---

## Approach

### The Core Problem

Numbers have fixed sizes (`Float64` = 8 bytes, `Uint8` = 1 byte). Booleans are 1 byte. Complex numbers are 8 bytes (2 × Float32). But strings are **variable-length**, `"Hi"` is 2 bytes, `"JavaScript"` is 10 bytes. The fundamental challenge is: **how do you store variable-length data in a fixed, contiguous memory layout that C can iterate over?**

### Prior Art Analysis

Before proposing a design, I studied three major approaches:

#### 1. Apache Arrow : Variable-Size Binary Layout

Arrow uses a **data buffer + offset buffer** architecture:

```
offsets: [0, 2, 6, 11]                    ← Int32Array (N+1 entries)
data:    [H][i][A][m][a][n][H][e][l][l][o] ← Uint8Array (UTF-8 bytes)
```

- **Pros:** Memory-efficient, O(1) indexed access, industry standard, great for immutable/read-heavy workloads.
- **Cons:** Mutation (set with larger string) requires rebuilding or appending.
- **Used by:** PyArrow, Pandas (via Arrow backend), DuckDB.

#### 2. NumPy NEP 55 : Three-Tier Storage (SSO + Arena + Heap)

NumPy's new `StringDType` (merged in NumPy 2.0) uses a sophisticated union-based layout:

```c
// Each element = 16-byte union:
typedef union {
    struct { size_t offset; size_t size_and_flags; } vstring;    // arena/heap
    struct { char buf[15]; unsigned char size_and_flags; } direct_buffer; // SSO
} packed_string;
```

**Three tiers:**

- **Short strings (≤15 bytes):** Stored directly inline in the 16-byte slot, zero heap access.
- **Medium strings (16–255 bytes):** Stored in a contiguous arena buffer with 1-byte size prefix.
- **Long strings (>255 bytes):** Stored via direct heap allocation (`malloc`).

<img width="2522" height="1536" alt="Image" src="https://github.com/user-attachments/assets/c9614326-3c3f-4d4d-94a2-3b0d385b7531" />

**Mutation strategy : "Reuse-or-Abandon":**

- If new string fits in old slot → reuse the space.
- If new string is larger → old space is **abandoned** (never shifted/compacted), new space allocated.
- Arena grows with a **1.25× expansion factor**.

**Key insight: Why arena becomes inefficient after 255 bytes:**
Below 255 bytes, the size prefix in the arena is just 1 byte (low overhead). Above 255 bytes, the size prefix jumps to `size_t` (8 bytes), the overhead grows significantly. Additionally, mutation of large strings forces a fallback to direct heap allocation anyway, making the arena pointless for large entries.

- **Pros:** SSO eliminates heap access for short strings (most real-world strings are short), excellent cache locality, constant `BYTES_PER_ELEMENT = 16`.
- **Cons:** Complex implementation, union-based layout less natural in JS, three code paths to maintain.
- **Used by:** NumPy 2.0+.

#### 3. Java : Heap + String Constant Pool

Java stores strings on the heap with an internal `byte[]` array and uses a String Constant Pool for deduplication. Out of scope for stdlib's use case.

### Proposed Design: Offset Table with Reuse-or-Abandon Mutation

After studying all three approaches, I propose an **Offset Table architecture** (inspired by Arrow) combined with NumPy's **"Reuse-or-Abandon" mutation strategy**. This balances simplicity with efficiency and follows stdlib's established patterns.

#### Internal Layout

```javascript
function StringArray() {
    // ...constructor logic (length, array, ArrayBuffer, iterable)...

    // Follow stdlib's _buffer + _length pattern:
    setReadOnly( this, '_buffer', dataBuffer );    // Uint8Array : concatenated UTF-8 bytes
    setReadOnly( this, '_offsets', offsetBuffer );  // Int32Array : byte boundaries (N+1 entries)
    setReadOnly( this, '_length', numStrings );     // Number of string elements
}
```

**Visual example:**

```
Strings: ["Hello", "stdlib", "Hi"]

_offsets (Int32Array):  [0, 5, 11, 13]      ← 4 entries for 3 strings
                         ↑  ↑   ↑   ↑
                         |  |   |   └─ end of "Hi"
                         |  |   └─ start of "Hi" (length = 13-11 = 2)
                         |  └─ start of "stdlib" (length = 11-5 = 6)
                         └─ start of "Hello" (length = 5-0 = 5)

_buffer (Uint8Array):   [72,101,108,108,111,115,116,100,108,105,98,72,105]
                         H  e   l   l   o   s   t   d   l   i   b  H  i
```

<img width="1978" height="977" alt="Image" src="https://github.com/user-attachments/assets/cfc1efa4-906f-414a-b512-4c0ecd880cd6" />

**Why this design:**

| Feature | Offset Table (Proposed) | NumPy SSO+Arena |
| :--- | :--- | :--- |
| **Follows stdlib pattern** | `_buffer` + `_length` | Would need `_slotBuffer` + `_dataBuffer` |
| **Memory per ASCII char** | 1 byte | 1 byte |
| **Encoding** | UTF-8 | UTF-8 |
| **O(1) indexed access** | Yes (via offsets) | Yes (via slots) |
| **BYTES_PER_ELEMENT** | Variable (needs design decision) | Fixed 16 |
| **Implementation complexity** | Medium | High |
| **C interop** | Two pointers (data + offsets) | Two pointers (slots + arena) |
| **Explainability for RFC** | Simple to diagram | Complex union |

#### The `get()` Implementation

```javascript
    // Module-level cached decoder for performance:
    var DECODER = new TextDecoder( 'utf-8' );
    var ENCODER = new TextEncoder();

    setReadOnly( StringArray.prototype, 'get', function get( idx ) {
        var start;
        var end;

        if ( !isStringArray( this ) ) {
            throw new TypeError( 'invalid invocation. `this` is not a string array.' );
        }
        if ( !isNonNegativeInteger( idx ) ) {
            throw new TypeError( format(
                'invalid argument. Must provide a nonnegative integer. Value: `%s`.', idx
            ));
        }
        if ( idx >= this._length ) {
            return;
        }
        start = this._offsets[ idx ];
        end = this._offsets[ idx + 1 ];
        if ( start === end ) {
            return ''; // empty string
        }
        return DECODER.decode( this._buffer.subarray( start, end ) );
    });
```

#### The `set()` Implementation : Reuse-or-Abandon Strategy

This is the most critical method. When setting a value that's larger than the existing string, we use NumPy's "Reuse-or-Abandon" approach:

```javascript
 setReadOnly( StringArray.prototype, 'set', function set( value ) {
        var oldStart;
        var oldEnd;
        var oldSize;
        var newSize;
        var encoded;
        var sbuf;
        var idx;
        var buf;
        var off;
        var N;
        var i;

        if ( !isStringArray( this ) ) {
            throw new TypeError( 'invalid invocation. `this` is not a string array.' );
        }
        buf = this._buffer;
        off = this._offsets;

        if ( arguments.length > 1 ) {
            idx = arguments[ 1 ];
            if ( !isNonNegativeInteger( idx ) ) {
                throw new TypeError( format(
                    'invalid argument. Index argument must be a nonnegative integer. Value: `%s`.', idx
                ));
            }
        } else {
            idx = 0;
        }

        // Case 1: Setting a single string value
        if ( isString( value ) ) {
            if ( idx >= this._length ) {
                throw new RangeError( format(
                    'invalid argument. Index argument is out-of-bounds. Value: `%u`.', idx
                ));
            }
            encoded = ENCODER.encode( value );

            oldStart = off[ idx ];
            oldEnd = off[ idx + 1 ];
            oldSize = oldEnd - oldStart;
            newSize = encoded.length;

            if ( newSize <= oldSize ) {
                // REUSE: New string fits in old slot, overwrite in place
                buf.set( encoded, oldStart );
                if ( newSize < oldSize ) {
                    this._rebuildOffsets( idx, newSize - oldSize );
                }
            } else {
                // ABANDON old space, APPEND to end of buffer
                this._appendAndUpdate( idx, encoded );
            }
            return;
        }

        // Case 2: Setting from a collection (array of strings)
        if ( isCollection( value ) ) {
            N = value.length;
            if ( idx + N > this._length ) {
                throw new RangeError(
                    'invalid arguments. Target array lacks sufficient storage to accommodate source values.'
                );
            }
            for ( i = 0; i < N; i++ ) {
                this.set( value[ i ], idx + i );
            }
            return;
        }

        throw new TypeError( format(
            'invalid argument. First argument must be either a string, an array-like object, or a string array. Value: `%s`.', value
        ));
    });
```

<img width="2784" height="1536" alt="Image" src="https://github.com/user-attachments/assets/b7d15396-7198-4f20-83fc-a46148aa0ef2" />

#### Arena Growth Strategy

Following NumPy's 1.25× growth factor:

```javascript
 function growBuffer( currentBuffer, neededCapacity ) {
        var newBuffer;
        var newSize;

        newSize = currentBuffer.length;
        while ( newSize < neededCapacity ) {
            newSize = Math.ceil( newSize * 1.25 );
        }
        // Minimum 64 bytes to avoid tiny allocations:
        newSize = Math.max( newSize, 64 );
        newBuffer = new Uint8Array( newSize );
        newBuffer.set( currentBuffer );
        return newBuffer;
    }
}
```

**Why 1.25× and not 2×?**

- 2× wastes too much memory for large arrays (a 100MB buffer would jump to 200MB)
- 1.1× causes too many reallocations (expensive `Uint8Array` copy each time)
- 1.25× is NumPy's empirically chosen sweet spot (good balance of memory and reallocation cost)

#### The Constructor : All Input Forms

Following `Complex64Array` and `BooleanArray` exactly:

```javascript
    function StringArray() {
        var byteOffset;
        var result;
        var nargs;
        var iter;
        var tmp;
        var buf;
        var off;
        var len;
        var arg;

        nargs = arguments.length;

        // Allow calling without new:
        if ( !(this instanceof StringArray) ) {
            if ( nargs === 0 ) return new StringArray();
            if ( nargs === 1 ) return new StringArray( arguments[0] );
            if ( nargs === 2 ) return new StringArray( arguments[0], arguments[1] );
            return new StringArray( arguments[0], arguments[1], arguments[2] );
        }

        if ( nargs === 0 ) {
            // Empty array:
            buf = new Uint8Array( 0 );
            off = new Int32Array( [ 0 ] );
            len = 0;
        } else if ( nargs === 1 ) {
            arg = arguments[ 0 ];
            if ( isNonNegativeInteger( arg ) ) {
                // new StringArray( 5 ) → 5 empty strings
                buf = new Uint8Array( 0 );
                off = new Int32Array( arg + 1 ); // all zeros = all empty strings
                len = arg;
            } else if ( isCollection( arg ) ) {
                // new StringArray( ['hello', 'world'] )
                result = fromStringCollection( arg );
                buf = result.buffer;
                off = result.offsets;
                len = result.length;
            } else if ( isObject( arg ) ) {
                // Iterable support
                if ( HAS_ITERATOR_SYMBOL === false ) {
                    throw new TypeError( '...' );
                }
                if ( !isFunction( arg[ ITERATOR_SYMBOL ] ) ) {
                    throw new TypeError( '...' );
                }
                iter = arg[ ITERATOR_SYMBOL ]();
                tmp = fromIterator( iter );
                result = fromStringCollection( tmp );
                buf = result.buffer;
                off = result.offsets;
                len = result.length;
            } else {
                throw new TypeError( '...' );
            }
        }

        setReadOnly( this, '_buffer', buf );
        setReadOnly( this, '_offsets', off );
        setReadOnly( this, '_length', len );

        return this;
    }
```

#### C Struct for ndarray Interop

```c
// Proposed C representation for StringArray data:
typedef struct {
    uint8_t  *data;        // UTF-8 byte buffer (the _buffer)
    int32_t  *offsets;     // Offset table (the _offsets, length = n+1)
    int64_t  length;       // Number of strings
    int64_t  data_len;     // Total bytes used in data buffer
    int64_t  data_cap;     // Allocated capacity of data buffer
} stdlib_strarray_t;

// Safe access API (inspired by NumPy's NpyString_load / NpyString_pack):
int stdlib_strarray_load(
    const stdlib_strarray_t *arr,
    int64_t idx,
    const char **out_buf,   // Pointer to string data (read-only)
    size_t *out_size         // Length in bytes
);

int stdlib_strarray_pack(
    stdlib_strarray_t *arr,
    int64_t idx,
    const char *buf,
    size_t size
);
```

**Why load/pack and not direct access?**
Following NumPy's design philosophy: by abstracting string access behind functions, we can change the internal memory layout (e.g., add SSO) without breaking C consumers. This is the same reason NumPy uses `npy_packed_static_string` as an opaque type.

### Future Optimization: Small String Optimization (SSO)

While this initial RFC proposes the Offset Table approach for architectural simplicity, I have also researched **Small String Optimization (SSO)** : storing strings ≤14 bytes directly in fixed 16-byte slots, eliminating arena lookups for short strings.

**How SSO would work:**

```
Each element = 16-byte slot in a Uint8Array:

SHORT STRING (≤14 bytes):
┌──────┬──────────────────────────────────────────────┬──────┐
│ Flags│  Inline UTF-8 data (up to 14 bytes)          │ Len  │
│ 1B   │  14 bytes                                     │ 1B   │
└──────┴──────────────────────────────────────────────┴──────┘

ARENA STRING (>14 bytes):
┌──────┬──────────────┬──────────────┬────────────────────────┐
│ Flags│  Arena Offset │  Byte Length │  (unused padding)      │
│ 1B   │  4 bytes      │  4 bytes     │  7 bytes               │
└──────┴──────────────┴──────────────┴────────────────────────┘
```

**Benefits of SSO:**

- Most real-world strings are short (variable names, labels, categories, country codes) they'd all be inline.
- Eliminates a pointer dereference for short strings → better cache performance.
- Makes `BYTES_PER_ELEMENT` a constant `16`.

**Why defer SSO:**

- Increases implementation complexity significantly (two code paths for every method).
- The Offset Table design is correct, explainable, and performant enough for initial adoption.
- SSO can be introduced as a backward-compatible optimization once the base API is stable.
- Better to discuss SSO with mentors during the community bonding period.

Once the base API is merged, SSO can be introduced to further eliminate arena lookups for short strings without changing the public API.

<img width="2742" height="1194" alt="Image" src="https://github.com/user-attachments/assets/1521aa64-1069-462d-a43e-54d86ed41546" />

### Why this project?

I've always been fascinated by the gap between how we use data structures at a high level and how they're actually represented in memory. When I saw Issue **Issue [#44](https://github.com/stdlib-js/google-summer-of-code/issues/44)**., I didn't just see "add string arrays", I saw a deep systems design problem: 
**how do you represent variable-length data in contiguous memory that both JavaScript and C can efficiently traverse?**

What excites me most is that this problem has been tackled by some of the best engineers in the world, the NumPy team with NEP 55, Apache Arrow with their columnar format, Julia with their UTF-8 strings and each made different tradeoffs. The opportunity to study these approaches and design a solution specifically tailored to stdlib's architecture is exactly the kind of challenge I want to take on.

I also believe this project has **outsized impact**. StringArray isn't just one package, it touches the entire stdlib ecosystem. Every array utility, every ndarray operation, every dtype resolver needs to learn about strings. Successfully completing this means I'll have touched nearly every corner of the codebase, and that depth of understanding is incredibly valuable, both for me as a developer and for stdlib as a project.

Finally, there's something deeply satisfying about working on infrastructure that other developers will build on. When someone writes `new StringArray(['hello', 'world'])` and it just works fast, memory-efficient, C-interoperable that's a legacy worth contributing to.


### Qualifications

With **55+ merged PRs** and **15 open PRs** across stdlib, I have deep familiarity with the codebase's architecture, coding conventions, testing patterns, and review process. My contributions span benchmark refactoring, float16 constants (`gamma-lanczos-g`, `eulergamma`), base special math functions (`roundnf`, `roundbf`), complex number utilities (`cround`, `csignumf`), and BLAS ndarray interfaces (`dapx`, `sfill`, `drev`).

Through these contributions, I've developed a working understanding of how custom typed arrays (`Complex64Array`, `BooleanArray`) are structured internally, how the dtype registry works, and how accessor-based array patterns are used throughout the library. The BLAS work in particular taught me strict TypeScript tuple types, 1D memory manipulation, and C-level array iteration, skills directly applicable to StringArray.

I have taken courses in Data Structures, Algorithms, Operating Systems, and Computer Architecture, which give me a strong foundation for understanding memory layouts, encoding schemes, and performance tradeoffs. My experience with C (including string manipulation and memory management) prepares me for the ndarray C integration portion of this project.

I have also studied NumPy's NEP 55 in depth, understanding the three-tier storage model (SSO/Arena/Heap), the arena allocator with 1.25× growth, the "Reuse-or-Abandon" mutation strategy, and why the arena becomes inefficient after 255 bytes (size metadata jumps from 1 byte to `size_t`). This research directly informs my design decisions for stdlib's `StringArray`.

### Prior art

This area has been extensively explored in major libraries and standards:

| Library/Standard | Approach | Key Insight for stdlib |
|---|---|---|
| **NumPy NEP 55** | Three-tier (SSO + Arena + Heap), UTF-8, packed unions | The gold standard for variable-length string arrays. Reuse-or-Abandon mutation, 1.25× arena growth, load/pack C API abstraction. |
| **Apache Arrow** | Offset table (data + offsets), UTF-8, immutable | Simple and proven. The basis for our proposed architecture. Used by Pandas, DuckDB, Spark. |
| **stdlib Complex64Array** | `Float32Array` backing, 2 floats per element, accessor pattern | The template for our constructor, `get`/`set`, and all prototype methods. |
| **stdlib BooleanArray** | `Uint8Array` backing, 1 byte per element, accessor pattern | Shows how a non-numeric dtype was recently integrated (2024). Closest precedent for StringArray integration. |
| **Julia** | UTF-8 encoded byte buffers, array of pointers | Simpler approach, but no special optimization for string arrays. |
| **Java** | Heap allocation + String Constant Pool | Out of scope, GC-managed, not applicable to typed array context. |

Of particular relevance is the recently added `BooleanArray` (`@stdlib/array/bool`), which demonstrates the full integration path for a new non-numeric dtype: constructor, 30+ prototype methods, assert packages, dtype registration, accessor support, and test/benchmark suites. I will follow this precedent exactly.


### Commitment

I am fully committed to this project as a **full-time, large project (350-hour commitment)** and am prepared to go beyond if needed. I will dedicate **35-40 hours per week** during my summer break and **25 hours per week** during my exam period (last week of May through first week of June), focusing on steady progress, well-structured pull requests, and thorough testing.

**Exam Period Note:** My university exams fall in the last week of May through the first week of June. During this period, I have intentionally scheduled lighter tasks (constructor implementation + core `get`/`set` methods) that were already prototyped during the bonding period, allowing me to maintain momentum at a reduced 25 hrs/week pace without blocking progress.

Before GSoC officially begins, I will:

1. Build a **working prototype** of the core `StringArray` (constructor + `get`/`set`) to validate my design.
2. Post an **RFC comment on Issue #44** presenting my Offset Table architecture and asking for mentor feedback on key design decisions.
3. Continue making contributions to stdlib to deepen my familiarity with the codebase.

After GSoC, I plan to stay involved addressing any remaining integration work, implementing SSO as a follow-up optimization, and contributing to ndarray C integration.


### Schedule

### Implementation Blueprint

The project is divided into **5 phases** with clear deliverables. Each phase builds on the previous one, and phases are designed so that midterm evaluation has a substantial, working deliverable.

### Community Bonding Period (Weeks C1-C3)

**Week C1: Design Validation & Environment Setup**

- Post RFC comment on Issue #44 with my Offset Table design, including diagrams and code sketches.
- Discuss key design decisions with mentors:
  - Should `BYTES_PER_ELEMENT` be fixed (16, slot-based) or omitted?
  - Should uninitialized elements default to `''` (empty string) or `null`?
  - Is Reuse-or-Abandon acceptable, or should we implement compaction?
- Set up local development environment, run existing test suites.

**Week C2: Prototype & Validate**

- Build a standalone prototype of `StringArray` core (constructor, `get`, `set`, `_offsets`) outside the main repo.
- Test with various string types: ASCII, multi-byte Unicode (emoji, CJK), empty strings, very long strings.
- Benchmark `get`/`set` performance against plain `Array` of strings.

**Week C3: Study Integration Points**

- Map every file that needs updating by grepping for `BooleanArray`, `bool`, and `complex64` across the codebase.
- Create a tracking issue listing all ~100+ packages that need StringArray support.
- Begin implementing based on mentor's go-ahead.

---

### Phase 1: Core StringArray Constructor (Weeks 1–2)

**Deliverables:**

- `@stdlib/array/string/lib/main.js`, Full constructor supporting:
  - `new StringArray()` , empty array
  - `new StringArray( 5 )` , 5 empty strings
  - `new StringArray( ['hello', 'world'] )`, from array
  - `new StringArray( iterable )`, from iterable
- `@stdlib/array/string/lib/from_array.js`, Helper for collection input
- `@stdlib/array/string/lib/from_iterator.js`, Helper for iterable input
- `@stdlib/array/string/lib/from_iterator_map.js`, Helper with callback
- Static properties: `StringArray.name = 'StringArray'`
- Prototype accessors: `buffer`, `byteLength`, `byteOffset`, `length`
- Core methods: `get( idx )`, `set( value, idx )`

**Files:**

```
lib/node_modules/@stdlib/array/string/
├── lib/
│   ├── main.js              [NEW] Constructor + get/set + accessors
│   ├── from_array.js         [NEW] Collection → StringArray
│   ├── from_iterator.js      [NEW] Iterator → StringArray
│   ├── from_iterator_map.js  [NEW] Iterator with map → StringArray
│   └── index.js              [NEW] Module entry point
├── package.json              [NEW]
├── README.md                 [NEW]
├── test/
│   └── test.js               [NEW] Constructor + get/set tests
├── benchmark/
│   └── benchmark.js          [NEW] Construction + access benchmarks
└── examples/
    └── index.js              [NEW] Usage examples
```

---

### Phase 2: Standard TypedArray Prototype Methods (Weeks 3–5)

**Week 3 : Iteration & Search:**

- `at( idx )`, `entries()`, `keys()`, `values()`
- `forEach( fcn, thisArg )`, `every( predicate )`, `some( predicate )`
- `find()`, `findIndex()`, `findLast()`, `findLastIndex()`
- `includes( searchElement, fromIndex )`, `indexOf()`, `lastIndexOf()`

**Week 4 : Transformation:**

- `map( fcn, thisArg )`, `filter( predicate, thisArg )`
- `reduce( reducer, initialValue )`, `reduceRight()`
- `fill( value, start, end )`
- `join( separator )`

**Week 5 : Copy & Reorder:**

- `slice( begin, end )`, `subarray( begin, end )`
- `copyWithin( target, start, end )`
- `reverse()`, `sort( compareFn )`
- `toReversed()`, `toSorted( compareFn )`, `with( idx, value )`
- `toString()`, `toLocaleString()`
- Static: `StringArray.from( src, clbk, thisArg )`, `StringArray.of( ...elements )`

**Tests:** Each method gets dedicated test cases following `@stdlib/array/bool/test/` patterns.

---

### Phase 3: Assert Packages & Dtype Registration (Week 6 : Midterm)

> **Midterm deliverable:** A fully working `StringArray` with all 30+ prototype methods, comprehensive tests, and dtype registration.

**New Packages:**

```
@stdlib/array/base/assert/is-stringarray/         [NEW]
@stdlib/array/base/assert/is-string-data-type/     [NEW]
@stdlib/assert/is-stringarray/                     [NEW]
```

**Modified Files:**

| File | Change |
|---|---|
| `@stdlib/array/dtypes/lib/dtypes.json` | Add `"string"` to `all` and `typed` categories |
| `@stdlib/array/ctors/lib/ctors.js` | Add `'string': StringArray` mapping |
| `@stdlib/array/dtype/` | Add StringArray → `'string'` dtype resolution |

---

### Phase 4: Ecosystem Integration (Weeks 7–9)

This is the largest phase updating ~50+ packages to recognize StringArray. Prioritized by dependency order:

**Week 7 : Core Accessors:**

| Package | Change |
|---|---|
| `@stdlib/array/base/getter` | Add accessor for StringArray |
| `@stdlib/array/base/setter` | Add accessor for StringArray |
| `@stdlib/array/base/accessor-getter` | Add `'string'` accessor |
| `@stdlib/array/base/accessor-setter` | Add `'string'` accessor |

**Week 8 : Array Creation Utilities:**

| Package | Change |
|---|---|
| `@stdlib/array/empty` | Support `dtype='string'` |
| `@stdlib/array/zeros` | Support `dtype='string'` (array of empty strings) |
| `@stdlib/array/filled` | Support `dtype='string'` |
| `@stdlib/array/from-iterator` | Support `dtype='string'` |
| `@stdlib/array/from-scalar` | Support `dtype='string'` |
| `@stdlib/array/convert` | Support conversion to/from `'string'` |

**Week 9 : Additional Integration:**

| Package | Change |
|---|---|
| `@stdlib/array/convert-same` | StringArray support |
| `@stdlib/array/slice` | StringArray support |
| `@stdlib/array/take` | StringArray support |
| `@stdlib/array/put` | StringArray support |
| `@stdlib/array/place` | StringArray support |
| `@stdlib/array/mskfilter` | StringArray support |
| `@stdlib/array/mskreject` | StringArray support |
| `@stdlib/array/mskput` | StringArray support |
| `@stdlib/array/to-fancy` | StringArray support |

*Note: For the sake of brevity and focus, the tables above highlight the 19 most critical dependency bottlenecks. The remaining 30+ packages in this phase include high-level utilities that simply need dtype resolution updates or minor accessor integrations, such as: `@stdlib/array/any`, `@stdlib/array/every`, `@stdlib/array/some`, `@stdlib/array/none`, `@stdlib/array/count`, `@stdlib/array/max`, `@stdlib/array/min`, `@stdlib/array/reverse`, `@stdlib/array/sort`, `@stdlib/array/shuffle`, `@stdlib/array/sample`, `@stdlib/array/unique`, `@stdlib/array/map`, `@stdlib/array/filter`, `@stdlib/array/to-iterator`, `@stdlib/array/to-json`, `@stdlib/array/pool`, `@stdlib/array/complex`, `@stdlib/array/int8`, `@stdlib/array/uint8`, `@stdlib/array/base/stride2offset`, `@stdlib/array/base/broadcast-array`, and various multidimensional array utilities under `@stdlib/ndarray/*`.*

---

### Phase 5: C Design & Documentation (Weeks 10–12)

**Week 10 : C Struct & Header:**

- Define `stdlib_strarray_t` struct in C header.
- Implement `stdlib_strarray_load()` and `stdlib_strarray_pack()` functions.
- Write basic napi native addon wrapping these functions.

**Week 11 : Documentation & Final Testing:**

- Comprehensive `README.md` with full API documentation and examples.
- Ensure all tests pass across Node.js versions.
- Run full benchmark suite, compare with plain arrays.
- Code freeze.

**Week 12 : Polish & Submission:**

- Address any remaining review feedback.
- Create tracking issues for remaining integration work (ndarray support, SSO optimization).
- Write a summary of completed work and future directions.
- Final submission.

### Detailed Day-wise Schedule Blueprint

For a granular, day-by-day breakdown of all 15 weeks (including exact hours, tasks, and file-level deliverables per day), see the full schedule document:

[StringArray Implementation Blueprint, Day-wise Schedule](https://docs.google.com/document/d/1lwWZItVJ0_pWOaYHXYgCnsiTXlwr7CKDK9aov8dHf8M/edit?usp=sharing)

### Stretch Goals (If Ahead of Schedule)

1. Implement SSO (Small String Optimization) for strings ≤14 bytes.
2. Begin ndarray string dtype support in `@stdlib/ndarray/`.
3. Add SIMD-friendly batch operations for string comparison/search in C.

## Open Questions for Mentors

1. **`BYTES_PER_ELEMENT`:** Should we define `BYTES_PER_ELEMENT` for `StringArray`? Since strings are variable-length, it doesn't have a fixed meaningful value. 
   Options: (a) omit it, (b) set to `1` (byte-level granularity), (c) set to `16` 
   if we adopt SSO slots.

2. **Default value for uninitialized elements:** Should `new StringArray(5)` produce 5 empty strings (`''`) or 5 `null` entries? NumPy defaults to empty strings.

3. **Mutation strategy:** Is "Reuse-or-Abandon" (NumPy's approach) acceptable, or should we implement compaction? I recommend Reuse-or-Abandon for simplicity 
   and O(1) mutation.

4. **C API priority:** Should the C struct and `load`/`pack` API be part of the initial implementation, or deferred to a follow-up after the JS API is stable?


### Related issues

- [#44](https://github.com/stdlib-js/google-summer-of-code/issues/44) [Idea]: Add support for string arrays in stdlib


### Checklist

- [x] I have read and understood the [Code of Conduct](https://github.com/stdlib-js/stdlib/blob/develop/CODE_OF_CONDUCT.md).
- [x] I have read and understood the application materials found in this repository.
- [x] I understand that plagiarism will not be tolerated, and I have authored this application in my own words.
- [x] I have read and understood the [patch requirement](https://github.com/stdlib-js/google-summer-of-code/blob/main/README.md#patch-requirement) which is necessary for my application to be considered for acceptance.
- [x] I have read and understood the [stdlib showcase requirement](https://github.com/stdlib-js/google-summer-of-code/blob/main/README.md#showcase-requirement) which is necessary for my application to be considered for acceptance.
- [x] The issue name begins with `[RFC]:` and succinctly describes your proposal.
- [x] I understand that, in order to apply to be a GSoC contributor, I must submit my final application to <https://summerofcode.withgoogle.com/> **before** the submission deadline.

File	Change
`@stdlib/array/dtypes/lib/dtypes.json`	Add `"string"` to `all` and `typed` categories
`@stdlib/array/ctors/lib/ctors.js`	Add `'string': StringArray` mapping
`@stdlib/array/dtype/`	Add StringArray → `'string'` dtype resolution

Package	Change
`@stdlib/array/base/getter`	Add accessor for StringArray
`@stdlib/array/base/setter`	Add accessor for StringArray
`@stdlib/array/base/accessor-getter`	Add `'string'` accessor
`@stdlib/array/base/accessor-setter`	Add `'string'` accessor

Package	Change
`@stdlib/array/empty`	Support `dtype='string'`
`@stdlib/array/zeros`	Support `dtype='string'` (array of empty strings)
`@stdlib/array/filled`	Support `dtype='string'`
`@stdlib/array/from-iterator`	Support `dtype='string'`
`@stdlib/array/from-scalar`	Support `dtype='string'`
`@stdlib/array/convert`	Support conversion to/from `'string'`

Feature	Offset Table (Proposed)	NumPy SSO+Arena
Follows stdlib pattern	`_buffer` + `_length`	Would need `_slotBuffer` + `_dataBuffer`
Memory per ASCII char	1 byte	1 byte
Encoding	UTF-8	UTF-8
O(1) indexed access	Yes (via offsets)	Yes (via slots)
BYTES_PER_ELEMENT	Variable (needs design decision)	Fixed 16
Implementation complexity	Medium	High
C interop	Two pointers (data + offsets)	Two pointers (slots + arena)
Explainability for RFC	Simple to diagram	Complex union

Library/Standard	Approach	Key Insight for stdlib
NumPy NEP 55	Three-tier (SSO + Arena + Heap), UTF-8, packed unions	The gold standard for variable-length string arrays. Reuse-or-Abandon mutation, 1.25× arena growth, load/pack C API abstraction.
Apache Arrow	Offset table (data + offsets), UTF-8, immutable	Simple and proven. The basis for our proposed architecture. Used by Pandas, DuckDB, Spark.
stdlib Complex64Array	`Float32Array` backing, 2 floats per element, accessor pattern	The template for our constructor, `get`/`set`, and all prototype methods.
stdlib BooleanArray	`Uint8Array` backing, 1 byte per element, accessor pattern	Shows how a non-numeric dtype was recently integrated (2024). Closest precedent for StringArray integration.
Julia	UTF-8 encoded byte buffers, array of pointers	Simpler approach, but no special optimization for string arrays.
Java	Heap allocation + String Constant Pool	Out of scope, GC-managed, not applicable to typed array context.

Package	Change
`@stdlib/array/convert-same`	StringArray support
`@stdlib/array/slice`	StringArray support
`@stdlib/array/take`	StringArray support
`@stdlib/array/put`	StringArray support
`@stdlib/array/place`	StringArray support
`@stdlib/array/mskfilter`	StringArray support
`@stdlib/array/mskreject`	StringArray support
`@stdlib/array/mskput`	StringArray support
`@stdlib/array/to-fancy`	StringArray support

[RFC]: Add support for string arrays in stdlib #218

Description

Full name

University status

University name

University program

Expected graduation

Short biography

Timezone

Contact details

Platform

Editor

Programming experience

JavaScript experience

Node.js experience

C/Fortran experience

Interest in stdlib

Version control

Contributions to stdlib

stdlib showcase

Goals

Main Goals

Supporting Goals

Approach

The Core Problem

Prior Art Analysis

1. Apache Arrow : Variable-Size Binary Layout

2. NumPy NEP 55 : Three-Tier Storage (SSO + Arena + Heap)

3. Java : Heap + String Constant Pool

Proposed Design: Offset Table with Reuse-or-Abandon Mutation

Internal Layout

The get() Implementation

The set() Implementation : Reuse-or-Abandon Strategy

Arena Growth Strategy

The Constructor : All Input Forms

C Struct for ndarray Interop

Future Optimization: Small String Optimization (SSO)

Why this project?

Qualifications

Prior art

Commitment

Schedule

Implementation Blueprint

Community Bonding Period (Weeks C1-C3)

Phase 1: Core StringArray Constructor (Weeks 1–2)

Phase 2: Standard TypedArray Prototype Methods (Weeks 3–5)

Phase 3: Assert Packages & Dtype Registration (Week 6 : Midterm)

Phase 4: Ecosystem Integration (Weeks 7–9)

Phase 5: C Design & Documentation (Weeks 10–12)

Detailed Day-wise Schedule Blueprint

Stretch Goals (If Ahead of Schedule)

Open Questions for Mentors

Related issues

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

The `get()` Implementation

The `set()` Implementation : Reuse-or-Abandon Strategy