-
Notifications
You must be signed in to change notification settings - Fork 1
Parallel Processing
Wikigen implements a two-level concurrency model for efficient wiki generation: repository-level parallelism controls how many projects are processed simultaneously, while page-level parallelism controls how many wiki pages are generated in parallel within each project. This document explains the underlying mechanisms including Go channels, WaitGroups, semaphore-based concurrency control, and atomic operations for lock-free progress tracking.
Wikigen uses a hierarchical concurrency model to balance throughput with resource utilization. The system supports two independent parallelism dimensions that can be tuned independently:
- Repository-level parallelism (-p flag): Controls how many projects or repository groups are processed concurrently
- Page-level parallelism (-pp flag): Controls how many wiki pages within a single project are generated concurrently
graph TD
A["Start: Load Tasks<br/>(Projects/Repos)"] --> B["Repository Semaphore<br/>-p parallelism"]
B --> C["Project Task 1"]
B --> D["Project Task 2"]
B --> E["Project Task N"]
C --> F["Page Semaphore<br/>-pp parallelism"]
D --> F
E --> F
F --> G["Generate Page 1"]
F --> H["Generate Page 2"]
F --> I["Generate Page M"]
G --> J["Track Progress<br/>Atomic Operations"]
H --> J
I --> J
J --> K["Collect Results"]
Repository-level parallelism is controlled by the -p flag, which specifies how many projects can be processed simultaneously. Each project in the tasks list runs in its own goroutine, and a buffered channel acts as a semaphore to limit concurrency.
Implementation details:
-
Semaphore channel: A buffered channel of size
parallelcontains placeholder structs that control concurrency - WaitGroup: Tracks completion of all repository-level goroutines
- Mutex: Protects shared result collection from concurrent writes
Sources: main.go:1018-1047
Code snippet:
sem := make(chan struct{}, parallel) // Semaphore with parallelism slots
var wg sync.WaitGroup
var mu sync.Mutex
var results []*WikiResult
for _, t := range tasks {
wg.Add(1)
sem <- struct{}{} // Acquire semaphore slot
go func(t task) {
defer wg.Done()
defer func() { <-sem }() // Release semaphore slot
result, err := generateWiki(...)
mu.Lock() // Protect shared results
if result != nil {
results = append(results, result)
}
mu.Unlock()
}(t)
}
wg.Wait() // Wait for all goroutines to completeThe semaphore works by blocking on the send operation (sem <- struct{}{}) when all slots are occupied, naturally enforcing the parallelism limit. When a goroutine completes and receives from the channel (<-sem), it frees a slot, allowing the next waiting goroutine to proceed.
Page-level parallelism is controlled by the -pp flag, which specifies how many wiki pages within a single project are generated concurrently. This parallelism level is independent of repository-level parallelism and allows fine-grained control over resource usage during the page generation phase.
Implementation details:
-
Semaphore channel: A buffered channel of size
pageParallelcontrols page generation concurrency - WaitGroup: Tracks completion of all page generation goroutines
-
Atomic counter:
pageDoneis incremented atomically to track progress without lock contention - Retry logic: Each page is attempted up to 3 times before failing
Sources: main.go:564-615
Code snippet:
var pageDone int32
pageSem := make(chan struct{}, pageParallel)
var pageWg sync.WaitGroup
for i := range allPages {
pageWg.Add(1)
pageSem <- struct{}{} // Acquire semaphore slot
go func(idx int) {
defer pageWg.Done()
defer func() { <-pageSem }() // Release semaphore slot
page := &allPages[idx]
maxRetries := 3
var success bool
for attempt := 1; attempt <= maxRetries; attempt++ {
_, err := claudeCall(...)
if err == nil {
success = true
break
}
}
if !success {
appendError(wikiDir, ...)
}
atomic.AddInt32(&pageDone, 1)
}(i)
}
pageWg.Wait()The page-level parallelism operates within the context of a single project. When multiple projects are being processed at the repository level, each project independently spawns up to pageParallel page generation goroutines.
Wikigen uses buffered channels as semaphores to implement bounded concurrency. This pattern is more efficient than using a separate semaphore type because it leverages Go's channel primitives and the goroutine scheduler.
How it works:
- Create a buffered channel with capacity equal to the desired parallelism limit
- Before spawning work, send a placeholder value to the channel (blocking if full)
- When work completes, receive from the channel to release the slot
- The
deferstatement ensures cleanup even if the goroutine panics
Advantages of this pattern:
- No additional data structures needed beyond channels
- Natural blocking behavior when concurrency limit is reached
- Integrates seamlessly with goroutine scheduler
- Fair queue of waiting goroutines maintained by the Go runtime
Sources: main.go:565, 570-573, 1018, 1027-1030
WaitGroups are used at both parallelism levels to ensure the main goroutine waits for all worker goroutines to complete before proceeding:
- At repository level, the main function uses
pageWg.Wait()to block until all projects complete - At page level, within each project, the code uses
pageWg.Wait()to block until all pages complete - The pattern
Add(1)before spawning andDone()in a defer ensures proper counting even with panics
Sources: main.go:566, 615, 1019, 1049
Progress tracking uses atomic operations to avoid the overhead of mutexes for simple counters. This allows multiple goroutines to update the progress count without waiting for locks.
The Progress struct combines mutexes for complex state with atomic operations for high-frequency updates:
classDiagram
class Progress {
-mu sync.Mutex
-totalItems int
-doneItems int32
-current map[string]string
+set(name string, status string)
+done(name string)
+print()
}
Structure:
type Progress struct {
mu sync.Mutex
totalItems int
doneItems int32 // Atomic counter
current map[string]string // Protected by mutex
}Atomic progress updates:
The done() method updates the atomic counter without acquiring the mutex for the counter itself:
func (p *Progress) done(name string) {
atomic.AddInt32(&p.doneItems, 1) // No mutex needed
p.mu.Lock()
defer p.mu.Unlock()
delete(p.current, name)
p.print()
}Reading atomic values:
The print() method reads the atomic counter to display progress:
func (p *Progress) print() {
done := int(atomic.LoadInt32(&p.doneItems))
pct := 0
if p.totalItems > 0 {
pct = done * 100 / p.totalItems
}
// ... format and display progress
}Benefits of this hybrid approach:
-
doneItemsis updated with atomic operations for minimal contention -
currentmap is protected by mutex but only updated when status changes - Progress display percentage can be calculated from the atomic value without locking
- The separation prevents the high-frequency atomic counter updates from blocking status map access
Sources: main.go:21-58
The -p and -pp flags allow users to tune parallelism according to their system resources and network constraints:
| Flag | Name | Default | Environment Variable | Description |
|---|---|---|---|---|
-p |
Repository Parallelism | 1 | WIKI_PARALLEL |
Number of projects/repos to process concurrently |
-pp |
Page Parallelism | 3 | WIKI_PAGE_PARALLEL |
Number of pages to generate concurrently within each project |
Flag parsing and defaults:
Sources: main.go:912-913, 699-708
flag.IntVar(¶llel, "p", envOrDefaultInt("WIKI_PARALLEL", 1),
"parallel projects/repos")
flag.IntVar(&pageParallel, "pp", envOrDefaultInt("WIKI_PAGE_PARALLEL", 3),
"parallel pages per project")
func envOrDefaultInt(key string, fallback int) int {
if v := os.Getenv(key); v != "" {
var i int
fmt.Sscan(v, &i)
if i > 0 {
return i
}
}
return fallback
}Usage examples:
# Process 4 projects in parallel, 6 pages per project
wikigen -p 4 -pp 6 owner/repo1 owner/repo2 ...
# Use environment variables
export WIKI_PARALLEL=2
export WIKI_PAGE_PARALLEL=4
wikigen owner/repo
# Slow system: sequential processing
wikigen -p 1 -pp 1 owner/repo
# High-performance system: aggressive parallelism
wikigen -p 8 -pp 10 owner/repoThe execution flow at each parallelism level follows a similar pattern:
sequenceDiagram
participant Main as Main Goroutine
participant Sem as Semaphore<br/>Channel
participant Worker as Worker<br/>Goroutine
participant Done as Work<br/>Complete
Main->>Sem: Check slot available
Sem-->>Main: Send placeholder (block if full)
Main->>Worker: Spawn goroutine
activate Worker
Worker->>Done: Execute work
activate Done
Done-->>Worker: Complete
deactivate Done
Worker->>Sem: Receive (release slot)
Sem-->>Main: Slot freed
Worker-->>Main: Done signal via WaitGroup
deactivate Worker
Key points in the execution flow:
- For each item to process (project or page), call
WaitGroup.Add(1) - Send to semaphore channel (blocks if semaphore is full)
- Spawn a goroutine that defers
Done()and semaphore receive - Worker performs actual work (wiki generation, page generation)
- Worker receives from semaphore to release slot
- Worker's
Done()method signals completion to WaitGroup - Main function calls
WaitGroup.Wait()to block until all workers finish
The page-level parallelism directly integrates with the retry mechanism. Each page generation attempt is counted as a separate operation, and the semaphore ensures that no more than pageParallel pages are being generated at any time, regardless of retry attempts.
Retry integration:
Sources: main.go:580-605
maxRetries := 3
var success bool
for attempt := 1; attempt <= maxRetries; attempt++ {
// ... attempt page generation ...
if success {
break // Exit retry loop after successful generation
}
}
if !success {
appendError(wikiDir, ...) // Log failure
// Write placeholder file indicating failure
}
atomic.AddInt32(&pageDone, 1) // Update progress after all retriesThe semaphore slot is held for the entire duration of all retry attempts for a single page, preventing the page generation parallelism from exceeding the configured limit even during retries.
The optimal settings depend on several factors:
Repository-level parallelism (-p):
- Limited by available memory and network bandwidth
- Each project requires space for cloned repositories
- Multiple concurrent Claude API calls can be rate-limited
- Starting point: number of CPU cores / 2
Page-level parallelism (-pp):
- Limited by Claude API rate limits (typically more restrictive)
- Each page generation makes a single Claude API call
- Higher values increase throughput but may hit API quotas
- Default of 3 is conservative to avoid rate limiting
Combined effect:
- Total concurrent Claude API calls =
-p×-pp - Network bandwidth scales with
(-p × number of repos) × (-pp) - Memory usage scales roughly with
-p(each project clones repos)
The hybrid approach to progress tracking minimizes contention:
- Atomic
doneItemscounter avoids mutex lock on every page completion - The mutex protects only the current status map, which changes less frequently
- Display updates are throttled by the progress.print() implementation
- No locks are held during the actual page generation work
Sources: main.go:32-44, 47-58
Wikigen provides thread-safe operations through:
- Goroutine-safe channels: Used for semaphore implementation
- WaitGroup: Provides happen-before guarantees for goroutine completion
- Mutex protection: Result collection and progress state modification
- Atomic operations: Lock-free progress counter updates
- Value semantics: Each goroutine receives its own copy of loop variables via closure parameters
Sources: main.go:1025-1047
The pattern go func(t task) { ... }(t) is critical for correctness—it passes the loop variable by value to the closure, ensuring each goroutine sees the correct value even though the loop variable itself may change during iteration.
- Architecture & Design — Overall system architecture, core data structures, and design patterns
- Wiki Generation Pipeline — Complete wiki generation process including concurrency control
- Configuration & Environment — Environment variables for parallelism configuration
- Error Handling & Retry Strategy — Retry logic integration with parallel page generation
- System Overview — Introduction to wikigen and its architecture
- System Overview
- Architecture & Design
- CLI Usage & Commands
- Configuration & Environment
- Input Formats & Repository Configuration
- Authentication & Git Integration
- Output Format & Wiki Structure
- Error Handling & Retry Mechanism
- Parallel Processing & Performance
- Input Validation & Security
- Build & Deployment
- Claude Code Integration
- Wiki Generation Processing Flow
- Multi-Repository Wiki Support
- Progress Tracking & Output Modes