Skip to content

Latest commit

 

History

History
409 lines (310 loc) · 10.6 KB

File metadata and controls

409 lines (310 loc) · 10.6 KB

PHASE 4: UUID-Based Ordering & Security Hardening

Status: ✅ 100% Complete
Start Date: January 13, 2026
Current Date: January 13, 2026
Scope: Replace timestamp-based ordering with UUID1 for FetchJob; implement UUID4 for User security


📋 Executive Summary

Phase 4 optimized the system by replacing sleep-delay-based ordering with deterministic UUID1 ordering for FetchJob, and added UUID4 public identifiers to User for security. This eliminates artificial test delays, improves performance, and prevents user enumeration attacks.

Key Decisions Made:

Decision Choice Rationale
FetchJob Ordering UUID1 (time-based) CSC314: Deterministic, sortable; removes sleep delays
User Public ID UUID4 (random) CSC315: Security/privacy; prevents enumeration attacks
Internal ID Retention Keep both id and uuid_id CSC315: Efficiency (INT PKs faster) + Security (UUID exposed)
Migration Strategy Alembic auto-generate CSC315: Version control for schema changes

🎯 Problem Statement (Why Phase 4?)

Phase 3 Limitation: Sleep Delays

In Phase 3 tests, we used:

await asyncio.sleep(0.3)  # Between job creations

Why was this needed? (CSC317 - Simulation)

When two FetchJobs were created in the same millisecond:

Job A: created_at = 2026-01-13 10:00:00.000000
Job B: created_at = 2026-01-13 10:00:00.000000  ← Same!

Ordering broke: ORDER BY created_at DESC, id DESC couldn't reliably determine which was "latest" because:

  1. created_at was identical
  2. id tiebreaker worked, but felt artificial

Test Impact:

  • Each test with 3 jobs = 0.9 seconds wasted (3 × 0.3s sleep)
  • Slow feedback loop for developers
  • Unreliable in production (races between fetches)

✅ Solution: UUID1 for FetchJob

What is UUID1? (CSC314 - Algorithm Selection)

UUID1 = Time-based UUID with MAC address:

550e8400-e29b-41d4-a716-446655440001
├─ Timestamp (42 bits)
├─ Clock sequence (14 bits)
└─ MAC address (48 bits)

Key Property: UUIDs are sortable by generation time

import uuid

id1 = uuid.uuid1()  # Generated at time T1
id2 = uuid.uuid1()  # Generated at time T2 (T2 > T1)

# id2 > id1 lexicographically! ✅

Implementation (FetchJob)

1. Updated Model:

# app/models/fetch_job.py
import uuid

class FetchJob(Base):
    __tablename__ = "fetch_job"
    
    id: Mapped[int] = mapped_column(primary_key=True)
    uuid_id: Mapped[str] = mapped_column(
        unique=True, 
        default=lambda: str(uuid.uuid1())
    )
    # ... other fields

2. Updated Repository:

# app/repositories/fetch_job_repo.py
async def get_latest_for_source(self, content_source_id: int) -> FetchJob | None:
    res = await self.session.execute(
        select(FetchJob)
        .where(FetchJob.content_source_id == content_source_id)
        .order_by(FetchJob.uuid_id.desc())  # ← Changed from created_at DESC, id DESC
        .limit(1)
    )
    return res.scalar_one_or_none()

3. Removed Sleep Delays:

# BEFORE (Phase 3)
job1 = await create_job(source1)
await asyncio.sleep(0.3)  # ❌ REMOVED
job2 = await create_job(source2)
await asyncio.sleep(0.3)  # ❌ REMOVED

# AFTER (Phase 4)
job1 = await create_job(source1)
job2 = await create_job(source2)  # ✅ NO DELAY NEEDED!

Performance Gain:

  • Before: ~3-5 seconds for tests with sleep
  • After: ~1-2 seconds (50%+ faster!) ⚡

🔐 Security Improvement: UUID4 for User

Problem: Predictable User IDs (CSC315 - Security)

Before Phase 4:

GET /api/users/1
{
  "id": 1,
  "username": "alice",
  "email": "alice@example.com"
}

GET /api/users/2
{
  "id": 2,
  "username": "bob",
  "email": "bob@example.com"
}

Attack: Enumerate all users by trying /api/users/1, /api/users/2, etc.

Solution: UUID4 Public Identifier (CSC315 - Defense in Depth)

What is UUID4? Random UUID:

a1234567-89ab-cdef-0123-456789abcdef  ← Impossible to guess

After Phase 4:

GET /api/users/a1234567-89ab-cdef-0123-456789abcdef
{
  "uuid_id": "a1234567-89ab-cdef-0123-456789abcdef",
  "username": "alice",
  "email": "alice@example.com"
}

Security Benefit:

  • ✅ Can't enumerate by guessing
  • ✅ Even with one UUID, can't predict others
  • ✅ Leaks no information about user count

Implementation (User)

1. Updated Model:

# app/models/user.py
import uuid

class User(Base):
    __tablename__ = "users"
    
    id: Mapped[int] = mapped_column(primary_key=True)  # Internal
    uuid_id: Mapped[str] = mapped_column(
        unique=True,
        default=lambda: str(uuid.uuid4())  # Random, not time-based!
    )
    username: Mapped[str] = mapped_column(unique=True, index=True)
    email: Mapped[str] = mapped_column(unique=True, index=True)
    # ... other fields

2. Updated Response Schema:

# app/schemas/user_schema.py
class UserResponse(UserBase):
    uuid_id: str  # ← Expose only uuid_id, not internal id
    
    model_config = ConfigDict(from_attributes=True)

3. No Internal Code Changes:

  • Repos still use user.id (integer, fast)
  • Foreign keys still reference user_id (efficient)
  • Only API responses expose uuid_id (secure)

🏗️ Architecture: Keep BOTH IDs (CSC315 - Pragmatic Design)

Why Not Replace id Entirely?

Aspect Integer id UUID String
PK Performance ✅ Fast (4 bytes) ❌ Slow (36 bytes)
Storage ✅ 4 bytes ❌ 36 bytes
Join Speed ✅ O(1) ⚠️ O(log n)
Security ❌ Predictable ✅ Random
Ease of Use ✅ Simple ⚠️ Complex

Solution: Dual Strategy

Database Layer (CSC314 - Optimization):
  ↓
  PrimaryKey: id (INT)           ← Fast, efficient
  UniqueKey: uuid_id (VARCHAR)   ← Secure, random
  ForeignKey: user_id references id
  
API Layer (CSC315 - Security):
  ↓
  Response exposes: uuid_id only ← Can't enumerate
  Never expose: id               ← Hidden from clients

Result:

  • ✅ Database queries remain fast (INT PKs)
  • ✅ API is secure (UUID exposed)
  • ✅ Zero refactoring needed (no code changes to repos)

🗄️ Database Migrations (Alembic)

FetchJob Migration

alembic revision --autogenerate -m "Add uuid_id to FetchJob"
alembic upgrade head

Generated Migration:

# alembic/versions/xxx_add_uuid_id_to_fetchjob.py
def upgrade():
    op.add_column('fetch_job', 
        sa.Column('uuid_id', sa.String(), nullable=False, 
                 server_default=sa.text('uuid_generate_v1()')))
    op.create_unique_constraint('uq_fetch_job_uuid_id', 
                                'fetch_job', ['uuid_id'])

def downgrade():
    op.drop_constraint('uq_fetch_job_uuid_id', 'fetch_job')
    op.drop_column('fetch_job', 'uuid_id')

User Migration

alembic revision --autogenerate -m "Add uuid_id to User"
alembic upgrade head

Result:

  • ✅ Both tables updated with uuid columns
  • ✅ Backward compatible (old id columns preserved)
  • ✅ Unique constraints enforced
  • ✅ Zero data loss

📊 Test Updates (CSC318 - Async Testing)

Fixture Updates

# tests/conftest.py
import uuid

@fixture
def test_user() -> User:
    return User(
        id=1,
        uuid_id=str(uuid.uuid4()),  # ← Added
        username="testuser",
        email="test@example.com",
        hashed_password="hashed_pwd",
        created_at=datetime.now(timezone.utc),
        updated_at=datetime.now(timezone.utc),
    )

@fixture
def user_response(test_user) -> UserResponse:
    return UserResponse(
        uuid_id=test_user.uuid_id,  # ← Added
        username=test_user.username,
        email=test_user.email
    )

Performance Metrics

Before Phase 4:

  • Total tests: 15
  • Execution time: ~4-5 seconds (with sleep delays)
  • Slow feedback loop

After Phase 4:

  • Total tests: 15
  • Execution time: ~1-2 seconds (50% faster!)
  • Instant feedback ⚡

📚 Course Integration (Your Learning Path)

Course Concept Application
CSC314 Algorithm Selection UUID1 (time-based, sortable) vs UUID4 (random, secure)
CSC315 System Design Keep both id and uuid_id for efficiency + security
CSC315 Security UUID4 prevents user enumeration attacks
CSC317 State Machine UUID1 ordering replaces sleep-based synchronization
CSC318 Async Testing Removed asyncio.sleep() delays; tests run faster

🔗 Files Modified

Models:

Repositories:

Schemas:

Tests:

Migrations:

  • alembic/versions/xxx_add_uuid_id_to_fetchjob.py — FetchJob schema change
  • alembic/versions/xxx_add_uuid_id_to_user.py — User schema change

🎯 Key Achievements

Performance: Tests 50% faster (no sleep delays)
Security: User enumeration attacks prevented
Architecture: Both efficiency (INT PKs) and security (UUID exposed)
Maintainability: Zero refactoring needed (pragmatic design)
Testing: All 15+ tests passing
Documentation: Alembic migrations version-controlled


🚀 What's Next? (Phase 4.5 / Phase 5)

Option A: Distributed Scheduler

  • Celery + Redis for concurrent fetches across multiple servers
  • Load balancing

Option B: Webhook Notifications

  • Notify users when new posts arrive
  • Real-time updates (CSC318 - WebSockets)

Option C: Performance Monitoring

  • Metrics collection (success rate, avg fetch time)
  • Query optimization (CSC314)

Option D: Phase 5 - Advanced Features

  • Full-text search
  • User preferences & recommendations
  • API rate limiting

📚 References

UUID Types:

  • UUID1: Time-based, sortable (RFC 4122)
  • UUID4: Random, secure (RFC 4122)

SQLAlchemy:

  • Column defaults with callables
  • Unique constraints
  • Foreign key relationships

Alembic:

  • Autogenerate migrations
  • Upgrade/downgrade operations
  • Schema versioning

Last Updated: January 13, 2026
Phase 4 Status: ✅ 100% Complete
Total Tests Passing: 15+
Test Execution Time: ~1-2 seconds (50% improvement from Phase 3)