Skip to content

Oak-B/bigdata-analysis-skill

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

bigdata-analysis-skill

An AI coding skill that makes your AI assistant production-safe when writing Hive, Impala, and Spark ETL code on HDFS/YARN.

It targets the class of bugs that are invisible — row counts look correct, no errors thrown, but data is silently wrong or performance collapses.

Install

npx skills add Oak-B/bigdata-analysis-skill@bigdata-analysis

What It Covers

Rule Problem It Prevents
Rule 0 DESCRIBE before coding — never guess column names or types
Rule 1 Never hard-code table names in Spark source
Rule 2 Keep long-text fields out of GROUP BY (control characters cause silent row explosion)
Rule 3 Filter first, then aggregate — prevents OOM on billion-row tables
Rule 4 Use Spark SQL, not DataFrame API (real benchmark: 3h → 15min)
Rule 5 Control broadcast JOIN threshold — prevents task explosion
Rule 6 Never use SELECT * in INSERT — prevents silent column shifts
Rule 7 Use LEFT JOIN for optional fields — prevents silent row loss
Rule 8 Refresh metadata after Spark write
Rule 9 UDF type safety — nested collection return types crash at runtime

Plus: date window off-by-one, Scala string interpolation pitfalls, regex engine differences, and more.

Two Modes

Mode Behavior
Analysis Run SQL → present numbers → ask the user before making decisions
Coding Follow the 10 rules strictly; never guess types, column order, or table names

Quick Error Reference

Symptom Likely Root Cause
New column all NULL / field values shifted SELECT * + schema change (Rule 6)
45+ Spark Jobs, 3-hour runtime DataFrame API + multiple .count() (Rule 4)
Job timeout, 26k+ tasks Auto-broadcast on medium table (Rule 5)
Row explosion, field misalignment Control characters in GROUP BY field (Rule 2)
OOM on aggregation Direct GROUP BY on billion-row table (Rule 3)
Silent row loss after JOIN INNER JOIN on optional field (Rule 7)
Hive/Impala sees no data after write Metadata not refreshed (Rule 8)
UDF NoClassDefFoundError Nested Scala collection return type (Rule 9)

File Structure

bigdata-analysis/
├── SKILL.md                          # Main skill instructions (10 rules + quick reference)
└── references/
    ├── spark-pitfalls.md             # Deep-dive: root cause analysis & extended examples
    └── sql-patterns.md               # AI-specific SQL anti-patterns

Who Is This For

  • Data engineers writing Hive/Impala SQL or Spark Scala ETL jobs
  • Anyone using AI coding assistants for big data workflows
  • Teams that have been bitten by "data looks right but isn't" bugs

License

MIT

About

AI coding skill for Hive/Impala/Spark ETL — 10 rules to prevent silent data bugs on HDFS/YARN

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors