bigdata-analysis-skill

An AI coding skill that makes your AI assistant production-safe when writing Hive, Impala, and Spark ETL code on HDFS/YARN.

It targets the class of bugs that are invisible — row counts look correct, no errors thrown, but data is silently wrong or performance collapses.

Install

npx skills add Oak-B/bigdata-analysis-skill@bigdata-analysis

What It Covers

Rule	Problem It Prevents
Rule 0	`DESCRIBE` before coding — never guess column names or types
Rule 1	Never hard-code table names in Spark source
Rule 2	Keep long-text fields out of `GROUP BY` (control characters cause silent row explosion)
Rule 3	Filter first, then aggregate — prevents OOM on billion-row tables
Rule 4	Use Spark SQL, not DataFrame API (real benchmark: 3h → 15min)
Rule 5	Control broadcast JOIN threshold — prevents task explosion
Rule 6	Never use `SELECT *` in INSERT — prevents silent column shifts
Rule 7	Use `LEFT JOIN` for optional fields — prevents silent row loss
Rule 8	Refresh metadata after Spark write
Rule 9	UDF type safety — nested collection return types crash at runtime

Plus: date window off-by-one, Scala string interpolation pitfalls, regex engine differences, and more.

Two Modes

Mode	Behavior
Analysis	Run SQL → present numbers → ask the user before making decisions
Coding	Follow the 10 rules strictly; never guess types, column order, or table names

Quick Error Reference

Symptom	Likely Root Cause
New column all NULL / field values shifted	`SELECT *` + schema change (Rule 6)
45+ Spark Jobs, 3-hour runtime	DataFrame API + multiple `.count()` (Rule 4)
Job timeout, 26k+ tasks	Auto-broadcast on medium table (Rule 5)
Row explosion, field misalignment	Control characters in `GROUP BY` field (Rule 2)
OOM on aggregation	Direct `GROUP BY` on billion-row table (Rule 3)
Silent row loss after JOIN	`INNER JOIN` on optional field (Rule 7)
Hive/Impala sees no data after write	Metadata not refreshed (Rule 8)
UDF `NoClassDefFoundError`	Nested Scala collection return type (Rule 9)

File Structure

bigdata-analysis/
├── SKILL.md                          # Main skill instructions (10 rules + quick reference)
└── references/
    ├── spark-pitfalls.md             # Deep-dive: root cause analysis & extended examples
    └── sql-patterns.md               # AI-specific SQL anti-patterns

Who Is This For

Data engineers writing Hive/Impala SQL or Spark Scala ETL jobs
Anyone using AI coding assistants for big data workflows
Teams that have been bitten by "data looks right but isn't" bugs

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
bigdata-analysis		bigdata-analysis
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bigdata-analysis-skill

Install

What It Covers

Two Modes

Quick Error Reference

File Structure

Who Is This For

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

bigdata-analysis-skill

Install

What It Covers

Two Modes

Quick Error Reference

File Structure

Who Is This For

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages