Skip to content

OpenPecha/tipitaka-data-extractor

Repository files navigation

Tipitaka APK Database Extraction

This repository contains tools and extracted data from the Tipitakapali.org Android APK (v26.2.5), which provides access to the complete Pali Tipitaka (Buddhist canon) with full-text search capabilities and multiple dictionaries.

Overview

The APK contains 8 SQLite databases that together form a comprehensive Pali Buddhist text corpus with:

  • 522,747 text segments from the complete Tipitaka
  • Full-text search index for the entire corpus
  • 8,967 HTML files with formatted book content
  • Multiple dictionaries: Digital Pali Dictionary (DPD), PTS Pali-English Dictionary, Abhidhana (Myanmar)
  • Grammatical tools: inflection tables, compound word splitter, synonyms

Repository Structure

├── README.md                     # This file
├── DATABASE_DOCUMENTATION.md     # Detailed database schema documentation
├── DATA_EXTRACTION_PLAN.md      # Extraction methodology and plans
├── extract_tipitaka.py          # Main extraction script
├── Tipitakapali.org_v26.2.5.apk # Original APK file
├── apk_extracted/               # Decompiled APK contents
├── db_extracted/                # Extracted SQLite databases
└── output/                      # Processed and extracted data
    ├── texts/                   # JSON files with extracted text content
    └── metadata/                # Database metadata and schemas

Key Features

Databases Included

  1. fts_tipitaka.db - Full-text search index (522,747 segments)
  2. cstpali.db - Complete HTML book content (8,967 files)
  3. dpd_tipitakapali.db - Digital Pali Dictionary
  4. dpd_inflection_tipitakapali.db - Grammatical inflection tables
  5. dpd_synonyms_tipitakapali.db - Synonym mappings
  6. dpd_splitter_tipitakapali.db - Compound word splitter
  7. abhidhan_tipitakapali.db - Abhidhana Myanmar dictionary
  8. ptsped2015ed_tipitakapali.db - PTS Pali-English Dictionary

Text Collections

  • Vinaya Piṭaka (Monastic rules)
  • Sutta Piṭaka (Discourses)
  • Abhidhamma Piṭaka (Analytical teachings)
  • Commentaries (Aṭṭhakathā)
  • Sub-commentaries (Ṭīkā)

Usage

Prerequisites

pip install sqlite3 json zipfile

Extract Text Data

python extract_tipitaka.py

This will:

  1. Extract all databases from the APK
  2. Process the full-text search database
  3. Generate JSON files with structured text data
  4. Create metadata files with database schemas

Access the Data

The extracted data is available in multiple formats:

  • JSON files in output/texts/ - structured text data by book
  • SQLite databases in db_extracted/ - original database files
  • Metadata in output/metadata/ - database schemas and statistics

Data Format

Each JSON file contains structured text segments:

{
  "book_info": {
    "code": "vin01m",
    "total_segments": 1234,
    "description": "Vinaya Mahavibhanga"
  },
  "segments": [
    {
      "path": "11@vin01m.mul0@k1",
      "content": "Pali text content...",
      "book_code": "11",
      "filename": "vin01m.mul0",
      "paragraph": "k1"
    }
  ]
}

Documentation

License

This project extracts data from the Tipitakapali.org APK for research and preservation purposes. Please respect the original creators' work and any applicable licenses.

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests to improve the extraction tools or documentation.

Acknowledgments

  • Tipitakapali.org - Original APK developers
  • Digital Pali Dictionary (DPD) - Comprehensive Pali dictionary project
  • Chaṭṭha Saṅgāyana - Digital Tipitaka source text
  • PTS - Pali Text Society dictionary

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages