Skip to content

FatineHic/Pred_TF_cancer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

102 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿงฌ Pred_tf_cancer

Analysis of NGS data from TCGA to study chromatin accessibility and predict transcription factor (TF) binding, with a particular focus on CEBPB.

TCGA ENCODE


๐Ÿ“Œ Table of Contents


๐Ÿ”ฌ Overview

This project focuses on the analysis of ATAC-seq data from TCGA (The Cancer Genome Atlas) to study chromatin accessibility and predict transcription factor (TF) binding, with a particular focus on CEBPB.

The pipeline covers:

  • Raw read alignment
  • Signal track generation
  • Normalization
  • Peak calling
  • TF binding prediction (MaxATAC)
  • Downstream analysis (PCA, t-SNE, DESeq2, clustering)

The goal is to produce high-quality, normalized data suitable for downstream analysis such as transcription factor binding prediction.


โš ๏ธ Important: Code vs Data

This project is split into two parts:

๐Ÿ’ป 1. This Git Repository (Code Only)

This repository contains the code required to run the analysis, but NOT the large datasets.

๐Ÿ“ฆ 2. Server Data (NOT in Git)

All large files are stored on a remote server at:

/data/hichamif/pred_tf_cancer/

This includes:

  • ๐Ÿงฌ Raw ATAC-seq BAM files (TCGA)
  • ๐Ÿ“Š BigWig / BedGraph signal tracks
  • ๐Ÿ“ Peak files (MACS2)
  • ๐Ÿ”ฎ TF predictions (MaxATAC)
  • ๐Ÿ“ˆ QC results
  • ๐Ÿงช Processed intermediate files

๐Ÿ‘‰ These are not included in GitHub due to size constraints.


๐Ÿ“ Repository Structure

.
โ”œโ”€โ”€ src/                # Core logic (pipeline + analysis)
โ”‚   โ”œโ”€โ”€ preprocessing/  # BAM processing, indexing, normalization
โ”‚   โ”œโ”€โ”€ analysis/       # PCA, t-SNE, DESeq2, clustering
โ”‚   โ”œโ”€โ”€ benchmarking/   # Evaluation of TF predictions
โ”‚   โ”œโ”€โ”€ visualization/  # Plots and heatmaps
โ”‚   โ””โ”€โ”€ utils/          # Helper scripts (logging, config, merging)
โ”œโ”€โ”€ workflows/
โ”‚   โ””โ”€โ”€ pipeline/       # Full pipeline (step-by-step scripts)
โ”œโ”€โ”€ notebooks/          # Exploratory & validation notebooks (Jupyter)
โ”œโ”€โ”€ config/             # Configuration (SLURM, parameters)
โ”œโ”€โ”€ archive/            # Old / deprecated files
โ””โ”€โ”€ README.md

๐Ÿ—‚ Server Data Structure

/data/hichamif/pred_tf_cancer/
โ”‚
โ”œโ”€โ”€ reads/                          # Raw ATAC-seq data (symlinked BAMs from TCGA)
โ”œโ”€โ”€ new_tracks/                     # Initial coverage tracks (BigWig, scaled to 1M reads)
โ”œโ”€โ”€ normalized_tracks/              # Normalized coverage tracks (RP20M)
โ”œโ”€โ”€ cluster_normalization/          # TF-specific normalization (e.g., for CEBPB)
โ”œโ”€โ”€ peak/                           # MACS2 peak calls (filtered)
โ”œโ”€โ”€ predicions_cluster/             # Subset of ATAC-seq predictions (e.g., LIHC)
โ”œโ”€โ”€ predictions_cluster_all/        # Full set of ATAC-seq predictions (across TCGA)
โ”œโ”€โ”€ pca_analysis/                   # Downstream analysis (PCA, DESeq2, clustering)
โ”œโ”€โ”€ QC_results/                     # Output from fast QC pipeline
โ”œโ”€โ”€ cell_line_data/                 # Reference cell line data for comparison
โ”œโ”€โ”€ processed_files/                # Generated during processing and feature annotation
โ”œโ”€โ”€ data/                           # Reference data files used for annotations
โ”œโ”€โ”€ others/                         # Misc files (e.g., hg38_chrom.sizes)
โ”œโ”€โ”€ subsets/                        # Cancer-type-specific subset analyses
โ”œโ”€โ”€ annotated_peaks/                # Peak data organized by cancer/sample/genes
โ”œโ”€โ”€ cancers/                        # Peak overlap analysis results
โ”œโ”€โ”€ SAMPLEFILE                      # List of samples and metadata
โ””โ”€โ”€ [scripts, logs, other metadata]

๐Ÿ“‚ Folder Descriptions

reads/

Contains symbolic links to BAM files downloaded from TCGA, each representing aligned reads for a specific sample. Each BAM file is accompanied by an index (.bai) file. The file number_reads_1.txt provides statistics like number of mapped reads and converted fragments.

reads/
โ”œโ”€โ”€ ATAC_TCGA-XXX_YYY_1.bam        # BAM: mapped reads for each sample
โ”œโ”€โ”€ ATAC_TCGA-XXX_YYY_1.bam.bai    # BAM index for fast access
โ””โ”€โ”€ number_reads_1.txt              # QC: read counts and mapping stats

โš ๏ธ Note: Not all stats are available; some samples are missing โ€” check QC_results/ for complete data.


new_tracks/

First-stage signal tracks in BigWig format, converted from BAM files using scaling to 1 million reads. They provide genome-wide coverage for each sample.

new_tracks/
โ””โ”€โ”€ ATAC_TCGA-XXX_YYY_1.bw         # BigWig: coverage per sample

normalized_tracks/

Coverage tracks normalized to a common read depth of 20 million (RP20M) to allow comparison across samples. Includes both BedGraph and BigWig formats. Intermediate uncompressed versions are also retained for inspection. Signal is calculated as total signal per sample, weighted by the size of each region.

normalized_tracks/
โ”œโ”€โ”€ ATAC_TCGA-XXX_YYY_1.bedgraph           # BedGraph: raw coverage (genomic coordinates + signal values)
โ”œโ”€โ”€ ATAC_TCGA-XXX_YYY_1_RP20M.bedgraph     # BedGraph: scaled to 20M reads
โ””โ”€โ”€ ATAC_TCGA-XXX_YYY_1_RP20M.bw           # BigWig: scaled to 20M reads

cluster_normalization/

Signal tracks that have undergone TF-specific normalization using the maxatac normalize tool. These are further processed to be used for clustering or input into predictive models. Includes per-chromosome and genome-wide summary statistics.

cluster_normalization/
โ”œโ”€โ”€ ATAC_TCGA-XXX_YYY_1_RP20M.bw                       # Normalized BigWig
โ”œโ”€โ”€ ATAC_TCGA-XXX_YYY_1_RP20M_chromosome_min_max.txt    # Per-chromosome min/max stats
โ””โ”€โ”€ ATAC_TCGA-XXX_YYY_1_RP20M_genome_stats.txt          # Genome-wide stats

peak/

Peak files output by MACS2, filtered to keep only standard chromosomes. One BED file per sample lists identified open chromatin regions. A summary file logs peak counts per sample.

peak/
โ”œโ”€โ”€ ATAC_TCGA-XXX_YYY_1_peaks_macs.bed     # BED: called peaks per sample
โ””โ”€โ”€ peaks_summary.txt                       # Number of peaks per sample

predicions_cluster/

Contains a subset of samples: all LIHC samples and one representative from each other cancer type (22 types excluding LIHC).

predicions_cluster/
โ”œโ”€โ”€ 1LIHC/                                              # Folder for ONE peak file (test)
โ”‚   โ”œโ”€โ”€ ATAC_TCGA-LIHC_TCGA-BC-A3KF_1.bed              # All peaks, unfiltered by MaxATAC
โ”‚   โ”œโ”€โ”€ ATAC_TCGA-LIHC_TCGA-BC-A3KF_1_RP20M.bw
โ”‚   โ””โ”€โ”€ ATAC_TCGA-LIHC_TCGA-BC-A3KF_1_RP20M_peaks.bed
โ”œโ”€โ”€ ATAC_TCGA-[CANCER]_[SAMPLE]_RP20M.bw                # One per cancer type (22 types, excl. LIHC)
โ”œโ”€โ”€ ATAC_TCGA-[CANCER]_[SAMPLE]_peaks.bed               # One per cancer type (22 types, excl. LIHC)
โ””โ”€โ”€ LIHC/                                               # All LIHC samples
    โ”œโ”€โ”€ ATAC_TCGA-LIHC_TCGA-[SAMPLE]_RP20M.bw
    โ”œโ”€โ”€ ATAC_TCGA-LIHC_TCGA-[SAMPLE]_RP20M_peaks.bed
    โ””โ”€โ”€ logs/                                           # Logs for prediction runs

predictions_cluster_all/

Full set of ATAC-seq predictions across TCGA. Contains all ACC samples, all BLCA samples, and subsets of BRCA and CESC samples.

predictions_cluster_all/
โ”œโ”€โ”€ ATAC_TCGA-[CANCER]_[SAMPLE]_RP20M.bw
โ””โ”€โ”€ ATAC_TCGA-[CANCER]_[SAMPLE]_peaks.bed

pca_analysis/

โš ๏ธ Status: May need verification or cleanup.

Contains results from dimensionality reduction analyses on the ATAC-seq data, including PCA, tSNE, and differential accessibility analysis using DESeq2.

pca_analysis/
โ”œโ”€โ”€ filtered_bams/                          # Processed filtered BAMs (NOT ALL FILES)
โ”œโ”€โ”€ generate_counts_*.log
โ””โ”€โ”€ results/
    โ”œโ”€โ”€ ATAC_all_regions_count*.txt
    โ”œโ”€โ”€ cell_line_counts/
    โ”œโ”€โ”€ sample_counts/
    โ”œโ”€โ”€ metadata.csv, samples.txt
    โ”œโ”€โ”€ DESeq_results_*.csv
    โ”œโ”€โ”€ DESeq2_significant_genes.csv
    โ”œโ”€โ”€ PCA_plot*.png
    โ”œโ”€โ”€ tSNE_plot1.png
    โ”œโ”€โ”€ heatmap_DESeq2.png
    โ”œโ”€โ”€ volcano_plot_*.pdf
    โ””โ”€โ”€ CancerType_Clustering1.png

QC_results/

Comprehensive quality control metrics for all samples, divided into three categories:

QC_results/
โ”œโ”€โ”€ basic/                                  # Basic BAM-level QC
โ”‚   โ”œโ”€โ”€ SAMPLE_flagstat.txt
โ”‚   โ”œโ”€โ”€ SAMPLE_idxstats.txt
โ”‚   โ””โ”€โ”€ summary.txt                         # Aggregated metrics: mapped %, mt %, pairing
โ”œโ”€โ”€ fragments/                              # Insert size distributions (subsampled reads)
โ”‚   โ””โ”€โ”€ summary.txt                         # Fragment type proportions (NFR, mono, di, tri)
โ””โ”€โ”€ peaks/                                  # Peak-level statistics
    โ””โ”€โ”€ counts.txt                          # Total peak counts per sample

cell_line_data/

Reference data from cancer cell lines corresponding to TCGA tumor types, including both ATAC-seq and ChIP-seq peaks. Data was copied from ENCODE for comparison purposes (no direct Jupyter access to ENCODE).

cell_line_data/
โ”œโ”€โ”€ atac/                                   # ATAC-seq peaks from cell lines
โ”‚   โ””โ”€โ”€ ATAC_[CELL_LINE]_*_peaks_macs.bed
โ””โ”€โ”€ chip/                                   # ChIP-seq peaks from cell lines
    โ””โ”€โ”€ CEBPB_[CELL_LINE]_*_peaks_peakzilla.bed

processed_files/

Contains intermediate and final processed files for downstream analyses, including summit locations, fixed-width windows (100bp around summits), and annotation files.

processed_files/
โ”œโ”€โ”€ summits/                                # Peak summit information
โ”‚   โ”œโ”€โ”€ *_summit.bed                        # Summits from ATAC-seq, ChIP-seq, predictions
โ”‚   โ””โ”€โ”€ motif_summits.bed                   # Summit locations for motif hits
โ””โ”€โ”€ windows/clean/                          # Fixed-width 100bp windows around summits
    โ””โ”€โ”€ *_window.bed

data/

Reference data files used throughout the pipeline.

data/
โ”œโ”€โ”€ hg38_maxatac_blacklist.bed              # Genomic regions to exclude (blacklist)
โ””โ”€โ”€ CEBPB_filtered_6mer.bed                 # Filtered CEBPB motif locations

others/

Miscellaneous files.

others/
โ””โ”€โ”€ hg38_chrom.sizes                        # Chromosome sizes for hg38

subsets/

Cancer-type-specific subset analyses with visualization outputs.

subsets/
โ””โ”€โ”€ {BRCA, LIHC, COAD, LUAD, ...}/
    โ”œโ”€โ”€ subset_{cancer_type}.tsv
    โ””โ”€โ”€ plots/
        โ”œโ”€โ”€ plot1.1_box_atac.png
        โ”œโ”€โ”€ plot1.2_access_overlap_bar.png
        โ”œโ”€โ”€ plot1.3_scatter_patient_vs_cell.png
        โ”œโ”€โ”€ plot2.2_motif_heatmap.png
        โ”œโ”€โ”€ plot2.3_motif_access_combo.png
        โ”œโ”€โ”€ plot4.1_feature_overlap_combo.png
        โ””โ”€โ”€ plot4.3_confidence_dist.png

annotated_peaks/

Peak data organized in three different ways:

annotated_peaks/
โ”œโ”€โ”€ by_cancer_type/         # Subdirectories for 23 TCGA cancer types
โ”‚   โ””โ”€โ”€ {ACC, BLCA, BRCA, CESC, CHOL, COAD, ESCA, GBM, HNSC, KIRC,
โ”‚        KIRP, LGG, LIHC, LUAD, LUSC, MESO, PCPG, PRAD, SKCM,
โ”‚        STAD, TGCT, THCA, UCEC}/
โ”œโ”€โ”€ by_sample/              # Per-sample annotated peak files
โ”‚   โ””โ”€โ”€ ATAC_TCGA-[CANCER]_TCGA-[PATIENT]_1_peaks_macs_annotated.bed
โ””โ”€โ”€ top_genes/              # Top genes per cancer type
    โ”œโ”€โ”€ [CANCER_TYPE]_top_100_genes.txt
    โ””โ”€โ”€ all_cancer_types_top10_summary.csv

cancers/

Peak overlap analysis results.

cancers/
โ”œโ”€โ”€ peak_overlap_TCGA-[CANCER_TYPE].txt              # Raw peak overlap data
โ”œโ”€โ”€ peak_overlap_TCGA-[CANCER_TYPE]_percent_table.txt # Percentage tables
โ”œโ”€โ”€ heatmap_[CANCER_TYPE].png                         # Heatmap visualizations
โ””โ”€โ”€ cancers/
    โ”œโ”€โ”€ heatmap_results/                              # Heatmaps for all 23 cancer types
    โ””โ”€โ”€ heatmapsresults_num/                          # Numerical data for heatmaps

SAMPLEFILE

A tab-delimited file containing metadata per sample, including:

  • Sample name
  • Genome reference (e.g., hg38)
  • Additional required columns for downstream processing

๐Ÿ“‹ File Formats & Column Descriptions

Patient ATAC-seq Peaks

๐Ÿ“‚ Location: peak/
๐Ÿ“„ Format: ATAC_TCGA-{CANCER_TYPE}_{SAMPLE_ID}_{REPLICATE}_peaks_macs.bed

Column Description
chr Chromosome
start Start position
end End position
length Peak length
signal_value Signal intensity
p-value Statistical significance

CEBPB Motif Data

๐Ÿ“‚ Location: data/CEBPB_filtered_6mer.bed

Column Description
chr Chromosome
start Start position
end End position
motif_sequence Motif sequence
score Motif score
strand Strand (+/-)

Prediction Files

๐Ÿ“‚ Location:

  • Multiple samples: predictions_cluster_all/
  • Single sample: predicions_cluster/
Column Description
chr Chromosome
start Start position
end End position
prediction_score Predicted binding probability

Cell Line ATAC-seq Peaks

๐Ÿ“‚ Location: cell_line_data/atac/
๐Ÿ“„ Format: ATAC_{CELL_LINE}_{REPLICATE}_peaks_macs.bed

Column Description
chr Chromosome
start Start position
end End position
length Peak length
signal_value Signal intensity
p-value Statistical significance

Cell Line ChIP-seq Peaks

๐Ÿ“‚ Location: cell_line_data/chip/
๐Ÿ“„ Format: CEBPB_{CELL_LINE}_{REPLICATE}_peaks_peakzilla.bed

Column Description
chr Chromosome
start Start position
end End position
summit Summit position
fold_change Fold change
q-value Adjusted p-value

100bp Window Files

๐Ÿ“‚ Location: processed_files/windows/clean/

If 100bp windows around summits are needed, they are available here.

Patient ATAC-seq peaks (windowed):
๐Ÿ“„ ATAC_{CANCER_TYPE}_{PATIENT_ID}_{REPLICATE}_peaks_macs_window.bed

chr1    10184   10284   441     199.492   4.68993
chr1    14482   14582   2327    44.2601   2.22135
Column Description
chr Chromosome
start Start position
end End position
length Peak length
signal_value Signal intensity
p-value Statistical significance

Cell line ATAC-seq peaks (windowed):
๐Ÿ“„ ATAC_{CELL_LINE}_{REPLICATE}_peaks_macs_window.bed

chr1    181569  181669  444     36.79459  7.72453
chr1    191425  191525  219     8.13932   3.35317

Cell line ChIP-seq peaks (windowed):
๐Ÿ“„ CEBPB_{CELL_LINE}_{REPLICATE}_peaks_peakzilla_window.bed

chr1    920269  920369  1.14    6.04
chr1    1000848 1000948 2.26    11.61
Column Description
chr Chromosome
start Start position
end End position
summit Summit position
fold_change Fold change
q-value Adjusted p-value

Patient predicted ChIP-seq peaks (windowed):
๐Ÿ“„ ATAC_{CANCER_TYPE}_{PATIENT_ID}_{REPLICATE}_RP20M_peaks_pred_window.bed

โš ๏ธ Not present for all patients โ€” 78 samples available.

chr1    10366   10466   0.99420804
chr1    15134   15234   0.5894148
Column Description
chr Chromosome
start Start position
end End position
prediction_score Predicted binding probability

Motif file (windowed):
๐Ÿ“„ motif_summits_window.bed

chr1    19523   19623   ATTGTGAAAT   0.000176   -
chr1    33155   33255   ATTGTGTAAT   7.22e-05   +
Column Description
chr Chromosome
start Start position
end End position
motif_sequence Motif sequence
score Motif score
strand Strand (+/-)

๐Ÿงฌ Workflow Overview

BAM Files (TCGA)
      โ”‚
      โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  1. BAM QC &    โ”‚โ”€โ”€โ†’ reads/
โ”‚     Indexing     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  2. Signal Track โ”‚โ”€โ”€โ†’ new_tracks/
โ”‚     Generation   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  3. Normalize    โ”‚โ”€โ”€โ†’ normalized_tracks/
โ”‚     (RP20M)      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
    โ”Œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”
    โ–ผ         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 4. Peakโ”‚ โ”‚ 5. TF-specific โ”‚โ”€โ”€โ†’ cluster_normalization/
โ”‚ Callingโ”‚ โ”‚  Normalization  โ”‚
โ””โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ”‚              โ”‚
    โ–ผ              โ–ผ
  peak/    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
           โ”‚ 6. TF Binding  โ”‚โ”€โ”€โ†’ predicions_cluster/
           โ”‚  Prediction    โ”‚    predictions_cluster_all/
           โ”‚  (MaxATAC)     โ”‚
           โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                   โ”‚
                   โ–ผ
           โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
           โ”‚ 7. Feature     โ”‚โ”€โ”€โ†’ processed_files/
           โ”‚  Integration   โ”‚
           โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                   โ”‚
                   โ–ผ
           โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
           โ”‚ 8. Downstream  โ”‚โ”€โ”€โ†’ pca_analysis/
           โ”‚  Analysis      โ”‚    subsets/
           โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Step Details

  1. BAM Acquisition & QC โ€” BAM files are linked from TCGA repositories and stored in reads/. Indexing and QC are performed.
  2. Signal Track Generation โ€” Using bedtools, BAMs are converted to genome-wide coverage tracks and stored in new_tracks/.
  3. Normalization โ€” Tracks are scaled to 20M reads and saved in normalized_tracks/.
  4. Peak Calling โ€” MACS2 is run on each BAM file. Peaks are filtered and saved in peak/.
  5. TF-specific Normalization โ€” Optional normalization (e.g., for CEBPB) using maxatac normalize, stored in cluster_normalization/.
  6. TF Binding Prediction โ€” MaxATAC is used to predict CEBPB binding in each sample.
  7. Feature Integration โ€” Peak summit locations are extracted from ATAC-seq, ChIP-seq, and predictions. 100bp windows are created around summits. Windows are merged to create a unified region set.
  8. QC and Metadata Tracking โ€” Sample info is tracked in SAMPLEFILE. QC summaries are included in each respective folder. Fragment size distributions are analyzed for NFR and nucleosome patterns.

๐Ÿ“Š Cancer Type Summary

Total Samples: 410
Based on available reads, peaks, and tracks (predictions still in progress).

Cancer Type TCGA Code Samples
Adrenocortical carcinoma ACC 9
Bladder urothelial carcinoma BLCA 10
Breast invasive carcinoma BRCA 75
Cervical squamous cell carcinoma CESC 4
Cholangiocarcinoma CHOL 5
Colon adenocarcinoma COAD 41
Esophageal carcinoma ESCA 18
Glioblastoma multiforme GBM 9
Head and neck squamous cell carcinoma HNSC 9
Kidney renal clear cell carcinoma KIRC 16
Kidney renal papillary cell carcinoma KIRP 34
Brain lower grade glioma LGG 13
Liver hepatocellular carcinoma LIHC 17
Lung adenocarcinoma LUAD 22
Lung squamous cell carcinoma LUSC 16
Mesothelioma MESO 7
Pheochromocytoma and Paraganglioma PCPG 9
Prostate adenocarcinoma PRAD 26
Skin cutaneous melanoma SKCM 13
Stomach adenocarcinoma STAD 21
Testicular germ cell tumors TGCT 9
Thyroid carcinoma THCA 14
Uterine corpus endometrial carcinoma UCEC 13
Total 410

๐Ÿ” Cancer Type โ†” Cell Line Mapping

Cell line data available on the server for comparison:

Cancer Type Cell Line Tissue Type
BRCA MCF7 Breast Cancer
LIHC HepG2 Liver Cancer
LUAD / LUSC A549 Lung Cancer
COAD HCT116 Colon Cancer
STAD SNU719 Stomach Cancer
BLCA T24 Bladder Cancer
GBM U87 Glioblastoma
PRAD LNCaP Prostate Cancer

๐Ÿงช Key Analysis Questions

  • How does CEBPB binding differ across cancer types?
  • Do cancer cell lines accurately represent patient tissue for CEBPB binding?
  • Can we identify cancer type-specific CEBPB binding sites?
  • What is the correlation between CEBPB binding and chromatin accessibility?
  • How does CEBPB binding relate to known cancer pathways?

๐Ÿ› ๏ธ Tools & Environments

Software

Tool Purpose
samtools BAM indexing and statistics (module load)
bedtools Coverage and read manipulation
macs2 Peak calling (pipeline compatible with both versions)
ucsc-bedgraphtobigwig BedGraph โ†’ BigWig conversion
bigWigToBedGraph BigWig โ†’ BedGraph conversion
maxatac TF-specific normalization, predictions, benchmarking
R (DESeq2, ggplot2) Statistical analysis and visualization
Python Scripting and utilities
SLURM HPC job scheduling

Conda Environments

# MaxATAC environment
source /shared/software/miniconda3/etc/profile.d/conda.sh
conda activate /shared/home/bancquaa/.conda/envs/maxatac

# Python analysis environment
conda activate /data/hichamif/envs/pred_tf_env

๐Ÿš€ Quick Start

Process a New Sample

1. Link BAM file to reads/ directory:

ln -s /path/to/original/sample.bam reads/ATAC_TCGA-XXX_YYY_1.bam

2. Generate signal track:

bash others/scripts/generate_tracks.sh reads/ATAC_TCGA-XXX_YYY_1.bam

3. Call peaks:

bash others/scripts/call_peaks.sh reads/ATAC_TCGA-XXX_YYY_1.bam

4. Predict TF binding:

# Activate MaxATAC environment
source /shared/software/miniconda3/etc/profile.d/conda.sh
conda activate /shared/home/bancquaa/.conda/envs/maxatac

# Run prediction
maxatac predict \
  --signal normalized_tracks/ATAC_TCGA-XXX_YYY_1_RP20M.bw \
  --model CEBPB \
  --output predictions_cluster_all/ATAC_TCGA-XXX_YYY_1

5. Add to metadata:

echo -e "ATAC_TCGA-XXX_YYY_1\thg38\tTCGA-XXX\tYYY\t1" >> SAMPLEFILE

๐Ÿ’ก For batch processing of multiple samples, use the provided SLURM scripts in the others/scripts/ directory.


๐Ÿ“š References


๐Ÿ“Ž Notes

  • All files are named in a consistent format: ATAC_TCGA-<CANCER>_<ID>_<REPLICATE> for traceability.
  • Scripts used for each step are available on request or in the supplementary scripts folder (if included).
  • This repository is optimized for reproducibility and modular processing.

Added details for me

๐Ÿงฌ 1. Donnรฉes ATAC-seq

๐Ÿ“ /data/hichamif/pred_tf_cancer/peak/

Ce rรฉpertoire contient les fichiers de pics ATAC-seq. Chaque fichier BED reprรฉsente les rรฉgions ouvertes (accessibles) dans le gรฉnome d'un รฉchantillon.

Utile pour :

  • Localiser les rรฉgions rรฉgulatrices actives.
  • Croiser ces rรฉgions avec les prรฉdictions CEBPB et les motifs.

๐Ÿ”ฎ 2. Prรฉdictions CEBPB (ร  partir d'ATAC-seq)

๐Ÿ“ /data/hichamif/pred_tf_cancer/predictions_cluster_all/
๐Ÿ“ /data/hichamif/pred_tf_cancer/predicions_cluster/

  • predictions_cluster_all/ โ†’ contient les prรฉdictions MaxATAC pour tous les รฉchantillons de chaque cancer.
  • predicions_cluster/ โ†’ contient une prรฉdiction par รฉchantillon reprรฉsentatif de chaque cancer.

๐Ÿ’ก Chaque fichier reprรฉsente des rรฉgions (format BED) prรฉdites comme รฉtant des sites de liaison de CEBPB.

๐Ÿงฌ 3. Motifs de CEBPB

๐Ÿ“ /data/hichamif/pred_tf_cancer/data/CEBPB_filtered_6mer.bed

Fichier BED des motifs CEBPB identifiรฉs (probablement via FIMO, MOODS ou une base comme HOCOMOCO ou JASPAR).

Utile pour :

  • Vรฉrifier si les pics ChIP/ATAC prรฉdits contiennent un motif canonique CEBPB.
  • Ajouter une couche de validation (filtrage par prรฉsence de motif).

๐Ÿงช 4. Fichiers prรฉdictifs par fenรชtre (clean)

๐Ÿ“‚ /data/hichamif/pred_tf_cancer/processed_files/windows/clean/

Ces fichiers reprรฉsentent des fenรชtres gรฉnomiques utilisรฉes pour la prรฉdiction, filtrรฉes (qualitรฉ, scores, taille, etc.).

Utile pour :

  • Lier les prรฉdictions aux rรฉgions gรฉnomiques prรฉcises.
  • Permettre un croisement propre entre prรฉdictions, motifs et pics ATAC/ChIP.

About

Predicting CEBPB transcription factor binding from TCGA NGS data across 23 cancer types using MaxATAC. Includes full preprocessing, normalization, peak calling, and downstream analysis pipeline.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors