Overview¶
In this notebook, we will annotate the transcription starts sites (TSSs) in bulk scATAC-seq data to get the active promoter information for base GRN construction.
Notebook file¶
Notebook file is available here. https://github.com/morris-lab/CellOracle/blob/master/docs/notebooks/01_ATAC-seq_data_processing/option2_Bulk_ATAC-seq_data/01_preprocess_Bulk_ATAC_seq_peak_data.ipynb
0. Import libraries¶
[7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import os, sys, shutil, importlib, glob
from tqdm import tqdm_notebook as tqdm
%config InlineBackend.figure_format = 'retina'
plt.rcParams['figure.figsize'] = [6, 4.5]
plt.rcParams["savefig.dpi"] = 300
[8]:
# Import celloracle function
from celloracle import motif_analysis as ma
1. Load input data¶
Import ATAC-seq bed file. This script can also be used with DNase-seq or Chip-seq data.
Here, we use bulk ATAC-seq data. Please prepare the bulk ATAC-seq data as a bed file format.
You can download the demo file by running the following command:
Note: If the file download fails, please manually download and unzip the data.
[3]:
# Download file.
!wget https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/bulk_ATAC_seq_peak_data.bed
# If you are using macOS, please try the following command.
#!curl -O https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/bulk_ATAC_seq_peak_data.bed
--2021-07-07 21:38:59-- https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/bulk_ATAC_seq_peak_data.bed
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10446347 (10.0M) [text/plain]
Saving to: ‘bulk_ATAC_seq_peak_data.bed’
bulk_ATAC_seq_peak_ 100%[===================>] 9.96M --.-KB/s in 0.1s
2021-07-07 21:39:00 (80.3 MB/s) - ‘bulk_ATAC_seq_peak_data.bed’ saved [10446347/10446347]
[9]:
# Load bed_file
file_path_of_bed_file = "bulk_ATAC_seq_peak_data.bed"
bed = ma.read_bed(file_path_of_bed_file)
print(bed.shape)
bed.head()
(436206, 4)
[9]:
chrom | start | end | seqname | |
---|---|---|---|---|
0 | chr1 | 3002478 | 3002968 | chr1_3002478_3002968 |
1 | chr1 | 3084739 | 3085712 | chr1_3084739_3085712 |
2 | chr1 | 3103576 | 3104022 | chr1_3103576_3104022 |
3 | chr1 | 3106871 | 3107210 | chr1_3106871_3107210 |
4 | chr1 | 3108932 | 3109158 | chr1_3108932_3109158 |
[10]:
# Convert bed file into peak name list
peaks = ma.process_bed_file.df_to_list_peakstr(bed)
peaks
[10]:
array(['chr1_3002478_3002968', 'chr1_3084739_3085712',
'chr1_3103576_3104022', ..., 'chrY_631222_631480',
'chrY_795887_796426', 'chrY_2397419_2397628'], dtype=object)
2. Make TSS annotation¶
IMPORTANT: Please make sure that you are setting the correct reference genome!
[11]:
tss_annotated = ma.get_tss_info(peak_str_list=peaks, ref_genome="mm9")
# Check results
tss_annotated.tail()
que bed peaks: 436206
tss peaks in que: 24822
[11]:
chr | start | end | gene_short_name | strand | |
---|---|---|---|---|---|
24817 | chr2 | 60560211 | 60561602 | Itgb6 | - |
24818 | chr15 | 3975177 | 3978654 | BC037032 | - |
24819 | chr14 | 67690701 | 67692101 | Ppp2r2a | - |
24820 | chr17 | 48455247 | 48455773 | B430306N03Rik | + |
24821 | chr10 | 59861192 | 59861608 | Gm17455 | + |
[12]:
# Change format
peak_id_tss = ma.process_bed_file.df_to_list_peakstr(tss_annotated)
tss_annotated = pd.DataFrame({"peak_id": peak_id_tss,
"gene_short_name": tss_annotated.gene_short_name.values})
tss_annotated = tss_annotated.reset_index(drop=True)
print(tss_annotated.shape)
tss_annotated.head()
(24822, 2)
[12]:
peak_id | gene_short_name | |
---|---|---|
0 | chr7_50691730_50692032 | Nkg7 |
1 | chr7_50692077_50692785 | Nkg7 |
2 | chr13_93564413_93564836 | Thbs4 |
3 | chr13_14613429_14615645 | Hecw1 |
4 | chr3_99688753_99689665 | Spag17 |
3. Save data¶
[10]:
tss_annotated.to_csv("processed_peak_file.csv")
Please go to the next step: Transcriptioin factor motif scan
https://morris-lab.github.io/CellOracle.documentation/tutorials/motifscan.html
[ ]: