Overview

In this notebook, we will annotate the transcription starts sites (TSSs) in bulk scATAC-seq data to get the active promoter information for base GRN construction.

Notebook file

Notebook file is available here. https://github.com/morris-lab/CellOracle/blob/master/docs/notebooks/01_ATAC-seq_data_processing/option2_Bulk_ATAC-seq_data/01_preprocess_Bulk_ATAC_seq_peak_data.ipynb

0. Import libraries

[7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

import os, sys, shutil, importlib, glob
from tqdm import tqdm_notebook as tqdm

%config InlineBackend.figure_format = 'retina'

plt.rcParams['figure.figsize'] = [6, 4.5]
plt.rcParams["savefig.dpi"] = 300

[8]:
# Import celloracle function
from celloracle import motif_analysis as ma

1. Load input data

Import ATAC-seq bed file. This script can also be used with DNase-seq or Chip-seq data.

Here, we use bulk ATAC-seq data. Please prepare the bulk ATAC-seq data as a bed file format.

You can download the demo file by running the following command:

Note: If the file download fails, please manually download and unzip the data.

https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/bulk_ATAC_seq_peak_data.bed

[3]:

# Download file.
!wget https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/bulk_ATAC_seq_peak_data.bed

# If you are using macOS, please try the following command.
#!curl -O https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/bulk_ATAC_seq_peak_data.bed
--2021-07-07 21:38:59--  https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/bulk_ATAC_seq_peak_data.bed
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10446347 (10.0M) [text/plain]
Saving to: ‘bulk_ATAC_seq_peak_data.bed’

bulk_ATAC_seq_peak_ 100%[===================>]   9.96M  --.-KB/s    in 0.1s

2021-07-07 21:39:00 (80.3 MB/s) - ‘bulk_ATAC_seq_peak_data.bed’ saved [10446347/10446347]

[9]:
# Load bed_file
file_path_of_bed_file = "bulk_ATAC_seq_peak_data.bed"
bed = ma.read_bed(file_path_of_bed_file)
print(bed.shape)
bed.head()
(436206, 4)
[9]:
chrom start end seqname
0 chr1 3002478 3002968 chr1_3002478_3002968
1 chr1 3084739 3085712 chr1_3084739_3085712
2 chr1 3103576 3104022 chr1_3103576_3104022
3 chr1 3106871 3107210 chr1_3106871_3107210
4 chr1 3108932 3109158 chr1_3108932_3109158
[10]:
# Convert bed file into peak name list
peaks = ma.process_bed_file.df_to_list_peakstr(bed)
peaks
[10]:
array(['chr1_3002478_3002968', 'chr1_3084739_3085712',
       'chr1_3103576_3104022', ..., 'chrY_631222_631480',
       'chrY_795887_796426', 'chrY_2397419_2397628'], dtype=object)

2. Make TSS annotation

IMPORTANT: Please make sure that you are setting the correct reference genome!

[11]:
tss_annotated = ma.get_tss_info(peak_str_list=peaks, ref_genome="mm9")

# Check results
tss_annotated.tail()
que bed peaks: 436206
tss peaks in que: 24822
[11]:
chr start end gene_short_name strand
24817 chr2 60560211 60561602 Itgb6 -
24818 chr15 3975177 3978654 BC037032 -
24819 chr14 67690701 67692101 Ppp2r2a -
24820 chr17 48455247 48455773 B430306N03Rik +
24821 chr10 59861192 59861608 Gm17455 +
[12]:
# Change format
peak_id_tss = ma.process_bed_file.df_to_list_peakstr(tss_annotated)
tss_annotated = pd.DataFrame({"peak_id": peak_id_tss,
                              "gene_short_name": tss_annotated.gene_short_name.values})
tss_annotated = tss_annotated.reset_index(drop=True)
print(tss_annotated.shape)
tss_annotated.head()
(24822, 2)
[12]:
peak_id gene_short_name
0 chr7_50691730_50692032 Nkg7
1 chr7_50692077_50692785 Nkg7
2 chr13_93564413_93564836 Thbs4
3 chr13_14613429_14615645 Hecw1
4 chr3_99688753_99689665 Spag17

3. Save data

[10]:
tss_annotated.to_csv("processed_peak_file.csv")

Please go to the next step: Transcriptioin factor motif scan

https://morris-lab.github.io/CellOracle.documentation/tutorials/motifscan.html

[ ]: