Overview¶

In this notebook, we will annotate the transcription starts sites (TSSs) in bulk scATAC-seq data to get the active promoter information for base GRN construction.

Notebook file¶

Notebook file is available here. https://github.com/morris-lab/CellOracle/blob/master/docs/notebooks/01_ATAC-seq_data_processing/option2_Bulk_ATAC-seq_data/01_preprocess_Bulk_ATAC_seq_peak_data.ipynb

0. Import libraries¶

[7]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

import os, sys, shutil, importlib, glob
from tqdm import tqdm_notebook as tqdm

%config InlineBackend.figure_format = 'retina'

plt.rcParams['figure.figsize'] = [6, 4.5]
plt.rcParams["savefig.dpi"] = 300

[8]:

# Import celloracle function
from celloracle import motif_analysis as ma

1. Load input data¶

Import ATAC-seq bed file. This script can also be used with DNase-seq or Chip-seq data.

Here, we use bulk ATAC-seq data. Please prepare the bulk ATAC-seq data as a bed file format.

You can download the demo file by running the following command:

Note: If the file download fails, please manually download and unzip the data.

https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/bulk_ATAC_seq_peak_data.bed

[3]:

# Download file.
!wget https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/bulk_ATAC_seq_peak_data.bed

# If you are using macOS, please try the following command.
#!curl -O https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/bulk_ATAC_seq_peak_data.bed

--2021-07-07 21:38:59--  https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/bulk_ATAC_seq_peak_data.bed
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10446347 (10.0M) [text/plain]
Saving to: ‘bulk_ATAC_seq_peak_data.bed’

bulk_ATAC_seq_peak_ 100%[===================>]   9.96M  --.-KB/s    in 0.1s

2021-07-07 21:39:00 (80.3 MB/s) - ‘bulk_ATAC_seq_peak_data.bed’ saved [10446347/10446347]

[9]:

# Load bed_file
file_path_of_bed_file = "bulk_ATAC_seq_peak_data.bed"
bed = ma.read_bed(file_path_of_bed_file)
print(bed.shape)
bed.head()

(436206, 4)

[9]:

	chrom	start	end	seqname
0	chr1	3002478	3002968	chr1_3002478_3002968
1	chr1	3084739	3085712	chr1_3084739_3085712
2	chr1	3103576	3104022	chr1_3103576_3104022
3	chr1	3106871	3107210	chr1_3106871_3107210
4	chr1	3108932	3109158	chr1_3108932_3109158

[10]:

# Convert bed file into peak name list
peaks = ma.process_bed_file.df_to_list_peakstr(bed)
peaks

[10]:

array(['chr1_3002478_3002968', 'chr1_3084739_3085712',
       'chr1_3103576_3104022', ..., 'chrY_631222_631480',
       'chrY_795887_796426', 'chrY_2397419_2397628'], dtype=object)

2. Make TSS annotation¶

IMPORTANT: Please make sure that you are setting the correct reference genome!

[11]:

tss_annotated = ma.get_tss_info(peak_str_list=peaks, ref_genome="mm9")

# Check results
tss_annotated.tail()

que bed peaks: 436206
tss peaks in que: 24822

[11]:

	chr	start	end	gene_short_name	strand
24817	chr2	60560211	60561602	Itgb6	-
24818	chr15	3975177	3978654	BC037032	-
24819	chr14	67690701	67692101	Ppp2r2a	-
24820	chr17	48455247	48455773	B430306N03Rik	+
24821	chr10	59861192	59861608	Gm17455	+

[12]:

# Change format
peak_id_tss = ma.process_bed_file.df_to_list_peakstr(tss_annotated)
tss_annotated = pd.DataFrame({"peak_id": peak_id_tss,
                              "gene_short_name": tss_annotated.gene_short_name.values})
tss_annotated = tss_annotated.reset_index(drop=True)
print(tss_annotated.shape)
tss_annotated.head()

(24822, 2)

[12]:

	peak_id	gene_short_name
0	chr7_50691730_50692032	Nkg7
1	chr7_50692077_50692785	Nkg7
2	chr13_93564413_93564836	Thbs4
3	chr13_14613429_14615645	Hecw1
4	chr3_99688753_99689665	Spag17

3. Save data¶

[10]:

tss_annotated.to_csv("processed_peak_file.csv")

Please go to the next step: Transcriptioin factor motif scan

https://morris-lab.github.io/CellOracle.documentation/tutorials/motifscan.html

[ ]: