Overview¶
Before building the base GRN, we need to annotate the coaccessible peaks and filter our active promoter/enhancer elements. First, we will identify the peaks around transcription starting sites (TSS). We will then merge the Cicero data with the TSS peak information and filter any peaks with weak connections to the TSS peaks. As such, the filtered peak data will only include TSS peaks and peaks with strong TSS connections. These will be our active promoter/enhancer elements for our base GRN.
Notebook file¶
Notebook file is available on CellOracle GitHub page. https://github.com/morris-lab/CellOracle/blob/master/docs/notebooks/01_ATAC-seq_data_processing/option1_scATAC-seq_data_analysis_with_cicero/02_preprocess_peak_data.ipynb
0. Import libraries¶
[1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import os, sys, shutil, importlib, glob
from tqdm.notebook import tqdm
[2]:
from celloracle import motif_analysis as ma
import celloracle as co
co.__version__
2023-01-20 17:19:19,057 [7367] WARNING py.warnings:99: [JupyterRequire] /home/k/anaconda3/envs/pandas1/lib/python3.6/site-packages/requests/__init__.py:91: RequestsDependencyWarning:
urllib3 (1.26.9) or chardet (3.0.4) doesn't match a supported version!
[2]:
'0.10.14'
[3]:
%config InlineBackend.figure_format = 'retina'
plt.rcParams['figure.figsize'] = [6, 4.5]
plt.rcParams["savefig.dpi"] = 300
1. Load scATAC peak data and peak connection data made with Cicero¶
In this notebook, we will annotate and filter output from Cicero. Please refer to the previous step to learn about data preparation with Cicero. https://morris-lab.github.io/CellOracle.documentation/tutorials/base_grn.html#step1-scatac-seq-analysis-with-cicero
Here, we will use the preprocessed fetal brain scATAC-seq data from step 1.
You can download the demo file by running the following command.
Note: If the download fails, please manually download and unzip the data. https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/all_peaks.csv
https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/cicero_connections.csv
[4]:
# Download file.
!wget https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/all_peaks.csv
!wget https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/cicero_connections.csv
# If you are using macOS, please try the following command.
#!curl -O https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/all_peaks.csv
#!curl -O https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/cicero_connections.csv
--2023-01-20 17:19:21-- https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/all_peaks.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2940392 (2.8M) [text/plain]
Saving to: ‘all_peaks.csv’
all_peaks.csv 100%[===================>] 2.80M --.-KB/s in 0.03s
2023-01-20 17:19:22 (92.3 MB/s) - ‘all_peaks.csv’ saved [2940392/2940392]
--2023-01-20 17:19:22-- https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/cicero_connections.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22749615 (22M) [text/plain]
Saving to: ‘cicero_connections.csv’
cicero_connections. 100%[===================>] 21.70M 109MB/s in 0.2s
2023-01-20 17:19:23 (109 MB/s) - ‘cicero_connections.csv’ saved [22749615/22749615]
[5]:
# Load scATAC-seq peak list.
peaks = pd.read_csv("all_peaks.csv", index_col=0)
peaks = peaks.x.values
peaks
[5]:
array(['chr10_100006139_100006389', 'chr10_100015291_100017830',
'chr10_100018677_100020384', ..., 'chrY_90804622_90805450',
'chrY_90808626_90809117', 'chrY_90810560_90811167'], dtype=object)
[6]:
# Load Cicero coaccessibility scores.
cicero_connections = pd.read_csv("cicero_connections.csv", index_col=0)
cicero_connections.head()
[6]:
Peak1 | Peak2 | coaccess | |
---|---|---|---|
1 | chr10_100006139_100006389 | chr10_99774288_99774570 | -0.003546 |
2 | chr10_100006139_100006389 | chr10_99825945_99826237 | -0.027536 |
3 | chr10_100006139_100006389 | chr10_99830012_99830311 | 0.009588 |
4 | chr10_100006139_100006389 | chr10_99833211_99833540 | -0.008067 |
5 | chr10_100006139_100006389 | chr10_99941805_99941955 | 0.000000 |
2. Annotate transcription start sites (TSSs)¶¶
If your scATAC-seq data was generated with mm10 reference genome, please set ref_genome="mm10"
.
You can check supported reference genome using ma.SUPPORTED_REF_GENOME
If your reference genome is not in the list, please send a request to us through CellOracle GitHub issue page.
[6]:
ma.SUPPORTED_REF_GENOME
[6]:
species | ref_genome | provider | |
---|---|---|---|
0 | Human | hg38 | UCSC |
1 | Human | hg19 | UCSC |
2 | Mouse | mm39 | UCSC |
3 | Mouse | mm10 | UCSC |
4 | Mouse | mm9 | UCSC |
5 | S.cerevisiae | sacCer2 | UCSC |
6 | S.cerevisiae | sacCer3 | UCSC |
7 | Zebrafish | danRer7 | UCSC |
8 | Zebrafish | danRer10 | UCSC |
9 | Zebrafish | danRer11 | UCSC |
10 | Xenopus | xenTro2 | UCSC |
11 | Xenopus | xenTro3 | UCSC |
12 | Rat | rn4 | UCSC |
13 | Rat | rn5 | UCSC |
14 | Rat | rn6 | UCSC |
15 | Drosophila | dm3 | UCSC |
16 | Drosophila | dm6 | UCSC |
17 | C.elegans | ce6 | UCSC |
18 | C.elegans | ce10 | UCSC |
19 | Arabidopsis | TAIR10 | Ensembl |
20 | Chicken | galGal4 | UCSC |
21 | Chicken | galGal5 | UCSC |
22 | Chicken | galGal6 | UCSC |
23 | Guinea_Pig | Cavpor3.0 | Ensembl |
24 | Axolotl | AmexG_v6.0-DD | Axolotl-omics.org |
[8]:
##!! Please make sure to specify the correct reference genome here
tss_annotated = ma.get_tss_info(peak_str_list=peaks, ref_genome="mm10")
# Check results
tss_annotated.tail()
que bed peaks: 86935
tss peaks in que: 17238
[8]:
chr | start | end | gene_short_name | strand | |
---|---|---|---|---|---|
17233 | chr1 | 55130650 | 55132118 | Mob4 | + |
17234 | chr6 | 94499875 | 94500767 | Slc25a26 | + |
17235 | chr19 | 45659222 | 45660823 | Fbxw4 | - |
17236 | chr12 | 100898848 | 100899597 | Gpr68 | - |
17237 | chr4 | 129491262 | 129492047 | Fam229a | - |
3. Integrate TSS info and cicero connections¶
The output file after the integration process has three columns: ["peak_id", "gene_short_name", "coaccess"
].
“peak_id” is either the TSS peak or the peaks that have a connection to a TSS peak.
“gene_short_name” is the gene name that associated with the TSS site.
“coaccess” is the coaccessibility score between the peak and a TSS peak. If the score is 1, it means that the peak is a TSS itself.
[9]:
integrated = ma.integrate_tss_peak_with_cicero(tss_peak=tss_annotated,
cicero_connections=cicero_connections)
print(integrated.shape)
integrated.head()
(44309, 3)
[9]:
peak_id | gene_short_name | coaccess | |
---|---|---|---|
0 | chr10_100006139_100006389 | Tmtc3 | 0.017915 |
1 | chr10_100015291_100017830 | Kitl | 1.000000 |
2 | chr10_100018677_100020384 | Kitl | 0.146517 |
3 | chr10_100050858_100051762 | Kitl | 0.069751 |
4 | chr10_100052829_100053395 | Kitl | 0.202670 |
4. Filter peaks¶
Remove peaks with weak coaccessibility scores.
[10]:
peak = integrated[integrated.coaccess >= 0.8]
peak = peak[["peak_id", "gene_short_name"]].reset_index(drop=True)
[11]:
print(peak.shape)
peak.head()
(15779, 2)
[11]:
peak_id | gene_short_name | |
---|---|---|
0 | chr10_100015291_100017830 | Kitl |
1 | chr10_100486534_100488209 | Tmtc3 |
2 | chr10_100588641_100589556 | 4930430F08Rik |
3 | chr10_100741247_100742505 | Gm35722 |
4 | chr10_101681379_101682124 | Mgat4c |
5. Save data¶
Save the promoter/enhancer peaks.
[12]:
peak.to_csv("processed_peak_file.csv")
Please go to next step: Transcriptoin factor motif scan
https://morris-lab.github.io/CellOracle.documentation/tutorials/motifscan.html
[ ]: