Overview¶

Before building the base GRN, we need to annotate the coaccessible peaks and filter our active promoter/enhancer elements. First, we will identify the peaks around transcription starting sites (TSS). We will then merge the Cicero data with the TSS peak information and filter any peaks with weak connections to the TSS peaks. As such, the filtered peak data will only include TSS peaks and peaks with strong TSS connections. These will be our active promoter/enhancer elements for our base GRN.

Notebook file¶

Notebook file is available on CellOracle GitHub page. https://github.com/morris-lab/CellOracle/blob/master/docs/notebooks/01_ATAC-seq_data_processing/option1_scATAC-seq_data_analysis_with_cicero/02_preprocess_peak_data.ipynb

0. Import libraries¶

[1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns


import os, sys, shutil, importlib, glob
from tqdm.notebook import tqdm

[2]:

from celloracle import motif_analysis as ma
import celloracle as co
co.__version__

2023-01-20 17:19:19,057 [7367] WARNING  py.warnings:99: [JupyterRequire] /home/k/anaconda3/envs/pandas1/lib/python3.6/site-packages/requests/__init__.py:91: RequestsDependencyWarning:

urllib3 (1.26.9) or chardet (3.0.4) doesn't match a supported version!

[2]:

'0.10.14'

[3]:

%config InlineBackend.figure_format = 'retina'

plt.rcParams['figure.figsize'] = [6, 4.5]
plt.rcParams["savefig.dpi"] = 300

1. Load scATAC peak data and peak connection data made with Cicero¶

In this notebook, we will annotate and filter output from Cicero. Please refer to the previous step to learn about data preparation with Cicero. https://morris-lab.github.io/CellOracle.documentation/tutorials/base_grn.html#step1-scatac-seq-analysis-with-cicero

Here, we will use the preprocessed fetal brain scATAC-seq data from step 1.

You can download the demo file by running the following command.

Note: If the download fails, please manually download and unzip the data. https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/all_peaks.csv

https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/cicero_connections.csv

[4]:

# Download file.
!wget https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/all_peaks.csv
!wget https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/cicero_connections.csv

# If you are using macOS, please try the following command.
#!curl -O https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/all_peaks.csv
#!curl -O https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/cicero_connections.csv

--2023-01-20 17:19:21--  https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/all_peaks.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2940392 (2.8M) [text/plain]
Saving to: ‘all_peaks.csv’

all_peaks.csv       100%[===================>]   2.80M  --.-KB/s    in 0.03s

2023-01-20 17:19:22 (92.3 MB/s) - ‘all_peaks.csv’ saved [2940392/2940392]

--2023-01-20 17:19:22--  https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/cicero_connections.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22749615 (22M) [text/plain]
Saving to: ‘cicero_connections.csv’

cicero_connections. 100%[===================>]  21.70M   109MB/s    in 0.2s

2023-01-20 17:19:23 (109 MB/s) - ‘cicero_connections.csv’ saved [22749615/22749615]

[5]:

# Load scATAC-seq peak list.
peaks = pd.read_csv("all_peaks.csv", index_col=0)
peaks = peaks.x.values
peaks

[5]:

array(['chr10_100006139_100006389', 'chr10_100015291_100017830',
       'chr10_100018677_100020384', ..., 'chrY_90804622_90805450',
       'chrY_90808626_90809117', 'chrY_90810560_90811167'], dtype=object)

[6]:

# Load Cicero coaccessibility scores.
cicero_connections = pd.read_csv("cicero_connections.csv", index_col=0)
cicero_connections.head()

[6]:

	Peak1	Peak2	coaccess
1	chr10_100006139_100006389	chr10_99774288_99774570	-0.003546
2	chr10_100006139_100006389	chr10_99825945_99826237	-0.027536
3	chr10_100006139_100006389	chr10_99830012_99830311	0.009588
4	chr10_100006139_100006389	chr10_99833211_99833540	-0.008067
5	chr10_100006139_100006389	chr10_99941805_99941955	0.000000

2. Annotate transcription start sites (TSSs)¶¶

If your scATAC-seq data was generated with mm10 reference genome, please set ref_genome="mm10".

You can check supported reference genome using ma.SUPPORTED_REF_GENOME

If your reference genome is not in the list, please send a request to us through CellOracle GitHub issue page.

[6]:

ma.SUPPORTED_REF_GENOME

[6]:

	species	ref_genome	provider
0	Human	hg38	UCSC
1	Human	hg19	UCSC
2	Mouse	mm39	UCSC
3	Mouse	mm10	UCSC
4	Mouse	mm9	UCSC
5	S.cerevisiae	sacCer2	UCSC
6	S.cerevisiae	sacCer3	UCSC
7	Zebrafish	danRer7	UCSC
8	Zebrafish	danRer10	UCSC
9	Zebrafish	danRer11	UCSC
10	Xenopus	xenTro2	UCSC
11	Xenopus	xenTro3	UCSC
12	Rat	rn4	UCSC
13	Rat	rn5	UCSC
14	Rat	rn6	UCSC
15	Drosophila	dm3	UCSC
16	Drosophila	dm6	UCSC
17	C.elegans	ce6	UCSC
18	C.elegans	ce10	UCSC
19	Arabidopsis	TAIR10	Ensembl
20	Chicken	galGal4	UCSC
21	Chicken	galGal5	UCSC
22	Chicken	galGal6	UCSC
23	Guinea_Pig	Cavpor3.0	Ensembl
24	Axolotl	AmexG_v6.0-DD	Axolotl-omics.org

[8]:

##!! Please make sure to specify the correct reference genome here
tss_annotated = ma.get_tss_info(peak_str_list=peaks, ref_genome="mm10")

# Check results
tss_annotated.tail()

que bed peaks: 86935
tss peaks in que: 17238

[8]:

	chr	start	end	gene_short_name	strand
17233	chr1	55130650	55132118	Mob4	+
17234	chr6	94499875	94500767	Slc25a26	+
17235	chr19	45659222	45660823	Fbxw4	-
17236	chr12	100898848	100899597	Gpr68	-
17237	chr4	129491262	129492047	Fam229a	-

3. Integrate TSS info and cicero connections¶

The output file after the integration process has three columns: ["peak_id", "gene_short_name", "coaccess"].

“peak_id” is either the TSS peak or the peaks that have a connection to a TSS peak.
“gene_short_name” is the gene name that associated with the TSS site.
“coaccess” is the coaccessibility score between the peak and a TSS peak. If the score is 1, it means that the peak is a TSS itself.

[9]:

integrated = ma.integrate_tss_peak_with_cicero(tss_peak=tss_annotated,
                                               cicero_connections=cicero_connections)
print(integrated.shape)
integrated.head()

(44309, 3)

[9]:

	peak_id	gene_short_name	coaccess
0	chr10_100006139_100006389	Tmtc3	0.017915
1	chr10_100015291_100017830	Kitl	1.000000
2	chr10_100018677_100020384	Kitl	0.146517
3	chr10_100050858_100051762	Kitl	0.069751
4	chr10_100052829_100053395	Kitl	0.202670

4. Filter peaks¶

Remove peaks with weak coaccessibility scores.

[10]:

peak = integrated[integrated.coaccess >= 0.8]
peak = peak[["peak_id", "gene_short_name"]].reset_index(drop=True)

[11]:

print(peak.shape)
peak.head()

(15779, 2)

[11]:

	peak_id	gene_short_name
0	chr10_100015291_100017830	Kitl
1	chr10_100486534_100488209	Tmtc3
2	chr10_100588641_100589556	4930430F08Rik
3	chr10_100741247_100742505	Gm35722
4	chr10_101681379_101682124	Mgat4c

5. Save data¶

Save the promoter/enhancer peaks.

[12]:

peak.to_csv("processed_peak_file.csv")

Please go to next step: Transcriptoin factor motif scan

https://morris-lab.github.io/CellOracle.documentation/tutorials/motifscan.html

[ ]: