{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Overview\n", "\n", "Before building the base GRN, we need to annotate the coaccessible peaks and filter our active promoter/enhancer elements. First, we will identify the peaks around transcription starting sites (TSS). We will then merge the Cicero data with the TSS peak information and filter any peaks with weak connections to the TSS peaks. As such, the filtered peak data will only include TSS peaks and peaks with strong TSS connections. These will be our active promoter/enhancer elements for our base GRN. \n", "\n", "### Notebook file\n", "\n", "Notebook file is available on CellOracle GitHub page.\n", "https://github.com/morris-lab/CellOracle/blob/master/docs/notebooks/01_ATAC-seq_data_processing/option1_scATAC-seq_data_analysis_with_cicero/02_preprocess_peak_data.ipynb\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 0. Import libraries" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "import seaborn as sns\n", "\n", "\n", "import os, sys, shutil, importlib, glob\n", "from tqdm.notebook import tqdm\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2023-01-20 17:19:19,057 [7367] WARNING py.warnings:99: [JupyterRequire] /home/k/anaconda3/envs/pandas1/lib/python3.6/site-packages/requests/__init__.py:91: RequestsDependencyWarning:\n", "\n", "urllib3 (1.26.9) or chardet (3.0.4) doesn't match a supported version!\n", "\n", "\n" ] }, { "data": { "text/plain": [ "'0.10.14'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from celloracle import motif_analysis as ma\n", "import celloracle as co\n", "co.__version__" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "%config InlineBackend.figure_format = 'retina'\n", "\n", "plt.rcParams['figure.figsize'] = [6, 4.5]\n", "plt.rcParams[\"savefig.dpi\"] = 300" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. Load scATAC peak data and peak connection data made with Cicero" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.0. Download data\n", "\n", "In this notebook, we will annotate and filter output from Cicero. Please refer to the previous step to learn about data preparation with Cicero.\n", "https://morris-lab.github.io/CellOracle.documentation/tutorials/base_grn.html#step1-scatac-seq-analysis-with-cicero\n", "\n", "\n", "Here, we will use the preprocessed fetal brain scATAC-seq data from step 1.\n", "\n", "\n", "You can download the demo file by running the following command. \n", "\n", "Note: If the download fails, please manually download and unzip the data.\n", "https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/all_peaks.csv\n", "\n", "https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/cicero_connections.csv" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2023-01-20 17:19:21-- https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/all_peaks.csv\n", "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n", "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 2940392 (2.8M) [text/plain]\n", "Saving to: ‘all_peaks.csv’\n", "\n", "all_peaks.csv 100%[===================>] 2.80M --.-KB/s in 0.03s \n", "\n", "2023-01-20 17:19:22 (92.3 MB/s) - ‘all_peaks.csv’ saved [2940392/2940392]\n", "\n", "--2023-01-20 17:19:22-- https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/cicero_connections.csv\n", "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...\n", "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 22749615 (22M) [text/plain]\n", "Saving to: ‘cicero_connections.csv’\n", "\n", "cicero_connections. 100%[===================>] 21.70M 109MB/s in 0.2s \n", "\n", "2023-01-20 17:19:23 (109 MB/s) - ‘cicero_connections.csv’ saved [22749615/22749615]\n", "\n" ] } ], "source": [ "# Download file. \n", "!wget https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/all_peaks.csv\n", "!wget https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/cicero_connections.csv\n", " \n", "# If you are using macOS, please try the following command.\n", "#!curl -O https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/all_peaks.csv\n", "#!curl -O https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/cicero_connections.csv\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.1. Load data" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['chr10_100006139_100006389', 'chr10_100015291_100017830',\n", " 'chr10_100018677_100020384', ..., 'chrY_90804622_90805450',\n", " 'chrY_90808626_90809117', 'chrY_90810560_90811167'], dtype=object)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Load scATAC-seq peak list.\n", "peaks = pd.read_csv(\"all_peaks.csv\", index_col=0)\n", "peaks = peaks.x.values\n", "peaks" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Peak1Peak2coaccess
1chr10_100006139_100006389chr10_99774288_99774570-0.003546
2chr10_100006139_100006389chr10_99825945_99826237-0.027536
3chr10_100006139_100006389chr10_99830012_998303110.009588
4chr10_100006139_100006389chr10_99833211_99833540-0.008067
5chr10_100006139_100006389chr10_99941805_999419550.000000
\n", "
" ], "text/plain": [ " Peak1 Peak2 coaccess\n", "1 chr10_100006139_100006389 chr10_99774288_99774570 -0.003546\n", "2 chr10_100006139_100006389 chr10_99825945_99826237 -0.027536\n", "3 chr10_100006139_100006389 chr10_99830012_99830311 0.009588\n", "4 chr10_100006139_100006389 chr10_99833211_99833540 -0.008067\n", "5 chr10_100006139_100006389 chr10_99941805_99941955 0.000000" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Load Cicero coaccessibility scores.\n", "cicero_connections = pd.read_csv(\"cicero_connections.csv\", index_col=0)\n", "cicero_connections.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2. Annotate transcription start sites (TSSs)¶\n", "## IMPORTANT: Please make sure that you are setting correct reference genoms.\n", " If your scATAC-seq data was generated with mm10 reference genome, please set `ref_genome=\"mm10\"`.\n", " \n", "You can check supported reference genome using `ma.SUPPORTED_REF_GENOME`\n", "\n", " If your reference genome is not in the list, please send a request to us through CellOracle GitHub issue page." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesref_genomeprovider
0Humanhg38UCSC
1Humanhg19UCSC
2Mousemm39UCSC
3Mousemm10UCSC
4Mousemm9UCSC
5S.cerevisiaesacCer2UCSC
6S.cerevisiaesacCer3UCSC
7ZebrafishdanRer7UCSC
8ZebrafishdanRer10UCSC
9ZebrafishdanRer11UCSC
10XenopusxenTro2UCSC
11XenopusxenTro3UCSC
12Ratrn4UCSC
13Ratrn5UCSC
14Ratrn6UCSC
15Drosophiladm3UCSC
16Drosophiladm6UCSC
17C.elegansce6UCSC
18C.elegansce10UCSC
19ArabidopsisTAIR10Ensembl
20ChickengalGal4UCSC
21ChickengalGal5UCSC
22ChickengalGal6UCSC
23Guinea_PigCavpor3.0Ensembl
24AxolotlAmexG_v6.0-DDAxolotl-omics.org
\n", "
" ], "text/plain": [ " species ref_genome provider\n", "0 Human hg38 UCSC\n", "1 Human hg19 UCSC\n", "2 Mouse mm39 UCSC\n", "3 Mouse mm10 UCSC\n", "4 Mouse mm9 UCSC\n", "5 S.cerevisiae sacCer2 UCSC\n", "6 S.cerevisiae sacCer3 UCSC\n", "7 Zebrafish danRer7 UCSC\n", "8 Zebrafish danRer10 UCSC\n", "9 Zebrafish danRer11 UCSC\n", "10 Xenopus xenTro2 UCSC\n", "11 Xenopus xenTro3 UCSC\n", "12 Rat rn4 UCSC\n", "13 Rat rn5 UCSC\n", "14 Rat rn6 UCSC\n", "15 Drosophila dm3 UCSC\n", "16 Drosophila dm6 UCSC\n", "17 C.elegans ce6 UCSC\n", "18 C.elegans ce10 UCSC\n", "19 Arabidopsis TAIR10 Ensembl\n", "20 Chicken galGal4 UCSC\n", "21 Chicken galGal5 UCSC\n", "22 Chicken galGal6 UCSC\n", "23 Guinea_Pig Cavpor3.0 Ensembl\n", "24 Axolotl AmexG_v6.0-DD Axolotl-omics.org" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ma.SUPPORTED_REF_GENOME" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "que bed peaks: 86935\n", "tss peaks in que: 17238\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chrstartendgene_short_namestrand
17233chr15513065055132118Mob4+
17234chr69449987594500767Slc25a26+
17235chr194565922245660823Fbxw4-
17236chr12100898848100899597Gpr68-
17237chr4129491262129492047Fam229a-
\n", "
" ], "text/plain": [ " chr start end gene_short_name strand\n", "17233 chr1 55130650 55132118 Mob4 +\n", "17234 chr6 94499875 94500767 Slc25a26 +\n", "17235 chr19 45659222 45660823 Fbxw4 -\n", "17236 chr12 100898848 100899597 Gpr68 -\n", "17237 chr4 129491262 129492047 Fam229a -" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "##!! Please make sure to specify the correct reference genome here\n", "tss_annotated = ma.get_tss_info(peak_str_list=peaks, ref_genome=\"mm10\") \n", "\n", "# Check results\n", "tss_annotated.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3. Integrate TSS info and cicero connections\n", "\n", "The output file after the integration process has three columns: `[\"peak_id\", \"gene_short_name\", \"coaccess\"`].\n", "\n", "- \"peak_id\" is either the TSS peak or the peaks that have a connection to a TSS peak.\n", "- \"gene_short_name\" is the gene name that associated with the TSS site. \n", "- \"coaccess\" is the coaccessibility score between the peak and a TSS peak. If the score is 1, it means that the peak is a TSS itself." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(44309, 3)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
peak_idgene_short_namecoaccess
0chr10_100006139_100006389Tmtc30.017915
1chr10_100015291_100017830Kitl1.000000
2chr10_100018677_100020384Kitl0.146517
3chr10_100050858_100051762Kitl0.069751
4chr10_100052829_100053395Kitl0.202670
\n", "
" ], "text/plain": [ " peak_id gene_short_name coaccess\n", "0 chr10_100006139_100006389 Tmtc3 0.017915\n", "1 chr10_100015291_100017830 Kitl 1.000000\n", "2 chr10_100018677_100020384 Kitl 0.146517\n", "3 chr10_100050858_100051762 Kitl 0.069751\n", "4 chr10_100052829_100053395 Kitl 0.202670" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "integrated = ma.integrate_tss_peak_with_cicero(tss_peak=tss_annotated, \n", " cicero_connections=cicero_connections)\n", "print(integrated.shape)\n", "integrated.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 4. Filter peaks\n", "Remove peaks with weak coaccessibility scores." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "peak = integrated[integrated.coaccess >= 0.8]\n", "peak = peak[[\"peak_id\", \"gene_short_name\"]].reset_index(drop=True)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(15779, 2)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
peak_idgene_short_name
0chr10_100015291_100017830Kitl
1chr10_100486534_100488209Tmtc3
2chr10_100588641_1005895564930430F08Rik
3chr10_100741247_100742505Gm35722
4chr10_101681379_101682124Mgat4c
\n", "
" ], "text/plain": [ " peak_id gene_short_name\n", "0 chr10_100015291_100017830 Kitl\n", "1 chr10_100486534_100488209 Tmtc3\n", "2 chr10_100588641_100589556 4930430F08Rik\n", "3 chr10_100741247_100742505 Gm35722\n", "4 chr10_101681379_101682124 Mgat4c" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(peak.shape)\n", "peak.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5. Save data\n", "Save the promoter/enhancer peaks." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "peak.to_csv(\"processed_peak_file.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Please go to next step: Transcriptoin factor motif scan**\n", "\n", "https://morris-lab.github.io/CellOracle.documentation/tutorials/motifscan.html" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "finalized": { "timestamp": 1674256776783, "trusted": true }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.10" } }, "nbformat": 4, "nbformat_minor": 2 }