{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Overview\n",
    "\n",
    "Before building the base GRN, we need to annotate the coaccessible peaks and filter our active promoter/enhancer elements. First, we will identify the peaks around transcription starting sites (TSS). We will then merge the Cicero data with the TSS peak information and filter any peaks with weak connections to the TSS peaks. As such, the filtered peak data will only include TSS peaks and peaks with strong TSS connections. These will be our active promoter/enhancer elements for our base GRN. \n",
    "\n",
    "### Notebook file\n",
    "\n",
    "Notebook file is available on CellOracle GitHub page.\n",
    "https://github.com/morris-lab/CellOracle/blob/master/docs/notebooks/01_ATAC-seq_data_processing/option1_scATAC-seq_data_analysis_with_cicero/02_preprocess_peak_data.ipynb\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 0. Import libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "import seaborn as sns\n",
    "\n",
    "\n",
    "import os, sys, shutil, importlib, glob\n",
    "from tqdm.notebook import tqdm\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2023-01-20 17:19:19,057 [7367] WARNING  py.warnings:99: [JupyterRequire] /home/k/anaconda3/envs/pandas1/lib/python3.6/site-packages/requests/__init__.py:91: RequestsDependencyWarning:\n",
      "\n",
      "urllib3 (1.26.9) or chardet (3.0.4) doesn't match a supported version!\n",
      "\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'0.10.14'"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from celloracle import motif_analysis as ma\n",
    "import celloracle as co\n",
    "co.__version__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "%config InlineBackend.figure_format = 'retina'\n",
    "\n",
    "plt.rcParams['figure.figsize'] = [6, 4.5]\n",
    "plt.rcParams[\"savefig.dpi\"] = 300"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. Load scATAC peak data and peak connection data made with Cicero"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.0. Download data\n",
    "\n",
    "In this notebook, we will annotate and filter output from Cicero. Please refer to the previous step to learn about data preparation with Cicero.\n",
    "https://morris-lab.github.io/CellOracle.documentation/tutorials/base_grn.html#step1-scatac-seq-analysis-with-cicero\n",
    "\n",
    "\n",
    "Here, we will use the preprocessed fetal brain scATAC-seq data from step 1.\n",
    "\n",
    "\n",
    "You can download the demo file by running the following command. \n",
    "\n",
    "Note: If the download fails, please manually download and unzip the data.\n",
    "https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/all_peaks.csv\n",
    "\n",
    "https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/cicero_connections.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2023-01-20 17:19:21--  https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/all_peaks.csv\n",
      "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n",
      "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 2940392 (2.8M) [text/plain]\n",
      "Saving to: ‘all_peaks.csv’\n",
      "\n",
      "all_peaks.csv       100%[===================>]   2.80M  --.-KB/s    in 0.03s   \n",
      "\n",
      "2023-01-20 17:19:22 (92.3 MB/s) - ‘all_peaks.csv’ saved [2940392/2940392]\n",
      "\n",
      "--2023-01-20 17:19:22--  https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/cicero_connections.csv\n",
      "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...\n",
      "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 22749615 (22M) [text/plain]\n",
      "Saving to: ‘cicero_connections.csv’\n",
      "\n",
      "cicero_connections. 100%[===================>]  21.70M   109MB/s    in 0.2s    \n",
      "\n",
      "2023-01-20 17:19:23 (109 MB/s) - ‘cicero_connections.csv’ saved [22749615/22749615]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Download file. \n",
    "!wget https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/all_peaks.csv\n",
    "!wget https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/cicero_connections.csv\n",
    "    \n",
    "# If you are using macOS, please try the following command.\n",
    "#!curl -O https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/all_peaks.csv\n",
    "#!curl -O https://raw.githubusercontent.com/morris-lab/CellOracle/master/docs/demo_data/cicero_connections.csv\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.1. Load data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['chr10_100006139_100006389', 'chr10_100015291_100017830',\n",
       "       'chr10_100018677_100020384', ..., 'chrY_90804622_90805450',\n",
       "       'chrY_90808626_90809117', 'chrY_90810560_90811167'], dtype=object)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Load scATAC-seq peak list.\n",
    "peaks = pd.read_csv(\"all_peaks.csv\", index_col=0)\n",
    "peaks = peaks.x.values\n",
    "peaks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Peak1</th>\n",
       "      <th>Peak2</th>\n",
       "      <th>coaccess</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>chr10_100006139_100006389</td>\n",
       "      <td>chr10_99774288_99774570</td>\n",
       "      <td>-0.003546</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>chr10_100006139_100006389</td>\n",
       "      <td>chr10_99825945_99826237</td>\n",
       "      <td>-0.027536</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>chr10_100006139_100006389</td>\n",
       "      <td>chr10_99830012_99830311</td>\n",
       "      <td>0.009588</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>chr10_100006139_100006389</td>\n",
       "      <td>chr10_99833211_99833540</td>\n",
       "      <td>-0.008067</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>chr10_100006139_100006389</td>\n",
       "      <td>chr10_99941805_99941955</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                       Peak1                    Peak2  coaccess\n",
       "1  chr10_100006139_100006389  chr10_99774288_99774570 -0.003546\n",
       "2  chr10_100006139_100006389  chr10_99825945_99826237 -0.027536\n",
       "3  chr10_100006139_100006389  chr10_99830012_99830311  0.009588\n",
       "4  chr10_100006139_100006389  chr10_99833211_99833540 -0.008067\n",
       "5  chr10_100006139_100006389  chr10_99941805_99941955  0.000000"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Load Cicero coaccessibility scores.\n",
    "cicero_connections = pd.read_csv(\"cicero_connections.csv\", index_col=0)\n",
    "cicero_connections.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. Annotate transcription start sites (TSSs)¶\n",
    "## IMPORTANT: Please make sure that you are setting correct reference genoms.\n",
    " If your scATAC-seq data was generated with mm10 reference genome, please set `ref_genome=\"mm10\"`.\n",
    " \n",
    "You can check supported reference genome using `ma.SUPPORTED_REF_GENOME`\n",
    "\n",
    " If your reference genome is not in the list, please send a request to us through CellOracle GitHub issue page."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>species</th>\n",
       "      <th>ref_genome</th>\n",
       "      <th>provider</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Human</td>\n",
       "      <td>hg38</td>\n",
       "      <td>UCSC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Human</td>\n",
       "      <td>hg19</td>\n",
       "      <td>UCSC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Mouse</td>\n",
       "      <td>mm39</td>\n",
       "      <td>UCSC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Mouse</td>\n",
       "      <td>mm10</td>\n",
       "      <td>UCSC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Mouse</td>\n",
       "      <td>mm9</td>\n",
       "      <td>UCSC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>S.cerevisiae</td>\n",
       "      <td>sacCer2</td>\n",
       "      <td>UCSC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>S.cerevisiae</td>\n",
       "      <td>sacCer3</td>\n",
       "      <td>UCSC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Zebrafish</td>\n",
       "      <td>danRer7</td>\n",
       "      <td>UCSC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Zebrafish</td>\n",
       "      <td>danRer10</td>\n",
       "      <td>UCSC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Zebrafish</td>\n",
       "      <td>danRer11</td>\n",
       "      <td>UCSC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Xenopus</td>\n",
       "      <td>xenTro2</td>\n",
       "      <td>UCSC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>Xenopus</td>\n",
       "      <td>xenTro3</td>\n",
       "      <td>UCSC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>Rat</td>\n",
       "      <td>rn4</td>\n",
       "      <td>UCSC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>Rat</td>\n",
       "      <td>rn5</td>\n",
       "      <td>UCSC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>Rat</td>\n",
       "      <td>rn6</td>\n",
       "      <td>UCSC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>Drosophila</td>\n",
       "      <td>dm3</td>\n",
       "      <td>UCSC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>Drosophila</td>\n",
       "      <td>dm6</td>\n",
       "      <td>UCSC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>C.elegans</td>\n",
       "      <td>ce6</td>\n",
       "      <td>UCSC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>C.elegans</td>\n",
       "      <td>ce10</td>\n",
       "      <td>UCSC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>Arabidopsis</td>\n",
       "      <td>TAIR10</td>\n",
       "      <td>Ensembl</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>Chicken</td>\n",
       "      <td>galGal4</td>\n",
       "      <td>UCSC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>Chicken</td>\n",
       "      <td>galGal5</td>\n",
       "      <td>UCSC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>Chicken</td>\n",
       "      <td>galGal6</td>\n",
       "      <td>UCSC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>Guinea_Pig</td>\n",
       "      <td>Cavpor3.0</td>\n",
       "      <td>Ensembl</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>Axolotl</td>\n",
       "      <td>AmexG_v6.0-DD</td>\n",
       "      <td>Axolotl-omics.org</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         species     ref_genome           provider\n",
       "0          Human           hg38               UCSC\n",
       "1          Human           hg19               UCSC\n",
       "2          Mouse           mm39               UCSC\n",
       "3          Mouse           mm10               UCSC\n",
       "4          Mouse            mm9               UCSC\n",
       "5   S.cerevisiae        sacCer2               UCSC\n",
       "6   S.cerevisiae        sacCer3               UCSC\n",
       "7      Zebrafish        danRer7               UCSC\n",
       "8      Zebrafish       danRer10               UCSC\n",
       "9      Zebrafish       danRer11               UCSC\n",
       "10       Xenopus        xenTro2               UCSC\n",
       "11       Xenopus        xenTro3               UCSC\n",
       "12           Rat            rn4               UCSC\n",
       "13           Rat            rn5               UCSC\n",
       "14           Rat            rn6               UCSC\n",
       "15    Drosophila            dm3               UCSC\n",
       "16    Drosophila            dm6               UCSC\n",
       "17     C.elegans            ce6               UCSC\n",
       "18     C.elegans           ce10               UCSC\n",
       "19   Arabidopsis         TAIR10            Ensembl\n",
       "20       Chicken        galGal4               UCSC\n",
       "21       Chicken        galGal5               UCSC\n",
       "22       Chicken        galGal6               UCSC\n",
       "23    Guinea_Pig      Cavpor3.0            Ensembl\n",
       "24       Axolotl  AmexG_v6.0-DD  Axolotl-omics.org"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ma.SUPPORTED_REF_GENOME"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "que bed peaks: 86935\n",
      "tss peaks in que: 17238\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>chr</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "      <th>gene_short_name</th>\n",
       "      <th>strand</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>17233</th>\n",
       "      <td>chr1</td>\n",
       "      <td>55130650</td>\n",
       "      <td>55132118</td>\n",
       "      <td>Mob4</td>\n",
       "      <td>+</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17234</th>\n",
       "      <td>chr6</td>\n",
       "      <td>94499875</td>\n",
       "      <td>94500767</td>\n",
       "      <td>Slc25a26</td>\n",
       "      <td>+</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17235</th>\n",
       "      <td>chr19</td>\n",
       "      <td>45659222</td>\n",
       "      <td>45660823</td>\n",
       "      <td>Fbxw4</td>\n",
       "      <td>-</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17236</th>\n",
       "      <td>chr12</td>\n",
       "      <td>100898848</td>\n",
       "      <td>100899597</td>\n",
       "      <td>Gpr68</td>\n",
       "      <td>-</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17237</th>\n",
       "      <td>chr4</td>\n",
       "      <td>129491262</td>\n",
       "      <td>129492047</td>\n",
       "      <td>Fam229a</td>\n",
       "      <td>-</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         chr      start        end gene_short_name strand\n",
       "17233   chr1   55130650   55132118            Mob4      +\n",
       "17234   chr6   94499875   94500767        Slc25a26      +\n",
       "17235  chr19   45659222   45660823           Fbxw4      -\n",
       "17236  chr12  100898848  100899597           Gpr68      -\n",
       "17237   chr4  129491262  129492047         Fam229a      -"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##!! Please make sure to specify the correct reference genome here\n",
    "tss_annotated = ma.get_tss_info(peak_str_list=peaks, ref_genome=\"mm10\") \n",
    "\n",
    "# Check results\n",
    "tss_annotated.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. Integrate TSS info and cicero connections\n",
    "\n",
    "The output file after the integration process has three columns: `[\"peak_id\", \"gene_short_name\", \"coaccess\"`].\n",
    "\n",
    "- \"peak_id\" is either the TSS peak or the peaks that have a connection to a TSS peak.\n",
    "- \"gene_short_name\" is the gene name that associated with the TSS site. \n",
    "- \"coaccess\" is the coaccessibility score between the peak and a TSS peak. If the score is 1, it means that the peak is a TSS itself."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(44309, 3)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>peak_id</th>\n",
       "      <th>gene_short_name</th>\n",
       "      <th>coaccess</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>chr10_100006139_100006389</td>\n",
       "      <td>Tmtc3</td>\n",
       "      <td>0.017915</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>chr10_100015291_100017830</td>\n",
       "      <td>Kitl</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>chr10_100018677_100020384</td>\n",
       "      <td>Kitl</td>\n",
       "      <td>0.146517</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>chr10_100050858_100051762</td>\n",
       "      <td>Kitl</td>\n",
       "      <td>0.069751</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>chr10_100052829_100053395</td>\n",
       "      <td>Kitl</td>\n",
       "      <td>0.202670</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                     peak_id gene_short_name  coaccess\n",
       "0  chr10_100006139_100006389           Tmtc3  0.017915\n",
       "1  chr10_100015291_100017830            Kitl  1.000000\n",
       "2  chr10_100018677_100020384            Kitl  0.146517\n",
       "3  chr10_100050858_100051762            Kitl  0.069751\n",
       "4  chr10_100052829_100053395            Kitl  0.202670"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "integrated = ma.integrate_tss_peak_with_cicero(tss_peak=tss_annotated, \n",
    "                                               cicero_connections=cicero_connections)\n",
    "print(integrated.shape)\n",
    "integrated.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4. Filter peaks\n",
    "Remove peaks with weak coaccessibility scores."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "peak = integrated[integrated.coaccess >= 0.8]\n",
    "peak = peak[[\"peak_id\", \"gene_short_name\"]].reset_index(drop=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(15779, 2)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>peak_id</th>\n",
       "      <th>gene_short_name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>chr10_100015291_100017830</td>\n",
       "      <td>Kitl</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>chr10_100486534_100488209</td>\n",
       "      <td>Tmtc3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>chr10_100588641_100589556</td>\n",
       "      <td>4930430F08Rik</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>chr10_100741247_100742505</td>\n",
       "      <td>Gm35722</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>chr10_101681379_101682124</td>\n",
       "      <td>Mgat4c</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                     peak_id gene_short_name\n",
       "0  chr10_100015291_100017830            Kitl\n",
       "1  chr10_100486534_100488209           Tmtc3\n",
       "2  chr10_100588641_100589556   4930430F08Rik\n",
       "3  chr10_100741247_100742505         Gm35722\n",
       "4  chr10_101681379_101682124          Mgat4c"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(peak.shape)\n",
    "peak.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5. Save data\n",
    "Save the promoter/enhancer peaks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "peak.to_csv(\"processed_peak_file.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Please go to next step: Transcriptoin factor motif scan**\n",
    "\n",
    "https://morris-lab.github.io/CellOracle.documentation/tutorials/motifscan.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "finalized": {
   "timestamp": 1674256776783,
   "trusted": true
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}