Overview

This notebook describes how to run in silico TF perturbations with the GRN models. Please read our paper to learn more about the CellOracle algorithm.

Notebook file

Notebook file is available at CellOracle GitHub. https://github.com/morris-lab/CellOracle/blob/master/docs/notebooks/05_simulation/Gata1_KO_simulation_with_Paul_etal_2015_data.ipynb

Data

In this notebook, CellOracle uses two types of input data.

In this tutorial, we will use the mouse hematopoiesis demo data (Paul et al., 2015). We can load the demo Oracle object using the command below.

oracle = co.data.load_tutorial_oracle_object()

  • Input data 2: Links object. The Links oobject stores the GRN data used in the simulations. In this tutorial, we use demo GRNs calculated from the hematopoiesis scRNA-seq data above and one of the mouse sci-ATAC-seq atlas base GRNs. We can load the demo Links object with this command.

links = co.data.load_tutorial_links_object()

What you can do

In this notebook, we perform two analyses.

  1. in silico TF perturbation to simulate cell identity shifts. CellOracle uses the GRN model to simulate cell identity shifts in response to TF perturbations. For this analysis, you will need the GRN models from the previous notebook.

  2. Compare simulation vectors with developmental vectors. In order to properly interpret the simulation results, it is also important to consider the natural direction of development. First, we will calculate a pseudotime gradient vector field to recapitulate the developmental flow. Then, we will compare the CellOracle TF perturbation vector field with the developmental vector field by calculating the inner product scores. Let’s call the inner product value as perturbation score (PS). Please see the step 5.6 for detail.

Custom data class / object

In this notebook, CellOracle uses four custom classes, Oracle, Links, Gradient_calculator, and Oracle_development_module.

  • Oracle is the main class in the CellOracle package. It is responsible for most of the calculations during GRN model construction and TF perturbation simulations.

  • Links is a class to store GRN data.

  • The Gradient_calculator calculates the developmental vector field from the pseudotime results. If you do not have pseudotime data for your trajectory, please see the pseudotime notebook to calculate this information. https://morris-lab.github.io/CellOracle.documentation/tutorials/pseudotime.html

  • The Oracle_development_module integrates the Oracle object data and the Gradient_calculator object data to analyze how TF perturbation affects on the developmental process. It also has many visualization functions.

0. Import libraries

[1]:
import os
import sys

import matplotlib.colors as colors
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns

This notebook was made with celloracle version 0.10.8. Please use celloracle>=0.10.8 Otherwise, you may get an error.

[2]:
import celloracle as co
co.__version__
/home/k/anaconda3/envs/pandas1/lib/python3.6/site-packages/requests/__init__.py:91: RequestsDependencyWarning:

urllib3 (1.26.9) or chardet (3.0.4) doesn't match a supported version!

[2]:
'0.10.8'
[3]:
#plt.rcParams["font.family"] = "arial"
plt.rcParams["figure.figsize"] = [6,6]
%config InlineBackend.figure_format = 'retina'
plt.rcParams["savefig.dpi"] = 600

%matplotlib inline
[4]:
# Make folder to save plots
save_folder = "figures"
os.makedirs(save_folder, exist_ok=True)

1. Load data

Load the oracle object. See the previous notebook for information on initializing an Oracle object.

[5]:
# Load the tutorial oracle object.
oracle = co.data.load_tutorial_oracle_object()
oracle

# Attention!! Please use the function below when you use your data.
# oracle = co.load_hdf5("ORACLE OBJECT PATH")
[5]:
Oracle object

Meta data
    celloracle version used for instantiation: 0.10.0
    n_cells: 2671
    n_genes: 1999
    cluster_name: louvain_annot
    dimensional_reduction_name: X_draw_graph_fa
    n_target_genes_in_TFdict: 21259 genes
    n_regulatory_in_TFdict: 1093 genes
    n_regulatory_in_both_TFdict_and_scRNA-seq: 90 genes
    n_target_genes_both_TFdict_and_scRNA-seq: 1850 genes
    k_for_knn_imputation: 66
Status
    Gene expression matrix: Ready
    BaseGRN: Ready
    PCA calculation: Done
    Knn imputation: Done
    GRN calculation for simulation: Not finished

In the previous notebook, we calculated the GRNs. Now, we will use these GRNs for the perturbation simulations. First, we will import the GRNs from the Links object.

[6]:
# Here, we load demo links object for the training purpose.
links = co.data.load_tutorial_links_object()

# Attention!! Please use the function below when you use your data.
# links = co.load_hdf5("YOUR LINK OBJCT PATH")

2. Make predictive models for simulation

Here, we will need to fit the ridge regression models again. This process will take less time than the GRN inference in the previous notebook, because we are using the filtered GRN models.

[7]:
links.filter_links()
oracle.get_cluster_specific_TFdict_from_Links(links_object=links)
oracle.fit_GRN_for_simulation(alpha=10,
                              use_cluster_specific_TFdict=True)

3. In silico TF perturbation analysis

Next, we will simulate the TF perturbation effects on cell identity to investigate its potential functions and regulatory mechanisms. Please see the CellOracle paper for more details on scientific rationale.

In this notebook, we will simulate the knockout of the Gata1 gene in the hematopoiesis trajectory.

Previous studies have shown that Gata1 is one of the TFs that regulates cell fate decisions in myeloid progenitors. Additionally, Gata1 has been shown to affect erythroid cell differentiation.

Here, we will use CellOracle to analyze Gata1 and attempt to recapitulate the previous findings from above.

[8]:
# Check gene expression
goi = "Gata1"
sc.pl.draw_graph(oracle.adata, color=[goi, oracle.cluster_column_name],
                 layer="imputed_count", use_raw=False, cmap="viridis")
../../_images/notebooks_05_simulation_Gata1_KO_simulation_with_Paul_etal_2015_data_17_0.png
[9]:
# Plot gene expression in histogram
sc.get.obs_df(oracle.adata, keys=[goi], layer="imputed_count").hist()
plt.show()
../../_images/notebooks_05_simulation_Gata1_KO_simulation_with_Paul_etal_2015_data_18_0.png
  • You can use any gene expression value in the in silico perturbations, but please avoid extremely high values that are far from the natural gene expression range. The upper limit allowed is twice the maximum gene expression.

To simulate Gata1 KO, we will set Gata1 expression to 0.

[10]:
# Enter perturbation conditions to simulate signal propagation after the perturbation.
oracle.simulate_shift(perturb_condition={goi: 0.0},
                      n_propagation=3)

After simulation, celloracle automatically evaluates whether the range of simulated values is appropriate and returns a warning if there is an atypical distribution. You can also evaluate the distribution pattern of simulated values in detail yourself. See the notebook below for more information. https://github.com/morris-lab/CellOracle/blob/master/docs/notebooks/05_simulation/misc/Check_distribution_of_simulation_value.ipynb

  • The steps above simulated tge global future gene expression shift after perturbation. This prediction is based on iterative calculations of signal propagation calculations within the GRN. Please look at our paper for more information.

  • The next step calculates the probability of cell state transitions based on the simulation data. You can use the transition probabilities between cells to predict how cells will change after a perturbation.

  • This transition probability will be used later.

[11]:
# Get transition probability
oracle.estimate_transition_prob(n_neighbors=200,
                                knn_random=True,
                                sampled_fraction=1)

# Calculate embedding
oracle.calculate_embedding_shift(sigma_corr=0.05)

4. Visualization

Caution: It is very important to find the optimal scale parameter.

  • We will need to adjust the scale parameter. Please seek to find the optimal scale parameter for the data based on your data.

  • If the vectors are not visible, you can try a smaller scale parameter to magnify the vector length. However, if you see large vectors in the randomized results (right panel), it means that the scale parameter is too small.

[12]:
fig, ax = plt.subplots(1, 2,  figsize=[13, 6])

scale = 25
# Show quiver plot
oracle.plot_quiver(scale=scale, ax=ax[0])
ax[0].set_title(f"Simulated cell identity shift vector: {goi} KO")

# Show quiver plot that was calculated with randomized graph.
oracle.plot_quiver_random(scale=scale, ax=ax[1])
ax[1].set_title(f"Randomized simulation vector")

plt.show()
../../_images/notebooks_05_simulation_Gata1_KO_simulation_with_Paul_etal_2015_data_26_0.png

4.2. Vector field graph

We will visualize the simulation results as a vector field on a digitized grid. Single cell transition vectors are grouped by grid point.

4.2.1 Find parameters for n_grid and min_mass

n_grid: Number of grid points.

min_mass: Threshold value for the cell density. The appropriate values for these parameters depends on the data. Please find appropriate values using the helper functions below.

[13]:
# n_grid = 40 is a good starting value.
n_grid = 40
oracle.calculate_p_mass(smooth=0.8, n_grid=n_grid, n_neighbors=200)

Please run oracle.suggest_mass_thresholds() to display a range of min_mass parameter values and choose a value to fit the data.

[14]:
# Search for best min_mass.
oracle.suggest_mass_thresholds(n_suggestion=12)
../../_images/notebooks_05_simulation_Gata1_KO_simulation_with_Paul_etal_2015_data_30_0.png

According to the results, the optimal min_mass is around 0.011.

[15]:
min_mass = 0.01
oracle.calculate_mass_filter(min_mass=min_mass, plot=True)
../../_images/notebooks_05_simulation_Gata1_KO_simulation_with_Paul_etal_2015_data_32_0.png

4.2.2 Plot vector fields

  • Again, we need to adjust the scale parameter. Please seek to find the optimal scale parameter that provides good visualization.

  • If you don’t see any vector, you can try the smaller scale parameter to magnify the vector length. However, if you see large vectors in the randomized results (right panel), it means the scale parameter is too small.

[16]:
fig, ax = plt.subplots(1, 2,  figsize=[13, 6])

scale_simulation = 0.5
# Show quiver plot
oracle.plot_simulation_flow_on_grid(scale=scale_simulation, ax=ax[0])
ax[0].set_title(f"Simulated cell identity shift vector: {goi} KO")

# Show quiver plot that was calculated with randomized graph.
oracle.plot_simulation_flow_random_on_grid(scale=scale_simulation, ax=ax[1])
ax[1].set_title(f"Randomized simulation vector")

plt.show()
../../_images/notebooks_05_simulation_Gata1_KO_simulation_with_Paul_etal_2015_data_34_0.png
[17]:
# Plot vector field with cell cluster
fig, ax = plt.subplots(figsize=[8, 8])

oracle.plot_cluster_whole(ax=ax, s=10)
oracle.plot_simulation_flow_on_grid(scale=scale_simulation, ax=ax, show_background=False)
../../_images/notebooks_05_simulation_Gata1_KO_simulation_with_Paul_etal_2015_data_35_0.png

5. [This step is optional] Compare simulation vector with development vectors

  • As shown above, we can simulate how TF perturbations affect cell identity and visualize the results as a vector field map.

  • To interpret the results, it is necessary to take into account the direction of natural differentiation. We will compare the simulated perturbation vectors with the developmental gradient vectors. By comparing them, we can intuitively understand how TFs impact cell fate determination during development. This perspective is also important for estimating the experimental perturbation results.

  • Here, we will calculate the development vector field using pseudotime gradient as follows.

  1. Transfer the pseudotime data into an n x n digitized grid.

  2. Calculate the 2D gradient of pseudotime to get vector field.

  3. Compare the in silico TF perturbation vector field with the development vector field by calculating inner product between these two vectors.

  • Note: Other methods may be used to represent a continuous scRNA-seq trajectory flow. For example, RNA velocity analysis is a good way to estimate the direction of cell differentiation. Choose the method that best suits your data.

In the analysis below, we will use pseudotime data. Pseudotime data is included in the demo data. If you are analyzing your own scRNA-seq data, please calculate pseudotime before starting this analysis.

Refer to the following link below to see an example about calculating pseudotime for scRNA-seq data. https://morris-lab.github.io/CellOracle.documentation/tutorials/pseudotime.html

[18]:
# Visualize pseudotime
fig, ax = plt.subplots(figsize=[6,6])

sc.pl.embedding(adata=oracle.adata, basis=oracle.embedding_name, ax=ax, cmap="rainbow",
                color=["Pseudotime"])
../../_images/notebooks_05_simulation_Gata1_KO_simulation_with_Paul_etal_2015_data_39_0.png
[19]:
from celloracle.applications import Gradient_calculator

# Instantiate Gradient calculator object
gradient = Gradient_calculator(oracle_object=oracle, pseudotime_key="Pseudotime")

We need to set n_grid and min_mass for the pseudotime grid point calculation too. n_grid: Number of grid point.

We already know approproate values for them. Please set the same values as step 4.2.1 above.

[20]:
gradient.calculate_p_mass(smooth=0.8, n_grid=n_grid, n_neighbors=200)
gradient.calculate_mass_filter(min_mass=min_mass, plot=True)
../../_images/notebooks_05_simulation_Gata1_KO_simulation_with_Paul_etal_2015_data_43_0.png

Next, we convert the pseudotime data into grid points. For this calculation we can chose one of two methods.

  • knn: K-Nearesr Neighbor regression. You will need to set number of neighbors. Please adjust n_knn for best results.This will depend on the population size and density of your scRNA-seq data.

gradient.transfer_data_into_grid(args={"method": "knn", "n_knn":50})

  • polynomial: Polynomial regression using x-axis and y-axis of dimensional reduction space.

In general, this method will be more robust. Please use this method if knn method does not work. n_poly is the number of degrees for the polynomial regression model. Please try to find appropriaten_poly searching for best results.

gradient.transfer_data_into_grid(args={"method": "polynomial", "n_poly":3})

[21]:
gradient.transfer_data_into_grid(args={"method": "polynomial", "n_poly":3}, plot=True)
../../_images/notebooks_05_simulation_Gata1_KO_simulation_with_Paul_etal_2015_data_46_0.png

Calculate the 2D vector map to represents the pseudotime gradient. After the gradient calculation, the length of the vector will be normalized automatically.

Please adjust the scale parameter to adjust vector length visualization.

[22]:
# Calculate graddient
gradient.calculate_gradient()

# Show results
scale_dev = 40
gradient.visualize_results(scale=scale_dev, s=5)
../../_images/notebooks_05_simulation_Gata1_KO_simulation_with_Paul_etal_2015_data_48_0.png
[23]:
# Visualize results
fig, ax = plt.subplots(figsize=[6, 6])
gradient.plot_dev_flow_on_grid(scale=scale_dev, ax=ax)
../../_images/notebooks_05_simulation_Gata1_KO_simulation_with_Paul_etal_2015_data_49_0.png
[24]:
# Save gradient object if you want.
#gradient.to_hdf5("Paul_etal.celloracle.gradient")

We will use the inner product calculations to quantitatively compare the 2D developmental flow and perturbation simulation vectors. > If you are not familiar with Inner product / Dot product, please see https://en.wikipedia.org/wiki/Dot_product

  • The inner product represents the similarity between two vectors.

  • Using the inner product, we compare the 2D vector field of perturbation simulation and development flow.

  • The inner product will be a positive value when two vectors are pointing in the same direction.

  • Conversely, the inner product will be a negative value when two vectors are pointing in the opposing directions.

51c9a14e23d94c219dae3531da1d3b3a

  • The length of the vectors also affects the magnitude of inner product value.

In summary, we quantitatively compare the directionality and size of vectors between perturbation simulation and natural differentiation using inner product, and we define the score as perturbation score (PS).

  • A negative PS means that the TF perturbation would block differentiation.

  • A positive PS means that the TF perturbation would promote differentiation.

[25]:
from celloracle.applications import Oracle_development_module

# Make Oracle_development_module to compare two vector field
dev = Oracle_development_module()

# Load development flow
dev.load_differentiation_reference_data(gradient_object=gradient)

# Load simulation result
dev.load_perturb_simulation_data(oracle_object=oracle)


# Calculate inner produc scores
dev.calculate_inner_product()
dev.calculate_digitized_ip(n_bins=10)
  • Here, need to adjust the vm parameter for the PS score color visualization. Please seek to find the optimal vm parameter that provides good visualization.

  • If you don’t see any color in the left panel below, you can try the smaller vm parameter to magnify the scale of vm visualization. However, if you see colors in the randomized results (right panel), it means the vm parameter is too small.

[26]:
# Show perturbation scores
vm = 0.02

fig, ax = plt.subplots(1, 2, figsize=[12, 6])
dev.plot_inner_product_on_grid(vm=0.02, s=50, ax=ax[0])
ax[0].set_title(f"PS")

dev.plot_inner_product_random_on_grid(vm=vm, s=50, ax=ax[1])
ax[1].set_title(f"PS calculated with Randomized simulation vector")
plt.show()
../../_images/notebooks_05_simulation_Gata1_KO_simulation_with_Paul_etal_2015_data_55_0.png
[27]:
# Show perturbation scores with perturbation simulation vector field
fig, ax = plt.subplots(figsize=[6, 6])
dev.plot_inner_product_on_grid(vm=vm, s=50, ax=ax)
dev.plot_simulation_flow_on_grid(scale=scale_simulation, show_background=False, ax=ax)
../../_images/notebooks_05_simulation_Gata1_KO_simulation_with_Paul_etal_2015_data_56_0.png
[28]:
# Let's visualize the results
dev.visualize_development_module_layout_0(s=5,
                                          scale_for_simulation=scale_simulation,
                                          s_grid=50,
                                          scale_for_pseudotime=scale_dev,
                                          vm=vm)
../../_images/notebooks_05_simulation_Gata1_KO_simulation_with_Paul_etal_2015_data_57_0.png

6. Focus on a single development lineage to interpret the results in detail

So far, we have used the Oracle_development_module to analyze the whole cell population. However, the Oracle_development_module can also analyze specific subpopulations.

In this example, we will analyze the MEP lineage.

[29]:
# Get cell index list for the cells of interest
clusters = ['Ery_0', 'Ery_1', 'Ery_2', 'Ery_3', 'Ery_4', 'Ery_5', 'Ery_6',
            'Ery_7', 'Ery_8', 'Ery_9', 'MEP_0', 'Mk_0']
cell_idx = np.where(oracle.adata.obs["louvain_annot"].isin(clusters))[0]

# Check index
print(cell_idx)
[   0    2    4 ... 2666 2668 2670]
[30]:
dev = Oracle_development_module()

# Load development flow
dev.load_differentiation_reference_data(gradient_object=gradient)

# Load simulation result
dev.load_perturb_simulation_data(oracle_object=oracle,
                                 cell_idx_use=cell_idx, # Enter cell id list
                                 name="Lineage_MEP" # Name of this cell group. You can enter any name.
                                 )

# Calculation
dev.calculate_inner_product()
dev.calculate_digitized_ip(n_bins=10)
[31]:
# Let's visualize the results
dev.visualize_development_module_layout_0(s=5,
                                          scale_for_simulation=scale_simulation,
                                          s_grid=50,
                                          scale_for_pseudotime=scale_dev,
                                          vm=vm)
../../_images/notebooks_05_simulation_Gata1_KO_simulation_with_Paul_etal_2015_data_61_0.png

Attention: We have changed the default color scheme for the perturbation score visualization at CellOracle = 0.9.0. If you would like to use the previous blue and red color scheme, please run the following command to change the color scheme.

[32]:
from celloracle.visualizations.config import CONFIG
CONFIG["cmap_ps"] = "coolwarm"
[33]:
dev.visualize_development_module_layout_0(s=5,
                                          scale_for_simulation=scale_simulation,
                                          s_grid=50,
                                          scale_for_pseudotime=scale_dev,
                                          vm=vm)
../../_images/notebooks_05_simulation_Gata1_KO_simulation_with_Paul_etal_2015_data_64_0.png
[ ]: