0. Overview: How to use a different motifs for celloracle motif scan.

In this notebook, we introduce how to prepare motif dataset for celloracle motif analysis. Celloracle uses list of motif object in gimmemotifs package. See gimmemotifs documentation for more details. (https://gimmemotifs.readthedocs.io/en/master/api.html#)

1 Import motifs from gimmemotifs dataset.

Gimmemotifs provides many motif dataset that was generated from public motif database including CisDB, ENCODE, HOMER, and JASPAR. https://gimmemotifs.readthedocs.io/en/master/overview.html

1.1 gimme.vertebrate.v5.0.

By default GimmeMotifs uses a non-redundant, clustered database of known vertebrate motifs. These motifs come from CIS-BP (http://cisbp.ccbr.utoronto.ca/) and other sources. This motif dataset can be easily loaded with the following command.

If your dataset is Mouse or Human, this one will be a good default choice.

[1]:
# Compare with default motifs in gimmemotifs
from gimmemotifs.motif import default_motifs
motifs =  default_motifs()

# Check first 10 motifs
motifs[:10]
[1]:
[GM.5.0.Sox.0001_AACAAT,
 GM.5.0.Homeodomain.0001_AGCTGTCAnnA,
 GM.5.0.Mixed.0001_snnGGsssGGs,
 GM.5.0.Nuclear_receptor.0001_TAwsTrGGTCAsTrGGTCA,
 GM.5.0.Mixed.0002_GCTAATTA,
 GM.5.0.Nuclear_receptor.0002_wnyrCTTCCGGGkC,
 GM.5.0.bHLH.0001_ACGTG,
 GM.5.0.Myb_SANT.0001_rrCCGTTAAACnGyy,
 GM.5.0.C2H2_ZF.0001_GCGkGGGCGG,
 GM.5.0.GATA.0001_TTATCTsnnnnnnnCA]

1.2 Motifs that are provided with gimmemotifs package

Many other motif databases come included with GimmeMotifs You can load them as follows.

[2]:
# Get folder path that stores motif data.
import os, glob
from gimmemotifs.motif import MotifConfig
config = MotifConfig()
motif_dir = config.get_motif_dir()

# Get motif data names
motifs_data_name = [i for i in os.listdir(motif_dir) if i.endswith(".pfm")]
motifs_data_name.sort()
motifs_data_name
[2]:
['CIS-BP.pfm',
 'ENCODE.pfm',
 'HOCOMOCOv10_HUMAN.pfm',
 'HOCOMOCOv10_MOUSE.pfm',
 'HOCOMOCOv11_HUMAN.pfm',
 'HOCOMOCOv11_MOUSE.pfm',
 'HOMER.pfm',
 'IMAGE.pfm',
 'JASPAR2018.pfm',
 'JASPAR2018_fungi.pfm',
 'JASPAR2018_insects.pfm',
 'JASPAR2018_nematodes.pfm',
 'JASPAR2018_plants.pfm',
 'JASPAR2018_urochordates.pfm',
 'JASPAR2018_vertebrates.pfm',
 'JASPAR2020.pfm',
 'JASPAR2020_fungi.pfm',
 'JASPAR2020_insects.pfm',
 'JASPAR2020_nematodes.pfm',
 'JASPAR2020_plants.pfm',
 'JASPAR2020_urochordates.pfm',
 'JASPAR2020_vertebrates.pfm',
 'RSAT_insects.pfm',
 'RSAT_plants.pfm',
 'RSAT_vertebrates.pfm',
 'SwissRegulon.pfm',
 'factorbook.pfm',
 'gimme.vertebrate.v5.0.pfm']
[3]:
# You can load motif files with "read_motifs"
from gimmemotifs.motif import read_motifs

path = os.path.join(motif_dir, "JASPAR2018_plants.pfm")
motifs = read_motifs(path)

# Check first 10 motifs
motifs[:10]
[3]:
[MA0020.1_Dof2_AAAGCn,
 MA0021.1_Dof3_AAAGyn,
 MA0034.1_Gam1_nnyAACCGmC,
 MA0044.1_HMG-1_sTTGTnyTy,
 MA0045.1_HMG-I/Y_nwAnAAAnrnmrAmAy,
 MA0053.1_MNB1A_AAAGC,
 MA0054.1_myb.Ph3_TAACnGTTw,
 MA0064.1_PBF_AAAGy,
 MA0082.1_squamosa_mCAwAwATrGwAAn,
 MA0096.1_bZIP910_mTGACGT]

2. Import motifs from Celloracle dataset.

Celloracle provides many motif dataset that was generated from CisDB. These motifs were divided by species. Please select motifs for your species.

[4]:
from celloracle import motif_analysis as ma

# Check available motifs
ma.MOTIFS_LIST
[4]:
['CisBP_ver2_Arabidopsis_thaliana.pfm',
 'CisBP_ver2_Caenorhabditis_elegans.pfm',
 'CisBP_ver2_Danio_rerio.pfm',
 'CisBP_ver2_Drosophila_ananassae.pfm',
 'CisBP_ver2_Drosophila_erecta.pfm',
 'CisBP_ver2_Drosophila_grimshawi.pfm',
 'CisBP_ver2_Drosophila_melanogaster.pfm',
 'CisBP_ver2_Drosophila_mix.pfm',
 'CisBP_ver2_Drosophila_mojavensis.pfm',
 'CisBP_ver2_Drosophila_persimilis.pfm',
 'CisBP_ver2_Drosophila_pseudoobscura.pfm',
 'CisBP_ver2_Drosophila_sechellia.pfm',
 'CisBP_ver2_Drosophila_simulans.pfm',
 'CisBP_ver2_Drosophila_virilis.pfm',
 'CisBP_ver2_Drosophila_willistoni.pfm',
 'CisBP_ver2_Drosophila_yakuba.pfm',
 'CisBP_ver2_Homo_sapiens.pfm',
 'CisBP_ver2_Mus_musculus.pfm',
 'CisBP_ver2_Rattus_norvegicus.pfm',
 'CisBP_ver2_Saccharomyces_cerevisiae.pfm',
 'CisBP_ver2_Xenopus_laevis.pfm',
 'CisBP_ver2_Xenopus_tropicalis.pfm',
 'CisBP_ver2_Xenopus_tropicalis_and_Xenopus_laevis.pfm']
[5]:
# Load motifs from celloracle dataset.
motifs = ma.load_motifs("CisBP_ver2_Homo_sapiens.pfm")

# Check first 10 motifs
motifs[:10]
[5]:
[M00056_2.00_TAATAAAT,
 M00070_2.00_nrAACAATAnn,
 M00111_2.00_nGCCynnGGs,
 M00112_2.00_CCTsrGGCnA,
 M00113_2.00_nsCCnnAGGs,
 M00114_2.00_nnGCCynnGG,
 M00115_2.00_nnATnAAAn,
 M00116_2.00_nnAATATTAnn,
 M00130_2.00_nnnGCCCCnn,
 M00142_2.00_GTrCTCmy]

3. Import motifs from custom motif dataset.

If you want to use another motif data source, you need to make a list of motif class in gimmemotifs. The easiest way to make such object is to use “read_motifs” function in gimmemotifs.

This function can load motif data text file. You need to prepare two files, XXX.motif2factors.txt and XXX.pfm.

3.1 XXX.motif2factors.txt

The text file, XXX.motif2factors.txt includes TF factor annotation for each motifs. The file should be like a tsv file like below.

The first column should be motif name, the motif name should match with motif name in pfm file. The second column is gene symbol, the thrid column is datasource. This column is not important. The forth column is additional information for this factor. The factor is labeled with “Y” If factor information was confirmed by some evidence. Otherwise, the factor is labeled with “N”.

[52]:
path_motif2factors = path.replace(".pfm", ".motif2factors.txt")

with open(path_motif2factors, "r") as f:
    for i, j in enumerate(f):
        print(j)
        if i>5:
            break
Motif   Factor  Evidence        Curated

MA0020.1_Dof2   Dof2    SELEX   Y

MA0021.1_Dof3   Dof3    SELEX   Y

MA0034.1_Gam1   Gam1    SELEX   Y

MA0044.1_HMG-1  HMG-1   SELEX   Y

MA0045.1_HMG-I/Y        HMG-I/Y SELEX   Y

MA0053.1_MNB1A  MNB1A   SELEX   Y

3.2 XXX.pfm

The second file, XXX.pfm. should includes motif pwm information. The file shoud be like below.

The motif name in this pfm file should exactly match with the motif name in motif2factor.txt file.

[55]:
with open(path, "r") as f:
    for i, j in enumerate(f):
        print(j)
        if i>10:
            break
# JASPAR2018_plants motif database

# Retrieved from: http://jaspar.genereg.net/download/CORE/JASPAR2018_CORE_plants_non-redundant_pfms_jaspar.txt

# Date: 2018-10-17

>MA0020.1_Dof2

0.9999  0.0000  0.0000  0.0000

0.9999  0.0000  0.0000  0.0000

0.9999  0.0000  0.0000  0.0000

0.0000  0.0000  0.9999  0.0000

0.1429  0.6666  0.0953  0.0953

0.3333  0.2857  0.1429  0.2381

>MA0021.1_Dof3

0.9999  0.0000  0.0000  0.0000

3.3 Load files as motif list

We can load files using read_motifs function in gimmemotifs.

First, please prepare two files, XXX.motif2factors.txt and XXX.pfm. in the same directly. If you have theses two file in a different place, we cannot use the read_motifs function.

Then use file path for XXX.pfm for the argument of read_motifs function.

[58]:
from gimmemotifs.motif import read_motifs

# Check path for pfm file
print(path)

# Read motifs
motifs = read_motifs(path)

# Check first 10 motifs
motifs[:10]
/home/k/anaconda3/envs/pandas1/lib/python3.6/site-packages/data/motif_databases/JASPAR2018_plants.pfm
[58]:
[MA0020.1_Dof2_AAAGCn,
 MA0021.1_Dof3_AAAGyn,
 MA0034.1_Gam1_nnyAACCGmC,
 MA0044.1_HMG-1_sTTGTnyTy,
 MA0045.1_HMG-I/Y_nwAnAAAnrnmrAmAy,
 MA0053.1_MNB1A_AAAGC,
 MA0054.1_myb.Ph3_TAACnGTTw,
 MA0064.1_PBF_AAAGy,
 MA0082.1_squamosa_mCAwAwATrGwAAn,
 MA0096.1_bZIP910_mTGACGT]

In another notebook, we introduce how to make XXX.pfm file and XXX.motif2factors.txt file. Pleas look at that if you want to make your motif data by yourself.