celloracle.motif_analysis module¶
The motif_analysis
module implements transcription factor motif scan.
Genomic activity information (peak of ATAC-seq or Chip-seq) is extracted first. Then the peak DNA sequence will be subjected to TF motif scan. Finally we will get list of TFs that potentially binds to a specific gene.
-
class
celloracle.motif_analysis.
TFinfo
(peak_data_frame, ref_genome, genomes_dir=None)¶ Bases:
object
This is a custom class for motif analysis in celloracle. TFinfo object performs motif scan using the TF motif database in gimmemotifs and several functions of genomepy. Analysis results can be exported as a python dictionary or dataframe. These files; python dictionary of dataframe of TF binding information, are needed during GRN inference.
-
peak_df
¶ dataframe about DNA peak and target gene data.
- Type
pandas.dataframe
-
all_target_gene
¶ target genes.
- Type
array of str
-
ref_genome
¶ reference genome name that was used in DNA peak generation.
- Type
str
-
scanned_df
¶ Results of motif scan. Key is a peak name. Value is a dataframe of motif scan.
- Type
dictionary
-
dic_targetgene2TFs
¶ Final product of motif scan. Key is a target gene. Value is a list of regulatory candidate genes.
- Type
dictionary
-
dic_peak2Targetgene
¶ Dictionary. Key is a peak name. Value is a list of the target gene.
- Type
dictionary
-
dic_TF2targetgenes
¶ Final product of motif scan. Key is a TF. Value is a list of potential target genes of the TF.
- Type
dictionary
-
copy
()¶ Deepcoty itself.
-
filter_motifs_by_score
(threshold, method='cumulative_score')¶ Remove motifs with low binding scores.
- Parameters
method (str) – thresholding method. Select either of [“indivisual_score”, “cumulative_score”]
-
filter_peaks
(peaks_to_be_remained)¶ Filter peaks.
- Parameters
peaks_to_be_remained (array of str) – list of peaks. Peaks that are NOT in the list will be removed.
-
make_TFinfo_dataframe_and_dictionary
(verbose=True)¶ This is the final step of motif_analysis. Convert scanned results into a data frame and dictionaries.
- Parameters
verbose (bool) – Whether to show a progress bar.
-
reset_dictionary_and_df
()¶ Reset TF dictionary and TF dataframe. The following attributes will be erased: TF_onehot, dic_targetgene2TFs, dic_peak2Targetgene, dic_TF2targetgenes.
-
reset_filtering
()¶ Reset filtering information. You can use this function to stat over the filtering step with new conditions. The following attributes will be erased: TF_onehot, dic_targetgene2TFs, dic_peak2Targetgene, dic_TF2targetgenes.
-
save_as_parquet
(folder_path=None)¶ Save itself. Some attributes are saved as parquet file.
- Parameters
folder_path (str) – folder path
-
scan
(background_length=200, fpr=0.02, n_cpus=- 1, verbose=True, motifs=None, TF_evidence_level='direct_and_indirect', TF_formatting='auto', batch_size=None, divide=100000)¶ Scan DNA sequences searching for TF binding motifs.
- Parameters
background_length (int) – background length. This is used for the calculation of the binding score.
fpr (float) – False positive rate for motif identification.
n_cpus (int) – number of CPUs for parallel calculation.
verbose (bool) – Whether to show a progress bar.
motifs (list) – a list of gimmemotifs motifs, will revert to default_motifs() if None
TF_evidence_level (str) – Please select one from [“direct”, “direct_and_indirect”]. If “direct” is selected, TFs that have a binding evidence were used. If “direct_and_indirect” is selected, TFs with binding evidence and inferred TFs are used. For more information, please read explanation of Motif class in gimmemotifs documentation (https://gimmemotifs.readthedocs.io/en/master/index.html)
-
set_motifs
(motifs=None, TF_formatting='auto', verbose=True)¶
-
to_dataframe
(verbose=True)¶ Return results as a dataframe. Rows are peak_id, and columns are TFs.
- Parameters
verbose (bool) – Whether to show a progress bar.
- Returns
TFinfo matrix.
- Return type
pandas.dataframe
-
to_dictionary
(dictionary_type='targetgene2TFs', verbose=True)¶ Return TF information as a python dictionary.
- Parameters
dictionary_type (str) – Type of dictionary. Select from [“targetgene2TFs”, “TF2targetgenes”]. If you chose “targetgene2TFs”, it returns a dictionary in which a key is a target gene, and a value is a list of regulatory candidate genes (TFs) of the target. If you chose “TF2targetgenes”, it returns a dictionary in which a key is a TF and a value is a list of potential target genes of the TF.
- Returns
dictionary.
- Return type
dictionary
-
to_hdf5
(file_path)¶ Save object as hdf5.
- Parameters
file_path (str) – file path to save file. Filename needs to end with ‘.celloracle.tfinfo’
-
-
celloracle.motif_analysis.
check_peak_format
(peaks_df, ref_genome, genomes_dir=None)¶ - Check peak format.
Check chromosome name.
Check peak size (length) and remove sort DNAs (<5bp)
- Parameters
peaks_df (pandas.DataFrame) –
ref_genome (str) – Reference genome name. e.g. “mm9”, “mm10”, “hg19” etc
genomes_dir (str) – Installation directory of Genomepy reference genome data.
- Returns
Peaks data after filtering.
- Return type
pandas.DataFrame
-
celloracle.motif_analysis.
get_tss_info
(peak_str_list, ref_genome, verbose=True, custom_tss_file_path=None)¶ Get annotation about Transcription Starting Site (TSS).
- Parameters
peak_str_list (list of str) – list of peak_id. e.g., [“chr5_0930303_9499409”, “chr11_123445555_123445577”]
ref_genome (str) – reference genome name.
verbose (bool) – verbosity.
custom_tss_file_path (str) – File path to the custom TSS reference bed file. If you just want to use reference genome that are supported in the CellOracle, you don’t need to set this parameter.
-
celloracle.motif_analysis.
integrate_tss_peak_with_cicero
(tss_peak, cicero_connections)¶ Process output of cicero data and returns DNA peak information for motif analysis in celloracle. Please see the celloracle tutorial for more information.
- Parameters
tss_peak (pandas.dataframe) – dataframe about TSS information. Please use the function, “get_tss_info” to get this dataframe.
cicero_connections (dataframe) – dataframe that stores the results of cicero analysis.
- Returns
DNA peak about promoter/enhancer and its annotation about target gene.
- Return type
pandas.dataframe
-
celloracle.motif_analysis.
is_genome_installed
(ref_genome, genomes_dir=None)¶ Celloracle motif_analysis module uses gimmemotifs and genomepy internally. Reference genome files should be installed in the PC to use gimmemotifs and genomepy. This function checks the installation status of the reference genome.
- Parameters
ref_genome (str) – names of reference genome. i.e., “mm10”, “hg19”
genomes_dir (str) – Installation directory of Genomepy reference genome data.
-
celloracle.motif_analysis.
load_TFinfo
(file_path)¶ Load TFinfo object which was saved as hdf5 file.
- Parameters
file_path (str) – file path.
- Returns
Loaded TFinfo object.
- Return type
-
celloracle.motif_analysis.
load_TFinfo_from_parquets
(folder_path)¶ Load TFinfo object which was saved with the function; “save_as_parquet”.
- Parameters
folder_path (str) – folder path
- Returns
Loaded TFinfo object.
- Return type
-
celloracle.motif_analysis.
load_motifs
(motifs_name, force_download=False)¶ Load motifs from celloracle motif database
- Parameters
motifs_name (str) – Name of motifs.
- Returns
List of gimmemotifs.motif object.
- Return type
list
-
celloracle.motif_analysis.
make_TFinfo_from_scanned_file
(path_to_raw_bed, path_to_scanned_result_bed, ref_genome)¶ This function is currently an available.
-
celloracle.motif_analysis.
peak2fasta
(peak_ids, ref_genome, genomes_dir)¶ Convert peak_id into fasta object.
- Parameters
peak_id (str or list of str) – Peak_id. e.g. “chr5_0930303_9499409” or it can be a list of peak_id. e.g. [“chr5_0930303_9499409”, “chr11_123445555_123445577”]
ref_genome (str) – Reference genome name. e.g. “mm9”, “mm10”, “hg19” etc
genomes_dir (str) – Installation directory of Genomepy reference genome data.
- Returns
DNA sequence in fasta format
- Return type
gimmemotifs fasta object
-
celloracle.motif_analysis.
read_bed
(bed_path)¶ Load bed file and return as dataframe.
- Parameters
bed_path (str) – File path.
- Returns
bed file in dataframe.
- Return type
pandas.dataframe
-
celloracle.motif_analysis.
remove_zero_seq
(fasta_object)¶ Remove DNA sequence with zero length
-
celloracle.motif_analysis.
scan_dna_for_motifs
(scanner_object, motifs_object, sequence_object, divide=100000, verbose=True, batch_size=None)¶ This is a wrapper function to scan DNA sequences searchig for Gene motifs.
- Parameters
scanner_object (gimmemotifs.scanner) – Object that do motif scan.
motifs_object (gimmemotifs.motifs) – Object that stores motif data.
sequence_object (gimmemotifs.fasta) – Object that stores sequence data.
- Returns
scan results is stored in data frame.
- Return type
pandas.dataframe