celloracle.motif_analysis module¶
The motif_analysis module implements transcription factor motif scan.
Genomic activity information (peak of ATAC-seq or Chip-seq) is extracted first. Then the peak DNA sequence will be subjected to TF motif scan. Finally we will get list of TFs that potentially binds to a specific gene.
- 
class celloracle.motif_analysis.TFinfo(peak_data_frame, ref_genome, genomes_dir=None)¶
- Bases: - object- This is a custom class for motif analysis in celloracle. TFinfo object performs motif scan using the TF motif database in gimmemotifs and several functions of genomepy. Analysis results can be exported as a python dictionary or dataframe. These files; python dictionary of dataframe of TF binding information, are needed during GRN inference. - 
peak_df¶
- dataframe about DNA peak and target gene data. - Type
- pandas.dataframe 
 
 - 
all_target_gene¶
- target genes. - Type
- array of str 
 
 - 
ref_genome¶
- reference genome name that was used in DNA peak generation. - Type
- str 
 
 - 
scanned_df¶
- Results of motif scan. Key is a peak name. Value is a dataframe of motif scan. - Type
- dictionary 
 
 - 
dic_targetgene2TFs¶
- Final product of motif scan. Key is a target gene. Value is a list of regulatory candidate genes. - Type
- dictionary 
 
 - 
dic_peak2Targetgene¶
- Dictionary. Key is a peak name. Value is a list of the target gene. - Type
- dictionary 
 
 - 
dic_TF2targetgenes¶
- Final product of motif scan. Key is a TF. Value is a list of potential target genes of the TF. - Type
- dictionary 
 
 - 
copy()¶
- Deepcoty itself. 
 - 
filter_motifs_by_score(threshold, method='cumulative_score')¶
- Remove motifs with low binding scores. - Parameters
- method (str) – thresholding method. Select either of [“indivisual_score”, “cumulative_score”] 
 
 - 
filter_peaks(peaks_to_be_remained)¶
- Filter peaks. - Parameters
- peaks_to_be_remained (array of str) – list of peaks. Peaks that are NOT in the list will be removed. 
 
 - 
make_TFinfo_dataframe_and_dictionary(verbose=True)¶
- This is the final step of motif_analysis. Convert scanned results into a data frame and dictionaries. - Parameters
- verbose (bool) – Whether to show a progress bar. 
 
 - 
reset_dictionary_and_df()¶
- Reset TF dictionary and TF dataframe. The following attributes will be erased: TF_onehot, dic_targetgene2TFs, dic_peak2Targetgene, dic_TF2targetgenes. 
 - 
reset_filtering()¶
- Reset filtering information. You can use this function to stat over the filtering step with new conditions. The following attributes will be erased: TF_onehot, dic_targetgene2TFs, dic_peak2Targetgene, dic_TF2targetgenes. 
 - 
save_as_parquet(folder_path=None)¶
- Save itself. Some attributes are saved as parquet file. - Parameters
- folder_path (str) – folder path 
 
 - 
scan(background_length=200, fpr=0.02, n_cpus=- 1, verbose=True, motifs=None, TF_evidence_level='direct_and_indirect', TF_formatting='auto', batch_size=None, divide=100000)¶
- Scan DNA sequences searching for TF binding motifs. - Parameters
- background_length (int) – background length. This is used for the calculation of the binding score. 
- fpr (float) – False positive rate for motif identification. 
- n_cpus (int) – number of CPUs for parallel calculation. 
- verbose (bool) – Whether to show a progress bar. 
- motifs (list) – a list of gimmemotifs motifs, will revert to default_motifs() if None 
- TF_evidence_level (str) – Please select one from [“direct”, “direct_and_indirect”]. If “direct” is selected, TFs that have a binding evidence were used. If “direct_and_indirect” is selected, TFs with binding evidence and inferred TFs are used. For more information, please read explanation of Motif class in gimmemotifs documentation (https://gimmemotifs.readthedocs.io/en/master/index.html) 
 
 
 - 
set_motifs(motifs=None, TF_formatting='auto', verbose=True)¶
 - 
to_dataframe(verbose=True)¶
- Return results as a dataframe. Rows are peak_id, and columns are TFs. - Parameters
- verbose (bool) – Whether to show a progress bar. 
- Returns
- TFinfo matrix. 
- Return type
- pandas.dataframe 
 
 - 
to_dictionary(dictionary_type='targetgene2TFs', verbose=True)¶
- Return TF information as a python dictionary. - Parameters
- dictionary_type (str) – Type of dictionary. Select from [“targetgene2TFs”, “TF2targetgenes”]. If you chose “targetgene2TFs”, it returns a dictionary in which a key is a target gene, and a value is a list of regulatory candidate genes (TFs) of the target. If you chose “TF2targetgenes”, it returns a dictionary in which a key is a TF and a value is a list of potential target genes of the TF. 
- Returns
- dictionary. 
- Return type
- dictionary 
 
 - 
to_hdf5(file_path)¶
- Save object as hdf5. - Parameters
- file_path (str) – file path to save file. Filename needs to end with ‘.celloracle.tfinfo’ 
 
 
- 
- 
celloracle.motif_analysis.check_peak_format(peaks_df, ref_genome, genomes_dir=None)¶
- Check peak format.
- Check chromosome name. 
- Check peak size (length) and remove sort DNAs (<5bp) 
 
 - Parameters
- peaks_df (pandas.DataFrame) – 
- ref_genome (str) – Reference genome name. e.g. “mm9”, “mm10”, “hg19” etc 
- genomes_dir (str) – Installation directory of Genomepy reference genome data. 
 
- Returns
- Peaks data after filtering. 
- Return type
- pandas.DataFrame 
 
- 
celloracle.motif_analysis.get_tss_info(peak_str_list, ref_genome, verbose=True, custom_tss_file_path=None)¶
- Get annotation about Transcription Starting Site (TSS). - Parameters
- peak_str_list (list of str) – list of peak_id. e.g., [“chr5_0930303_9499409”, “chr11_123445555_123445577”] 
- ref_genome (str) – reference genome name. 
- verbose (bool) – verbosity. 
- custom_tss_file_path (str) – File path to the custom TSS reference bed file. If you just want to use reference genome that are supported in the CellOracle, you don’t need to set this parameter. 
 
 
- 
celloracle.motif_analysis.integrate_tss_peak_with_cicero(tss_peak, cicero_connections)¶
- Process output of cicero data and returns DNA peak information for motif analysis in celloracle. Please see the celloracle tutorial for more information. - Parameters
- tss_peak (pandas.dataframe) – dataframe about TSS information. Please use the function, “get_tss_info” to get this dataframe. 
- cicero_connections (dataframe) – dataframe that stores the results of cicero analysis. 
 
- Returns
- DNA peak about promoter/enhancer and its annotation about target gene. 
- Return type
- pandas.dataframe 
 
- 
celloracle.motif_analysis.is_genome_installed(ref_genome, genomes_dir=None)¶
- Celloracle motif_analysis module uses gimmemotifs and genomepy internally. Reference genome files should be installed in the PC to use gimmemotifs and genomepy. This function checks the installation status of the reference genome. - Parameters
- ref_genome (str) – names of reference genome. i.e., “mm10”, “hg19” 
- genomes_dir (str) – Installation directory of Genomepy reference genome data. 
 
 
- 
celloracle.motif_analysis.load_TFinfo(file_path)¶
- Load TFinfo object which was saved as hdf5 file. - Parameters
- file_path (str) – file path. 
- Returns
- Loaded TFinfo object. 
- Return type
 
- 
celloracle.motif_analysis.load_TFinfo_from_parquets(folder_path)¶
- Load TFinfo object which was saved with the function; “save_as_parquet”. - Parameters
- folder_path (str) – folder path 
- Returns
- Loaded TFinfo object. 
- Return type
 
- 
celloracle.motif_analysis.load_motifs(motifs_name, force_download=False)¶
- Load motifs from celloracle motif database - Parameters
- motifs_name (str) – Name of motifs. 
- Returns
- List of gimmemotifs.motif object. 
- Return type
- list 
 
- 
celloracle.motif_analysis.make_TFinfo_from_scanned_file(path_to_raw_bed, path_to_scanned_result_bed, ref_genome)¶
- This function is currently an available. 
- 
celloracle.motif_analysis.peak2fasta(peak_ids, ref_genome, genomes_dir)¶
- Convert peak_id into fasta object. - Parameters
- peak_id (str or list of str) – Peak_id. e.g. “chr5_0930303_9499409” or it can be a list of peak_id. e.g. [“chr5_0930303_9499409”, “chr11_123445555_123445577”] 
- ref_genome (str) – Reference genome name. e.g. “mm9”, “mm10”, “hg19” etc 
- genomes_dir (str) – Installation directory of Genomepy reference genome data. 
 
- Returns
- DNA sequence in fasta format 
- Return type
- gimmemotifs fasta object 
 
- 
celloracle.motif_analysis.read_bed(bed_path)¶
- Load bed file and return as dataframe. - Parameters
- bed_path (str) – File path. 
- Returns
- bed file in dataframe. 
- Return type
- pandas.dataframe 
 
- 
celloracle.motif_analysis.remove_zero_seq(fasta_object)¶
- Remove DNA sequence with zero length 
- 
celloracle.motif_analysis.scan_dna_for_motifs(scanner_object, motifs_object, sequence_object, divide=100000, verbose=True, batch_size=None)¶
- This is a wrapper function to scan DNA sequences searchig for Gene motifs. - Parameters
- scanner_object (gimmemotifs.scanner) – Object that do motif scan. 
- motifs_object (gimmemotifs.motifs) – Object that stores motif data. 
- sequence_object (gimmemotifs.fasta) – Object that stores sequence data. 
 
- Returns
- scan results is stored in data frame. 
- Return type
- pandas.dataframe