celloracle.motif_analysis module¶

The motif_analysis module implements transcription factor motif scan.

Genomic activity information (peak of ATAC-seq or Chip-seq) is extracted first. Then the peak DNA sequence will be subjected to TF motif scan. Finally we will get list of TFs that potentially binds to a specific gene.

class celloracle.motif_analysis.TFinfo(peak_data_frame, ref_genome, genomes_dir=None)¶

Bases: object

This is a custom class for motif analysis in celloracle. TFinfo object performs motif scan using the TF motif database in gimmemotifs and several functions of genomepy. Analysis results can be exported as a python dictionary or dataframe. These files; python dictionary of dataframe of TF binding information, are needed during GRN inference.

peak_df¶

dataframe about DNA peak and target gene data.

Type: pandas.dataframe

all_target_gene¶

target genes.

Type: array of str

ref_genome¶

reference genome name that was used in DNA peak generation.

Type: str

scanned_df¶

Results of motif scan. Key is a peak name. Value is a dataframe of motif scan.

Type: dictionary

dic_targetgene2TFs¶

Final product of motif scan. Key is a target gene. Value is a list of regulatory candidate genes.

Type: dictionary

dic_peak2Targetgene¶

Dictionary. Key is a peak name. Value is a list of the target gene.

Type: dictionary

dic_TF2targetgenes¶

Final product of motif scan. Key is a TF. Value is a list of potential target genes of the TF.

Type: dictionary

copy()¶: Deepcoty itself.

filter_motifs_by_score(threshold, method='cumulative_score')¶

Remove motifs with low binding scores.

Parameters: method (str) – thresholding method. Select either of [“indivisual_score”, “cumulative_score”]

filter_peaks(peaks_to_be_remained)¶

Filter peaks.

Parameters: peaks_to_be_remained (array of str) – list of peaks. Peaks that are NOT in the list will be removed.

make_TFinfo_dataframe_and_dictionary(verbose=True)¶

This is the final step of motif_analysis. Convert scanned results into a data frame and dictionaries.

Parameters: verbose (bool) – Whether to show a progress bar.

reset_dictionary_and_df()¶: Reset TF dictionary and TF dataframe. The following attributes will be erased: TF_onehot, dic_targetgene2TFs, dic_peak2Targetgene, dic_TF2targetgenes.

reset_filtering()¶: Reset filtering information. You can use this function to stat over the filtering step with new conditions. The following attributes will be erased: TF_onehot, dic_targetgene2TFs, dic_peak2Targetgene, dic_TF2targetgenes.

save_as_parquet(folder_path=None)¶

Save itself. Some attributes are saved as parquet file.

Parameters: folder_path (str) – folder path

scan(background_length=200, fpr=0.02, n_cpus=- 1, verbose=True, motifs=None, TF_evidence_level='direct_and_indirect', TF_formatting='auto', batch_size=None, divide=100000)¶

Scan DNA sequences searching for TF binding motifs.

Parameters

background_length (int) – background length. This is used for the calculation of the binding score.
fpr (float) – False positive rate for motif identification.
n_cpus (int) – number of CPUs for parallel calculation.
verbose (bool) – Whether to show a progress bar.
motifs (list) – a list of gimmemotifs motifs, will revert to default_motifs() if None
TF_evidence_level (str) – Please select one from [“direct”, “direct_and_indirect”]. If “direct” is selected, TFs that have a binding evidence were used. If “direct_and_indirect” is selected, TFs with binding evidence and inferred TFs are used. For more information, please read explanation of Motif class in gimmemotifs documentation (https://gimmemotifs.readthedocs.io/en/master/index.html)

set_motifs(motifs=None, TF_formatting='auto', verbose=True)¶

to_dataframe(verbose=True)¶

Return results as a dataframe. Rows are peak_id, and columns are TFs.

Parameters: verbose (bool) – Whether to show a progress bar.
Returns: TFinfo matrix.
Return type: pandas.dataframe

to_dictionary(dictionary_type='targetgene2TFs', verbose=True)¶

Return TF information as a python dictionary.

Parameters: dictionary_type (str) – Type of dictionary. Select from [“targetgene2TFs”, “TF2targetgenes”]. If you chose “targetgene2TFs”, it returns a dictionary in which a key is a target gene, and a value is a list of regulatory candidate genes (TFs) of the target. If you chose “TF2targetgenes”, it returns a dictionary in which a key is a TF and a value is a list of potential target genes of the TF.
Returns: dictionary.
Return type: dictionary

to_hdf5(file_path)¶

Save object as hdf5.

Parameters: file_path (str) – file path to save file. Filename needs to end with ‘.celloracle.tfinfo’

celloracle.motif_analysis.check_peak_format(peaks_df, ref_genome, genomes_dir=None)¶

Check peak format.

Check chromosome name.
Check peak size (length) and remove sort DNAs (<5bp)

Parameters

peaks_df (pandas.DataFrame) –
ref_genome (str) – Reference genome name. e.g. “mm9”, “mm10”, “hg19” etc
genomes_dir (str) – Installation directory of Genomepy reference genome data.

Returns

Peaks data after filtering.

Return type

pandas.DataFrame

celloracle.motif_analysis.get_tss_info(peak_str_list, ref_genome, verbose=True, custom_tss_file_path=None)¶

Get annotation about Transcription Starting Site (TSS).

Parameters

peak_str_list (list of str) – list of peak_id. e.g., [“chr5_0930303_9499409”, “chr11_123445555_123445577”]
ref_genome (str) – reference genome name.
verbose (bool) – verbosity.
custom_tss_file_path (str) – File path to the custom TSS reference bed file. If you just want to use reference genome that are supported in the CellOracle, you don’t need to set this parameter.

celloracle.motif_analysis.integrate_tss_peak_with_cicero(tss_peak, cicero_connections)¶

Process output of cicero data and returns DNA peak information for motif analysis in celloracle. Please see the celloracle tutorial for more information.

Parameters

tss_peak (pandas.dataframe) – dataframe about TSS information. Please use the function, “get_tss_info” to get this dataframe.
cicero_connections (dataframe) – dataframe that stores the results of cicero analysis.

Returns

DNA peak about promoter/enhancer and its annotation about target gene.

Return type

pandas.dataframe

celloracle.motif_analysis.is_genome_installed(ref_genome, genomes_dir=None)¶

Celloracle motif_analysis module uses gimmemotifs and genomepy internally. Reference genome files should be installed in the PC to use gimmemotifs and genomepy. This function checks the installation status of the reference genome.

Parameters

ref_genome (str) – names of reference genome. i.e., “mm10”, “hg19”
genomes_dir (str) – Installation directory of Genomepy reference genome data.

celloracle.motif_analysis.load_TFinfo(file_path)¶

Load TFinfo object which was saved as hdf5 file.

Parameters: file_path (str) – file path.
Returns: Loaded TFinfo object.
Return type: TFinfo

celloracle.motif_analysis.load_TFinfo_from_parquets(folder_path)¶

Load TFinfo object which was saved with the function; “save_as_parquet”.

Parameters: folder_path (str) – folder path
Returns: Loaded TFinfo object.
Return type: TFinfo

celloracle.motif_analysis.load_motifs(motifs_name, force_download=False)¶

Load motifs from celloracle motif database

Parameters: motifs_name (str) – Name of motifs.
Returns: List of gimmemotifs.motif object.
Return type: list

celloracle.motif_analysis.make_TFinfo_from_scanned_file(path_to_raw_bed, path_to_scanned_result_bed, ref_genome)¶: This function is currently an available.

celloracle.motif_analysis.peak2fasta(peak_ids, ref_genome, genomes_dir)¶

Convert peak_id into fasta object.

Parameters

peak_id (str or list of str) – Peak_id. e.g. “chr5_0930303_9499409” or it can be a list of peak_id. e.g. [“chr5_0930303_9499409”, “chr11_123445555_123445577”]
ref_genome (str) – Reference genome name. e.g. “mm9”, “mm10”, “hg19” etc
genomes_dir (str) – Installation directory of Genomepy reference genome data.

Returns

DNA sequence in fasta format

Return type

gimmemotifs fasta object

celloracle.motif_analysis.read_bed(bed_path)¶

Load bed file and return as dataframe.

Parameters: bed_path (str) – File path.
Returns: bed file in dataframe.
Return type: pandas.dataframe

celloracle.motif_analysis.remove_zero_seq(fasta_object)¶: Remove DNA sequence with zero length

celloracle.motif_analysis.scan_dna_for_motifs(scanner_object, motifs_object, sequence_object, divide=100000, verbose=True, batch_size=None)¶

This is a wrapper function to scan DNA sequences searchig for Gene motifs.

Parameters

scanner_object (gimmemotifs.scanner) – Object that do motif scan.
motifs_object (gimmemotifs.motifs) – Object that stores motif data.
sequence_object (gimmemotifs.fasta) – Object that stores sequence data.

Returns

scan results is stored in data frame.

Return type

pandas.dataframe