celloracle.motif_analysis module

The motif_analysis module implements transcription factor motif scan.

Genomic activity information (peak of ATAC-seq or Chip-seq) is extracted first. Then the peak DNA sequence will be subjected to TF motif scan. Finally we will get list of TFs that potentially binds to a specific gene.

class celloracle.motif_analysis.TFinfo(peak_data_frame, ref_genome, genomes_dir=None)

Bases: object

This is a custom class for motif analysis in celloracle. TFinfo object performs motif scan using the TF motif database in gimmemotifs and several functions of genomepy. Analysis results can be exported as a python dictionary or dataframe. These files; python dictionary of dataframe of TF binding information, are needed during GRN inference.

peak_df

dataframe about DNA peak and target gene data.

Type

pandas.dataframe

all_target_gene

target genes.

Type

array of str

ref_genome

reference genome name that was used in DNA peak generation.

Type

str

scanned_df

Results of motif scan. Key is a peak name. Value is a dataframe of motif scan.

Type

dictionary

dic_targetgene2TFs

Final product of motif scan. Key is a target gene. Value is a list of regulatory candidate genes.

Type

dictionary

dic_peak2Targetgene

Dictionary. Key is a peak name. Value is a list of the target gene.

Type

dictionary

dic_TF2targetgenes

Final product of motif scan. Key is a TF. Value is a list of potential target genes of the TF.

Type

dictionary

copy()

Deepcoty itself.

filter_motifs_by_score(threshold, method='cumulative_score')

Remove motifs with low binding scores.

Parameters

method (str) – thresholding method. Select either of [“indivisual_score”, “cumulative_score”]

filter_peaks(peaks_to_be_remained)

Filter peaks.

Parameters

peaks_to_be_remained (array of str) – list of peaks. Peaks that are NOT in the list will be removed.

make_TFinfo_dataframe_and_dictionary(verbose=True)

This is the final step of motif_analysis. Convert scanned results into a data frame and dictionaries.

Parameters

verbose (bool) – Whether to show a progress bar.

reset_dictionary_and_df()

Reset TF dictionary and TF dataframe. The following attributes will be erased: TF_onehot, dic_targetgene2TFs, dic_peak2Targetgene, dic_TF2targetgenes.

reset_filtering()

Reset filtering information. You can use this function to stat over the filtering step with new conditions. The following attributes will be erased: TF_onehot, dic_targetgene2TFs, dic_peak2Targetgene, dic_TF2targetgenes.

save_as_parquet(folder_path=None)

Save itself. Some attributes are saved as parquet file.

Parameters

folder_path (str) – folder path

scan(background_length=200, fpr=0.02, n_cpus=- 1, verbose=True, motifs=None, TF_evidence_level='direct_and_indirect', TF_formatting='auto', batch_size=None, divide=100000)

Scan DNA sequences searching for TF binding motifs.

Parameters
  • background_length (int) – background length. This is used for the calculation of the binding score.

  • fpr (float) – False positive rate for motif identification.

  • n_cpus (int) – number of CPUs for parallel calculation.

  • verbose (bool) – Whether to show a progress bar.

  • motifs (list) – a list of gimmemotifs motifs, will revert to default_motifs() if None

  • TF_evidence_level (str) – Please select one from [“direct”, “direct_and_indirect”]. If “direct” is selected, TFs that have a binding evidence were used. If “direct_and_indirect” is selected, TFs with binding evidence and inferred TFs are used. For more information, please read explanation of Motif class in gimmemotifs documentation (https://gimmemotifs.readthedocs.io/en/master/index.html)

set_motifs(motifs=None, TF_formatting='auto', verbose=True)
to_dataframe(verbose=True)

Return results as a dataframe. Rows are peak_id, and columns are TFs.

Parameters

verbose (bool) – Whether to show a progress bar.

Returns

TFinfo matrix.

Return type

pandas.dataframe

to_dictionary(dictionary_type='targetgene2TFs', verbose=True)

Return TF information as a python dictionary.

Parameters

dictionary_type (str) – Type of dictionary. Select from [“targetgene2TFs”, “TF2targetgenes”]. If you chose “targetgene2TFs”, it returns a dictionary in which a key is a target gene, and a value is a list of regulatory candidate genes (TFs) of the target. If you chose “TF2targetgenes”, it returns a dictionary in which a key is a TF and a value is a list of potential target genes of the TF.

Returns

dictionary.

Return type

dictionary

to_hdf5(file_path)

Save object as hdf5.

Parameters

file_path (str) – file path to save file. Filename needs to end with ‘.celloracle.tfinfo’

celloracle.motif_analysis.check_peak_format(peaks_df, ref_genome, genomes_dir=None)
Check peak format.
  1. Check chromosome name.

  2. Check peak size (length) and remove sort DNAs (<5bp)

Parameters
  • peaks_df (pandas.DataFrame) –

  • ref_genome (str) – Reference genome name. e.g. “mm9”, “mm10”, “hg19” etc

  • genomes_dir (str) – Installation directory of Genomepy reference genome data.

Returns

Peaks data after filtering.

Return type

pandas.DataFrame

celloracle.motif_analysis.get_tss_info(peak_str_list, ref_genome, verbose=True, custom_tss_file_path=None)

Get annotation about Transcription Starting Site (TSS).

Parameters
  • peak_str_list (list of str) – list of peak_id. e.g., [“chr5_0930303_9499409”, “chr11_123445555_123445577”]

  • ref_genome (str) – reference genome name.

  • verbose (bool) – verbosity.

  • custom_tss_file_path (str) – File path to the custom TSS reference bed file. If you just want to use reference genome that are supported in the CellOracle, you don’t need to set this parameter.

celloracle.motif_analysis.integrate_tss_peak_with_cicero(tss_peak, cicero_connections)

Process output of cicero data and returns DNA peak information for motif analysis in celloracle. Please see the celloracle tutorial for more information.

Parameters
  • tss_peak (pandas.dataframe) – dataframe about TSS information. Please use the function, “get_tss_info” to get this dataframe.

  • cicero_connections (dataframe) – dataframe that stores the results of cicero analysis.

Returns

DNA peak about promoter/enhancer and its annotation about target gene.

Return type

pandas.dataframe

celloracle.motif_analysis.is_genome_installed(ref_genome, genomes_dir=None)

Celloracle motif_analysis module uses gimmemotifs and genomepy internally. Reference genome files should be installed in the PC to use gimmemotifs and genomepy. This function checks the installation status of the reference genome.

Parameters
  • ref_genome (str) – names of reference genome. i.e., “mm10”, “hg19”

  • genomes_dir (str) – Installation directory of Genomepy reference genome data.

celloracle.motif_analysis.load_TFinfo(file_path)

Load TFinfo object which was saved as hdf5 file.

Parameters

file_path (str) – file path.

Returns

Loaded TFinfo object.

Return type

TFinfo

celloracle.motif_analysis.load_TFinfo_from_parquets(folder_path)

Load TFinfo object which was saved with the function; “save_as_parquet”.

Parameters

folder_path (str) – folder path

Returns

Loaded TFinfo object.

Return type

TFinfo

celloracle.motif_analysis.load_motifs(motifs_name, force_download=False)

Load motifs from celloracle motif database

Parameters

motifs_name (str) – Name of motifs.

Returns

List of gimmemotifs.motif object.

Return type

list

celloracle.motif_analysis.make_TFinfo_from_scanned_file(path_to_raw_bed, path_to_scanned_result_bed, ref_genome)

This function is currently an available.

celloracle.motif_analysis.peak2fasta(peak_ids, ref_genome, genomes_dir)

Convert peak_id into fasta object.

Parameters
  • peak_id (str or list of str) – Peak_id. e.g. “chr5_0930303_9499409” or it can be a list of peak_id. e.g. [“chr5_0930303_9499409”, “chr11_123445555_123445577”]

  • ref_genome (str) – Reference genome name. e.g. “mm9”, “mm10”, “hg19” etc

  • genomes_dir (str) – Installation directory of Genomepy reference genome data.

Returns

DNA sequence in fasta format

Return type

gimmemotifs fasta object

celloracle.motif_analysis.read_bed(bed_path)

Load bed file and return as dataframe.

Parameters

bed_path (str) – File path.

Returns

bed file in dataframe.

Return type

pandas.dataframe

celloracle.motif_analysis.remove_zero_seq(fasta_object)

Remove DNA sequence with zero length

celloracle.motif_analysis.scan_dna_for_motifs(scanner_object, motifs_object, sequence_object, divide=100000, verbose=True, batch_size=None)

This is a wrapper function to scan DNA sequences searchig for Gene motifs.

Parameters
  • scanner_object (gimmemotifs.scanner) – Object that do motif scan.

  • motifs_object (gimmemotifs.motifs) – Object that stores motif data.

  • sequence_object (gimmemotifs.fasta) – Object that stores sequence data.

Returns

scan results is stored in data frame.

Return type

pandas.dataframe