MEALR combinatorial regulation analysis

Extract TRANSFAC(R) PWMs from combinatorial regulation analysis

This tool extracts TRANSFAC® PWMs from a result table generated by the MEALR combinatorial regulation analysis. The PWMs represent transcription factor binding specificities that constitute the combinatorial module predicted by the MEALR model.

Input parameters

Parameter Description
Input MEALR search table Input table, a MEALR search result table
Probability cutoff Probability cutoff
Accuracy cutoff Model accuracy cutoff
Importance cutoff Logistic regression coefficient cutoff
Output Output table
Probability and accuracy cutoffs select MEALR models from the input table based on the observed match probability and the model accuracy on test data, respectively. The importance cutoff is applied to the PWM feature coefficients of the MEALR model allowing to focus on more important matrices by increasing the threshold.

Output

The output contains the TRANSFAC® PWMs extracted from MEALR models according to specified cutoffs. This table can further be applied in several analyses, e.g. to extract corresponding transcription factors using the tool or to create a profile for binding site predictions ( Create profile from site model table) with MATCHTM.

Output table column Description
Max. probability Highest match probability of models from which PWM was extracted
Max. accuracy Highest accuracy of a model from which PWM was extracted
Max. importance Highest importance of PWM in extracted models
Avg. probability Average match probability of models from which PWM was extracted
Avg. accuracy Average accuracy of a model from which PWM was extracted
Avg. importance Average importance of PWM in extracted models
Min. probability Lowest match probability of models from which PWM was extracted
Min. accuracy Lowest accuracy of a model from which PWM was extracted
Min. importance Lowest importance of PWM in extracted models
Cell sources Cell sources of extracted models
Tissue sources Tissue sources of extracted models
Factors Transcription factors targeted in experiments of extracted models
Model ids Ids of models from which PWM was extracted

Example analysis Open the tool in the user interface.

  • Specify a result table of a MEALR search analysis (Input example).

  • Specify probability, accuracy and importance cutoffs

  • Specify an output table. A result path is suggested automatically on the basis of specified input (Example result).

TRANSFAC(R) MEALR combinatorial regulation analysis

This analysis applies combinatorial regulatory models (CRMs) based on the MEALR affinity score [1] to classify or scan sequences for occurrences of combinations of transcription factor binding sites represented by TRANSFAC® PWMs. The models are taken from the MEALR library whose training data originate from the TRANSFAC® collection of high-throughput sequencing experiments.

Input parameters

Parameter Description
Sequence track Sequence track or collection to search
Sequence source Sequence source associated with the sequence track. Either a custom or genomic sequence source
Cell sources Focus on models from selected cell sources
Tissue sources Focus on models from selected tissue sources
Classification mode Classify entire sequence instead of scanning for hits
Scan mode Scan mode, best hit or cutoff based
Step size Step size for scanning mode
Probability cutoff Cutoff for the probability that a sequence matches the model
Model accuracy cutoff Select models with test set accuracy equal or better than the accuracy cutoff prior to search
Output folder Output folder for analysis results

Classification and scan modes

The Classification mode evaluates input sequences as a whole, whereas the scan mode analyzes sequence windows separated by the given step size (sliding window). In scan mode, the Best hit method reports the best scoring sequence window disregarding a cutoff and the Cutoff method reports the best non-overlapping windows satisfying the specified cutoff.

Note

The MEALR search applies sequence length limits. The minimal sequence length for scan and classification modes is 50 base characters. The classification mode supports sequences up to 5000 base characters. Ideally, input sequences for classification mode should have lengths corresponding to genomic regions typically observed in ChIP-seq studies like 500 - 1000 bases, whereas input sequences for scanning should not be too short, e.g. ≥300 bases. We recommend to take differences between model length and sequence length, which are reported in the output table, into account in the assessment of the reliability of predictions.

Cell and tissue sources can be selected to focus on a subset of CRMs which have been trained with data from respective sources. Please note that selection of multiple cells and/or tissues gathers all CRMs that are associated with any one of selected sources.

Output

The output folder encompasses a table and sequence track with information about model hits. The output table contains sequence start and end points of hits, model ids, match probabilities as well as other values as described below. For input sequences derived from genomic regions (instead of imported as custom sequences) the table includes in addition a sequence id generated for a region as well as the genomic sequence id, start and end coordinates.

Output table column Description
Sequence id Sequence id of custom or genomic sequence
Interval site name Sequence id constructed for genomic interval
Start CRM region start (one-based)
End CRM region end (one-based)
Model id MEALR model id
Factor gene Gene symbol of transcription factor analyzed in source experiment generating training data
Cell source Cell source of training data
Tissue source Tissue source of training data
Model accuracy Test set accuracy of MEALR model
Model type Type of MEALR model (LR: logistic regression, WLR: LR with weighting of CRM region positions)
Model length Length of CRM region
Sequence length Length of analyzed sequence
Score CRM score
Probability CRM probability

Example analysis

Open the tool in the user interface.

  • Specify sequence(s) to analyze. This should be a track item with custom or genomic sequences (Input example). A sequence source is suggested automatically.

  • Select cell and tissue sources for filtering. There are over 400 cell types and more than 50 tissues to choose from. Please note that multiple selection causes all models from the library to be considered that are associated with any one of the selected cells or tissues.

  • Choose Classification mode if scan mode is not desired

  • If scan mode, specify a step size and select Best hit or Cutoff mode

  • If Cutoff mode, specify a cutoff for model hits

  • Specify a model accuracy cutoff

  • Specify an output folder for results. The folder can already exist or be newly created by the workflow (Example result folder).

  1. Katie Lloyd, Stamatia Papoutsopoulou, Emily Smith, Philip Stegmaier, Francois Bergey, et al., The SysmedIBD Consortium; Using systems medicine to identify a therapeutic agent with potential for repurposing in inflammatory bowel disease. Dis Model Mech 1 November 2020; 13 (11): dmm044040.