MEALR combinatorial regulation analysis

Extract TRANSFAC(R) PWMs from combinatorial regulation analysis

This tool extracts TRANSFAC® PWMs from a result table generated by the MEALR combinatorial regulation analysis. The PWMs represent transcription factor binding specificities that constitute the combinatorial module predicted by the MEALR model.

Input parameters

Parameter	Description
Input MEALR search table	Input table, a MEALR search result table
Probability cutoff	Probability cutoff
Accuracy cutoff	Model accuracy cutoff
Importance cutoff	Logistic regression coefficient cutoff
Output	Output table

Probability and accuracy cutoffs select MEALR models from the input table based on the observed match probability and the model accuracy on test data, respectively. The importance cutoff is applied to the PWM feature coefficients of the MEALR model allowing to focus on more important matrices by increasing the threshold.

Output

The output contains the TRANSFAC® PWMs extracted from MEALR models according to specified cutoffs. This table can further be applied in several analyses, e.g. to extract corresponding transcription factors using the tool or to create a profile for binding site predictions ( Create profile from site model table) with MATCH^TM.

Output table column	Description
Max. probability	Highest match probability of models from which PWM was extracted
Max. accuracy	Highest accuracy of a model from which PWM was extracted
Max. importance	Highest importance of PWM in extracted models
Avg. probability	Average match probability of models from which PWM was extracted
Avg. accuracy	Average accuracy of a model from which PWM was extracted
Avg. importance	Average importance of PWM in extracted models
Min. probability	Lowest match probability of models from which PWM was extracted
Min. accuracy	Lowest accuracy of a model from which PWM was extracted
Min. importance	Lowest importance of PWM in extracted models
Cell sources	Cell sources of extracted models
Tissue sources	Tissue sources of extracted models
Factors	Transcription factors targeted in experiments of extracted models
Model ids	Ids of models from which PWM was extracted

Example analysis Open the tool in the user interface.

Specify a result table of a MEALR search analysis (Input example).
Specify probability, accuracy and importance cutoffs
Specify an output table. A result path is suggested automatically on the basis of specified input (Example result).

TRANSFAC(R) MEALR combinatorial regulation analysis

This analysis applies combinatorial regulatory models (CRMs) based on the MEALR affinity score [1] to classify or scan sequences for occurrences of combinations of transcription factor binding sites represented by TRANSFAC® PWMs. The models are taken from the MEALR library whose training data originate from the TRANSFAC® collection of high-throughput sequencing experiments.

Input parameters

Parameter	Description
Sequence track	Sequence track or collection to search
Sequence source	Sequence source associated with the sequence track. Either a custom or genomic sequence source
Cell sources	Focus on models from selected cell sources
Tissue sources	Focus on models from selected tissue sources
Classification mode	Classify entire sequence instead of scanning for hits
Scan mode	Scan mode, best hit or cutoff based
Step size	Step size for scanning mode
Probability cutoff	Cutoff for the probability that a sequence matches the model
Model accuracy cutoff	Select models with test set accuracy equal or better than the accuracy cutoff prior to search
Output folder	Output folder for analysis results

Classification and scan modes

The Classification mode evaluates input sequences as a whole, whereas the scan mode analyzes sequence windows separated by the given step size (sliding window). In scan mode, the Best hit method reports the best scoring sequence window disregarding a cutoff and the Cutoff method reports the best non-overlapping windows satisfying the specified cutoff.

Note

The MEALR search applies sequence length limits. The minimal sequence length for scan and classification modes is 50 base characters. The classification mode supports sequences up to 5000 base characters. Ideally, input sequences for classification mode should have lengths corresponding to genomic regions typically observed in ChIP-seq studies like 500 - 1000 bases, whereas input sequences for scanning should not be too short, e.g. ≥300 bases. We recommend to take differences between model length and sequence length, which are reported in the output table, into account in the assessment of the reliability of predictions.

Cell and tissue sources can be selected to focus on a subset of CRMs which have been trained with data from respective sources. Please note that selection of multiple cells and/or tissues gathers all CRMs that are associated with any one of selected sources.

Output

The output folder encompasses a table and sequence track with information about model hits. The output table contains sequence start and end points of hits, model ids, match probabilities as well as other values as described below. For input sequences derived from genomic regions (instead of imported as custom sequences) the table includes in addition a sequence id generated for a region as well as the genomic sequence id, start and end coordinates.

Output table column	Description
Sequence id	Sequence id of custom or genomic sequence
Interval site name	Sequence id constructed for genomic interval
Start	CRM region start (one-based)
End	CRM region end (one-based)
Model id	MEALR model id
Factor gene	Gene symbol of transcription factor analyzed in source experiment generating training data
Cell source	Cell source of training data
Tissue source	Tissue source of training data
Model accuracy	Test set accuracy of MEALR model
Model type	Type of MEALR model (LR: logistic regression, WLR: LR with weighting of CRM region positions)
Model length	Length of CRM region
Sequence length	Length of analyzed sequence
Score	CRM score
Probability	CRM probability

Example analysis

Open the tool in the user interface.

Specify sequence(s) to analyze. This should be a track item with custom or genomic sequences (Input example). A sequence source is suggested automatically.
Select cell and tissue sources for filtering. There are over 400 cell types and more than 50 tissues to choose from. Please note that multiple selection causes all models from the library to be considered that are associated with any one of the selected cells or tissues.
Choose Classification mode if scan mode is not desired
If scan mode, specify a step size and select Best hit or Cutoff mode
If Cutoff mode, specify a cutoff for model hits
Specify a model accuracy cutoff
Specify an output folder for results. The folder can already exist or be newly created by the workflow (Example result folder).

Katie Lloyd, Stamatia Papoutsopoulou, Emily Smith, Philip Stegmaier, Francois Bergey, et al., The SysmedIBD Consortium; Using systems medicine to identify a therapeutic agent with potential for repurposing in inflammatory bowel disease. Dis Model Mech 1 November 2020; 13 (11): dmm044040.