MEALR combinatorial regulation analysis
Extract TRANSFAC(R) PWMs from combinatorial regulation analysis
This tool extracts TRANSFAC® PWMs from a result table generated by the MEALR combinatorial regulation analysis. The PWMs represent transcription factor binding specificities that constitute the combinatorial module predicted by the MEALR model.
Input parameters
Parameter | Description |
---|---|
Input MEALR search table | Input table, a MEALR search result table |
Probability cutoff | Probability cutoff |
Accuracy cutoff | Model accuracy cutoff |
Importance cutoff | Logistic regression coefficient cutoff |
Output | Output table |
Output
The output contains the TRANSFAC® PWMs extracted from MEALR models according to specified cutoffs. This table can further be applied in several analyses, e.g. to extract corresponding transcription factors using the tool or to create a profile for binding site predictions ( Create profile from site model table) with MATCHTM.
Output table column | Description |
---|---|
Max. probability | Highest match probability of models from which PWM was extracted |
Max. accuracy | Highest accuracy of a model from which PWM was extracted |
Max. importance | Highest importance of PWM in extracted models |
Avg. probability | Average match probability of models from which PWM was extracted |
Avg. accuracy | Average accuracy of a model from which PWM was extracted |
Avg. importance | Average importance of PWM in extracted models |
Min. probability | Lowest match probability of models from which PWM was extracted |
Min. accuracy | Lowest accuracy of a model from which PWM was extracted |
Min. importance | Lowest importance of PWM in extracted models |
Cell sources | Cell sources of extracted models |
Tissue sources | Tissue sources of extracted models |
Factors | Transcription factors targeted in experiments of extracted models |
Model ids | Ids of models from which PWM was extracted |
Example analysis Open the tool in the user interface.
Specify a result table of a MEALR search analysis (Input example).
Specify probability, accuracy and importance cutoffs
Specify an output table. A result path is suggested automatically on the basis of specified input (Example result).
TRANSFAC(R) MEALR combinatorial regulation analysis
This analysis applies combinatorial regulatory models (CRMs) based on the MEALR affinity score [1] to classify or scan sequences for occurrences of combinations of transcription factor binding sites represented by TRANSFAC® PWMs. The models are taken from the MEALR library whose training data originate from the TRANSFAC® collection of high-throughput sequencing experiments.
Input parameters
Parameter | Description |
---|---|
Sequence track | Sequence track or collection to search |
Sequence source | Sequence source associated with the sequence track. Either a custom or genomic sequence source |
Cell sources | Focus on models from selected cell sources |
Tissue sources | Focus on models from selected tissue sources |
Classification mode | Classify entire sequence instead of scanning for hits |
Scan mode | Scan mode, best hit or cutoff based |
Step size | Step size for scanning mode |
Probability cutoff | Cutoff for the probability that a sequence matches the model |
Model accuracy cutoff | Select models with test set accuracy equal or better than the accuracy cutoff prior to search |
Output folder | Output folder for analysis results |
Classification and scan modes
The Classification mode evaluates input sequences as a whole, whereas the scan mode analyzes sequence windows separated by the given step size (sliding window). In scan mode, the Best hit method reports the best scoring sequence window disregarding a cutoff and the Cutoff method reports the best non-overlapping windows satisfying the specified cutoff.
Note
The MEALR search applies sequence length limits. The minimal sequence length for scan and classification modes is 50 base characters. The classification mode supports sequences up to 5000 base characters. Ideally, input sequences for classification mode should have lengths corresponding to genomic regions typically observed in ChIP-seq studies like 500 - 1000 bases, whereas input sequences for scanning should not be too short, e.g. ≥300 bases. We recommend to take differences between model length and sequence length, which are reported in the output table, into account in the assessment of the reliability of predictions.
Cell and tissue sources can be selected to focus on a subset of CRMs which have been trained with data from respective sources. Please note that selection of multiple cells and/or tissues gathers all CRMs that are associated with any one of selected sources.
Output
The output folder encompasses a table and sequence track with information about model hits. The output table contains sequence start and end points of hits, model ids, match probabilities as well as other values as described below. For input sequences derived from genomic regions (instead of imported as custom sequences) the table includes in addition a sequence id generated for a region as well as the genomic sequence id, start and end coordinates.
Output table column | Description |
---|---|
Sequence id | Sequence id of custom or genomic sequence |
Interval site name | Sequence id constructed for genomic interval |
Start | CRM region start (one-based) |
End | CRM region end (one-based) |
Model id | MEALR model id |
Factor gene | Gene symbol of transcription factor analyzed in source experiment generating training data |
Cell source | Cell source of training data |
Tissue source | Tissue source of training data |
Model accuracy | Test set accuracy of MEALR model |
Model type | Type of MEALR model (LR: logistic regression, WLR: LR with weighting of CRM region positions) |
Model length | Length of CRM region |
Sequence length | Length of analyzed sequence |
Score | CRM score |
Probability | CRM probability |
Example analysis
Open the tool in the user interface.
Specify sequence(s) to analyze. This should be a track item with custom or genomic sequences (Input example). A sequence source is suggested automatically.
Select cell and tissue sources for filtering. There are over 400 cell types and more than 50 tissues to choose from. Please note that multiple selection causes all models from the library to be considered that are associated with any one of the selected cells or tissues.
Choose Classification mode if scan mode is not desired
If scan mode, specify a step size and select Best hit or Cutoff mode
If Cutoff mode, specify a cutoff for model hits
Specify a model accuracy cutoff
Specify an output folder for results. The folder can already exist or be newly created by the workflow (Example result folder).