Before following this tutorial, install TSSFider command line tools and pre-trained models.
During this tutorial you will train a new TSS predictor using the A. thaliana genome and make some predictions. It is also a good way to verify if you installed TSSFinder properly on your system.
This tutorial supposes that you are using OS X or Ubuntu Linux.
First, download training_sets files. You can download it directly from your terminal:
$ curl -L -o training_sets.zip xxx
After that, extract it:
$ unzip training_sets.zip
It will generate the training_sets/athaliana
irectory containing files in 4 formats:
train.fa
, a FASTA file which contains the genomic sequence$ head athaliana/athaliana.tair10.fa
>1:AT1G05400_1
GTCACAAAAACTCTTCAGCCTTGTTAAGATCATCCAAAATCTCAGTCTTAACCTCATCA
GGAAGTGGTGCAGAAGGACCTGCCTTTGAGTAAAATTTCGCCAAAGCGCGGATTACATT
CTCCAGCTCCACGTATGATTCCTGAAACACATGAATATTTCAATCAATTTCAATTGATC
AACTTTCAATAAAAGAAGAAGATAAAGATTAAACAAACATACTTCTCCGGCAACAGTCT
TTTGTCCTCTCCAATTGCTCAGATAATCACGAATGGATTCTTTAGCAGCATCAGCACTT
CTTCTGAACTTAGCATTGTCTTTAGGGTTCTCTTTCAATGACTCCCTCATGGTTTTGAC
AACCTCTCTTGCAGATTTCAAGTAAGCTTTTGGTAACTCTTTGCCAGACTTTGTTCTCT
CATTAGGATCAAACAAAGTCTTGAAAGCACCAACAACTCCTTCTCCTTCGTCTAACTTT
TCCTCTGACGCCTGAGCTTGTAATCCAGAAAACTCAATTGTGGCCATGAACGAAGAAGC
ACCAACAACACATTTCAAGAAACCACGACGATCAAACAAAGATGATGTTTGGCCTTGAC
GAAGCTTTTCATCTATTTTAATTTGCAAAGTTTAAAGCAGTTAAATAAAACTTGTAGAT
AAGTTTAGAAATCGTGAAATCAATCCATCTCTTATAAAAAACTAAATCATTGAAGAAAC
ATGTTTGATTTCAACATTGGATGATATAAACAACATTGACATTAAAACAGAAAAGAGTT
TTCAAAAAAAAAAAAAAAAAAACACAGAAAAGAGATAAAGAAGAAGAAGAGTGATACTT
TGACATTGTGGCTTCGATTTCATATGGTGATGCACCAAAGTAGGAGAAAAGTTCATGAC
GGCTACAAGGAAACCCATGAATCTTTCTGTCTGATAACAACAGGAACCTCAGAATCAGA
TATAGATGAAAAAGAGGATGGTTCTAGTTGTTTGCGAAAATGGCGAATTTGCCCTTCAA
GTTGATGATAATAAAATAAAAAACGCTAGAAGGAGGGTTCGAACCTCCGACCTTGTGGT
TAACAGCCACACGCTCTAACCAGCTGAGCTATTCCAGCTACTTGTTATGTTATGTGTAT
TTATGTTTTATATATTGTCTTATCATTTATTAAATTGAGTTATTGAATTTCTTTTTAGT
CGAAACGGTGTTTTTTTTTAATTCACTTACAAAAAATAAGGATTGGTGCATGGGAAAGG
CCGATTCATAACAAATTGTAGGCCCATTAAAACGAAGCCCATTATCAACCCTATATATA
GTGCGGCGTCGGAGTTTTTATATTTTTCCAAACAGGAATTAAGAAACCCCTCTAGTCGA
CGATATCGAGATCGATCCAACCTTTACCAACGCTTTATATCTCTTCTAAGGTAAAATTT
GTCGCTCCTTATCTCTCTCACTGTTTCTTAGAAAATTCATCTTAATTATTATATAGATA
GAGAACCCTAAAAGTTCAAGGTTGTATTTGCGATCGAAACTATCAGCGGCCGTGTTCCG
AACCAGTTTTGAATTTCAATTTTTTTTTTTGAAGGTTAAAAGTCTTTGATTCGTGAATG
ATGATGGATCCGCCGCCTACGAAGGAGTTTGCTTTGGCCCCAGTATATGTGTATTGGGA
CATGAAGAGGTGTCCGGTTCCAGATGACTATGATGCTCGTCGGGTTGGTCCGTGTATTA
AACGGATTTTAAGGAAATCAGGCTACAATGGTCCTGTCACCATCACTGCTGTTGGCTCA
CTATCAAAGGTCCCTCGTGACATCCTTGAAGTGGTCTCTTCCACTGGAATCTCTCTTTA
TCACGAGGTCGCCAGTAAGCATCGATCTTCTCATCCTTATTCCCTTTTTATTTCCTTTA
TTAGAGAGAGAGAAAGAGTCTCTCTAAGTATTTAGTTTCTGAAACAGGTAGGAAATGCA
TGGTTTCGTGTTCCCTTGACCATGATGATTTTTAGCCACGTCCAGCTACTTTG
...
tss.bed
, a BED file which describe all the TSSs of each gene$ head athaliana/athaliana.dataset1.tss.bed
1 1584074 1584075 AT1G05400_1 1 +
2 3496151 3496152 AT2G07739_1 1 -
3 20185206 20185207 AT3G54520_1 1 -
1 2339384 2339385 AT1G07600_1 1 -
5 18710592 18710593 AT5G46150_1 1 -
5 4056153 4056154 AT5G12850_1 1 +
4 10103987 10103988 AT4G18280_1 1 +
2 7341118 7341119 AT2G16930_1 1 +
4 14606004 14606005 AT4G29870_1 1 +
5 16385303 16385304 AT5G40890_1 1 -
...
start.bed
, a BED file which describe all the start conds of each gene$ head athaliana/athaliana.dataset1.start.bed
1 1597842 1597843 AT1G05440_1 1 -
1 1766149 1766150 AT1G05840_1 1 -
1 1775654 1775655 AT1G05880_1 1 +
1 1823346 1823347 AT1G06010_1 1 +
1 1855962 1855963 AT1G06120_1 1 +
1 2015941 2015942 AT1G06580_1 1 -
1 2105232 2105233 AT1G06850_1 1 +
1 2174791 2174792 AT1G07090_1 1 -
1 2219378 2219379 AT1G07220_1 1 -
1 2222777 2222778 AT1G07230_1 1 -
...
tata-box.bed
, a BED file which describe all the TATA-Boxes of each gene$ head athaliana/athaliana.tata.bed
1 1597898 1597899 AT1G05440_1 1 -
1 1599879 1599880 AT1G05450_1 1 +
1 1605108 1605109 AT1G05460_1 1 -
1 1635711 1635712 AT1G05520_1 1 -
1 1637884 1637885 AT1G05530_1 1 -
1 1647180 1647181 AT1G05560_1 1 -
1 1659469 1659470 AT1G05570_1 1 -
1 1662216 1662217 AT1G05575_1 1 -
1 1664464 1664465 AT1G05577_1 1 -
1 1669320 1669321 AT1G05590_1 1 +
...
Training command
, a EXAMPLE file which describe all the command-line for Training of the TSSFinder. Add `tssfinder` and `tssfinder/bin/cli` folders to the `PATH`.
$ tssfinder-train --model athaliana/models/athaliana.1/ \
--start athaliana/athaliana.dataset1.start.bed \
--tata athaliana/athaliana.tata.bed \
--tss athaliana/athaliana.dataset1.start.bed \
--genome athaliana/athaliana.tair10.fa \
out_file.train
, TSSFinder training output files: observation.myop-prom and transitions.myop-prom. These files store the main characteristics of the TSS signal and the Core promoter region.$ head athaliana/models/athaliana.1/observation.myop-prom
x[0,1]=16; BEGIN 4.72831
x[0,1]=16; CDS/INTRON 4.45971
x[0,1]=15; PROMOTER#2 -0.0977334
x[0,1]=15; TSS-3 -0.309565
x[0,1]=15; TSS-2 -0.0631545
x[0,1]=15; TSS-1 0.0579794
x[0,1]=15; TSS-0 0.134232
x[0,1]=15; TSS+1 0.162803
x[0,1]=15; TSS+2 0.147902
x[0,1]=15; TSS+3 0.146342
...
x[-3,0]=2;x[2,0]=0; TSS+3 -0.00913836
x[-3,0]=2;x[2,0]=0; 5'UTR -0.0516346
x[-3,0]=2;x[2,0]=0; TATA-3 0.0696071
x[-3,0]=2;x[2,0]=0; TATA-2 0.0918243
x[-3,0]=2;x[2,0]=0; TATA-1 0.0620057
x[-3,0]=2;x[2,0]=0; TATA-0 0.0783161
x[-3,0]=2;x[2,0]=0; TATA+1 0.00231838
x[-3,0]=2;x[2,0]=0; TATA+2 -0.0814799
x[-3,0]=2;x[2,0]=0; TATA+3 -0.0521071
x[-3,0]=2;x[2,0]=0; PROMOTER#1 -0.0646568
$ head athaliana/models/athaliana.1/transitions.myop-prom
BEGIN PROMOTER#2 3.99651
PROMOTER#2 PROMOTER#2 12.9566
PROMOTER#2 TSS-3 4.54967
PROMOTER#2 CDS/INTRON -0.572944
PROMOTER#2 TATA-3 5.3424
TSS-3 TSS-2 6.85829
TSS-2 TSS-1 6.75882
TSS-1 TSS-0 6.80063
TSS-0 TSS+1 6.92607
TSS+1 TSS+2 7.45872
...
TATA-3 TATA-2 5.67511
TATA-2 TATA-1 5.63469
TATA-1 TATA-0 5.48533
TATA-0 TATA+1 5.46865
TATA+1 TATA+2 5.86431
TATA+2 TATA+3 6.09575
TATA+3 TSS-3 -0.49982
TATA+3 PROMOTER#1 6.24643
PROMOTER#1 TSS-3 4.95724
PROMOTER#1 PROMOTER#1 9.70172
Prediction command
, a EXAMPLE file which describe all the command-line for Prediction of the TSSFinder. Add `tssfinder` and `tssfinder/bin/cli` folders to the `PATH`.
$ tssfinder --model athaliana/models/athaliana.1/ \
--start athaliana/athaliana.dataset0.start.bed \
--genome athaliana/athaliana.tair10.fa \
--output athaliana/output_athaliana.model_0 \ *The output directory must be created before.
out_file.prediction
, TSSFinder prediction output files: out.tss.bed and out.tata.bed. These files store the main characteristics of the TSS signal and the Core promoter region.$ head athaliana/output_athaliana.model_0/out.tss.bed
1 1597861 1597862 AT1G05440_1 1 -
1 1766210 1766211 AT1G05840_1 1 -
1 1775639 1775640 AT1G05880_1 1 +
1 1822638 1822639 AT1G06010_1 1 +
1 1855921 1855922 AT1G06120_1 1 +
1 2016002 2016003 AT1G06580_1 1 -
1 2105038 2105039 AT1G06850_1 1 +
1 2174821 2174822 AT1G07090_1 1 -
1 2219461 2219462 AT1G07220_1 1 -
1 2222823 2222824 AT1G07230_1 1 -
...
$ head athaliana/output_athaliana.model_0/out.tata.bed
1 1847847 1847848 AT1G06090_1 1 +
1 1851440 1851441 AT1G06100_1 1 +
1 2339417 2339418 AT1G07600_1 1 -
1 2524044 2524045 AT1G08090_1 1 +
1 2613276 2613277 AT1G08290_1 1 -
1 2630887 2630888 AT1G08340_1 1 +
1 2658685 2658686 AT1G08430_1 1 +
1 2909085 2909086 AT1G09030_1 1 -
1 3794288 3794289 AT1G11300_1 1 +
1 4003810 4003811 AT1G11870_1 1 +
...