Tutorials

  1. Quick Start

Quick Start

Before following this tutorial, install TSSFider command line tools and pre-trained models.

During this tutorial you will train a new TSS predictor using the A. thaliana genome and make some predictions. It is also a good way to verify if you installed TSSFinder properly on your system.

This tutorial supposes that you are using OS X or Ubuntu Linux.

First, download training_sets files. You can download it directly from your terminal:

$ curl -L -o training_sets.zip xxx

After that, extract it:

$ unzip training_sets.zip

It will generate the training_sets/athaliana irectory containing files in 4 formats:

  • train.fa, a FASTA file which contains the genomic sequence
training_sets/a
$ head athaliana/athaliana.tair10.fa
>1:AT1G05400_1
GTCACAAAAACTCTTCAGCCTTGTTAAGATCATCCAAAATCTCAGTCTTAACCTCATCA
GGAAGTGGTGCAGAAGGACCTGCCTTTGAGTAAAATTTCGCCAAAGCGCGGATTACATT
CTCCAGCTCCACGTATGATTCCTGAAACACATGAATATTTCAATCAATTTCAATTGATC
AACTTTCAATAAAAGAAGAAGATAAAGATTAAACAAACATACTTCTCCGGCAACAGTCT
TTTGTCCTCTCCAATTGCTCAGATAATCACGAATGGATTCTTTAGCAGCATCAGCACTT
CTTCTGAACTTAGCATTGTCTTTAGGGTTCTCTTTCAATGACTCCCTCATGGTTTTGAC
AACCTCTCTTGCAGATTTCAAGTAAGCTTTTGGTAACTCTTTGCCAGACTTTGTTCTCT
CATTAGGATCAAACAAAGTCTTGAAAGCACCAACAACTCCTTCTCCTTCGTCTAACTTT
TCCTCTGACGCCTGAGCTTGTAATCCAGAAAACTCAATTGTGGCCATGAACGAAGAAGC
ACCAACAACACATTTCAAGAAACCACGACGATCAAACAAAGATGATGTTTGGCCTTGAC
GAAGCTTTTCATCTATTTTAATTTGCAAAGTTTAAAGCAGTTAAATAAAACTTGTAGAT
AAGTTTAGAAATCGTGAAATCAATCCATCTCTTATAAAAAACTAAATCATTGAAGAAAC
ATGTTTGATTTCAACATTGGATGATATAAACAACATTGACATTAAAACAGAAAAGAGTT
TTCAAAAAAAAAAAAAAAAAAACACAGAAAAGAGATAAAGAAGAAGAAGAGTGATACTT
TGACATTGTGGCTTCGATTTCATATGGTGATGCACCAAAGTAGGAGAAAAGTTCATGAC
GGCTACAAGGAAACCCATGAATCTTTCTGTCTGATAACAACAGGAACCTCAGAATCAGA
TATAGATGAAAAAGAGGATGGTTCTAGTTGTTTGCGAAAATGGCGAATTTGCCCTTCAA
GTTGATGATAATAAAATAAAAAACGCTAGAAGGAGGGTTCGAACCTCCGACCTTGTGGT
TAACAGCCACACGCTCTAACCAGCTGAGCTATTCCAGCTACTTGTTATGTTATGTGTAT
TTATGTTTTATATATTGTCTTATCATTTATTAAATTGAGTTATTGAATTTCTTTTTAGT
CGAAACGGTGTTTTTTTTTAATTCACTTACAAAAAATAAGGATTGGTGCATGGGAAAGG
CCGATTCATAACAAATTGTAGGCCCATTAAAACGAAGCCCATTATCAACCCTATATATA
GTGCGGCGTCGGAGTTTTTATATTTTTCCAAACAGGAATTAAGAAACCCCTCTAGTCGA
CGATATCGAGATCGATCCAACCTTTACCAACGCTTTATATCTCTTCTAAGGTAAAATTT
GTCGCTCCTTATCTCTCTCACTGTTTCTTAGAAAATTCATCTTAATTATTATATAGATA
GAGAACCCTAAAAGTTCAAGGTTGTATTTGCGATCGAAACTATCAGCGGCCGTGTTCCG
AACCAGTTTTGAATTTCAATTTTTTTTTTTGAAGGTTAAAAGTCTTTGATTCGTGAATG
ATGATGGATCCGCCGCCTACGAAGGAGTTTGCTTTGGCCCCAGTATATGTGTATTGGGA
CATGAAGAGGTGTCCGGTTCCAGATGACTATGATGCTCGTCGGGTTGGTCCGTGTATTA
AACGGATTTTAAGGAAATCAGGCTACAATGGTCCTGTCACCATCACTGCTGTTGGCTCA
CTATCAAAGGTCCCTCGTGACATCCTTGAAGTGGTCTCTTCCACTGGAATCTCTCTTTA
TCACGAGGTCGCCAGTAAGCATCGATCTTCTCATCCTTATTCCCTTTTTATTTCCTTTA
TTAGAGAGAGAGAAAGAGTCTCTCTAAGTATTTAGTTTCTGAAACAGGTAGGAAATGCA
TGGTTTCGTGTTCCCTTGACCATGATGATTTTTAGCCACGTCCAGCTACTTTG
...
  • tss.bed, a BED file which describe all the TSSs of each gene
$ head athaliana/athaliana.dataset1.tss.bed
1	1584074	  1584075	  AT1G05400_1	1	+
2	3496151	  3496152	  AT2G07739_1	1	-
3	20185206	20185207	AT3G54520_1	1	-
1	2339384	  2339385	  AT1G07600_1	1	-
5	18710592	18710593	AT5G46150_1	1	-
5	4056153	  4056154	  AT5G12850_1	1	+
4	10103987	10103988	AT4G18280_1	1	+
2	7341118	  7341119	  AT2G16930_1	1	+
4	14606004	14606005	AT4G29870_1	1	+
5	16385303	16385304	AT5G40890_1	1	-
...
  • start.bed, a BED file which describe all the start conds of each gene
$ head athaliana/athaliana.dataset1.start.bed
1	1597842	1597843	AT1G05440_1	1	-
1	1766149	1766150	AT1G05840_1	1	-
1	1775654	1775655	AT1G05880_1	1	+
1	1823346	1823347	AT1G06010_1	1	+
1	1855962	1855963	AT1G06120_1	1	+
1	2015941	2015942	AT1G06580_1	1	-
1	2105232	2105233	AT1G06850_1	1	+
1	2174791	2174792	AT1G07090_1	1	-
1	2219378	2219379	AT1G07220_1	1	-
1	2222777	2222778	AT1G07230_1	1	-
...
  • tata-box.bed, a BED file which describe all the TATA-Boxes of each gene
$ head athaliana/athaliana.tata.bed
1	1597898	1597899	AT1G05440_1	1	-
1	1599879	1599880	AT1G05450_1	1	+
1	1605108	1605109	AT1G05460_1	1	-
1	1635711	1635712	AT1G05520_1	1	-
1	1637884	1637885	AT1G05530_1	1	-
1	1647180	1647181	AT1G05560_1	1	-
1	1659469	1659470	AT1G05570_1	1	-
1	1662216	1662217	AT1G05575_1	1	-
1	1664464	1664465	AT1G05577_1	1	-
1	1669320	1669321	AT1G05590_1	1	+
...
  • Training command, a EXAMPLE file which describe all the command-line for Training of the TSSFinder. Add `tssfinder` and `tssfinder/bin/cli` folders to the `PATH`.

$ tssfinder-train --model athaliana/models/athaliana.1/ \           
                --start athaliana/athaliana.dataset1.start.bed \  
                --tata athaliana/athaliana.tata.bed \             
                --tss athaliana/athaliana.dataset1.start.bed \    
                --genome athaliana/athaliana.tair10.fa \          
  • out_file.train, TSSFinder training output files: observation.myop-prom and transitions.myop-prom. These files store the main characteristics of the TSS signal and the Core promoter region.
$ head athaliana/models/athaliana.1/observation.myop-prom
x[0,1]=16;	BEGIN	4.72831
x[0,1]=16;	CDS/INTRON	4.45971
x[0,1]=15;	PROMOTER#2	-0.0977334
x[0,1]=15;	TSS-3	-0.309565
x[0,1]=15;	TSS-2	-0.0631545
x[0,1]=15;	TSS-1	0.0579794
x[0,1]=15;	TSS-0	0.134232
x[0,1]=15;	TSS+1	0.162803
x[0,1]=15;	TSS+2	0.147902
x[0,1]=15;	TSS+3	0.146342
...
x[-3,0]=2;x[2,0]=0;	TSS+3	-0.00913836
x[-3,0]=2;x[2,0]=0;	5'UTR	-0.0516346
x[-3,0]=2;x[2,0]=0;	TATA-3	0.0696071
x[-3,0]=2;x[2,0]=0;	TATA-2	0.0918243
x[-3,0]=2;x[2,0]=0;	TATA-1	0.0620057
x[-3,0]=2;x[2,0]=0;	TATA-0	0.0783161
x[-3,0]=2;x[2,0]=0;	TATA+1	0.00231838
x[-3,0]=2;x[2,0]=0;	TATA+2	-0.0814799
x[-3,0]=2;x[2,0]=0;	TATA+3	-0.0521071
x[-3,0]=2;x[2,0]=0;	PROMOTER#1	-0.0646568

$ head athaliana/models/athaliana.1/transitions.myop-prom

BEGIN	PROMOTER#2	3.99651
PROMOTER#2	PROMOTER#2	12.9566
PROMOTER#2	TSS-3	4.54967
PROMOTER#2	CDS/INTRON	-0.572944
PROMOTER#2	TATA-3	5.3424
TSS-3	TSS-2	6.85829
TSS-2	TSS-1	6.75882
TSS-1	TSS-0	6.80063
TSS-0	TSS+1	6.92607
TSS+1	TSS+2	7.45872
...
TATA-3	TATA-2	5.67511
TATA-2	TATA-1	5.63469
TATA-1	TATA-0	5.48533
TATA-0	TATA+1	5.46865
TATA+1	TATA+2	5.86431
TATA+2	TATA+3	6.09575
TATA+3	TSS-3	-0.49982
TATA+3	PROMOTER#1	6.24643
PROMOTER#1	TSS-3	4.95724
PROMOTER#1	PROMOTER#1	9.70172

  • Prediction command, a EXAMPLE file which describe all the command-line for Prediction of the TSSFinder. Add `tssfinder` and `tssfinder/bin/cli` folders to the `PATH`.

$ tssfinder --model athaliana/models/athaliana.1/ \           
          --start athaliana/athaliana.dataset0.start.bed \ 
          --genome athaliana/athaliana.tair10.fa \         
          --output athaliana/output_athaliana.model_0 \  *The output directory must be created before.
  • out_file.prediction, TSSFinder prediction output files: out.tss.bed and out.tata.bed. These files store the main characteristics of the TSS signal and the Core promoter region.
$ head athaliana/output_athaliana.model_0/out.tss.bed

1	1597861	1597862	AT1G05440_1	1	-
1	1766210	1766211	AT1G05840_1	1	-
1	1775639	1775640	AT1G05880_1	1	+
1	1822638	1822639	AT1G06010_1	1	+
1	1855921	1855922	AT1G06120_1	1	+
1	2016002	2016003	AT1G06580_1	1	-
1	2105038	2105039	AT1G06850_1	1	+
1	2174821	2174822	AT1G07090_1	1	-
1	2219461	2219462	AT1G07220_1	1	-
1	2222823	2222824	AT1G07230_1	1	-
...
$ head athaliana/output_athaliana.model_0/out.tata.bed

1	1847847	1847848	AT1G06090_1	1	+
1	1851440	1851441	AT1G06100_1	1	+
1	2339417	2339418	AT1G07600_1	1	-
1	2524044	2524045	AT1G08090_1	1	+
1	2613276	2613277	AT1G08290_1	1	-
1	2630887	2630888	AT1G08340_1	1	+
1	2658685	2658686	AT1G08430_1	1	+
1	2909085	2909086	AT1G09030_1	1	-
1	3794288	3794289	AT1G11300_1	1	+
1	4003810	4003811	AT1G11870_1	1	+
...