wordcount

 

Function

Counts words of a specified size in a DNA sequence

Description

Displays all the words of the specified length with the number of
times it occurs.

Usage

Here is a sample session with wordcount


% wordcount tembl:rnu68037 -wordsize=3 
Counts words of a specified size in a DNA sequence
Output file [rnu68037.wordcount]: 

Go to the input files for this example
Go to the output files for this example

Command line arguments

   Mandatory qualifiers:
  [-sequence]          sequence   Sequence USA
   -wordsize           integer    Word size
   -outfile            outfile    Output file name

   Optional qualifiers: (none)
   Advanced qualifiers: (none)
   General qualifiers:
  -help                boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose


Mandatory qualifiers Allowed values Default
[-sequence]
(Parameter 1)
Sequence USA Readable sequence Required
-wordsize Word size Integer 2 or more 4
-outfile Output file name Output file <sequence>.wordcount
Optional qualifiers Allowed values Default
(none)
Advanced qualifiers Allowed values Default
(none)

Input file format

wordcount reads any sequence USA.

Input files for usage example

'tembl:rnu68037' is a sequence entry in the example nucleic acid database 'tembl'

Database entry: tembl:rnu68037

ID   RNU68037   standard; RNA; ROD; 1218 BP.
XX
AC   U68037;
XX
SV   U68037.1
XX
DT   23-SEP-1996 (Rel. 49, Created)
DT   04-MAR-2000 (Rel. 63, Last updated, Version 2)
XX
DE   Rattus norvegicus EP1 prostanoid receptor mRNA, complete cds.
XX
KW   .
XX
OS   Rattus norvegicus (Norway rat)
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;
OC   Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Rattus.
XX
RN   [1]
RP   1-1218
RA   Abramovitz M., Boie Y.;
RT   "Cloning of the rat EP1 prostanoid receptor";
RL   Unpublished.
XX
RN   [2]
RP   1-1218
RA   Abramovitz M., Boie Y.;
RT   ;
RL   Submitted (26-AUG-1996) to the EMBL/GenBank/DDBJ databases.
RL   Biochemistry & Molecular Biology, Merck Frosst Center for Therapeutic
RL   Research, P. O. Box 1005, Pointe Claire - Dorval, Quebec H9R 4P8, Canada
XX
DR   SWISS-PROT; P70597; PE21_RAT.
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..1218
FT                   /db_xref="taxon:10116"
FT                   /organism="Rattus norvegicus"
FT                   /strain="Sprague-Dawley"
FT   CDS             1..1218
FT                   /codon_start=1
FT                   /db_xref="SWISS-PROT:P70597"
FT                   /note="family 1 G-protein coupled receptor"
FT                   /product="EP1 prostanoid receptor"
FT                   /protein_id="AAB07735.1"
FT                   /translation="MSPYGLNLSLVDEATTCVTPRVPNTSVVLPTGGNGTSPALPIFSM
FT                   TLGAVSNVLALALLAQVAGRLRRRRSTATFLLFVASLLAIDLAGHVIPGALVLRLYTAG
FT                   RAPAGGACHFLGGCMVFFGLCPLLLGCGMAVERCVGVTQPLIHAARVSVARARLALALL
FT                   AAMALAVALLPLVHVGHYELQYPGTWCFISLGPPGGWRQALLAGLFAGLGLAALLAALV
FT                   CNTLSGLALLRARWRRRRSRRFRENAGPDDRRRWGSRGLRLASASSASSITSTTAALRS
FT                   SRGGGSARRVHAHDVEMVGQLVGIMVVSCICWSPLLVLVVLAIGGWNSNSLQRPLFLAV
FT                   RLASWNQILDPWVYILLRQAMLRQLLRLLPLRVSAKGGPTELSLTKSAWEASSLRSSRH
FT                   SGFSHL"
XX
SQ   Sequence 1218 BP; 162 A; 397 C; 387 G; 272 T; 0 other;
     atgagcccct acgggcttaa cctgagccta gtggatgagg caacaacgtg tgtaacaccc        60
     agggtcccca atacatctgt ggtgctgcca acaggcggta acggcacatc accagcgctg       120
     cctatcttct ccatgacgct gggtgctgtg tccaacgtgc tggcgctggc gctgctggcc       180
     caggttgcag gcagactgcg gcgccgccgc tcgactgcca ccttcctgtt gttcgtcgcc       240
     agcctgcttg ccatcgacct agcaggccat gtgatcccgg gcgccttggt gcttcgcctg       300
     tatactgcag gacgtgcgcc cgctggcggg gcctgtcatt tcctgggcgg ctgtatggtc       360
     ttctttggcc tgtgcccact tttgcttggc tgtggcatgg ccgtggagcg ctgcgtgggt       420
     gtcacgcagc cgctgatcca cgcggcgcgc gtgtccgtag cccgcgcacg cctggcacta       480
     gccctgctgg ccgccatggc tttggcagtg gcgctgctgc cactagtgca cgtgggtcac       540
     tacgagctac agtaccctgg cacttggtgt ttcattagcc ttgggcctcc tggaggttgg       600
     cgccaggcgt tgcttgcggg cctcttcgcc ggccttggcc tggctgcgct ccttgccgca       660
     ctagtgtgta atacgctcag cggcctggcg ctccttcgtg cccgctggag gcggcgtcgc       720
     tctcgacgtt tccgagagaa cgcaggtccc gatgatcgcc ggcgctgggg gtcccgtgga       780
     ctccgcttgg cctccgcctc gtctgcgtca tccatcactt caaccacagc tgccctccgc       840
     agctctcggg gaggcggctc cgcgcgcagg gttcacgcac acgacgtgga aatggtgggc       900
     cagctcgtgg gcatcatggt ggtgtcgtgc atctgctgga gccccctgct ggtattggtg       960
     gtgttggcca tcgggggctg gaactctaac tccctgcagc ggccgctctt tctggctgta      1020
     cgcctcgcgt cgtggaacca gatcctggac ccatgggtgt acatcctgct gcgccaggct      1080
     atgctgcgcc aacttcttcg cctcctaccc ctgagggtta gtgccaaggg tggtccaacg      1140
     gagctgagcc taaccaagag tgcctgggag gccagttcac tgcgtagctc ccggcacagt      1200
     ggcttcagcc acttgtga                                                    1218
//

Output file format

Output files for usage example

File: rnu68037.wordcount

ctg	54
tgg	53
gcc	53
ggc	51
cgc	47
gct	47
gtg	40
tgc	39
cct	38
gcg	36
cca	29
ggg	26
cag	25
ctt	25
tcc	25
ggt	24
ccc	24
ctc	23
tgt	23
gca	22
cgt	22
ccg	22
cac	22
agc	21
acg	19
ttg	19
cgg	19
tcg	18
ttc	17
cat	17
agg	17
act	16
gtc	16
gag	16
aac	15
gga	14
atc	14
tct	14
tca	13
cta	13
atg	12
gtt	11
acc	11
gta	11
aca	10
tac	10
tga	10
caa	10
gac	9
agt	9
tag	9
ttt	8
cga	7
gat	6
taa	6
tat	5
aga	5
gaa	4
aat	3
ata	3
tta	3
att	3
aag	2
aaa	1

The file simply consists of two columns, separated by spaces or TAB characters.

The first column consists of all the possible words of size wordsize. The second column consists of the count of those words in the input sequence.

Data files

None.

Notes

None.

References

None.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

0 if successful.

Known bugs

None.

See also

Program nameDescription
bananaBending and curvature plot in B-DNA
btwistedCalculates the twisting in a B-DNA sequence
chaosCreate a chaos game representation plot for a sequence
compseqCounts the composition of dimer/trimer/etc words in a sequence
danCalculates DNA RNA/DNA melting temperature
freakResidue/base frequency table or plot
isochorePlots isochores in large DNA sequences
sirnaFinds siRNA duplexes in mRNA

Author(s)

This application was written by Ian Longden (il@sanger.ac.uk) Informatics Division, The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.

History

Completed 27th November 1998.

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments