|
preg
|
Function
Regular expression search of a protein sequence
Description
This searches for matches of a regular expression to a protein sequence.
A regular expression is a way of specifying an ambiguous pattern to
search for. Regular expressions are commonly used in some computer
programming languages and may be more familiar to some users than to
others.
The following is a short guide to regular expressions in EMBOSS:
- ^
-
use this at the start of a pattern to insist that the pattern can only
match at the start of a sequence. (eg. '^M' matches a methionine at
the start of the sequence)
- $
-
use this at the end of a pattern to insist that the pattern can only
match at the end of a sequence (eg. 'R$' matches an arginine at
the end of the sequence)
- ()
-
groups a pattern. This is commonly used with '|' (eg. '(ACD)|(VWY)'
matches either the first 'ACD' or the second 'VWY' pattern )
- |
-
This is the OR operator to enable a match to be made to either one
pattern OR another. There is no AND operator in this version of regular
expressions.
The following quantifier characters specify the number of time that
the character before (in this case 'x') matches:
- x?
-
matches 0 or 1 times (ie, '' or 'x')
- x*
-
matches 0 or more times (ie, '' or 'x' or 'xx' or 'xxx', etc)
- x+
-
matches 1 or more times (ie, 'x' or 'xx' or 'xxx', etc)
Quantifiers can follow any of the following types of character specification:
- x
-
any character (ie 'A')
- \x
-
the character after the backslash is used instead of its normal
regular expression meaning. This is commonly used to turn off the
special meaning of the characters '^$()|?*+[]-.'. It may be especially
useful when searching for gap characters in a sequence (eg '\.' matches
only a dot character '.')
- [xy]
-
match one of the characters 'x' or 'y'. You may have one or more
characters in this set.
- [x-z]
-
match any one of the set of characters starting with 'x' and
ending in 'y' in ASCII order (eg '[A-G]' matches any one of: 'A', 'B',
'C', 'D', 'E', 'F', 'G')
- [^x-z]
-
matches anything except any one of the group of characters in
ASCII order (eg '[^A-G]' matches anything EXCEPT any one of: 'A', 'B',
'C', 'D', 'E', 'F', 'G')
- .
-
the dot character matches any other character (eg: 'A.G' matches
'AAG', 'AaG', 'AZG', 'A-G' 'A G', etc.)
Combining some of these features gives these examples from the PROSITE
patterns database:
'[STAGCN][RKH][LIVMAFY]$'
which is the 'Microbodies C-terminal targeting
signal'.
'LP.TG[STGAVDE]'
which is the 'Gram-positive cocci surface proteins
anchoring hexapeptide'.
Regular expressions are case-sensitive.
The pattern 'AAAA' will not match the sequence 'aaaa'.
Usage
Here is a sample session with preg
% preg
Regular expression search of a protein sequence
Input sequence(s): tsw:*_rat
Regular expression pattern: IA[QWF]A
Output file [100k_rat.preg]:
|
Go to the input files for this example
Go to the output files for this example
Command line arguments
Mandatory qualifiers:
[-sequence] seqall Sequence database USA
[-pattern] regexp Regular expression pattern
[-outfile] outfile Output file name
Optional qualifiers: (none)
Advanced qualifiers: (none)
General qualifiers:
-help boolean Report command line options. More
information on associated and general
qualifiers can be found with -help -verbose
|
Mandatory qualifiers |
Allowed values |
Default |
[-sequence] (Parameter 1) |
Sequence database USA |
Readable sequence(s) |
Required |
[-pattern] (Parameter 2) |
Regular expression pattern |
Any regular epression pattern is accepted |
Required |
[-outfile] (Parameter 3) |
Output file name |
Output file |
<sequence>.preg |
Optional qualifiers |
Allowed values |
Default |
(none) |
Advanced qualifiers |
Allowed values |
Default |
(none) |
Input file format
preg reads any protein sequence USA.
Input files for usage example
'tsw:*_rat' is a sequence entry in the example protein database 'tsw'
Output file format
Output files for usage example
File: 100k_rat.preg
preg search of tsw:*_rat with pattern IA[QWF]A
Matches in 100K_RAT
100K_RAT 390 IAQA
|
Data files
None.
Notes
None.
References
None.
Warnings
Regular expressions are case-sensitive.
The pattern 'AAAA' will not match the sequence 'aaaa'.
Diagnostic Error Messages
None.
Exit status
It always exits with a status of 0.
Always returns 0.
Known bugs
None.
Program name | Description |
antigenic | Finds antigenic sites in proteins |
digest | Protein proteolytic enzyme or reagent cleavage digest |
fuzzpro | Protein pattern search |
fuzztran | Protein pattern search after translation |
helixturnhelix | Report nucleic acid binding motifs |
oddcomp | Finds protein sequence regions with a biased composition |
patmatdb | Search a protein sequence with a motif |
patmatmotifs | Search a PROSITE motif database with a protein sequence |
pepcoil | Predicts coiled coil regions |
pestfind | Finds PEST motifs as potential proteolytic cleavage sites |
pscan | Scans proteins using PRINTS |
sigcleave | Reports protein signal cleavage sites |
Other EMBOSS programs allow you to search for simple patterns and may be
easier for the user who has never used regular expressions before:
- fuzznuc - Nucleic acid pattern search
- fuzzpro - Protein pattern search
- fuzztran - Protein pattern search after translation
Author(s)
This application was written by Peter Rice (pmr@sanger.ac.uk) Informatics
Division, The Sanger Centre, Wellcome Trust Genome Campus, Hinxton,
Cambridge, CB10 1SA, UK.
History
Written (1999) - Peter Rice
Target users
This program is intended to be used by everyone and everything,
from naive users to embedded scripts.
Comments