Package UniGene
source code
Parse Unigene flat file format files such as the Hs.data file.
Here is an overview of the flat file format that this parser deals with:
Line types/qualifiers:
ID UniGene cluster ID
TITLE Title for the cluster
GENE Gene symbol
CYTOBAND Cytological band
EXPRESS Tissues of origin for ESTs in cluster
RESTR_EXPR Single tissue or development stage contributes
more than half the total EST frequency for this gene.
GNM_TERMINUS genomic confirmation of presence of a 3' terminus;
T if a non-templated polyA tail is found among
a cluster's sequences; else
I if templated As are found in genomic sequence or
S if a canonical polyA signal is found on
the genomic sequence
GENE_ID Entrez gene identifier associated with at least one
sequence in this cluster;
to be used instead of LocusLink.
LOCUSLINK LocusLink identifier associated with at least one
sequence in this cluster;
deprecated in favor of GENE_ID
HOMOL Homology;
CHROMOSOME Chromosome. For plants, CHROMOSOME refers to mapping
on the arabidopsis genome.
STS STS
ACC= GenBank/EMBL/DDBJ accession number of STS
[optional field]
UNISTS= identifier in NCBI's UNISTS database
TXMAP Transcript map interval
MARKER= Marker found on at least one sequence in this
cluster
RHPANEL= Radiation Hybrid panel used to place marker
PROTSIM Protein Similarity data for the sequence with
highest-scoring protein similarity in this cluster
ORG= Organism
PROTGI= Sequence GI of protein
PROTID= Sequence ID of protein
PCT= Percent alignment
ALN= length of aligned region (aa)
SCOUNT Number of sequences in the cluster
SEQUENCE Sequence
ACC= GenBank/EMBL/DDBJ accession number of sequence
NID= Unique nucleotide sequence identifier (gi)
PID= Unique protein sequence identifier (used for
non-ESTs)
CLONE= Clone identifier (used for ESTs only)
END= End (5'/3') of clone insert read (used for
ESTs only)
LID= Library ID; see Hs.lib.info for library name
and tissue
MGC= 5' CDS-completeness indicator; if present, the
clone associated with this sequence is believed
CDS-complete. A value greater than 511 is the gi
of the CDS-complete mRNA matched by the EST,
otherwise the value is an indicator of the
reliability of the test indicating CDS
completeness; higher values indicate more
reliable CDS-completeness predictions.
SEQTYPE= Description of the nucleotide sequence.
Possible values are mRNA, EST and HTC.
TRACE= The Trace ID of the EST sequence, as provided by
NCBI Trace Archive
|
UG_INDENT = 12
|
|
StringTypes = ( <type 'str'>, <type 'unicode'>)
|
|
__package__ = ' Bio.UniGene '
|
|
xml_support = 1
|