org.biojavax.bio.seq.io
Class INSDseqFormat

java.lang.Object
  extended by org.biojavax.bio.seq.io.RichSequenceFormat.BasicFormat
      extended by org.biojavax.bio.seq.io.INSDseqFormat
All Implemented Interfaces:
SequenceFormat, RichSequenceFormat

public class INSDseqFormat
extends RichSequenceFormat.BasicFormat

Format reader for INSDseq files. This version of INSDseq format will generate and write RichSequence objects. Loosely Based on code from the old, deprecated, org.biojava.bio.seq.io.GenbankXmlFormat object. Understands http://www.insdc.org/files/documents/INSD_V1.4.dtd Does NOT understand the "sites" keyword in INSDReference_position. Interprets this instead as an empty location. This is because there is no obvious way of representing the "sites" keyword in BioSQL. Note also that the INSDInterval tags and associate stuff are not read, as this is duplicate information to the INSDFeature_location tag which is already fully parsed. However, they are written on output, although there is no guarantee that the INSDInterval tags will exactly match the INSDFeature_location tag as it is not possible to exactly reflect its contents using these.

Since:
1.5
Author:
Alan Li (code based on his work), Richard Holland, George Waldon

Nested Class Summary
static class INSDseqFormat.Terms
          Implements some INSDseq-specific terms.
 
Nested classes/interfaces inherited from interface org.biojavax.bio.seq.io.RichSequenceFormat
RichSequenceFormat.BasicFormat, RichSequenceFormat.HeaderlessFormat
 
Field Summary
protected static String ACC_VERSION_TAG
           
protected static String ACCESSION_TAG
           
protected static String AUTHOR_TAG
           
protected static String AUTHORS_GROUP_TAG
           
protected static String COMMENT_TAG
           
protected static String CONSORTIUM_TAG
           
protected static String CONTIG_TAG
           
protected static String CREATE_DATE_TAG
           
protected static String CREATE_REL_TAG
           
protected static String DATABASE_XREF_TAG
           
protected static Pattern dbxp
           
protected static String DEFINITION_TAG
           
protected static String DIVISION_TAG
           
protected static String FEATURE_ACCESSION_TAG
           
protected static String FEATURE_FROM_TAG
           
protected static String FEATURE_INTERBP_TAG
           
protected static String FEATURE_INTERVAL_TAG
           
protected static String FEATURE_INTERVALS_GROUP_TAG
           
protected static String FEATURE_ISCOMP_TAG
           
protected static String FEATURE_KEY_TAG
           
protected static String FEATURE_LOC_TAG
           
protected static String FEATURE_OPERATOR_TAG
           
protected static String FEATURE_PARTIAL3_TAG
           
protected static String FEATURE_PARTIAL5_TAG
           
protected static String FEATURE_POINT_TAG
           
protected static String FEATURE_TAG
           
protected static String FEATURE_TO_TAG
           
protected static String FEATUREQUAL_NAME_TAG
           
protected static String FEATUREQUAL_TAG
           
protected static String FEATUREQUAL_VALUE_TAG
           
protected static String FEATUREQUALS_GROUP_TAG
           
protected static String FEATURES_GROUP_TAG
           
static String INSDSEQ_FORMAT
          The name of this format
protected static String INSDSEQ_TAG
           
protected static String INSDSEQS_GROUP_TAG
           
protected static String JOURNAL_TAG
           
protected static String KEYWORD_TAG
           
protected static String KEYWORDS_GROUP_TAG
           
protected static String LENGTH_TAG
           
protected static String LOCUS_TAG
           
protected static String MOLTYPE_TAG
           
protected static String ORGANISM_TAG
           
protected static String OTHER_SEQID_TAG
           
protected static String OTHER_SEQIDS_GROUP_TAG
           
protected static String PUBMED_TAG
           
protected static String REFERENCE_LOCATION_TAG
           
protected static String REFERENCE_POSITION_TAG
           
protected static String REFERENCE_TAG
           
protected static String REFERENCES_GROUP_TAG
           
protected static String REMARK_TAG
           
protected static String SECONDARY_ACCESSION_TAG
           
protected static String SECONDARY_ACCESSIONS_GROUP_TAG
           
protected static String SEQUENCE_TAG
           
protected static String SOURCE_TAG
           
protected static String STRANDED_TAG
           
protected static String TAXONOMY_TAG
           
protected static String TITLE_TAG
           
protected static String TOPOLOGY_TAG
           
protected static String UPDATE_DATE_TAG
           
protected static String UPDATE_REL_TAG
           
protected static Pattern xmlSchema
           
protected static String XREF_DBNAME_TAG
           
protected static String XREF_ID_TAG
           
protected static String XREF_TAG
           
 
Constructor Summary
INSDseqFormat()
           
 
Method Summary
 void beginWriting()
          Informs the writer that we want to start writing.
 boolean canRead(BufferedInputStream stream)
          Check to see if a given stream is in our format.
 boolean canRead(File file)
          Check to see if a given file is in our format.
 void finishWriting()
          Informs the writer that are done writing.
 String getDefaultFormat()
          getDefaultFormat returns the String identifier for the default sub-format written by a SequenceFormat implementation.
 SymbolTokenization guessSymbolTokenization(BufferedInputStream stream)
          On the assumption that the stream is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it.
 SymbolTokenization guessSymbolTokenization(File file)
          On the assumption that the file is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it.
 boolean readRichSequence(BufferedReader reader, SymbolTokenization symParser, RichSeqIOListener rlistener, Namespace ns)
          Reads a sequence from the given buffered reader using the given tokenizer to parse sequence symbols.
 boolean readSequence(BufferedReader reader, SymbolTokenization symParser, SeqIOListener listener)
          Read a sequence and pass data on to a SeqIOListener.
 void writeSequence(Sequence seq, Namespace ns)
          Writes a sequence out to the outputstream given by beginWriting() using the default format of the implementing class.
 void writeSequence(Sequence seq, PrintStream os)
          writeSequence writes a sequence to the specified PrintStream, using the default format.
 void writeSequence(Sequence seq, String format, PrintStream os)
          writeSequence writes a sequence to the specified PrintStream, using the specified format.
 
Methods inherited from class org.biojavax.bio.seq.io.RichSequenceFormat.BasicFormat
getElideComments, getElideFeatures, getElideReferences, getElideSymbols, getLineWidth, getPrintStream, setElideComments, setElideFeatures, setElideReferences, setElideSymbols, setLineWidth, setPrintStream
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

INSDSEQ_FORMAT

public static final String INSDSEQ_FORMAT
The name of this format

See Also:
Constant Field Values

INSDSEQS_GROUP_TAG

protected static final String INSDSEQS_GROUP_TAG
See Also:
Constant Field Values

INSDSEQ_TAG

protected static final String INSDSEQ_TAG
See Also:
Constant Field Values

LOCUS_TAG

protected static final String LOCUS_TAG
See Also:
Constant Field Values

LENGTH_TAG

protected static final String LENGTH_TAG
See Also:
Constant Field Values

TOPOLOGY_TAG

protected static final String TOPOLOGY_TAG
See Also:
Constant Field Values

STRANDED_TAG

protected static final String STRANDED_TAG
See Also:
Constant Field Values

MOLTYPE_TAG

protected static final String MOLTYPE_TAG
See Also:
Constant Field Values

DIVISION_TAG

protected static final String DIVISION_TAG
See Also:
Constant Field Values

UPDATE_DATE_TAG

protected static final String UPDATE_DATE_TAG
See Also:
Constant Field Values

CREATE_DATE_TAG

protected static final String CREATE_DATE_TAG
See Also:
Constant Field Values

UPDATE_REL_TAG

protected static final String UPDATE_REL_TAG
See Also:
Constant Field Values

CREATE_REL_TAG

protected static final String CREATE_REL_TAG
See Also:
Constant Field Values

DEFINITION_TAG

protected static final String DEFINITION_TAG
See Also:
Constant Field Values

DATABASE_XREF_TAG

protected static final String DATABASE_XREF_TAG
See Also:
Constant Field Values

XREF_TAG

protected static final String XREF_TAG
See Also:
Constant Field Values

ACCESSION_TAG

protected static final String ACCESSION_TAG
See Also:
Constant Field Values

ACC_VERSION_TAG

protected static final String ACC_VERSION_TAG
See Also:
Constant Field Values

SECONDARY_ACCESSIONS_GROUP_TAG

protected static final String SECONDARY_ACCESSIONS_GROUP_TAG
See Also:
Constant Field Values

SECONDARY_ACCESSION_TAG

protected static final String SECONDARY_ACCESSION_TAG
See Also:
Constant Field Values

OTHER_SEQIDS_GROUP_TAG

protected static final String OTHER_SEQIDS_GROUP_TAG
See Also:
Constant Field Values

OTHER_SEQID_TAG

protected static final String OTHER_SEQID_TAG
See Also:
Constant Field Values

KEYWORDS_GROUP_TAG

protected static final String KEYWORDS_GROUP_TAG
See Also:
Constant Field Values

KEYWORD_TAG

protected static final String KEYWORD_TAG
See Also:
Constant Field Values

SOURCE_TAG

protected static final String SOURCE_TAG
See Also:
Constant Field Values

ORGANISM_TAG

protected static final String ORGANISM_TAG
See Also:
Constant Field Values

TAXONOMY_TAG

protected static final String TAXONOMY_TAG
See Also:
Constant Field Values

REFERENCES_GROUP_TAG

protected static final String REFERENCES_GROUP_TAG
See Also:
Constant Field Values

REFERENCE_TAG

protected static final String REFERENCE_TAG
See Also:
Constant Field Values

REFERENCE_LOCATION_TAG

protected static final String REFERENCE_LOCATION_TAG
See Also:
Constant Field Values

REFERENCE_POSITION_TAG

protected static final String REFERENCE_POSITION_TAG
See Also:
Constant Field Values

TITLE_TAG

protected static final String TITLE_TAG
See Also:
Constant Field Values

JOURNAL_TAG

protected static final String JOURNAL_TAG
See Also:
Constant Field Values

PUBMED_TAG

protected static final String PUBMED_TAG
See Also:
Constant Field Values

XREF_DBNAME_TAG

protected static final String XREF_DBNAME_TAG
See Also:
Constant Field Values

XREF_ID_TAG

protected static final String XREF_ID_TAG
See Also:
Constant Field Values

REMARK_TAG

protected static final String REMARK_TAG
See Also:
Constant Field Values

AUTHORS_GROUP_TAG

protected static final String AUTHORS_GROUP_TAG
See Also:
Constant Field Values

AUTHOR_TAG

protected static final String AUTHOR_TAG
See Also:
Constant Field Values

CONSORTIUM_TAG

protected static final String CONSORTIUM_TAG
See Also:
Constant Field Values

COMMENT_TAG

protected static final String COMMENT_TAG
See Also:
Constant Field Values

FEATURES_GROUP_TAG

protected static final String FEATURES_GROUP_TAG
See Also:
Constant Field Values

FEATURE_TAG

protected static final String FEATURE_TAG
See Also:
Constant Field Values

FEATURE_KEY_TAG

protected static final String FEATURE_KEY_TAG
See Also:
Constant Field Values

FEATURE_LOC_TAG

protected static final String FEATURE_LOC_TAG
See Also:
Constant Field Values

FEATURE_INTERVALS_GROUP_TAG

protected static final String FEATURE_INTERVALS_GROUP_TAG
See Also:
Constant Field Values

FEATURE_INTERVAL_TAG

protected static final String FEATURE_INTERVAL_TAG
See Also:
Constant Field Values

FEATURE_FROM_TAG

protected static final String FEATURE_FROM_TAG
See Also:
Constant Field Values

FEATURE_TO_TAG

protected static final String FEATURE_TO_TAG
See Also:
Constant Field Values

FEATURE_POINT_TAG

protected static final String FEATURE_POINT_TAG
See Also:
Constant Field Values

FEATURE_ISCOMP_TAG

protected static final String FEATURE_ISCOMP_TAG
See Also:
Constant Field Values

FEATURE_INTERBP_TAG

protected static final String FEATURE_INTERBP_TAG
See Also:
Constant Field Values

FEATURE_ACCESSION_TAG

protected static final String FEATURE_ACCESSION_TAG
See Also:
Constant Field Values

FEATURE_OPERATOR_TAG

protected static final String FEATURE_OPERATOR_TAG
See Also:
Constant Field Values

FEATURE_PARTIAL5_TAG

protected static final String FEATURE_PARTIAL5_TAG
See Also:
Constant Field Values

FEATURE_PARTIAL3_TAG

protected static final String FEATURE_PARTIAL3_TAG
See Also:
Constant Field Values

FEATUREQUALS_GROUP_TAG

protected static final String FEATUREQUALS_GROUP_TAG
See Also:
Constant Field Values

FEATUREQUAL_TAG

protected static final String FEATUREQUAL_TAG
See Also:
Constant Field Values

FEATUREQUAL_NAME_TAG

protected static final String FEATUREQUAL_NAME_TAG
See Also:
Constant Field Values

FEATUREQUAL_VALUE_TAG

protected static final String FEATUREQUAL_VALUE_TAG
See Also:
Constant Field Values

SEQUENCE_TAG

protected static final String SEQUENCE_TAG
See Also:
Constant Field Values

CONTIG_TAG

protected static final String CONTIG_TAG
See Also:
Constant Field Values

dbxp

protected static final Pattern dbxp

xmlSchema

protected static final Pattern xmlSchema
Constructor Detail

INSDseqFormat

public INSDseqFormat()
Method Detail

canRead

public boolean canRead(File file)
                throws IOException
Check to see if a given file is in our format. Some formats may be able to determine this by filename, whilst others may have to open the file and read it to see what format it is in. A file is in INSDseq format if the second XML line contains the phrase "http://www.ebi.ac.uk/dtd/INSD_INSDSeq.dtd".

Specified by:
canRead in interface RichSequenceFormat
Overrides:
canRead in class RichSequenceFormat.BasicFormat
Parameters:
file - the File to check.
Returns:
true if the file is readable by this format, false if not.
Throws:
IOException - in case the file is inaccessible.

guessSymbolTokenization

public SymbolTokenization guessSymbolTokenization(File file)
                                           throws IOException
On the assumption that the file is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it. For formats that only accept one tokenization, just return it without checking the file. For formats that accept multiple tokenizations, its up to you how you do it. Always returns a DNA tokenizer.

Specified by:
guessSymbolTokenization in interface RichSequenceFormat
Overrides:
guessSymbolTokenization in class RichSequenceFormat.BasicFormat
Parameters:
file - the File object to guess the format of.
Returns:
a SymbolTokenization to read the file with.
Throws:
IOException - if the file is unrecognisable or inaccessible.

canRead

public boolean canRead(BufferedInputStream stream)
                throws IOException
Check to see if a given stream is in our format. A stream is in INSDseq format if the second XML line contains the phrase "http://www.ebi.ac.uk/dtd/INSD_INSDSeq.dtd".

Parameters:
stream - the BufferedInputStream to check.
Returns:
true if the stream is readable by this format, false if not.
Throws:
IOException - in case the stream is inaccessible.

guessSymbolTokenization

public SymbolTokenization guessSymbolTokenization(BufferedInputStream stream)
                                           throws IOException
On the assumption that the stream is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it. For formats that only accept one tokenization, just return it without checking the stream. For formats that accept multiple tokenizations, its up to you how you do it. Always returns a DNA tokenizer.

Parameters:
stream - the BufferedInputStream object to guess the format of.
Returns:
a SymbolTokenization to read the stream with.
Throws:
IOException - if the stream is unrecognisable or inaccessible.

readSequence

public boolean readSequence(BufferedReader reader,
                            SymbolTokenization symParser,
                            SeqIOListener listener)
                     throws IllegalSymbolException,
                            IOException,
                            ParseException
Read a sequence and pass data on to a SeqIOListener.

Parameters:
reader - The stream of data to parse.
symParser - A SymbolParser defining a mapping from character data to Symbols.
listener - A listener to notify when data is extracted from the stream.
Returns:
a boolean indicating whether or not the stream contains any more sequences.
Throws:
IllegalSymbolException - if it is not possible to translate character data from the stream into valid BioJava symbols.
IOException - if an error occurs while reading from the stream.
ParseException

readRichSequence

public boolean readRichSequence(BufferedReader reader,
                                SymbolTokenization symParser,
                                RichSeqIOListener rlistener,
                                Namespace ns)
                         throws IllegalSymbolException,
                                IOException,
                                ParseException
Reads a sequence from the given buffered reader using the given tokenizer to parse sequence symbols. Events are passed to the listener, and the namespace used for sequences read is the one given. If the namespace is null, then the default namespace for the parser is used, which may depend on individual implementations of this interface.

Parameters:
reader - the input source
symParser - the tokenizer which understands the sequence being read
rlistener - the listener to send sequence events to
ns - the namespace to read sequences into.
Returns:
true if there is more to read after this, false otherwise.
Throws:
IllegalSymbolException - if the tokenizer couldn't understand one of the sequence symbols in the file.
IOException - if there was a read error.
ParseException

beginWriting

public void beginWriting()
                  throws IOException
Informs the writer that we want to start writing. This will do any initialisation required, such as writing the opening tags of an XML file that groups sequences together.

Throws:
IOException - if writing fails.

finishWriting

public void finishWriting()
                   throws IOException
Informs the writer that are done writing. This will do any finalisation required, such as writing the closing tags of an XML file that groups sequences together.

Throws:
IOException - if writing fails.

writeSequence

public void writeSequence(Sequence seq,
                          PrintStream os)
                   throws IOException
writeSequence writes a sequence to the specified PrintStream, using the default format.

Parameters:
seq - the sequence to write out.
os - the printstream to write to.
Throws:
IOException

writeSequence

public void writeSequence(Sequence seq,
                          String format,
                          PrintStream os)
                   throws IOException
writeSequence writes a sequence to the specified PrintStream, using the specified format.

Parameters:
seq - a Sequence to write out.
format - a String indicating which sub-format of those available from a particular SequenceFormat implemention to use when writing.
os - a PrintStream object.
Throws:
IOException - if an error occurs.

writeSequence

public void writeSequence(Sequence seq,
                          Namespace ns)
                   throws IOException
Writes a sequence out to the outputstream given by beginWriting() using the default format of the implementing class. If namespace is given, sequences will be written with that namespace, otherwise they will be written with the default namespace of the implementing class (which is usually the namespace of the sequence itself). If you pass this method a sequence which is not a RichSequence, it will attempt to convert it using RichSequence.Tools.enrich(). Obviously this is not going to guarantee a perfect conversion, so it's better if you just use RichSequences to start with! Namespace is ignored as INSDseq has no concept of it.

Parameters:
seq - the sequence to write
ns - the namespace to write it with
Throws:
IOException - in case it couldn't write something

getDefaultFormat

public String getDefaultFormat()
getDefaultFormat returns the String identifier for the default sub-format written by a SequenceFormat implementation.

Returns:
a String.


Copyright © 2012 BioJava. All Rights Reserved.