BioJava:CookBook:PDB:mmcif

How to parse mmCIF files using BioJava

What is mmCIF?

The Protein Data Bank (PDB) has been distributing its archival files as PDB files for a long time. The PDB file format is based on “punchcard”-style rules how to store data in a flat file. With the increasing complexity of macromolecules that have are being resolved experimentally, this file format can not be used any more to represent some or the more complex structures. As such, the wwPDB recently announced the transition from PDB to mmCIF/PDBx as the principal deposition and dissemination file format (see here and here).

The mmCIF file format has been around for some time (see 1,2 ) BioJava has been supporting mmCIF already for several years. This tutorial is meant to provide a quick introduction into how to parse mmCIF files using BioJava

The basics

BioJava provides you with both a mmCIF parser and a data model that reads PDB and mmCIF files into a biological and chemically meaningful data model. If you don’t want to use that data model, you can still use BioJava’s file parsers, and more on that later, let’s start first with the most basic way of loading a protein structure.

Quick Installation

Before we start, just one quick paragraph of how to get access to BioJava.

BioJava is open source and you can get the code from Github, however it might be easier this way:

BioJava uses Maven as a build and distribution system. If you are new to Maven, take a look at the Getting Started with Maven guide.

As of version 4, BioJava is available in maven central. Thus you just need to include this in your pom.xml file:

        <dependencies>
                ...
                <dependency>
                        <groupId>org.biojava</groupId>
                        <artifactId>biojava-structure</artifactId>
                        <version>4.0.0-SNAPSHOT</version>
                </dependency>
                <!-- other biojava jars as needed -->
        </dependencies>

If you run ‘mvn package’ on your project, the BioJava dependencies will be automatically downloaded and installed for you.

First steps

The simplest way to load a PDB file is by using the StructureIO class.

    Structure structure = StructureIO.getStructure("4HHB");
    // and let's print out how many atoms are in this structure
    System.out.println(StructureTools.getNrAtoms(structure));

BioJava automatically downloaded the PDB file for hemoglobin 4HHB and copied it into a temporary location. This demonstrates two things:

BioJava can automatically download and install files locally
BioJava by default writes those files into a temporary location (The system temp directory “java.io.tempdir”).

If you already have a local PDB installation, you can configure where BioJava should read the files from by setting the PDB_DIR system property

    -DPDB_DIR=/wherever/you/want/

or by setting an environment variable

export PDB_DIR=/wherever/you/want/

Note that the layout of files in those directories will mimick the “divided” layout in the official PDB ftp repository.

From PDB to mmCIF

By default BioJava is using the PDB file format for parsing data. In order to switch it to use mmCIF, we can take control over the underlying AtomCache which manages your PDB (and btw. also SCOP, CATH) installations.

        AtomCache cache = new AtomCache();
            
        cache.setUseMmCif(true);
            
        // if you struggled to set the PDB_DIR property correctly in the previous step, 
        // you could set it manually like this:
        cache.setPath("/tmp/");
            
        StructureIO.setAtomCache(cache);
            
        Structure structure = StructureIO.getStructure("4HHB");
                    
        // and let's count how many chains are in this structure.
        System.out.println(structure.getChains().size());

As you can see, the AtomCache will again download the missing mmCIF file for 4HHB in the background.

Low level access

By default the file content will be loaded into the BioJava data structures. The parser contains a built-in event model, which allows to load your own, custom data structures. For this you will require to implement the MMcifConsumer interface

@since 1.7

   public static void main(String[] args){

       String fileName = args[0];  
         
       InputStream inStream =  new FileInputStream(fileName);  
         
       MMcifParser parser = new SimpleMMcifParser();

       SimpleMMcifConsumer consumer = new SimpleMMcifConsumer();

       // The Consumer builds up the BioJava - structure object.  
               // you could also hook in your own and build up you own data model.            
       parser.addMMcifConsumer(consumer);

       try {  
           parser.parse(new BufferedReader(new InputStreamReader(inStream)));  
       } catch (IOException e){  
           e.printStackTrace();  
       }

               // now get the protein structure.  
       Structure cifStructure = consumer.getStructure();  
                     

}

The parser operates similar to a XML parser by triggering “events”. The SimpleMMcifConsumer listens to new categories being read from the file and then builds up the BioJava data model.

To re-use the parser for your own datamodel, just implement the MMcifConsumer interface and add it to the SimpleMMcifParser.

        parser.addMMcifConsumer(myOwnConsumerImplementation);

For more info on how to work with the BioJava structure data model see <BioJava:CookBook:PDB:atoms>.

References

1. westbrook2000 pmid=10842738 2. westbrook2003 pmid=12647386