What is a PDB file?

Macromoltek, Inc.
6 min readMar 29, 2019

--

Our Instagram and blog often make reference to PDB files. We briefly mentioned the Protein Data Bank (also abbreviated as PDB) in our previous blog about docking which may have generated some confusion. This blog post will explain the use of PDB files and how to read them. I should mention that there is another PDB file format called Program Database which was created by Microsoft as a debugging tool. If that’s what you’re looking for, visit the article here.

Origin of the PDB Format

Pictured are Walter Hamilton, Helen Berman, and Tom Koetzle who were all important contributors to the creation and care of the PDB (1).

PDB files are used to visualize the crystal structures of molecules in simulation software. Crystallography is an area of study dating back to 1912. At the time, crystal structures were derived from x ray diffraction patterns and mathematical equations. Our blog post on crystallography explains the process and evolution in-depth. We visit the topic because, during the 1960s, the scientific community saw a conversion of crystallographic data to digital molecular structure data. The increasing size and amount of data, together with the development of the Brookhaven RAster Display (BRAD) console, generated the need for a standard format repository. Walter Hamilton was the deputy chairman of the chemistry department at Brookhaven National Laboratory, UK and a brilliant mathematical crystallographer. Dr. Hamilton met Edgar Meyer of Texas A&M University at the annual American Crystallographic Association (ACA) Meeting in 1968. At Walter Hamilton’s behest, Meyer traveled to Brookhaven to work with him on the standardization of crystal structure data and creation of a repository. Meyer recalled this time in his memoir submitted to the ACA website,

“Starting with crystallographic coordinates, my program DISPLAY drove a color television monitor to draw red/green 3D images of up to 256 atoms, which was crucial for the founding of the Protein Data Bank (PDB) at Brookhaven in 1971”

The file format Meyer’s created for DISPLAY was the foundation of the .PDB format which is still used today, though with many additions and modifications. Meyer’s wowed the scientific community in September of 1971 when he remotely accessed the PDB using SEARCH — his link and query software — with a query for myoglobin and defined that he wanted all atoms within a chosen radius of its iron atom to appear on DISPLAY. This was the first use of networking in the life sciences.

Columns of a PDB File

The file Meyers retrieved for DISPLAY was a primitive form of the text below. This is what you’ll find if you open a PDB file in a text editor. The file may seem complicated at first, but upon closer inspection it’s quite simple — as long as you have a basic understanding of biochemistry and coordinate mapping.

How a PDB file appears in a text editor

At the most basic level, a PDB file is a list of atoms and cartesian (XYZ) coordinates. The same format can be used to represent any physical object. Additional information is included to provide more details about the atoms and to build atoms into amino acids, amino acids (residues) into proteins, and proteins into macromolecules — see our protein blog post for more details. Skipping the first column, the second is the sequential atom number. Next, is the atomic symbol. Atoms are described not only by their one letter symbol, but with a remoteness indicator code. This code corresponds to the order in which the atoms appear in the amino acid’s structure — A for alpha, B for beta, G for gamma, and so on. The residue name is in column 4 followed by the chain identifier in column 5, and the residue number of the residue the atoms belong to in column 6. The XYZ orthogonal coordinates are next in that respective order. The graphical units for coordinates are in angstroms, which is equal to 0.1 nanometers. The next columns are occupancy and temperature factor (B-factor, B-value) which require greater explanation.

Occupancy
Organic crystals like proteins aren’t as static as inorganic crystals such as table salt. Organic crystals have multiple conformations — or geometries — when exposed to environmental factors. For example, myoglobin is a protein that carries a metal ion which binds to oxygen. When this binding event occurs, the conformation of myoglobin changes. Conformational changes mean atom locations shift. The occupancy column is an estimate of the amount of possible locations with each conformation. That is — if an atom has been observed in 4 different positions in 4 different conformations, the occupancy column will be 0.25 instead of the default 1.00.

B-factor
Observing the atoms of a macromolecular crystal can only be done indirectly — atoms are obviously too small to see! Observation is accomplished with several different experiments that use interactions with electrons to map the location of the atoms. Electrons are never static, so this approach doesn’t always yield the exact location of the atoms. Our atomic maps must be measured with a degree of confidence because electrons are never in one place. Each atom is denoted with a B-factor as a representation of the confidence of its location. Lower B-factors correlate with higher confidence and vice versa. Atoms at the interior of a crystal tend to have lower B-factors because their electrons are stabilized by the surrounding atoms, while atoms at the surface have higher B-factors because they can move with a greater degree of freedom.

The picture on the left is an interior residue of the myoglobin molecule while the right is a residue on the surface. The green mesh represents the B-values of each atom. Notice how those at the surface have much wider nets — indicating the electrons moved much more during mapping.

Other Keywords
Some PDB files contain HETATM in place of the ATOM keyword. ATOM is used for all protein and nucleic acid atoms while HETATM is reserved for small molecule atoms.
The TER keyword notes the end of a chain. Remember from our What is a Protein blog that proteins have hierarchical structures. The tertiary structure is the chain. TER notifies the display software that the chains are not continuous so the software won’t try to connect them. Additionally, the final oxygen atom of the final residue in the end of the chain will sometimes be given a remoteness indicator of XT (appearing as OXT) to denote the C-terminus.

Use of PDB Files

Graphical software uses PDB files to simulate visual representations of microscopic molecules. That’s cool enough on its own! If you want more, though, there are many important uses for this information. PDB files are used to simulate docking so researchers can determine, for example, how two molecules interact. This has further implications like how the interaction may be blocked if the two molecules are, say, a toxin and a protein on your cell’s surface! Medical scientists can observe surface proteins from cancer cells to determine if they have pockets where small molecule drugs may bind with them. PDB files and software to read them is free and readily available to the public. Check out Pymol and fetch a PDB from the repository to see for yourself. You may learn something new.

Links and Citations:
1. Photo of Hamilton, Berman, and Koetzle https://www.amercrystalassn.org/assets/History/Helen_M_Berman_Newsletter_2012.pdf

Looking for more information about Macromoltek, Inc? Visit our website at www.Macromoltek.com
Interested in molecular simulations, biological art, or learning more about molecules? Subscribe to our Twitter and Instagram!

--

--

Macromoltek, Inc.

Welcome to the Macromoltek blog! We're an Austin-based biotech firm focused on using computers to further the discovery and design of antibodies.