Department of Mathematics and Computer Science, Bioinformatics and Information Sciences Center, Western Kentucky University, Bowling Green, KY 42103, USA

Abstract

Background

A protein structure can be determined by solving a so-called distance geometry problem whenever a set of inter-atomic distances is available and sufficient. However, the problem is intractable in general and has proved to be a NP hard problem. An updated geometric build-up algorithm (UGB) has been developed recently that controls numerical errors and is efficient in protein structure determination for cases where only sparse exact distance data is available. In this paper, the UGB method has been improved and revised with aims at solving distance geometry problems more efficiently and effectively.

Methods

An efficient algorithm (called the revised updated geometric build-up algorithm (RUGB)) to build up a protein structure from atomic distance data is presented and provides an effective way of determining a protein structure with sparse exact distance data. In the algorithm, the condition to determine an unpositioned atom iteratively is relaxed (when compared with the UGB algorithm) and data structure techniques are used to make the algorithm more efficient and effective. The algorithm is tested on a set of proteins selected randomly from the Protein Structure Database-PDB.

Results

We test a set of proteins selected randomly from the Protein Structure Database-PDB. We show that the numerical errors produced by the new RUGB algorithm are smaller when compared with the errors of the UGB algorithm and that the novel RUGB algorithm has a significantly smaller runtime than the UGB algorithm.

Conclusions

The RUGB algorithm relaxes the condition for updating and incorporates the data structure for accessing neighbours of an atom. The revisions result in an improvement over the UGB algorithm in two important areas: a reduction on the overall runtime and decrease of the numeric error.

Introduction

Proteins are important bio-molecules in biological systems and activities. A protein is a polypeptide chain made of 20 different types of amino acids. An amino acid sequence determines the structure of the protein. Knowledge of the protein structure gives us insight into function of the protein and its dynamics. Therefore, it is always important to have an accurate protein structure in the highest resolution available. The distances between many pairs of atoms in a protein can often be determined based on our knowledge of chemistry (for example certain types of bond-lengths and bond angles)

In an experimental setting there are two additional restrictions: First, often only a small subset of all pair-wise distances may be available. Second, instead of a single distance, experiments might only yield a distance range for a pair of atoms (a lower bound and upper bound of a distance). Several algorithms have been developed as solutions or approximate solutions to MDGP. These algorithms include singular value decomposition

In this paper we will only consider the MDGP in the case when exact distances are available. Furthermore we concentrate on a particular class of algorithms that are computationally quite fast and will often suffice to solve the MDGP problem. These are so called geometric build-up algorithms (GB)

Here we will refer to a positioned atom as an atom with known coordinates in 3D space and an unpositioned atom as an atom where we do not know its 3D coordinates. It is well-known in geometry that in 3D an unpositioned point

A major problem in the geometric build up procedure is numerical stability when a protein has a large number of atoms. Due to computational round off or truncation, errors are introduced into the build-up coordinates and the iterative nature of the algorithm can cause these errors to accumulate. This problem has been solved by using an updated geometric build-up (UGB) algorithm

Geometric build-up algorithms

Initially, four atoms that are not co-planar are selected such that all six inter-atomic distances between each pair of these four atoms are known. A set of coordinates for the four atoms is determined that satisfies the distances between them. We call atoms with fixed coordinates

The outline of the General Geometric Build-Up Algorithm for Solving MDGP [Dong and Wu 2002b]

The outline of the General Geometric Build-Up Algorithm for Solving MDGP [Dong and Wu 2002b]

**Definition 1.1** A set of points ^{3}

**Definition 1.2** A set of four points in ^{3}

**Definition 1.3** A point _{i}_{j}_{i}_{i,j}_{j}

**Theorem 1.1** Given a set of distances among four non-coplanar points, then the coordinates of the four points can be uniquely determined up to a rigid motion that is a combination of a translation, a rotation and possibly a reflection.

**Proof.** The distances between the four points define a tetrahedron and therefore this is obvious.

**Theorem 1.2** If the coordinates of four non-planar atoms _{i}, i=1,2,3,4_{i,j,} , i=1,2,3,4_{j}_{j}^{3}^{3}

**Proof.** While this theorem is geometrically obvious, we provide a short proof that will give us insight of how the coordinates of the fifth atom are actually computed. Let _{i} =_{i}_{i}_{i}^{T}_{j} =_{j}_{j}_{j}^{T}

Square the equations and expand their left-hand-sides to obtain

Subtract the first equation from the rest to reduce the equations to the following three,

Define the matrix

We can then write the above equations in the following matrix form.

Since _{1}, _{2}, _{3}, _{4} are not in the same plane, the matrix _{j}^{3}^{3}

Note that the above algorithm (given in the proof of Theorem 1.2) shows that the coordinates of _{j}^{3})^{3})

As shown in previous reports

An updated geometric build-up algorithm (UGB)

This algorithm incorporates the idea of re-computing the coordinates of the four atoms in a metric base to minimize the rounding error. In many cases, there exist many options to select a metric base of four atoms that can determine an unpositioned atom. In the updated geometric build-up algorithm, four non-coplanar atoms with original distances among them are preferred. The reason is that a metric base forms a tetrahedron _{5}

It is important to realize that the determination of the coordinates of the unpositioned atom is independent of the coordinates of other atoms obtained previously. After translation and rotation of the complete graph _{5}

The outline of the updated geometric build-up algorithm for solving the molecular distance geometry problem with sparse exact distances [Wu and Wu]

The outline of the updated geometric build-up algorithm for solving the molecular distance geometry problem with sparse exact distances [Wu and Wu]

There are two major steps in this algorithm. First, the positions of the four base atoms are recomputed based on Theorem 1.1. The new positions of the four base atoms are completely independent of their old positions, and this first step just guarantees that the four base atoms form a tetrahedron where the distances between the atoms as accurate as possible. Second, the translation vector and rotation matrix need to be found for re-initializing. This second step requires techniques used in computation of the Root Mean Square Deviation (RMSD).

We explain the re-initialization step for a tetrahedron when all distances among four atoms are available. Let (_{i}, y_{i}, z_{i}^{th}_{ij}^{th}^{th}

_{1}=0, y_{1}=0, z_{1}=0

_{2}=d_{21}, y_{2}=0, z_{2}=0

We explain the standard RMSD steps for any two structures of embedded points with coordinate matrices

Now, _{1}_{1}_{1}_{1}_{F} is defined by _{i}_{i}_{1}^{T}X_{1}^{T}^{T}^{4})^{6})

In this paper, the UGB algorithm is improved by a revised updated geometric build-up algorithm (RUGB). This algorithm aims at reducing the computational complexity of the UGB algorithm. As we will show the RUGB algorithm also improves the numerical error over the performance of the UGB algorithm.

Methods

A revised updated geometric build-up algorithm (RUGB)

Although the updated geometric build-up algorithm UGB has shown the property of controlling numerical errors, the UGB algorithm requires searching for four atoms with distances among them as a metric base in every iteration. A revised updated geometric build-up algorithm is described in this paper. The algorithm is based on the regular updated geometric build-up algorithm and modified by adding a new data structure and relaxing the condition of a metric base. The first modification in the algorithm is that instead of requiring four metric base atoms with distances among them, this algorithm requires three metric base atoms with distances among them and one additional atom. The purpose of relaxing the condition is to cut down the time it takes to find a new metric base. The updating scheme can still be implemented with only three metric base atoms. However, using three atoms, with all distances among them, will result in two possible sets of coordinates for the position of an undetermined atom. In order to distinguish the correct solution from the incorrect solution we use the distance to a fourth determined atom that is not in coplanar with the first three base atoms. This strategy is also based on Theorem 1.2. The re-initialization and updating of the metric base of three atoms also follows the steps similar to those in UGB algorithm introduced in the previous section. In this case, three atoms rather than four atoms are considered.

A second modification is the creation of a data structure that makes it easy to access all of the neighbouring atoms given by the original distance matrix for any atom. Here we refer to the degree of an atom as the number of neighbouring atoms and _{max}_{max}_{max}_{max}_{max}

The size of _{max}

**PDB Name**

**# Atoms**

**d _{max}**

**d _{max}/n**

'1VII.pdb'

596

77

0.129195

'1HIP.pdb'

617

37

0.059968

'1ULR.pdb'

677

36

0.053176

'1BOM.pdb'

700

69

0.098571

'1AIK.pdb'

729

49

0.067215

'1CEU.pdb'

854

65

0.076112

'1KVX.pdb'

954

38

0.039832

'1VMP.pdb'

1166

74

0.063465

'1HSM.pdb'

1251

73

0.058353

'1HAA.pdb'

1310

69

0.052672

The outline of the revised updated geometric build-up algorithm (RUGB) for solving the molecular distance geometry problem with sparse exact distances

The outline of the revised updated geometric build-up algorithm (RUGB) for solving the molecular distance geometry problem with sparse exact distances

**Theorem 3.1** Assume that any four initial metric base atoms can lead to the complete determination of a protein structure given a sparse set of distance data, then a protein structure can be determined by the revised geometric build-up algorithm (RUGB) using ^{2}d_{max}^{3})_{max}

**Proof.** For any unpositioned atom _{max}^{3})_{1}, x_{2}, x_{3}_{max})_{4}_{1}, x_{2}, x_{3}_{4}

If both a metric base of three atoms _{1}, x_{2}, x_{3}_{4}_{1}, x_{2}, x_{3}_{4}_{1}, x_{2}, x_{3}_{4}

Thus for an unpositioned atom _{max}^{3})_{max}^{3})^{2}d_{max}^{3})

Note that often in NMR structure determination, only distances less than 5Å can be obtained. Therefore, the typical distance matrix is sparse in realistic applications. However, the RUGB algorithm of Figure

**Theorem 3.2** Given a sparse set of distance data for a protein, then it takes at most ^{3}d_{max}^{6})

**Proof.** In a protein structure, there are at most _{max}^{3})_{max}^{3}) O(n^{2}d_{max}^{3})^{3}d_{max}^{6})

Results

We tested the RUGB algorithm on a set of proteins. We also compared the results with results generated by the GB algorithm and the UGB algorithm. The testing data was prepared in the following way. A set of proteins with their structures were downloaded from the protein structure database PDB

The Table

The numerical results of RUGB and UGB

**PDB Name**

**# Atoms**

**RUGB time (s)**

**UGB time (s)**

**RUGB error (Å)**

**UGB error (Å)**

'2DX2.pdb'

174

3.5803

4.2447

2.31E-11

1.71E-08

'1ID7.pdb'

189

3.0529

4.4187

8.62E-14

2.87E-12

'1B5N.pdb'

332

8.1185

10.1274

1.93E-10

8.67E-08

'1FW5.pdb'

332

6.9327

9.6053

1.65E-12

6.29E-08

'1SOL.pdb'

353

8.318

13.5202

7.33E-13

5.72E-11

'1JAV.pdb'

360

7.9572

11.4536

2.78E-12

1.50E-08

'1meq.pdb'

405

8.7641

14.076

2.43E-12

1.20E-10

'1AMB.pdb'

438

13.966

16.9998

7.11E-12

4.35E-07

'1R7C.pdb'

532

13.3252

26.2002

8.62E-10

5.50E-08

'1HLL.pdb'

540

13.0888

28.5319

2.83E-12

5.41E-07

'1VII.pdb'

596

13.0338

24.7907

3.56E-10

2.28E-07

'1HIP.pdb'

617

15.9565

35.5588

4.80E-10

5.45E-07

'1ULR.pdb'

677

19.9154

127.6762

3.84E-10

5.43E-11

'1BOM.pdb'

700

15.6276

37.5214

1.36E-09

3.16E-09

'1AIK.pdb'

729

17.302

39.4843

9.19E-09

7.89E-09

'1CEU.pdb'

854

21.3126

49.3975

3.15E-10

2.43E-09

'1KVX.pdb'

954

27.6469

83.2725

7.21E-04

6.61E-04

'1VMP.pdb'

1166

32.7741

95.3844

1.01E-06

5.57E-06

'1HSM.pdb'

1251

37.8582

108.2448

5.88E-07

3.22E-07

'1HAA.pdb'

1310

35.6037

129.6353

4.49E-10

8.25E-07

E-5 means 10^{-5} and E+5 means 10^{5}; others follow similarly

In Table

protein 3D structure determined by RUGB and the original protein 3Dstructure of 1HAA

**Protein 3D structure determined by RUGB and the original protein 3Dstructure of 1HAA.** The left picture is for protein 3D structure of 1HAA determined by RUGB and the right picture is for protein the original protein 3Dstructure of 1HAA

This is surprising since the up-date regimes are very similar. The main reason could be the following: The RUGB algorithm uses only three base atoms to numerically determine an unpositioned atom with two solutions and one additional atom to fix the real solution. This up-dating procedure involves less numerical calculation when compared with the 4 atom up-dating routine of the UGB algorithm. So it could be that the RUGB up-dating produces a smaller numerical error.

The theoretical analysis (Theorems 3.1 and 3.2) discuss the upper bound of run-time of RUGB. Clearly the numerical data shows that the algorithm runs much faster than the theoretical worst-case analysis using the proteins in our data set. The run-time data is plotted in Figure ^{2}^{1.2}

Plot of run-time of the UGB algorithm with the two best-fit functions

Plot of run-time of the UGB algorithm with the two best-fit functions

In Table

The Table

Numerical results of using RUGB and GB methods in protein structure determination

**PDB Name**

**Atoms**

**RUGB error***

**GB error1****

**GB error2***

2DX2

174

2.31E-11

7.81E-12

4.80E-05

1ID7

189

8.62E-14

1.94E-13

8.48E-08

1B5N

332

1.93E-10

1.87E-07

4.31E+07

1FW5

332

1.65E-12

2.31E-08

1.55E+00

1SOL

353

7.33E-13

1.58E-05

1.70E+04

1JAV

360

2.78E-12

3.33E-03

4.97E+01

1MEQ

405

2.43E-12

4.54E-08

2.21E+04

1AMB

438

7.11E-12

3.01E-09

1.11E+00

1R7C

532

8.62E-10

1.54E-2

6.07E+12

1HLL

540

2.83E-12

2.04

1.83E+09

1VII

596

3.56E-10

0.373

1.52E+05

1HIP

617

4.80E-10

1.25E+5

N.A.

1ULR

677

3.84E-10

3.20E+3

7.33E+09

1BOM

700

1.36E-09

2.7E-2

1.68E+12

1AIK

729

9.19E-09

26.9

N.A.

1CEU

854

3.15E-10

5E-5

9.35E+09

1KVX'

954

7.21E-04

977.49

7.45E+30

1VMP

1166

1.01E-06

2.78071E+13

N.A.

1HSM

1251

5.88E-07

1857.809626

1.37E+15

1HAA

1310

4.49E-10

83.15

6.62E+09

All errors are in Å.

N.A. means such protein can not be determined due to a large numerical error

E-5 means 10^{-5} and E+5 means 10^{5}; others follow similarly

* for each tested protein, a given set of distances are prepared with a cut-off distance 5Å

** for each tested protein, a given set of distances are prepared with a cut-off distance 8Å

In Table

Using a 5Å cut-off distance, the GB algorithm fails in producing a complete protein structure in some instances due to a round-off error that gets out of control. For the 8Å cut-off distance the given set of pair wise distances is much denser. This work verifies that the importance of updating that is used in both the RUGB and the UGB algorithms. Both algorithms indeed can determine a protein structure with a high accuracy.

Conclusions

A very accurate protein structure is essential to understand the function and dynamics of the protein in biological systems and activities. Applications of distance geometry in protein structures determination arise from the fact that pair wise distances of atoms in a protein can often be obtained from experiments or our knowledge of chemistry. Hence a protein structure can be determined if there exists a solution to the distance geometry problem. However, the problem is proved to be NP-complete. GB algorithms do not solve all distance geometry problems. In the cases where they do give a solution, GB algorithms can determine protein structure efficiently and accurately. In the GB algorithm, the positions of atoms are determined iteratively and rely on other already determined positions of atoms, which cause the accumulation of numerical errors. The strategy of updating allows us to control the size of numerical errors. However, in the UBG algorithm updating requires implementing an expensive step that contributes up to ^{4})

The RUGB algorithm has shown important properties of controlling numerical errors and effectiveness. However, this paper provides only theoretical studies of the method. The practical problems generally have distance ranges in a data set, such as NMR structure determination and protein structure prediction. In the future, we will address the application of RUGB methods in these cases. Also the theoretical results provide the upper bound of run-time when a sparse set of distances is given. More advanced methods should also be Applications of knowledge in graph theory or other advanced data structures may improve the algorithm further and will be a topic of future research.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

RTD carried out the programming and numerical tests. CE participated in the design of the study and helped to draft the manuscript. DW participated in the design of the study and helped to draft the manuscript. All authors read and approved the final manuscript.

Acknowledgements

The authors would like to thank the National Institutes of Health (NIH) and National Center for Research Resources (NCRR) Grant P20 RR16481 (Kentucky Biomedical Research Infrastructure Network) and National Science Foundation (NSF) Kentucky EPSCoR Research Enhancement Grant (REG) for support.

This article has been published as part of