Title: NMR Peak Assignment : Better Algorithm
1NMR Peak Assignment Better Algorithm
- Frederick Vizeacoumar
-
- Tommy Chu
2Agenda
- Introduction
- Problem Definition
- Previous Works
- Input Parameters
- Design
- Results
- HSQC and Survey on Dipolar Coupling
- Conclusion
3Agenda
- Introduction
- Problem Definition
- Previous Works
- Input Parameters
- Design
- Results
- HSQC and Survey on Dipolar Coupling
- Conclusion
4Introduction
- Nuclear Magnetic Resonance (NMR)
5Introduction
- Nuclear Magnetic Resonance (NMR)
- Use the strong magnetic wave to align nuclei
(isotopes). - When this spin transition occurs, the nuclei are
said to be in resonance with the applied
radiation.
6NMR measurement
- Chemical Shift
- ppm
- Electrons in the molecule have small magnetic
fields - When applied string magnetic field, electrons
tend to oppose the applied field. - NMR Spectrum
7NMR spectroscopy
- Study the physical, chemical, and biological
properties. - Problem
- Identified sequence.
- Unknown (complete) structure.
- Known basic structure.
- Unknown the structure corresponding to AAs.
8Procedure to determine protein structure using NMR
- Three steps
- Data generation
- Involves corresponding resonance peaks to AA and
forming spin system. - Data interpretation
- Involves matching spin system to amino acids
providing inter and intra AA distance and angle. - NMR Structure calculation
- Involves structure determination using Molecular
Dynamics (MD) Energy Minimization (EM).
9Agenda
- Introduction
- Problem Definition
- Previous Works
- Input Parameters
- Design
- Results
- HSQC and Survey on Dipolar Coupling
- Conclusion
10Project -- Data Interpretation
- Primary Goal
- (Better) Automated Peak Assignment.
- Steps for doing this data interpretation
- Map resonance peaks from different NMR spectra to
same residue. - Identify adjacency relationship.
- Assign the segments to the protein sequence.
- Problem with existing Algorithms
- Low accuracy.
- High time complexity.
11Peak Assignment
- Assignment procedure need to address two crucial
information - Different AA types have different distribution of
spin system. - The adjacency info, scalar coupling, between spin
systems are obtained by identifying their common
resonance frequencies. - Enhancement techniques
- HSQC
- Dipolar coupling
12Agenda
- Introduction
- Problem Definition
- Previous Works
- Input Parameters
- Design
- Results
- HSQC and Survey on Dipolar Coupling
- Conclusion
13Previous Works
- 1 formulated NMR assignment problem to a
Constraint Bipartite Matching (CBM) problem. - 1 proposed a naïve (two layer) algorithm.
- 1,6 proved D-string CBM is NM-hard.
- 6 proposed two approximation algorithm.
- 5 applied the branch-and-bound techniques.
- 9 attempted to solve CBM using extensive search
techniques in artificial intelligence.
14Bipartite Matching
- G(U ? V,E)
- U is sequence of AA
- V is the set of Spins
- U ? V ?
- G(U ? V, M) ? G i.e. m ? M cannot share a same
vertex.
Protein Sequence
Spin Systems
A
3
G
2
C
1
T
4
15Bipartite Matching
- G(U ? V,E)
- U is sequence of AA
- V is the set of Spins
- U ? V ?
- G(U ? V, M) ? G i.e. m ? M cannot share a same
vertex
Protein Sequence
Spin Systems
A
3
G
2
C
1
T
4
16Bipartite Matching
- Perfect Matching
- All node need to be covered by a match.
- Weighted Bipartite Matching
- Each edge associated with a weight.
- Maximum (Perfect) Weighted Bipartite Matching
- Maximize the total weight for all m ? M .
Protein Sequence
Spin Systems
A
3
G
2
C
1
T
4
17Constraint Bipartite Matching
- For all segment ? V
- If (ui,vj) ? M i.e. ui ? U ? vj ? V
- Then (ui1,vj1) ? M
- D-string CBM
- Specify the problem with maximum segment size is
D.
Protein Sequence
Spin Systems
A
3
G
2
C
1
T
4
18Two Layered Algorithm 1
- First layer
- Filter out unlikely assignments for long strings.
- Second layer
- Try all possible combinations of assignments for
long strings to find the maximum one as the
result.
19Approximation Algorithms 2
- 2D Approximation Algorithm
- Let M be an optimal matching on G.
- Every edges corresponding to a vertex will
conflict with at most 2 edges in M. - 3logD Approximation algorithm
- If the length of the longest string in V is at
most four times the length of shortest string - Then greedy algorithm will find a solution whose
weight is at least 1/6 of the optimal.
20Branch and bounded algorithm
- A systematic method for solving optimization
problems. - Construct a search tree and apply a carefully
selected criterion to determine which node to
expand the search. - Exponential time in the worst case
21AUTOASSIGN/AUTOPEAK
- 5 Major Searches
- Make strongest matches
- Allow degenerate shifts
- Extend assigned segments
- Match weaker spin systems
- Finish Assignments
- Claims to have 98 accuracy.
- Test only on RNA single strand sequence.
- Infeasible to compute using protein sequences.
22Agenda
- Introduction
- Problem Definition
- Previous Works
- Input Parameters
- Design
- Results
- HSQC and Survey on Dipolar Coupling
- Conclusion
23Input Parameters
- Protein Sequence
- Sequence of location and AA
- Spin Systems
- Segment of chemical shift separated by comma
- Score Scheme
- A table storing the score between an AA and the
range of chemical shift
24Example input for protein sequence
- 1 GLY
- 2 SER
- 3 VAL
- 4 GLU
- 5 GLN
- 6 ILE
- 7 SER
- 8 GLY
25Input Parameters
- Protein Sequence
- Sequence of location and AA
- Spin Systems
- Segment of chemical shift separated by comma
- Score Scheme
- A table storing the score between an AA and the
range of chemical shift
26Example input for spin system (three segments)
- 12.5
- 13.5 ,
- 5.35
- 6.4
- 7.21 ,
- 16.1
- 17.2
27Input Parameters
- Protein Sequence
- Sequence of location and AA
- Spin Systems
- Segment of chemical shift separated by comma
- Score Scheme
- A table storing the score between an AA and the
range of chemical shift
28Example input for score scheme
- 5
- GLY
- SER
- VAL
- GLU
- GLN
- 4
- 0 5
- 5 10
- 10 15
- 15 20
- 1 5 2 4
- 2 1 4 0
- 1 2 3 4
- 2 4 1 3
- 2 3 3 6
29Agenda
- Introduction
- Problem Definition
- Previous Works
- Input Parameters
- Design
- Results
- HSQC and Survey on Dipolar Coupling
- Conclusion
30Implemented Design
- CBM Two layer approximation algorithm
- As mentioned earlier we use the bipartite graph
matching problem with some constraints to solve
the peak assignment problem - Key idea is the multi-dimensional NMR spectra
contains inter residual peaks that convey the
connectivity information between residues - We match the inter- and intra residual peaks
using their chemical shift making the
connectivity information straightforward
31CBM Two layer Approximation
- layerOne( sequence, spin, score, threshold)
- for every segment Ui in spin do
- for every position Vj in the sequence do
- if score (Ui,Vj) threshold then
- mark Vj as possible assignment
position -
- layerTwo( sequence, spin, score)
- smax -8
- for every possible legal combination set from
layerOne - do calculate the score and call it si
- if si gt smax then
- smax si and store current position
as final assignment -
322 Approximation Algorithm
- Key Idea is to form a Weighted Bipartite Graph
and use it to find the matching. - We use some restriction and constraints in
identifying the leading innermost edge as
explained. - We find an innermost edge, add that with all its
conflicting edges and take the minimum weight on
this set and subtract it and remove the edges
with weight 0. - The above process is called on the new set of
graph formed recursively to obtain feasible
matching
333 Log D Approximation
- The idea is if there are m different segments in
the spin system we group these m small sets into
overlapping groups based on some formulas - find the score for each group and maximize the
final score and assignment to feasible solution - This also uses weighted bipartite graph
- As explained before this problem includes still
more constraints in identifying there is no
overlapping between segments.
343 Log D Approximation
- 3 log D Approximation(Score U, Spin V)
- r lmax / lmin , where lmax and lmin are max
and min length of - string in V
- group V into g max(2, log4r) subsets Vi such
that - 4i-1 s / lmin 4i
- for every i ? 1,2, g
- cal the set Ei of edges of G incident to
strings in Vi - initialize Mi Ø
- while (Ei ? Ø)
- find an edge e ? Ei of maximum
weight - add e to Mi and delete e and
all edges conflicting with e from Ei - greedily extend Mi to a maximal feasible
matching of G - output the heaviest one among Mi
35Improved Two layer Algorithm
- The threshold value in CBM algorithm is good to
eliminate but determination of this threshold
value is trial and error basis. - 6 considers 3LogD approximation is better than
2 approximation. - In 3 Log D approximation, the partitioning the
subsets of segments in the spin system is based
on the formula 4i-1 s / lmin 4i . Does this
always give a better improvement in score ?
36Improved Two Layer Algo.
- layerOne (sequence U, spin V, score)
- for every subset Vi of the set spin system V
do - Ei all edges incident from Vi to U
- Mi Ø
- while (Ei ? Ø)
- find an edge e ? Ei of max weight
- add e to Mi and delete e and all
conflicting edges with e from Ei -
- mark positions in sequence set for
corresponding Mi set as possible - assignment
-
37Improved Two Layer Algo.
- layerTwo (sequence, spin, score, possible
assignment) - smax -8
- for every possible assingment position in U do
- calculate the score and call it si
- if si gt smax then
- smax si and store current position
as final assignment -
-
- If there are m groups of spin segments, then
total number - of search would be 2m-1 .
38Agenda
- Introduction
- Problem Definition
- Previous Works
- Input Parameters
- Design
- Results
- HSQC and Survey on Dipolar Coupling
- Conclusion
39Current Results
40Expected Results
41Agenda
- Introduction
- Problem Definition
- Previous Works
- Input Parameters
- Design
- Results
- HSQC and Survey on Dipolar Coupling
- Conclusion
42Interface with HSQC
- Main work in Hetero-nuclear Single Quantum
Correlation involves in identifying the NH amide
side chain. - Basically it is a biological experiments which
yields data related to the NHx group directly
attached to proton - All AA produces one signal for N-H amide group
(except proline) based on its pH value and the
chemical shift exhibited, the NH side chain is
visible.
43HSQC cont
- Folded proteins or protein domains display a
broad distribution of NMR frequencies resulting
good spread-out. - Unfolded proteins do exhibit same resulting in
overlapping frequencies. - HSQC technique adds ligand shifting signals which
changes the overlapping. - This process also involves few calculation that
results in a better spin system values.
44Survey on Dipolar Coupling
- Unlike HSQC enabling to identify the NH amide
side chain, Dipolar coupling identifies the Ca -
Cb side chain. - Provides long range info which is lacking in NMR
experiments - This is also an biological experimental process
to improve the spin system which is the main part
in identifying the protein structure. - The results of this experiment enables us to
determine the side chain rotamer states using
rotamer prediction algorithm.
45Dipolar Coupling cont
- NMR solutions structures are determined primarily
using restraints derived from nuclear overhauser
effects. This derivation yields to the
proton-proton distance less than 5Å. Hence for
elongated molecules, NMR is not efficient. - Elongated molecules are present in the helical
structure of the protein sequence. - This local error on elongated molecules tends to
add over the length resulting in poor protein
structure determination.
46Dipolar Coupling Cont
- The size of dipolar coupling observed bet 2
nuclei is given by - DPQ(q,f)DaPQ(3cos2q 1)1.5 R sin2q cos 2f
- Where R is rhombicity (shape of molecule)
- The value obtained here helps to observe
elongated molecules as well
47Agenda
- Introduction
- Problem Definition
- Previous Works
- Input Parameters
- Design
- Results
- HSQC and Survey on Dipolar Coupling
- Conclusion
48Conclusion
- Major part of NMR technique involves in peak
assignment process. - The main goal of our project in finding the
better algorithm lead us to think about improving
the matching and score scheme rather than the
improvement on the computational process of the
algorithms. - From the already existing algorithm, we found
that if there are m segments, a reduces amount of
subsets was taken and a better matching was done.
We thought it would be nice to look if all the
possible subsets are taken and observed for a
better match score value.
49Future Directions
- As given in the CBM two layer algorithm, if the
NMR experiment could someway help in identifying
the threshold value, then in our improved
version, the number of checks could be reduced
with this threshold value. - Extracting the results from HSQC experiment and
with some algorithms developed for this data set,
we can get a better spin system for the NH amide
side chain and improve our assignment process - Using the data from the Dipolar coupling to
identify the Ca - Cb side chain would get us even
better spin segments. This might yield to a
better protein structure determination.
50Acknowledgements
- Our Sincere thanks to
- Dr. Guohui Lin, Professor, U of A.
- Mr. Xiang Wan, Ph.D. Student, U of A.
- Mr. Jon McCall, Spectrum Research LLC.
- Dr. Gaetano T. Montelione, Rutgers Univ.
51References
- 1 Y. Xu, D. Xu, D. Kim, V. Olman, J.
Razumovskaya, and T. Jiang. "Automated assignment
of backbone NMR peaks using constrained bipartite
matching", IEEE Computing Science Engineering,
450-62,2002. - 2 C. Bartels, T. Xia, M. Billeter, P. Gu, and
K. Wu, "The program XEASY for computer-supported
NMR spectral analysis of biological
acromolecules", J. Biol. NMR 6, 1-10, 1995. - 3 K. P. Neidig, M. Geyer, A. Go, C. Antz, R.
Saffrich, W. Beneicke,and H. R. Kalbitzer.
"AURELIA, a program for computer-aided analysis
of multidimensional NMR spectra", J. Biomol. NMR
6, 255-270, 1995. - 4 B. R. Brooks, R. E. Bruccoleri, B. D.
Olafson, D. J. States, S. Swaminathan, and M.
Karplus. "CHARMM A Program for Macromolecular
Energy, Minimization, and Dynamics Calculations",
J. Comp. Chem. 4, 187-217, 1983. - 5 G. Lin, D. Xu, Z-Z. Chen, T. Jiang, J. Wen,
Y. Xu. "An Efficient Branch-and-Bound Algorithm
for the Assignment of Protein Backbone NMR
Peaks", in Proceeding of the IEEE Computer
Society Bioinformatics Conference 2002 (CSB
2002), P165 - 174.
- 6 Z-Z. Chen, T. Jiang, G. Lin, J. Wen, D. Xu,
Ying Xu. "Better Approximation Algorithms for NMR
Spectral Peak Assignment." The second Workshop on
Algorithms in Bioinformatics (WABI)", LNCS 2454,
pp. 82-96, 2002. - 7 F.Tian, H.Valafar and J.H. Prestegard. "A
dipolar coupling based strategy for simultaneous
resonance assignment and structure determination
of protein backbones", Journal of the American
Chemical Society, 12311791-11796, 2001. - 8 R. Bar-Yehuda and S. Even. "A local-ratio
theorem for approximating the weight vertex cover
problem'', Annuals of Discrete Mathematics,
2527-46, 1985. - 9 D E. Zimmerman, C A. Kulikowski, Y Huang, W
Feng, M Tashiro, S Shimotakahara,C-Y Chien, R
Powers, and G T. Montelione. "Automated Analysis
of Protein NMR Assignments Using Methods from
Artificial Intelligence'', J. Mol Biol 269,
592-610, (1997). - 10 J.J. Warren and P.B. Moore. "Application of
dipolar coupling data to refinement of the
solution structure of the Sarcin-Ricin loop
RNA''. Journal of Bimolecular NMR, 20 311-323,
2001. - 11 Michael Andrec, Yuichi Harano, Mathew P.
Jacobson, Richard A Friesner and Ronald M Levy,
"Complete Protein Structure Determination Using
Backbone Residual Dipolar Couplings and Side
chain Rotamer Prediction''.