Research Paper: Proteins as the Database of Mankind



The Database of Mankind

            Proteins are integral building blocks of life. Their function and structure drive every process in the human body, allowing man to thrive. Although there is a vast amount of information known about proteins, science is still struggling to discover the seemingly infinite sum of knowledge that remains unknown about them. Because proteins are involved in almost every bodily function it is crucial to identify both their structure and function in order to further the field of medicine. Countless diseases and plights upon humanity are caused by a malfunction in protein related disorders. If man can grasp a better understanding of these proteins humanity can greatly benefit.
            However, a pivotal problem involved in understanding proteins in order to better understand how the human body works is not the process of classifying these proteins, but rather organizing and analyzing the immense quantities of data amassed. In recent history the amount of information regarding proteins deposited in the Protein Data Bank is expanding exponentially. With the integration of skills from both mathematicians and biologists identifying the region of proteins involved with their function is a distinct possibility. Although it may be impossible to define the region of every protein that dictates its function, science is making extraordinary strides in achieving this possibility.
            In a recent study conducted Dr. Tracey Bray in the BioMed Central Journal the idea of using bioinformatics to predict the functional site of a protein is discussed. In recent history, structural genomics groups have made a substantial effort to define the structure of proteins, yet have failed to acknowledge their functions. The study is interested in shedding light upon the function of a protein by identifying the part of the protein that is most likely integral in determining its function. The most common method to determine the function of a protein is to compare proteins in regards to the sequencing of their amino acids. Since most functions of proteins are assumed to be evolutionally conserved, their functions can be predicted by determining structurally important residues in already characterized proteins. Although this has an obvious benefit to proteins that have already been classified, it has a just as obvious drawback; what about proteins that have not been classified yet? Similarly, proteins that have residues conserved for structure rather than function will yield a false conclusion regarding their function. Thus, another bioinformatic technique must be instituted.
            New and improved sequence-based methods fall into two categories; one that identifies structural similarities with an already know protein and its function and one that predicts functional sites through geometric structures as well as electrostatic properties. In order to utilize these two techniques databases are constructed to not only encode the information about known proteins structures and functions, but to also predict the function of unknown proteins using algorithms. In regards to enzymes, it has been determined that enzyme active sites are found in the largest surface cleft of the respective enzyme. With the analysis from these databases these sites can be identified with startling accuracy. Although structural approaches yield high accuracies, the precision of identifying activation sites is best when both structural and sequencing conservation techniques are combined.
            Despite the accuracy of this combined technique, there are very few large databases that are available to those who are concerned with this information. Thus the strides made in implicating the SitesIdentify functional site prediction tool are crucial. Although there are similar databases and prediction software available, the author stresses the importance of not only it’s accessibility but it’s intrinsically user friendly qualities.
            The method in which SitesIdentify operates is by placing a 2A grid over the protein structure and applying a uniform charge on each non-hydrogen atom. The electrostatic potential is then calculated using Difference Poisson-Bolzmann calculations and the peak potential is predicted. Once the peak potential is identified (there is error in this method) this area is defined as the centroid of the functional site. In contrast, the other method utilized is by combining the method discussed above in combination with sequence conservation techniques. Similar homologues are identified by putting the sequence into PSI-BLAST and then calculating a conservation score using amino acid and stereochemical diversity and gap occurrence at that position. The peak potential is determined in the same manner as before, but is further refined by determining a single central atom in each amino acid weighted with the conservation scores.
            When a submission is made, SitesIdentify utilizes two different approaches. If a conservation approach is selected by the user it identifies the homologues through Conserved Residue Colouring program that runs sequences contained in SEQRES that records the results in the PDB file. From this, conservation scores are produced for every residue that determines the peak potential’s location on the protein. When SitesIdentify fails to recognize a homologue, the method switches to one that uses charged-based calculations.
            When a user does not select the conservation method the CRC program is nullified and immediately calculates the location of the peak potential through a calculation that integrates a uniform charge-weighting method. Then a user-supplied radius is determined around the predicted centroid and residues are selected. Thus, the possible functional sites are highlighted and produced in a list.
            There appears to be countless reasons why SitesIdentify is an important and possibly integral technology for the future of classifying the structure and function of proteins. First of all, SitesIdentify is accessible to anyone without a license or an account. Not only can anyone upload a structure file, but SitesIdentify also informs the person who submits the structure if it is not possible to identify or if it was identified. Furthermore, the average time to calculate the peak potential was only six minutes using the conservation method and only two minutes if it only utilizes charge-based calculations. Also, SitesIdentify can distinguish between enzymes and non-enzymes with precision. Because enzymes have large surface accessible clefts, SitesIdentify can determine between the two using this property.
            The optimism of identifying the structure and function of proteins through SitesIdentify is promising. The study leaves the reader with a feeling of hope. Even though there are obvious imperfections in SitesIdentify due to the fact that the peak potential is just a prediction of where the function site is, this system appears better than others that are similar. Overall, the article was relatively difficult to understand because of the language barrier that exists to those who are not familiar with bioinformatic practices. However, the underlying message was clear. With the use of SitesIdentify it is possible to understand the functional site of proteins, and from that there exists a potent possibility for the improvement of health care in regards to protein related health issues. If the mechanisms of proteins’ functions are discovered man can discover solutions to inhibit or stimulate these sites in order to rectify the problems that exist in the effected patient. This possibility is tantalizing. Perhaps in the near future these protein based diseases and disorders will fade into our past. 

Frugal Fitness