Bioinformatics Programming for Bioavailability Analysis of Sequence Patterns in Public Genomic Databases

Changsu Dong; Roy Lee; Joseph Sayad; Konstantinos Krampis

+1 (929) 600-8049

- Feedback
- Signup
- Submit Manuscript

e-Pub

Full Text

Advancements in Bioequivalence & Bioavailability

Bioinformatics Programming for Bioavailability Analysis of Sequence Patterns in Public Genomic Databases

Changsu Dong¹, Roy Lee¹, Joseph Sayad², and Konstantinos Krampis^1,2,3

¹ Belfer Research Building, Weill Cornell Medical College and Hunter College, USA

² Department of Biological Sciences, Hunter College, USA

³ Department of Physiology and Biophysics, Cornell University, USA

*Corresponding author: Konstantinos Krampis, Department of Physiology and Biophysics, Institute for Computational Biomedicine, Weill Cornell Medical College, Cornell University, New York, USA

Submission: April 19, 2018; Published: May 10, 2018

DOI: 10.31031/ABB.2018.01.000508

ISSN 2640-9275
Volume1 Issue2

Abstract

In this study we present a novel bioinformatics software for the analysis of bioavailability of short amino acid peptides, in various proteins found across four phylogenetic kingdoms of Archaea, Bacteria, Mammals and Plants. In order to assess bioavailability of these peptides, we have used a set of large-scale protein databases from the National Center for Biotechnology Information, and the Basic Local Alignment Search Tools (BLAST+) search program. In our results we present the counts of peptides matches across the phylogenetic kingdoms, and also in further detail for Gram positive or negative bacteria. Our bioinformatics software is written in Python and is made available within this publication as freely available for academic and non-profit use.

Results

In order to understand the prevalence of the short peptide patterns in various proteins across different kingdoms (Archaea, Bacteria, Mammals, and Plants), we performed a set of pattern searches using sequence alignments to the databases of the National Center for Biotechnology Information [1]. Specifically, we used the Basic Local Alignment Search Tools (BLAST+) with a local copy the NCBI databases, and searched the databases with queries of peptide sequence patterns through the BLASTP protein alignment subprogram. For our database searches, we used a set of peptides sequences each six amino acids long, which were first described in [2], in addition to a set of twelve amino acid peptides described in [3]. According to these studies, these small protein peptides are factors in a range of diseases ranging from Parkinson’s and Alzheimer’s, to diabetes, mainly due to their role in formation of oligomer aggregates leading to amyloid plaques. The innovation of our study was the use of a powerful computer server in our laboratory, in combination with a novel computer code written in Python (Methods Section), to data mine the complete set of peptide matches from the BLAST+ to the NCBI databases. These databases contain a complete, non-redundant collection of reference genome sequences representative of all major organisms and phylogenetic tree clades. Using the NCBI FTP Site, we downloaded the Archaea, Bacteria, Mammals, and Plants reference databases (Table 1) in FASTA format [4]. Each kingdom had multiple FASTA files for each genome included in the kingdom, which were concatenated into a single file for each kingdom. Following this, the single files were searched for identities to the peptides from the selected studies described above using the BLAST+ programs, which were run with the parameters described in the Methods section.

Table 1: Protein databases from NCBI used in our study.

In our results from the BLAST+ searches of the different short peptides we observed 3 matches in the Plants and Mammals phylogenetic kingdoms, and 5 matches for the Archaea (Figure 1). The most numerous matches were for the Bacteria (266 matches), which was expected given the large number of species within this phylogenetic kingdom. In order to further clarify the results, we separated the Bacterial species in Gram+ (88 hits) or Gram- (178 hits). With this, more detailed numbers are presented for the different Phyla of the Bacteria in each Gram category. For clarity, the results were also visualized in tree format (Figure 1), demonstrating the number of matches across the different phylogenetic clades.

Figure 1: Number of matches by BLAST+ searches across the different phylogenetic kingdoms.

Methods

We performed four separate BLAST+ searches in the Archaea, Bacteria, Plants and Mammals phylogenetic kingdoms Table 1, using the BLASTP sub-program for peptide alignment to the database. Before performing BLASTP, we formatted the FASTA files (.faa) that were downloaded from NCBI, and produced files in the format required (.pal, .pni) for BLASTP database search. Towards this, we used the makeblastdb command, which takes a FASTA file as input and outputs a database ready for BLASTP. makeblastdb -in all_plantBlast.faa -parse_seqids –db type prot. The exact command parameters used were the ones shown below (with the Plant database as example, the .pal / .pni suffixes of the database files are not required in the command): blastp -outfmt 5 -query peptide. fa -word_size 2 -matrix=PAM30 -db all_plant Blast -out Plant.xml -evalue 100000000000 -max_target_seqs 500 -qcov_hsp_perc 100.

As seen in the command above, we set the BLASTP results to be written in the output file (“-out”) in extensible Markup Language (XML) format. The reason was that his format is standardized, making it easy to perform further analysis and data mining of the BLASTP results, using a programming language such as xml. etree in Python 2.7 shown on the “Code Insert” section below. The developed code reads the complete XML BLASTP output, in addition to filtering and counting matches for the database that contain any of the keywords related to the specific functionality of peptides (lines 16-20 of the code). Finally, the code prints the output, from which we counted the number of hits per kingdom presented in (Figure 1) in the results.

Bioinformatics Code

from xml.etree import Element Tree

#This program parses XML output from blast

#Input include 1) XML input

#Output 2) Output prints

# 2a) Blast Matches that are non-nucleotide binding

# 2b) Blast Matches that are nucleotide binding

def blast ExactMatch(file Name, hit Seq):

root = ElementTree.parse(fileName).getroot()

rootSub1 = root.getchildren()

iterations = rootSub1[8].getchildren()

blastOutputIterations = iterations[0].getchildren()

IterationHits = blastOutputIterations[4].getchildren()

Nucleotide Key Words = [‘SYNTHET’,’NUCLEOTID’,’ABC TRANSPORTER’,’PHOSPHO’,’ADP’,’AMP’,’ATP’,’ATP BINDING’,’ATPDEPENDENT’,’ ATPASE’,’CAMP’,’CDP’,’CGMP’,’CMP’,’COENZYME A’,’CTP’,’CTP

BINDING’,’CTP-DEPENDENT’,’DNA BINDING’,’DNA REPAIR’,’FAD’,’FADH 2’,’GDP’,’GMP’,’GTP’,’GTP

BINDING’,’GTP-DEPENDENT’,’GTPASE’,’HELICASE’,’NAD +’,’NADH’,’NADP +’,’NADPH’,’NUCLEOTIDE

BINDING’,’RNA BINDING’,’TRNA BINDING’,’UDP’,’UMP’,’UTP’,’UTP BINDING’,’UTP-DEPENDENT’]

TotalFullHits =0

TotalFullHitsNucleotide = 0

TotalAnyHits = 0

HitIDs = []

for hitDescr1 in IterationHits:

TotalAnyHits = TotalAnyHits ++ 1

hitDescr2 = hitDescr1.getchildren()[5]

hitDescr3 = hitDescr2.getchildren()[0]

Hsp_score = hitDescr3.findtext(‘Hsp_score’)

Hsp_qseq = hitDescr3.findtext(‘Hsp_qseq’)

Hsp_hseq = hitDescr3.findtext(‘Hsp_hseq’)

Hsp_midline = hitDescr3.findtext(‘Hsp_midline’)

Hsp_evalue = hitDescr3.findtext(‘Hsp_evalue’) Hit_num = hitDescr1.findtext(‘Hit_num’)

Hit_id = hitDescr1.findtext(‘Hit_id’)

Hit_def = (hitDescr1.findtext(‘Hit_def’)).upper()

NucleotideProtein = “NO “

if any(x in Hit_def for x in NucleotideKeyWords):

NucleotideProtein = “YES”

if Hsp_hseq==hitSeq and “HYPOTHETICAL PROTEIN”

not in Hit_def:

print “Hit Number:” + Hit_num

print “Hit ID:” + Hit_id

HitIDs.append(Hit_id)

print “Hit Def:” + Hit_def

print “Nucleotide Binding Protein:” + NucleotideProtein

print “Hit Accession:”+ hitDescr1.findtext(‘Hit_ accession’)

print “Hsp Midline:” + Hsp_midline

TotalFullHits = TotalFullHits + 1

print “Hsp_score:”+Hsp_score

print “Hsp_qseq:”+Hsp_qseq

print “Hsp_hseq:”+Hsp_hseq

print “Hsp_evalue:”+Hsp_evalue

if NucleotideProtein == “YES”:

TotalFullHitsNucleotide = TotalFullHitsNucleotide + 1

print “****************************************”

print “Total Blast Full Hits:”+str(TotalFullHits)

print “Blast Full Hits & Nucleotide:”+

str(TotalFullHitsNucleotide)

FullNonNuc = TotalFullHits - TotalFullHitsNucleotide

print “Blast Full Hits & Non-Nucleotide:”+ str(FullNonNuc)

def appendToFile(textFile,strToAppend):

with open(textFile,’a’) as file_object:

file_object.write(strToAppend)

def deleteLastLine(textFile):

lines = open(textFile).readlines()

open(textFile,’w’).writelines(lines[:-1])

def seqCount(fastaInput):

appendToFile(fastaInput,”>”)

with open(fastaInput) as file_object:

TotalNumberAA = 0

TotalNumberSeq = -1

for line in file_object:

lineFirstChar = line[0]

seqCount = 0

seq = ‘’

if line[0] == ‘>’:

try:

seqHeader = seqHeader2

except:

seqHeader = line

seqHeader2 = ‘’

print seqHeader

while (lineFirstChar!=’>’):

seq = seq + line

line = next(file_object)

lineFirstChar = line[0]

seqHeader2 = line

TotalNumberSeq = TotalNumberSeq + 1

seqHeader = seqHeader2

if seq!=’’:

print seq

ProteinLength = len(seq)-1

TotalNumberAA = TotalNumberAA + ProteinLength

print ProteinLength

print “************************************** *****”

print “TotalAAinDB:”+str(TotalNumberAA)

print “TotalSeqinDB:”+str(TotalNumberSeq)

deleteLastLine(fastaInput)

def createFastaFromRef(filterFAA,refDBFAA): header = “>”

with open(refDBFAA) as oldfile, open(filterFAA, ‘w’) as newfile:

for line in oldfile:

if any(HitID in line for HitID in HitIDs): #HitID are the elements from the forloop in this line

newfile.write(line)

myLine = next(oldfile)

while (myLine[:1]!=”>”):

newfile.write(myLine)

myLine = next(oldfile)

Acknowledgement

We thank all members of the Bioinformatics Core Infrastructures Lab for their useful feedback and suggestions during the preparation of the manuscript. This research was supported by the Center for Translational and Basic Research grant from National Institute on Minority Health and Health Disparities (G12 MD007599) and Weill Cornell Medical College-Clinical and Translational Science Center (2UL1TR000457-06).

References

© 2018 Konstantinos Krampis. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and build upon your work non-commercially.

Submit Query

PubMed Indexed Articles

Track Your Article

Editor In Chief

Hirotada TSUJII

Ph.D in Agriculture from Faculty of Agriculture, Tohoku University

Approaches in Poultry, Dairy & Veterinary Sciences

Maria Kuman

Research Professor, PhD, Holistic Research Institute

Advances in Complementary & Alternative Medicine

Tomasz Karski

MD PhD, Professor, Vincent Pol University

Orthopedic Research Online Journal

Jiexiong Feng

Professor, Chief Doctor, Director of Department of Pediatric Surgery, Associate Director of Department of Surgery, Doctoral Supervisor Tongji hospital, Tongji medical college, Huazhong University of Science and Technology

Research in Pediatrics & Neonatology

Muhammad Atiqullah

Senior Research Engineer and Professor, Center for Refining and Petrochemicals, Research Institute, King Fahd University of Petroleum and Minerals (KFUPM), Dhahran, Saudi Arabia

Research & Development in Material Science

Ian James Martins

Fellow of International Agency for Standards and Ratings (IASR), Edith Cowan University, Sarich Neuroscience Research Institute

Advancements in Case Studies

Thomas F George

Chancellor Emeritus / Professor Emeritus of Chemistry and Physics, University of Missouri–St. Louis

Annals of Chemical Science Research

Jose Crisologo de Sales Silva

Ph.D in Science from the Federal University of Alagoas, UFAL, Brazil

Novel Research in Sciences

Naglaa Sami Adbel Aziz Mahmoud

Assistant Professor in College of Architecture, Art and Design

Academic Journal of Engineering Studies

Tong-Ching Tom Wu

Interim Dean, College of Education and Health Sciences, Director of Biomechanics Laboratory, Sport Science Innovation Program, Bridgewater State University

Research & Investigations in Sports Medicine

Dr. Jose Luis Turabian

Professor of numerous training courses in Family Medicine

Associative Journal of Health Sciences

Dariusz Jacek Jakóbczak

Assistant Professor, Department of Electronics and Computer Science

COJ Electronics & Communications

Member In

View All...

Quick Links

Editorial Board Registrations

×

Join as Editor

Join as Associate Editor
Submit your Article
Best Paper of the Volume
Reprints
Refer a Friend

×

Refer a Friend

Suggested By

Referrer Details
Advertise With Us

×

Advertise With Us

Our Recent Edition

Top Editors

Zhengcai Lou

Wenzhou Medical University, China
Ya Lie Ku

Fooyin University, Taiwan
Volkan Sarper Erikci

Saglik Bilimleri University, Turkey
Tomasz Karski

Vincent Pol University, Poland
Thamil Selvam

National Defence University of Malaysia, Malaysia
Tarik Baykara

Dogus University, Turkey
Steven Smith

Hope College, USA
Stanislav Grigoriev

Russian Academy of Sciences, Russia
Shi Zhou

Southern Cross University, Australia
Shewikar Farrag

Umm Al-Qura University, Saudi Arabia
Ray Marks

City University of New York, USA
Praveen K Maghelal

Khalifa University of Science & Technology, United Arab Emirates
Peng Yu

Hebei Normal University, China
Nawal Mohamed Khalafallah

Alexandria University, Egypt
N K Kishore

Indian Institute of Technology Kharagpur, India
Muzzalupo Innocenzo

Council for Agriculture Research and Analysis of Agri Economy (CREA), Italy
Muhammad Atiqullah

King Fahd University of Petroleum and Minerals, Saudi Arabia
Mohamed A Rashed

King Abdulaziz University, Saudi Arabia
Maurice E Morgenstein

University of Oregon, USA
Martin Sweatman

University of Edinburgh, Scotland
Maria Kuman

University of Tennessee, USA
Manuel Velasco

Central University of Venezuela, Venezuela
Majid Monajjemi

Islamic Azad University Central Tehran Branch, Iran
Luisetto Mauro

Tourin University, Italy
Lloyd Arthur Jenkins

Teaching & Public Speaking, Spain
Leonardo Milella

Paeditric Hospital "Giovanni XXIII", Italy
Kanakis Dimitrios

University of Nicosia, Cyprus
Jose Luis Clua Espuny

Universidad Miguel Hernández de Elche, Spain
John Korstad

Oral Roberts University, USA
Jinliang Zhang

Beijing Normal University, China
Irina Koretsky

Howard University, USA
Ian James Martins

Edith Cowan University, Australia
Hamid Yahiya Hussain

Dubai Health Authority, UAE
Gundu HR Rao

University of Minnesota, USA
GP Karmakar

Indian Institute of Technology Kharagpur, India
Ghassan George Haddad

Serhal Hospital, Lebanon
George Gregory Buttigieg

University of Malta, Malta
Fumihiko Hinoshita

National Center for Global Health and Medicine, Japan
Freida Pemberton

Molloy College, USA
Francisco Welington de Sousa Lima

Federal University of Piauí, Brazil
Florian Bert

Krankenhaus Nordwest Hospital, Germany
Fathi Habashi

Laval University, Canada
Dora Alicia Cortes Hernandez

Cinvestav-Unidad Saltillo, Mexico
Daniel Kinem

UPMC Hamot Neuroscience Institute, USA
Conxita Mestres Miralles

Ramon Llull University, Spain
Barry Kraynack

White Bear Associates, LLC, USA
Arkady S Voloshin

Lehigh University, USA
Alireza Heidari

California Southern University, USA
Alex Guskov

Institute of Solid State Physics of RAS, Russia
Alan Diego Briem Stamm

University of Buenos Aires, Argentina
Ahmed Nasr Ghanem

Mansoura University, Egypt
Afaf K El Ansary

King Saud University, Saudi Arabia
A Bernardes

University of Coimbra, Portugal

Financial Support

Latest e-Books

Latest Video

© 2017 Crimson Publishers, All rights reserved. No part of this content may be reproduced or transmitted in any form or by any means as per the standard guidelines of fair use. Creative Commons License Open Access by Crimson Publishers is licensed under

a Creative Commons Attribution 4.0 International License. Based on a work at www.crimsonpublishers.com. Best viewed in

| Above IE 9.0 version

Scroll

Full Text

Advancements in Bioequivalence & Bioavailability

Bioinformatics Programming for Bioavailability Analysis of Sequence Patterns in Public Genomic Databases

Abstract

Results

Methods

Bioinformatics Code

Acknowledgement

References

PubMed Indexed Articles

Track Your Article

Editor In Chief

Member In

Signup for Newsletter

Quick Links

Our Recent Edition

Top Editors

Financial Support

Sponsors

Latest e-Books

Latest Video

Reprints