Sarvesh Datta Dixit and Shalini Singh*
Department of Chemistry, India
*Corresponding author: Shalini Singh, QSAR & Cheminformatics Laboratory,Department of Chemistry, Bareilly, India
Submission: May 05, 2022;Published: May 25, 2022
ISSN: 2576-9170 Volume3 Issue5
Substituted coumarins, such as sulfocoumarins (1,2-Benzoxathiine-2,2-dioxides) are the most important class of potent and isoform-selective inhibitors of Tumor-Associated Carbonic Anhydrase, CA XII. The regression analyses were carried out using regular as well as Ridge multiple regression analyses. Application of a variety of statistics, namely (δ) statistics, Ridge regression and parameters derived therefrom were used for modeling the CA XII activity. We have attempted to build QSAR models to explore the correlations between the calculated molecular descriptors on a pool of 17 compounds and their experimental CAXII inhibitory activities. The quality of prediction is high enough (Coefficient of Variation 0.0695, r2 =0.9138, F =31.7984, Adj R2= 0.8850). The innovation of this work consists in not only exploring the structural attributes of bioactive molecules
Keywords: Carbonic anhydrase inhibitors; PRECLAV; Tumor-associated carbonic anhydrase XII; Sulfocoumarins; Dragon
Carbonic Anhydrases are a superfamily of metalloenzymes with basic function of catalyzing Carbon dioxide hydration and dehydration reactions. These enzymes have been classified into seven genetically different families bearing Latin names, α, β, γ, δ, ζ, η and θ Carbonic Anhydrases (CAs) [1-3]. All Carbonic Anhydrases found in human beings are of the α- class and fifteen isoforms of this class of the enzyme have been discovered and characterized, so far. Out of these fifteen types, only twelve are catalytically active and these are hCAs I to IV, VA, VB, VI, VII, IX, XII to XV. These hCAs can be further divided into four separate subsets on the basis of their presence in the cellular matrix. Thus, hCA I, II, III, VII, VIII, X, XI, XIII are cytosolic proteins, hCA VA, VB are present in the mitochondria, hCA VI is the enzyme, which is secreted, hCA IV is a Glycosylphosphatidylinositol (GPI), an anchored protein and hCA IX, XII and XIV are trans membrane isoforms. These enzymes are of ubiquitous by its presence and are responsible for a wide range of physiological processes of vital importance in living beings viz. electrolyte secretion in a variety of tissues, biosynthetic reactions like gluconeogenesis, lipogenesis, ureagenesis, bone resorption, calcification etc. Dysregulated expression or abnormal activity of these enzymes can culminate in severe pathological conditions, including growth of malignant tumours [4-7]. hCAs IX and XII need a special mention as these are used as markers of disease progression in case of hypoxic tumours and their exclusive or targeted inhibition has the therapeutic effect of reducing both primary tumours and metastases.
The two ubiquitous isoforms, hCA I and II are normally the off- target isoforms as their inhibition along with that of the hCA IX and XII isoforms leads to undesired effects. Thus, the isoform specific CA inhibition has caught the attention of scientists during past few years. The off- target inhibition is associated mainly with the use of the so called classical ‘sulphonamide’ inhibitors. Many non-classical CA inhibitors have been developed and their efficiency tested in recent years. Coumarins are one such important group of non-classical inhibitors [8]. These are widely distributed in nature and are endowed with a variety of pharmacological properties like antioxidant, anti- tubercular, anti-bacterial, anti-coagulant, anti-fungal, antihypertensive, antihyperglycemic, enzyme inhibition etc. These have also been found to be selective and effective hCA IX and hCA XII inhibitors by way of occlusion to the active site entrance. Further, the five membered heterocycle, 1,2,4-oxadiazole is a neutral heterocycle which acts as a substitute for amides and carboxylic acids and possesses excellent pharmacological activity.
Therefore, it was decided to club both the heterocycles, synthesize
coumarin-1,2,4-oxadiazole hybrid with different substituents on
the adjacent benzene ring and test its efficiency towards selective
inhibition of physiologically and pharmacologically active relevant
hCAs using acetazolamide as a standard. The group comprising
Pavitra et al. [9] have worked on this project. They synthesized
various substituents of coumarin-1,2,4-oxadiazole hybrid (from a
to q) and determined Ki (inhibition constant) values for obtaining
an inhibitory potential towards different hCAs [9]. They drew
following conclusions.
A. The two cytosolic enzymes, hCA I and hCA II were not
inhibited by any of the derivatives of the synthesized hybrid
compound as evidenced by the Ki value of 10000nM or above for
all of these.
B. The tumour related isoform, hCA IX was inhibited to the
maximum extent (Ki=23.6nM) by the derivative of the hybrid
containing a methoxy group in the para position of the phenyl ring
located at the 5th position of the 1,2,4-oxadiazole.
C. The other tumour related isoform, hCA XII was best
inhibited by the derivative of the hybrid containing a tertiary
butyl group in the para position of the phenyl ring located at the
5th position of the 1,2,4-oxadiazole (Ki=1.00).
Data set for analysis
Figure 1:
The inhibitory activity of 17 sulfocoumarins against hCAs XII was obtained from literature [9]. The known structure and a known value of the inhibitory activity were taken in calibration set to develop a QSAR model and is presented in Table 1 and their structures are shown in Figure 1. Ki activity originally determined in nanomolar values was converted in ‘A’ by means of equation A = log(c/Ki) * where c was taken as 100000 in order to obtain large values of ‘A’. The inhibitory activity value ‘A’ of the molecules under the study spanned in a range from 3 to 5 is more suggestive. Dragon calculated several pharmacokinetic properties-Lipinski violations, [10] flexibility as well as several common measures of bioavailability for each potential analog.
Table 1: Value of the Predictors used in QSAR study of Calibration set and prediction set and CA XII activities (in μM and A = log10000/c), Estimated activities, hat diagonal, Standardized Residual, R |Student| of the calibration set molecules 1-17.
Drug like indices
The drug-like indices are dummy variables having values equal to one when all the criteria of the consensus definition of a drug-like molecule are satisfied, 0 otherwise. These are filters used to extract good drug candidates from large collections of compounds. A drug-like score is a real value ranging from 0 to 1, calculated as the fraction of criteria satisfied. A score of 1 indicates that a compound is a good candidate to be a drug, whereas a score of 0 indicates that a compound will likely not be a drug [11]. The index DLS_01 is a drug-like score based on the Lipinski’s rules. The finding of new bioactive molecule is the most important mean of computational drug discovery by QSAR Model.
The minimum energy geometry for each compound is performed by the conformational search capability of the hyperchem program [12]. Isomeric SMILES notation was used as program input in order to avoid any influences on conformational model generation by presenting 3D seed structures. The conformations of the minimal energy obtained by molecular mechanic calculations were further minimized by quantum chemical calculations. The semi empirical PM6 method [13-15] included in the MOPAC 2009 software [16], optimized the geometry more thoroughly.
The energy minimized structure is used to compute special molecular properties, as well as physicochemical, electronic, constitutional, virtual fragmentation descriptors, and whole molecule quantum chemical (global) descriptors. MOPAC [16] and [PRECLAV [17] programs are calculated over many descriptors for each molecule. The parameters to be computed are different descriptors that are investigative of molecular structure and used as independent variable.
Chemometric tools
The QSAR model, built by dependent variables, is the experimental information related with biological activity. The parameters to be calculated are various descriptors that are indicative of molecular structure and used as an independent variable. The PRECLAV algorthim [17] was used for obtaining the parameters and for the statistical analysis as reported earlier [18-28]. Regression analysis was done by NCSS [29] software by maximum correlation method, the thumb rule in statistics and calculated. Here, CV is a coefficient of variance, R is a multiple correlation coefficient, R2A is an adjustable R2, and F is the Fishers statistics.
Comment on R2A
Before proceeding further, it is necessary to comment on R2A [30]. By definition it takes into account the adjustment of R2. If a variable is added that does not contribute its fair share, the R2A will decline. R2A is a measure of the % explained variation in the dependent variable that takes into account a relationship between the number of compounds and the number of independent variables in the regression model. Whereas R2 will always increase when an independent variable is added, R2A will decrease if the added variable doesn’t reduce the unexplained variation enough to offset the loss of degrees of freedom.
Variance Inflation Factor (VIF) and eigen values
We now discuss the variance inflation factor (VIF) and eigen
values section, of the parameters involved in the model. These
values are presented.
The VIF is defined as:
VIF = 1 / (1-R2i) (5)
Where Ri is the multiple correlation coefficient of the ith
independent variable on all other independent variables. Thus, a
VIF is defined for each variable in the equation, not for the equation
as a whole, and all the VIF values should be less than 10. All VIF
values for both models 5 and 9 are around 2, and thus much lesser
than 10, indicating these models reach the statistical requirements
and that there is no co-linearity problem. The conclusions arrived
at from VIF values are further confirmed from the respective
correlation matrix, eigen values and Ridge statistics (Figure 2).
Figure 2:Correlation of observed vs. estimated KI in the calibration set and validation set.
Applicability of domain and detection of outliers
Predictive power of a model on the new data set is influenced by the similarity of the chemical nature between the calibration set and prediction set [31]. A QSAR model can be used for screening new compounds if its domain of application defines [29-32] the need to typify the model applicability of the domain which is also reflected in the OECD guidelines for QSAR model validation [32,33]. QSAR model should only be used for making predictions of compounds that fall within the particular domain and may be considered reliable. One simple approach to define-the applicability of the domain is extent of extrapolation [34-36]. This is based on the calculation of the hat diagonal (leverage, hi) for each chemical, where the QSAR model is used to predict its activity [37-39].
Where, xi = the descriptor-row vector of the query molecule and X = k x n matrix containing the k descriptor values for each one of the n training molecules. A hat diagonal (leverage) value >3(k + 1) ⁄n (leverage warning limit [27] is considered large.
Outliers are observations that poorly fit the regression model. Outlying compounds should not be removed unless a proper reason for their removal is present. The variance of the observed residuals is not constant which makes comparisons among them difficult. One of the solutions to standardize the residuals [38,39] is by dividing them by their standard deviations. This provides a set of residuals with constant variance. |R Student| (cross-validated Leave one out standardized residuals) [40] is a standardized residual that has the impact of a single observation removed from the mean square error. A molecule is defined as an outlier in which |R Student| > 2 [40]. To visualize the applicability of domain of a developed QSAR model, William plot was used. In the William plot, |R Student| versus leverage values (hi) are plotted. This plot could be used for a direct and simple graphical finding of both the response outliers and structurally important compounds in a mode.
Using only the “significant” descriptors, PRECLAV computed
ten thousand QSPR type multilinear equations. The quality of the
obtained equations can be reflected by the value of the Q function
and also by values of some usual statistical functions. During the
NCSS MLR analysis, it was observed that the equation with the
highest value of the R function is 4-parametric model and also that
this model holds the highest predictive power, which is as follows:
Dependent property: hCA XII inhibitory activity.
Molecules number in calibration set: 17
Number of “significant” descriptors in presence of prediction set
= 242
log Ki (hCAXII) =84.5050-2.4191 (0.3024) ifd -0.0149(0.0020)
ban -10.4590(1.5848) SM5_B(p)+ 5.7437 (1.9428) DLS_cons
Whereas the quality of correlation is described by the statistical
indices:
Coefficient of Variation 0.0695, r2 =0.9138, F =31.7984, Adj
R2r2cv 0.8850=
Se = standard error of values, r2=Pearson square correlation, F =
Fisher function, R2A is adjustable R2
If d=Spherical shape index
ban =minimum aromaticity of aromatic chemical bonds
SM5_B(p) =Spectral moment of the order 5 from burden matrix
weighted by polarizability
DLS_cons=Dragon consensus drug –like score Drug-like indices
The negative correlation of if d (Spherical shape index) and ban (minimum aromaticity of aromatic chemical bonds), as computed by Preclav descriptor shows that an increase in the value of these descriptor decreases the activity and SM5_B(p).
A dragon descriptor (spectral moment of order 5 from burden matrix weighted by polarizability) shows that as the negative correlation of this descriptor increases the value of the descriptor decreases the activity. The computational alert is a filter that identifies compounds lying in a region of property space
where the probability of useful oral activity is very low. A compound that fails the alert will likely be poorly bioavailable because of poor absorption or permeation. This alert index is a dummy variable taking value 1 when two or more properties are out of range. DLS index is a drug like score based on the Lipinski’s rules.
External validation of the computation method
In this work, the molecules with rank 2,7,10 and 17 for QSAR study constituted the validation set and the remaining molecules formed the reduced calibration set. The validation set of 04 molecules (22% of the database) captured all the features and spanned the activity range of the entire dataset. We may suppose that the reduced calibration set obtained in this method is a representative sample for the calibration set [19]. The remaining 12 molecules formed the reduced calibration set. In case, there is a validation set, the most important tool is the correlation between the estimated and experimental values of the QSAR equation for the molecules in the validation set.
Hence, we can state that the estimated values for the molecules in the validation set are close to the experimental ones and have ordered the molecules in a series according to the actual CAXII activity value. This was confirmed by the graph (Figure 2) between observed and estimated values of the calibration set and the validation set.
Applicability domain
Figure 3:Normal probability plot of residuals of obs. act. |RStudent| of observed vs. Hat Diagonal.
|R Student| of observed inhibitory activity and Hat diagonal (leverage) are used to assign applicability of domain (AD). Table 1 shows the values for leverage calculated for both the calibration set and the prediction set compounds. William plot (Figure 3) shows the applicability of domain for the developed model of the calibration set. The points with leverage value higher than the warning limit are the influential compounds. Therefore, it is not considered an outlier. William plot shows that all molecules in the calibration set lie in the application domain of the developed model (Figure 3). If the residuals are normally distributed, the data points of the normal probability plot will fall along a straight line. Major deviations from this ideal picture reflect departures from normality. Stragglers at either end of the normal probability plot indicate outliers. Curvature at both the end of the plot indicates long or short distributional tails. Convex, or concave curvature indicates a lack of symmetry. Gaps, plateaus, or segmentation indicate clustering and may require a closer examination of the data or model. Of course, use of this graphic tool with very small sample sizes is unwise. If the residuals are not normally distributed, the t-tests on regression coefficients, the F-tests, and the interval estimates are not valid. This is a critical assumption to check (NCSS reference) (Table 2).
Table 2:
Figure 4:Ridge trace section.
This is the famous ridge trace that is the signature of this technique (Figure 4). The plot is really very straight forward to read. It presents the standardized regression coefficients on the vertical axis and various values of k along the horizontal axis. Since the values of k span for several orders of magnitude, adopt a logarithmic scale along this axis. The points on the left vertical axis (the left ends of the lines) are the ordinary least squares regression values. These occur for k equal zero. As k is increased, the values of the regression estimates change, often wildly at first. At some point, the coefficients seem to settle down and then gradually drift towards zero. The task of the ridge regression analyst is to determine at what value of k these coefficients are at their stable values. A vertical line is drawn at the value selected for reporting purposes. It is anticipated that you would run the program several times until an appropriate value of k is determined. In this example, our search would be between 0.0001 and 0.1. The value selected on this graph happens to be 0.066237, the value obtained from the analytic search. We might be inclined to use an even smaller value of k such as 0.01. Remember, the smaller the value of k, the smaller the amount of bias that is included in the estimates (NCSS reference).
This is a plot that we have added that shows the impact of k on the variance inflation factors (Figure 5). Since the major goal of ridge regression is to remove the impact of multicollinearity, it is important to know at what point multicollinearity has been dealt with. This plot shows this. The currently selected value of k is shown by a vertical line. Since the rule-of-thumb is that multicollinearity is not a problem once all VIFs are less than 10, we inspect the graph for this point. In this example, it appears that all VIFs are small enough once.
Figure 5:Ridge regression report.
Statistically, significant linear QSAR models imply the proposal of CA XII activity for data representation, data modeling and data prediction. The excellent correlation with drug like indices established this model. The model shows that polarizability of the compounds does not play a dominant role for the activity. Thus, an attempt has been made to design and develop novel QSAR models against CA XII activity decreases the test and fault issue and predicts the biological activity before synthesis.
This article is dedicated to the memory of the late Prof. Padmakar V. Khadikar (1936-2012).
© 2022 Shalini Singh. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and build upon your work non-commercially.