Wen-Qiang Wang* and Pengjie Wang
Department of Civil Engineering, Queen’s University, Canada
*Corresponding author: Wen-Qiang Wang, Beaty Water Research Centre, Department of Civil Engineering, 69 Union Street, Queen’s University, Kingston, K7L 3N6, Canada
Submission: December 11, 2024;Published: February 11, 2025
ISSN:2832-4463 Volume4 Issue3
Predicting cadmium (Cd(II)) adsorption in soils is critical for managing heavy metal contamination and mitigating its environmental risks. This study introduces a hybrid machine learning model that integrates Decision Trees (DT), Multi-Output Nonlinear Regression (MNLR), and Backpropagation Neural Networks (BPNN) to achieve accurate predictions of Cd(II) adsorption capacity. The model incorporates advanced data scaling techniques and feature expansion to effectively handle data heterogeneity and capture complex nonlinear relationships among soil properties, including Cation-Exchange Capacity (CEC), Organic Carbon Content (OC), clay content, pH, and soil-to-solution ratio. Sensitivity analysis identifies clay content as the most influential parameter, revealing its significant role in modulating adsorption behavior. The model demonstrates superior predictive performance, with an R² value of 0.898 and a substantial reduction in training loss, highlighting its potential for advancing environmental risk assessment and remediation strategies for contaminated soils. This work establishes a foundation for applying machine learning to optimize predictions in environmental science, offering insights into heavy metal adsorption and guiding the development of efficient remediation approaches.
Keywords:Machine learning; Heavy metal; Adsorption; Soil remediation; Decision tree; Neural network
Heavy metals in soil pose significant environmental risks to both human health and animals. While soil adsorption can limit the long-range transport of heavy metals, the preservation of metal ions in local soils still substantially threat the nearby ecosystems [1,2]. Therefore, accurately estimating adsorption capacity is critical for assessing environmental risks and developing effective remediation strategies [1,3,4]. The adsorption capacity of heavy metals in soil depends on various factors, including pH, Cation-Exchange Capacity (CEC), clay content, and Organic Carbon Content (OC), all of which vary significantly across different soil types [4,5]. As such, estimating the adsorption capacity of a specific soil is challenging [6,7]. Traditional batch experiments are precise but time-consuming [1-8], while empirical isotherms like Freundlich and Langmuir models require experimental data and predefined variables [9-11], limiting their applicability in heterogeneous soils [12]. Advanced models like Surface Complexation Models (SCMs) rely on physical governing equations but are effective mainly for soils with well-defined surface functional groups [13].
Machine learning provides an innovative and powerful alternative, it excels in uncovering hidden relationships without extensive prerequisites [14]. This case report introduces a novel machine learning model with soil properties to predict heavy metal adsorption in soils. Cadmium (Cd) is selected as a case study due to its toxicity, carcinogenicity, and bioaccumulative effects [15]. This case study assessed the effectiveness of machine learning in predicting Cd(II) adsorption while also identifying the key soil properties that significantly influence adsorption behavior. The results of the machine learning model highlight its potential as a promising tool for providing insights into brown field redevelopment contaminated by heavy metals.
Data pre-processing
The dataset in this study includes adsorption capacity of Cd(II) and various soil properties. Key input variables, selected for their strong correlation with heavy metal adsorption, include CEC, OC, clay content, equilibrium heavy metal concentrations, soil pH, solution pH, solution temperature, and soil-to-solution ratio. The dataset, compiled from open literature, consists of 1,093 soil samples from diverse Cd(II)-contaminated sites. Due to the wide magnitude range and extreme values present in these variables, data scaling was applied prior to model training to mitigate the impact of data heterogeneity. Kernel Density Estimation (KDE) with marginal histograms was used to ensure that scaling preserved the original data distribution. Additionally, data normalization was applied to smooth variable distributions, enhancing the model’s ability to interpret underlying relationship.
Framework of machine learning based model
The model applied in this study is a hybrid machine learning model adapted from previous work [16], it begins with a Decision Tree (DT) to classify soil types based on their inherent characteristics. The spitting algorithm of applied DT is given by Equations [1-4].
Where Sumleft and Sumright represent the sums of all data points in each respective leaf that denotes as yleft and yright (Equations (1) and (2)). The weights of each leaf, denoted as wr and w1 , were introduced to adjust the dataset in each leaf in (Equation (3)). The decision tree algorithm dynamically adjusts these weights to minimise the impurity until the specified criterion is reached (Equation (4)). Equation (1) to (4) uses a modified Friedman Mean Squared Error (MSE) as the splitting criterion [17], with parameters such as a predefined maximum depth and a minimum of 50 samples per leaf to prevent overfitting. Next, a Multi-Output Non-Linear Regression (MNLR) is applied, utilizing 6th-degree polynomial features to capture complex non-linear relationships. These features are expanded into 465 variables, which are combined with the original input values as a reference set, and processed using a random forest regressor. Finally, a Backpropagation Neural Network (BPNN) is employed to further refine the predictions. The BPNN in this study consists of an input layer, two hidden layers and an output layer to update the weights by optimization algorithm. The forward propagation of the Backpropagation Neural Network (BPNN) employs the Leaky ReLU activation function, as shown in Equation (5), to introduce non-linearities between the layer connection:
where xi represents the input value from previous nodes and wi is the corresponding weight in current layer. The application of the Leaky ReLU activation function enables the model to capture complex relationships in the data. The optimized weights are then determined using the Stochastic Gradient Descent (SGD) optimization algorithm to minimize the Mean Absolute Error (MAE) between the model’s predictions and the ground truths, over 10,000 training epochs [18]. The overall structure of the model is illustrated in (Figure 1).
Figure 1:Schematic representation of hybrid model. The input of BPNN include both expended variables from MNLRs and original input values.
Model performance
The DT-MNLR machine learning model demonstrated strong predictive performance for Cd(II) adsorption (Table S1, Supporting Information), with an R² value of 0.898, RMSE of 0.593, and Nash- Sutcliffe Efficiency (NSE) of 0.896, indicating a high degree of accuracy in the model’s predictions. In addition, the introduction of a highly non-linear relationship and the use of the original reference set significantly reduced the training loss, achieving a fourfold decrease from 1.25 to 0.25. The Figure 2 and regression statistics reveal a strong linear correlation between the predicted and true values, highlighting the model’s ability to accurately estimate adsorption capacity (Figure 2).
Table 1:Model’s predictions of adsorption capacity vs. ground truths..
Figure 2:Prediction of hybrid model compared with true value.
The further sensitivity analysis shows that clay content significantly influences the parameter variability. For the scenario with 20% clay, its effect is more pronounced compared to 5% clay content. Other parameters, such as OC and CEC, also exhibit noticeable variability with changes of ±10%. However, factors like solution pH and soil-to-solution ratio display relatively minor sensitivity in both scenarios. Overall, higher clay content amplifies the impact of various soil and solution parameters on the system. This result demonstrates that the significance of variables influencing adsorption capacity depends on specific conditions.
This study demonstrates the effectiveness of the DT-MNLR hybrid machine learning model in predicting cadmium (Cd(II)) adsorption in soils with high accuracy, as evidenced by the high R² value and significantly reduced error metrics. By integrating advanced data preprocessing, decision trees for classification, multi-output nonlinear regression for feature expansion, and backpropagation neural networks for refinement, the model achieves a robust predictive capability that can address the complexities of heterogeneous soil environments. The findings underscore the critical role of clay content, alongside other factors such as OC and CEC, in influencing adsorption behavior under varying conditions. These insights are not only vital for optimizing cadmium contamination assessments but also serve as a benchmark for employing machine learning in environmental remediation strategies.
The implications of this work extend beyond cadmium to broader applications in heavy metal contamination management. Future research could explore the model’s adaptability to other pollutants, such as chromium (Cr(II)), copper (Cu(II)), and lead (Pb(II)), while incorporating additional processes like biological uptake and redox transformations to enhance predictive accuracy. By bridging machine learning with environmental science, this study paves the way for more efficient, data-driven solutions to soil contamination challenges.
© 2025 Wen-Qiang Wang. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and build upon your work non-commercially.