杨梅花,史舟等Evaluation of Machine Learning Approaches toPredict Soil Organic Matter and pH Usingvis-NIR Spectra

发布日期:2019年01月11日 Author: Core journals:sensors

Evaluation of Machine Learning Approaches to Predict Soil Organic Matter and pH Using

vis-NIR Spectra

1          Institute of Agricultural Remote Sensing and Information Technology Application,

College of Environmental and Resource Sciences, Zhejiang University, Hangzhou 310058, China; 0015862@zju.edu.cn (M.Y.); xudongyun@zju.edu.cn (D.X.)

2          Department of Environmental Engineering, Yuzhang Normal University, Nanchang 330103, China

3          The Institute National de la Recherche Agronomique (INRA), Unité InfoSol, 45075 Orléans, France; songchao.chen@inra.fr

4          Department of Land Resource Management, Jiangxi University of Finance and Economics,

Nanchang 330013, China; lihongyi1981@zju.edu.cn

5          Key Laboratory of Spectroscopy Sensing, Ministry of Agriculture, Hangzhou 310058, China

*     Correspondence: shizhou@zju.edu.cn


 

Received: 8 October 2018; Accepted: 7 January 2019; Published: 11 January 2019


check Eoj

updates


 

Abstract: Soil organic matter (SOM) and pH are essential soil fertility indictors of paddy soil in the middle-lower Yangtze Plain. Rapid, non-destructive and accurate determination of SOM and pH is vital to preventing soil degradation caused by inappropriate land management practices. Visible-near infrared (vis-NIR) spectroscopy with multivariate calibration can be used to effectively estimate soil properties. In this study, 523 soil samples were collected from paddy fields in the Yangtze Plain, China. Four machine learning approaches—partial least squares regression (PLSR), least squares-support vector machines (LS-SVM), extreme learning machines (ELM) and the Cubist regression model (Cubist)—were used to compare the prediction accuracy based on vis-NIR full bands and bands reduced using the genetic algorithm (GA). The coefficient of determination (R2), root mean square error (RMSE), and ratio of performance to inter-quartile distance (RPIQ) were used to assess the prediction accuracy. The ELM with GA reduced bands was the best model for SOM (SOM: R2 = 0.81, RMSE = 5.17, RPIQ = 2.87) and pH (R2  = 0.76, RMSE = 0.43, RPIQ = 2.15).  The performance of

the LS-SVM for pH prediction did not differ significantly between the model with GA (R2 = 0.75, RMSE = 0.44, RPIQ = 2.08) and without GA (R2 = 0.74, RMSE = 0.45, RPIQ = 2.07). Although a slight increase was observed when ELM were used for prediction of SOM and pH using reduced bands (SOM: R2 = 0.81, RMSE = 5.17, RPIQ = 2.87; pH: R2 = 0.76, RMSE = 0.43, RPIQ = 2.15) compared with full bands (R2 = 0.81, RMSE = 5.18, RPIQ = 2.83; pH: R2 = 0.76, RMSE = 0.45, RPIQ = 2.07), the number

of wavelengths was greatly reduced (SOM: 201 to 44; pH: 201 to 32). Thus, the ELM coupled with reduced bands by GA is recommended for prediction of properties of paddy soil (SOM and pH) in the middle-lower Yangtze Plain.

 

Keywords: machine learning approaches; vis-NIR spectra; paddy soil; soil organic matter; pH


 

1.  Introduction

As a major soil type, paddy soils are widely distributed in China, with an area of about 30 million hm2. This accounts for 29% of the cultivated land in China, especially in the Yangtze River Delta and southern China [1]. Therefore, it is of great importance to evaluate and monitor the quality of paddy soils. In southern China, soil organic matter (SOM) and pH are important components in soil quality

 

Sensors 2019, 19, 263; doi:10.3390/s19020263                                                                                                                                                                 www.mdpi.com/journal/sensors


 

assessment as the former is directly related to crop yield and the latter to food security. Therefore, rapid, accurate and non-destructive assessment of SOM and pH is vital to soil fertility evaluation and monitoring under conventional cropping systems in large areas [1]. However, conventional laboratory measurement of soil properties is time-consuming, tedious and cannot be applied to large areas.

Visible near-infrared spectroscopy (vis-NIR) has become increasingly popular as an alternative to conventional laboratory analyses because it is rapid, non-destructive, cost-effective, does not require hazardous chemicals, and enables several soil properties to be simultaneously estimated from a single spectrum [2]. When vis-NIR radiation interacts with a soil sample, we can detect the overtones and combinations of fundamental molecular vibrations, such as O–H, C–H, N–H and C=O groups [3]. Vis-NIR has been used to predict soil chemical and physical properties, particularly for SOM, texture, and clay mineralogy [4].

However, soil vis-NIR spectra are largely nonspecific because of the overlapping absorption of soil constituents. Complex absorption patterns generated from soil constituents and quartz need to be mathematically extracted from the spectra [4]. Partial least squares regression (PLSR) is a commonly used linear model; however, there are many nonlinear relationships between spectral data and target soil characteristics in nature [5,6]. Therefore, some non-linear machine learning techniques, including artificial neural networks (ANN), support vector machine regression (SVMR), least square-support vector machines (LS-SVM), random forest and the Cubist regression model (Cubist) have been used [7–13]. Moreover, soil is a complex mixture that consists of water, air, and organic and inorganic mineral matter of variable origins, so it is difficult to achieve universal acceptance with the same calibration techniques. Some researchers [9,10,14] have shown that machine learning techniques lead to satisfactory results for the prediction of soil organic carbon (SOC) and pH in a large range area. Extreme learning machines (ELMs), the emergent machine learning technique put forward by Huang et al. [15], have been used extensively over the past several years because of their good generalization performance and extremely fast learning speed. In addition, the large number of spectral variables in most data sets need to eliminate unrelated variables to provide insight into the important wavelengths related to soil properties and enable their use in prediction of soil properties. Therefore, this study was conducted to investigate paddy fields in the Yangtze Plain, China.

The specific goals of this study were to: (i) explore the important wavelengths of vis-NIR in SOM and pH predictions; (ii) compare the performance of linear (PLSR) and non-linear (least square-support vector machines, LS-SVM; extreme learning machines, ELM; Cubist regression model, Cubist) models for predicting SOM and pH.

2.  Materials and Methods

 

2.1.   Study Area and Soil Sampling

The study was conducted in the middle-lower Yangtze Plain, which includes Jiangsu, Zhejiang, Jiangxi and Hunan. The parent material of the soils is alluvial deposits from the Yangtze River and its tributaries and the main soil type of the study area is paddy soil,  which is a kind of Anthrosol   in Chinese Soil Taxonomy. We  selected 57 paddy fields with an area of more than 0.6 km2,  and  8–10 sampling sites for each field were selected, with a total of 523 soil samples ultimately being collected (Figure 1). The soil type of samples and the parent material can be seen in Table 1. These soil samples were air-dried, ground and sieved to less than 2 mm. Stones and plant residues were removed. Each sample was divided by the quartering method into two portions, one for laboratory chemical analysis and the other one for spectral measurements. The SOM content was measured by the potassium dichromate volumetric method, soil pH was determined in a slurry of soil and water at a ratio of 1:2.5 using an electronic pH meter.


 

 

Figure 1. Location of sampling sites.

Table 1. Basic properties of the soil samples in the middle-lower Yangtze plain.

 

Texture

Sample Number

Crop

Parent Material

Clay

83

Idle field, Silkworm

Acidic crystalline

Clay loam

220

Rice

Alluvial deposit

Loam

120

Rice

Alluvial deposit

Sandy loam

100

Grass, Idle field

Red sandstone

 

2.2.   Spectroscopic Measurement and Pre-Processing of Spectra

The vis-NIR spectra were measured with an ASD FieldSpec® Pro FR spectrometer (Analytical Spectral Devices Inc., Boulder, CO, USA) using a high-intensity contact probe with a spectral range of 350 to 2500 nm and a spectral resolution of 1 nm. Each sample was placed in a petri dish (10 cm diameter and 1.5 cm depth), after which the spectrometer was calibrated using a Spectralon® panel with 99% reflectance. We measured three replicates of spectra for each sample and each replicate consisted of 10 internal scans. The resulting 30 spectra were averaged into one to represent each sample.

A spectral range of 400 to 2400 nm was used because the spectra outside this range had a       lot of noise. The reflectance spectra (R) were transformed to absorbance (A = log10(1/R)), then resampled to 10 nm to reduce dimensionality. The spectra of 523 samples were split into calibration (350 samples) and validation (173 samples) sets using the Kennard–Stone algorithm [16]. We used the continuum-removed spectrum [17] of the mean of all samples to help interpret the main absorption features in the spectra.

We used the genetic algorithm (GA) to select an optimal subset of spectral bands in the calibration process. Many studies [18–20] have demonstrated the importance of GA wavelength selection in the calibration step to avoid the selection of random correlation and irrelevant variables. Wavelength selection using GA can improve the robustness of multivariate calibrations without loss of prediction capacity, and furthermore, provides useful information about the chemical system [11]. We used the method of genetic algorithm (GA) and partial least squares regression (GA–PLS) for feature selection. In this study,  the parameter values of the GA were set as follows based on Leardi et al. [21] and    Shi et al. [22]: max generation of 100, population size of 64, mutation rate of 0.01 and replicate run of

10. The fitness function was determined according to the root mean square error of cross-validation (RMSECV) of the partial least squares. For this we used the PLS_Toolbox 8.5.1 (Eigenvector Research Inc., Wenatchee, WA, USA) in MatLab (R2016A, The MathWorks Inc., Natick, MA, USA).


 

2.3.   Multivariate Regression Models

 

2.3.1.   Partial Least Squares Regression (PLSR)

Partial least squares regression (PLSR) is a linear regression model widely used in the quantitative analysis of diffuse reflectance spectra in soil [23]. This method uses a latent variable approach to model covariance structures in two projected spaces of the predicted and observed variables [24]. The optimum number of latent variables would be the number that yields the minimum prediction error sum of squares using cross-validation of the calibration set. For PLSR we used the R package ‘pls’ [25] of R 3.3.3 [26].

2.3.2.   Least Squares-Support Vector Machines (LS-SVM)

The LS-SVM method employs classification and regression analysis to solve linear and nonlinear multivariate problems [27]. The LS-SVM method uses linear equations instead of convex quadratic programming for classical SVMs. Additionally, the LS-SVM uses a Kernel function of Gaussian radial basis function (RBF). We used a systematic grid search method to optimize the parameters C and γ of the RBF. The optimal model parameters were determined by the lowest RMSE in the calibration set by leave-one-out cross-validation. This was done with MatLab.

2.3.3.   Cubist Regression Model

The Cubist model is a data mining technique and an extension of the M5 model tree developed by Quinlan [28]. Cubist is a rule-based regression where a model tree is first created and then reduced to a series of rules. These rules partition samples according to their spectra, and a unique linear model is then applied to predict the target variable. In addition, Cubist can utilise boosting (committees) and adjust its predictions using neighbours from within the training data set (neigbours). More details on Cubist and its implementation can be found in Viscarra Rossel and Webster and Minasny and McBratney [9,29]. The committees and neighbours were determined by the lowest RMSE in the calibration set by leave-one-out cross-validation. For Cubist we used the R package ‘Cubist’ [30] of R 3.3.3 [26].

2.3.4.   Extreme Learning Machine

The extreme learning machine (ELM) is a generalized single-hidden layer feedforward network (SLFN) with a weight and hidden layer threshold in the first layer that are randomly assigned and a weight in the output layer that is calculated directly by the least-squares method. The entire learning process is completed in one round, with no iterations required; therefore, this algorithm performs at extremely fast learning speed. The simplified scheme can be found in Figure 2.

Algorithm ELM: For N distinct samples (xi, ti), where xi = [xi1, xi2, . . . , xin]T Rn, xi were soil

spectra, while ti were the observed values of either SOM or pH.

Given a hidden node number Nˆ , the activation function is defined as follows:

 

Nˆ                                Nˆ


g(x) = ∑ βj gj(xi) = ∑ βj g(wj·xi + bj) = oi,  i = 1.2, . . . , N;  j = 1, 2, . . . Nˆ


(1)


j=1                            j=1

where wj Rn is the weight vector connecting the input nodes to the jth hidden node and bj R is the threshold of the jth hidden node, βj R represents the weight vector connecting the jth hidden node and the output nodes. To approach the real results of the training data infinitely, the prediction result

i=1

 

oi must be consistent with the real result ti, in which case Nˆ       "oi − ti" = 0. Under these conditions,

i=1

 

Equation (1) can be expressed as follows: N   βj g(wj·xi + bj) = ti which is represented by a matrix:

Hβ = T                                                                                (2)


 


where:

   g(w1·x1 + b1)   · · ·    g.w ˆ ·x1 + b ˆ Σ 


β1 

 

SHAPE \* MERGEFORMAT

.

文本框: .


t1 T

 

SHAPE \* MERGEFORMAT

.

文本框: .


H = 


N

.                                        .

.                 . . .                  .

g(w1·xN + b1)     · · ·    g  wNˆ ·xN + bNˆ

 

.


N

Σ 


β =    .     

 

SHAPE \* MERGEFORMAT

βNˆ

文本框: βNˆ


and =    .    

SHAPE \* MERGEFORMAT

tN

文本框: tN


(3)


∈                     

 


N×Nˆ

 


Nˆ ×1

 

when input weight wj Rn and bias bj R are randomly assigned, the output matrix H in the hidden layer can be calculated by ELM, after which the output weight β is calculated by βˆ  = HT where His the Mosse-Penrose generalized inverse of H.

The ELM regression ability varies significantly with the number of initial hidden neurons. To select the number of hidden neurons, we conducted an experiment by varying the number of hidden neurons from 1 to 120 in steps of 1. The optimum number of initial hidden neurons would be the number that produced the minimum prediction error sum of squares for cross-validation of the calibration set. This was done in MatLab.


Figure 2. Structure of ELM (wj, weighting; bj, biases).

2.4. Model Evaluation

We compared the measured values vs. the predicted values of the cross-validation in the calibration and validation data sets using simple linear regression. The predictability of the different calibration models was evaluated by the coefficient of determination (R2), mean error (ME), root mean square error (RMSE) and ratio of performance to inter-quartile distance (RPIQ), which were suggested for assessing vis-NIR model performance by Bellon-Maurel et al. [31]. Generally, larger values of R2 and RPIQ and smaller RMSE indicate better model performance.

3.  Results

 

3.1.   Descriptive Statistics of the Soil Properties

Table 2 shows a summary of statistics describing the SOM and pH for the total, calibration, and validation datasets. The SOM ranged from 2.44 to 60.50 g kg−1 and the pH ranged from 3.92 to 8.60 for the whole dataset. The coefficient of variation (CV) of SOM was high (>35%), whereas the CV for pH was moderate (around 15%) according to the Wilding categorized standard [32]. These findings indicated that the soil properties had wide ranges and were spatially variable within the research area to obtain high spectroscopy calibration accuracy and thus good predictive performance [33]. The similarity of summary statistics (e.g., mean, SD, and CV) from the calibration and validation sets showed that they were both able to represent the entire data set.

©2004-2020 浙江大学农业遥感与信息技术应用研究所 浙ICP备05074421号 浙公网安备33010602010295