Evaluation ofMachine Learning Approaches to Predict Soil Organic Matter and pH Using

vis-NIR Spectra

1 Institute of Agricultural Remote Sensing andInformation Technology Application,

Collegeof Environmental and Resource Sciences, Zhejiang University, Hangzhou 310058,China; 0015862@zju.edu.cn (M.Y.); xudongyun@zju.edu.cn (D.X.)

2 Department of Environmental Engineering, Yuzhang Normal University, Nanchang330103, China

3 The Institute National de la RechercheAgronomique (INRA), Unité InfoSol, 45075 Orléans, France; songchao.chen@inra.fr

4 Department of Land Resource Management, JiangxiUniversity of Finance and Economics,

Nanchang330013, China; lihongyi1981@zju.edu.cn

5 Key Laboratory of Spectroscopy Sensing, Ministry ofAgriculture, Hangzhou 310058, China

* Correspondence: shizhou@zju.edu.cn

Received: 8 October 2018;Accepted: 7 January 2019; Published: 11 January 2019

check Eoj

updates

Abstract: Soil organicmatter (SOM) and pH are essential soil fertility indictors of paddy soil in themiddle-lower Yangtze Plain. Rapid,non-destructive and accurate determination of SOM and pH is vital to preventing soil degradation causedby inappropriate land management practices. Visible-near infrared (vis-NIR) spectroscopy with multivariate calibration can be used to effectively estimate soilproperties. In thisstudy, 523 soil sampleswere collected frompaddy fields in the Yangtze Plain,China. Four machine learning approaches—partial least squares regression(PLSR), least squares-support vector machines (LS-SVM), extreme learningmachines (ELM) and the Cubist regression model (Cubist)—were used to comparethe prediction accuracy based on vis-NIR full bands and bands reduced using thegenetic algorithm (GA). The coefficient of determination (R2),root mean square error (RMSE), and ratio of performance to inter-quartiledistance (RPIQ) were used to assess the prediction accuracy. The ELM with GA reducedbands was the best modelfor SOM (SOM: R2 = 0.81,RMSE = 5.17, RPIQ = 2.87) and pH (R2 = 0.76, RMSE = 0.43, RPIQ = 2.15). Theperformance of

the LS-SVMfor pH prediction did not differ significantly between the model with GA (R2 =0.75, RMSE = 0.44, RPIQ = 2.08) and without GA (R2 = 0.74, RMSE = 0.45, RPIQ = 2.07). Althougha slight increase was observedwhen ELM were used for prediction of SOM and pH using reduced bands (SOM: R2 = 0.81, RMSE = 5.17, RPIQ = 2.87; pH: R2 = 0.76, RMSE = 0.43, RPIQ = 2.15) comparedwith full bands (R2 = 0.81, RMSE = 5.18, RPIQ = 2.83; pH: R2 = 0.76, RMSE = 0.45,RPIQ = 2.07), the number

of wavelengths was greatlyreduced (SOM: 201 to 44; pH: 201 to 32). Thus, the ELM coupled with reducedbands by GA is recommended for prediction of properties of paddy soil (SOM andpH) in the middle-lower Yangtze Plain.

Keywords: machine learning approaches; vis-NIR spectra; paddysoil; soil organic matter; pH

1. Introduction

As a major soil type, paddysoils are widelydistributed in China,with an area of about30 million hm2. This accounts for 29% ofthe cultivated land in China, especially in the Yangtze River Delta andsouthern China [1]. Therefore, it is ofgreat importance to evaluate and monitor the quality of paddy soils. Insouthern China, soil organic matter (SOM) and pH are important components insoil quality

Sensors 2019, 19, 263; doi:10.3390/s19020263 www.mdpi.com/journal/sensors

assessment as the former isdirectly related to crop yield and the latter to food security. Therefore, rapid, accurate and non-destructive assessment of SOM and pH is vital to soil fertility evaluation andmonitoring under conventional cropping systems in large areas [1]. However, conventional laboratory measurement of soil properties is time-consuming, tediousand cannot be applied to large areas.

Visible near-infrared spectroscopy (vis-NIR) has become increasingly popular as an alternative to conventional laboratory analysesbecause it is rapid, non-destructive, cost-effective, does not requirehazardous chemicals, and enables severalsoil properties to be simultaneously estimated from a singlespectrum [2]. When vis-NIR radiationinteracts with a soil sample, we can detect the overtones and combinations offundamental molecular vibrations, such as O–H, C–H, N–H and C=O groups [3]. Vis-NIR has been used to predict soil chemical and physical properties, particularly for SOM, texture,and clay mineralogy [4].

However, soil vis-NIRspectra are largely nonspecific because of the overlapping absorption of soil constituents. Complex absorption patternsgenerated from soil constituents and quartz need to be mathematically extracted from the spectra[4].Partial least squaresregression (PLSR) is a commonlyused linear model; however, there are many nonlinear relationships between spectral data and target soil characteristics in nature [5,6]. Therefore, some non-linear machinelearning techniques, includingartificial neural networks (ANN), support vector machine regression (SVMR), least square-support vector machines (LS-SVM),random forest and the Cubistregression model (Cubist) have been used [7–13]. Moreover, soil is a complex mixture thatconsists of water, air, and organic and inorganic mineralmatter of variable origins, so it is difficult to achieve universal acceptance with thesame calibration techniques. Some researchers [9,10,14] haveshown that machine learningtechniques lead to satisfactory results for the prediction of soil organic carbon (SOC) and pH in a largerange area. Extreme learning machines (ELMs), the emergent machine learning techniqueput forward by Huang et al. [15], have been used extensively over the pastseveral years because of their good generalization performance and extremely fast learning speed.In addition, the large numberof spectral variablesin most data sets need to eliminateunrelated variables to provide insightinto the importantwavelengths related to soil properties and enable their use in prediction of soil properties. Therefore, this study was conductedto investigate paddy fields in the Yangtze Plain, China.

The specificgoals of this study were to: (i) explore the important wavelengths of vis-NIR in SOM and pH predictions; (ii) compare theperformance of linear (PLSR) and non-linear (least square-support vector machines, LS-SVM; extreme learningmachines, ELM; Cubistregression model, Cubist)models for predicting SOM and pH.

2. Materials and Methods

2.1. Study Area andSoil Sampling

The studywas conducted in the middle-lower Yangtze Plain,which includes Jiangsu, Zhejiang, Jiangxi and Hunan. The parent material of thesoils is alluvial deposits from the YangtzeRiver and its tributaries and the main soil type of the study area ispaddy soil, which is a kind ofAnthrosol in Chinese Soil Taxonomy. We selected 57 paddyfields with an area of more than 0.6 km2, and 8–10sampling sites for each field were selected, with a total of 523 soil samplesultimately being collected (Figure 1). Thesoil type of samples and the parent material can be seen in Table 1.These soil samples were air-dried, ground and sieved to less than 2 mm. Stonesand plant residues were removed. Each sample was divided by the quarteringmethod into two portions, one for laboratory chemical analysis and the otherone for spectral measurements. The SOM content was measured by the potassiumdichromate volumetric method,soil pH was determined in a slurryof soil and water at a ratio of 1:2.5 using an electronicpH meter.

Figure 1. Location of sampling sites.

Table 1. Basic propertiesof the soil samples in the middle-lower Yangtze plain.

Texture	Sample Number	Crop	Parent Material
Clay	83	Idle field, Silkworm	Acidic crystalline
Clay loam	220	Rice	Alluvial deposit
Loam	120	Rice	Alluvial deposit
Sandy loam	100	Grass, Idle field	Red sandstone

2.2. SpectroscopicMeasurement and Pre-Processing of Spectra

The vis-NIR spectra were measured with an ASD FieldSpec® Pro FR spectrometer (Analytical Spectral Devices Inc., Boulder,CO, USA) using a high-intensity contact probe with a spectralrange of 350 to 2500 nm and a spectral resolution of 1 nm. Each sample was placed in a petri dish (10 cm diameter and 1.5 cm depth), after which the spectrometer was calibrated using a Spectralon® panel with 99% reflectance. We measuredthree replicates of spectra for each sample and each replicate consistedof 10 internal scans. The resulting 30 spectra wereaveraged into oneto represent eachsample.

Aspectral range of 400 to 2400 nm was used because the spectra outside thisrange had a lot of noise. Thereflectance spectra (R) were transformed to absorbance (A = log10(1/R)), thenresampled to 10 nm to reduce dimensionality. The spectra of 523 samples weresplit into calibration (350 samples) and validation (173 samples) sets using the Kennard–Stone algorithm[16]. We used thecontinuum-removed spectrum [17] of the meanof all samples to help interpret the main absorption features in the spectra.

We used the geneticalgorithm (GA) to select an optimal subset of spectralbands in the calibrationprocess. Many studies [18–20] have demonstrated the importance of GAwavelength selection in the calibration step to avoid the selection of randomcorrelation and irrelevant variables. Wavelength selection usingGA can improve the robustness of multivariate calibrations without loss of predictioncapacity, and furthermore, providesuseful information about the chemical system [11].We used the method of genetic algorithm(GA) and partialleast squares regression (GA–PLS) for featureselection. In this study, the parameter values of the GA wereset as follows based on Leardi et al. [21]and Shi et al. [22]: max generation of 100, population size of 64,mutation rate of 0.01 and replicate run of

10. Thefitness function was determined according to the root mean square error ofcross-validation (RMSECV) of the partial least squares. For this we used the PLS_Toolbox 8.5.1 (Eigenvector Research Inc., Wenatchee, WA, USA) in MatLab (R2016A, The MathWorks Inc., Natick, MA, USA).

2.3. MultivariateRegression Models

2.3.1. Partial LeastSquares Regression (PLSR)

Partial least squares regression (PLSR) is a linear regression model widely used in the quantitativeanalysis of diffuse reflectance spectra in soil [23].This method uses a latent variable approach to model covariance structures intwo projected spaces of the predicted and observed variables [24]. The optimum number of latent variables wouldbe the number that yields the minimum prediction error sum of squares usingcross-validation of the calibration set. For PLSR we used the R package ‘pls’ [25] of R 3.3.3 [26].

2.3.2. Least Squares-Support Vector Machines (LS-SVM)

The LS-SVMmethod employs classification and regression analysisto solve linearand nonlinear multivariateproblems [27]. The LS-SVM method uses linearequations instead of convex quadratic programming for classical SVMs. Additionally, the LS-SVM uses a Kernelfunction of Gaussianradial basis function (RBF). We used a systematic grid search methodto optimize the parameters C and γ of the RBF. The optimal model parameters were determined by the lowestRMSE in the calibration set byleave-one-out cross-validation. This was done with MatLab.

2.3.3. Cubist RegressionModel

TheCubist model is a data mining technique and an extension of the M5 model treedeveloped by Quinlan [28]. Cubistis a rule-based regression where a model tree is first createdand then reduced to a series of rules. These rules partition samples according to their spectra,and a unique linear model is then applied to predict thetarget variable. In addition, Cubist can utilise boosting (committees) andadjust its predictions using neighbours from within the training data set(neigbours). More details on Cubist and its implementation can be found inViscarra Rossel and Webster andMinasny and McBratney [9,29]. The committees and neighbours were determinedby the lowest RMSE in the calibration set by leave-one-out cross-validation. For Cubist we used the R package‘Cubist’ [30] of R 3.3.3[26].

2.3.4. Extreme LearningMachine

Theextreme learning machine (ELM) is a generalized single-hidden layer feedforward network (SLFN) with a weight and hiddenlayer threshold in the first layer that are randomly assigned and a weight in the outputlayer that is calculated directlyby the least-squares method. The entire learning process is completed in oneround, with no iterations required; therefore, this algorithm performs atextremely fast learning speed. The simplified scheme can be found in Figure 2.

AlgorithmELM: For N distinct samples (x_i, t_i), where x_i = [x_i₁, x_i₂, . . . , x_in]^T ∈ Rn, x_i were soil

spectra, while t_i were the observedvalues of either SOM or pH.

Given a hidden node number Nˆ , the activation function is defined as follows:

Nˆ Nˆ

g(x) = ∑ β_j g_j(x_i) = ∑ β_j g(w_j·x_i + b_j) = o_i, i = 1.2, . . . , N; j = 1, 2, . . . Nˆ

(1)

j=1 j=1

where w_j ∈ Rn is the weightvector connecting the input nodesto the jth hiddennode and b_j ∈ R isthe threshold of the jth hiddennode, βj ∈ R represents the weightvector connecting the jth hidden node and the output nodes. To approach the real resultsof the training data infinitely, the prediction result

i=1

o_i must be consistent with the real result t_i, in which case ∑^Nˆ "o_i − t_i" = 0. Under these conditions,

i=1

Equation (1) can be expressed as follows: ∑^N β_j g(w_j·x_i + b_j) = t_i which is representedby a matrix:

Hβ = T (2)

where:

 g(w₁·x₁ + b₁) · · · g.w _ˆ ·x₁ + b _ˆ Σ 

 β1 

SHAPE \* MERGEFORMAT

 t1 T

SHAPE \* MERGEFORMAT

H = 

. .

. . . . .

g(w₁·x_N + b₁) · · · g wNˆ ·x_N + bNˆ

Σ 

, β =  . 

SHAPE \* MERGEFORMAT

βNˆ

and T =  . 

SHAPE \* MERGEFORMAT

t_N

(3)

∈ ∈

N×Nˆ

Nˆ ×1

when input weight w_j Rn and bias b_j R are randomly assigned, the output matrix H in the hidden layer can be calculated byELM, afterwhich theoutput weight β is calculated by βˆ = H†T where H† is the Mosse-Penrose generalized inverse of H.

The ELM regression abilityvaries significantly withthe number of initial hiddenneurons. To selectthe number of hidden neurons,we conducted an experiment by varying the number of hidden neurons from 1 to 120 in steps of 1. Theoptimum number of initial hidden neurons would be the number that producedthe minimum prediction error sum of squares for cross-validation of the calibration set. This was done in MatLab.

Figure2. Structureof ELM (wj, weighting; bj, biases).

2.4.Model Evaluation

We compared the measuredvalues vs. the predicted values of the cross-validation in the calibration andvalidation data sets using simple linear regression. The predictability of the different calibration modelswas evaluated by the coefficient of determination (R2), mean error (ME), root mean square error (RMSE) and ratio of performance to inter-quartile distance(RPIQ), which were suggestedfor assessing vis-NIR model performance by Bellon-Maurel et al. [31]. Generally,larger values of R2 and RPIQ and smaller RMSEindicate better model performance.

3. Results

3.1. DescriptiveStatistics of the Soil Properties

Table 2 shows a summary of statistics describing the SOM and pHfor the total, calibration, and validation datasets. The SOM ranged from 2.44to 60.50 g kg−1 andthe pH ranged from 3.92 to 8.60 for the whole dataset. The coefficient ofvariation (CV) of SOM was high (>35%), whereas the CV for pH was moderate(around 15%) according to the Wilding categorized standard [32]. These findings indicated that the soil propertieshad wide ranges and were spatially variable within the research area to obtainhigh spectroscopy calibration accuracy and thus good predictive performance [33]. The similarity of summary statistics (e.g., mean, SD, and CV) from the calibration and validation sets showed that they were both able torepresent the entire data set.