How to calculate standard error of nonlinear regression

Here are the steps to estimate parameters α and n in Excel using nonlinear regression.

1. List the applied suction pressure as the independent variable in column A and measured soil water content (θ) as the dependent variable in column B (Figure 1).

Figure 1

Data input and initial value set for α and n in a spreadsheet.

2. Temporarily set the value of α as 0.1 and n as 1 in cells B19 and B20, respectively (Figure 1). It is important to set an appropriate initial value because an obviously unreasonable initial value will lead to an unanticipated value. Please refer to related document for initial value estimation (e.g., Delboy 1994). List the measured θ r and θ s in cells B21 and B22, respectively. Then the predicted θ value can be calculated with the van Genuchten soil water retention curve model (Eq. 5) using suction pressure and all parameter values. For example, the predicted θ in cell E2 ($ {\widehat{\theta}}_{E2} $) is calculated by the following formula:

$$ {\widehat{\theta}}_{E2}=\$B\$21+\left(\$B\$22-\$B\$21\right)\ast \left(1+\left(\$B\$19\ast \$A2\right)\hat{\mkern6mu} \$B\$20\right)\hat{\mkern6mu} \left(-1+1/\$B\$20\right) $$

(1)

As Figure 1 shows, all the predicted θ values are 0.395 given the initial values.

3. Calculate the sum of squared residuals (SSE) using Excel function SUMXMY2 in cell B23 by entering “SUMXMY2(B2 : B17, E2 : E17)”. We obtain 0.83005 for SSE for the given initial parameter values (Figure 1).

4. The model obtains the maximum likelihood when the SSE is minimized, which is the principle of least-square fitting method. The Solver tool in Excel can be used to minimize the SSE values. The Solver tool can be found under the Data menu in Excel. If not found there, it has to be added from File menu through the path File- > Options- > Add-Ins- > Solver Add-in. As Figure 2 shows, the “Set Objective” box is the value to be optimized, which is the SSE value in cell B23. Click “Min” to minimize the objective SSE by changing the values of α and n as shown in the “By Changing Variable Cells”.

Figure 2

Solver working screen in Excel 2010.

5. The Solver will then find the minimum SSE (in cell B23) and corresponding α (in cell B19) and n (in cell B20) values (Figure 3). The measured θ values are in agreement with the predicted θ values (Figure 4), indicating a good nonlinear curve fitting.

Figure 3

A spreadsheet for estimating nonlinear regression coefficients α and n .

Figure 4

Measured and predicted soil water content versus soil suction pressure.

Stepwise application of the Monte Carlo method in estimating parameter uncertainties with 200 simulations is demonstrated below:

1. Resample θ using the Monte Carlo method in different columns. Take cell L2 for example, the simulated θ (θ L2 ) is calculated by the following formula:

$$ {\theta}_{L2}=\$\mathrm{E}2+\mathrm{NORM}.\mathrm{I}\mathrm{N}\mathrm{V}\left(\mathrm{RAND}\left(\right),0,\mathrm{SQRT}\left(SSE/df\right)\right) $$

(2)

where $E2 refers to the corresponding predicted θ. SSE is the value calculated above, which is the value in cell B23 in Figure 5. The degree of freedom (df) equals 14. The θ values in the other rows in column L are simulated in a similar way. The simulated values for a new dependent variable θ are demonstrated in cells L2-L17 (Figure 5). The same Monte Carlo simulations are performed from column M to column HC. Therefore, a total of 200 sets of simulated θ are obtained (Figure 5).

Figure 5

Resampling dependent variable θ using Monte Carlo method and initial values set (only the first 5 simulations are shown).

2. Use the same procedure as introduced before to conduct nonlinear regression for each new data set of θ. Note that the new data set of θ will change during optimization, which will result in errors in fitting. Therefore, we copy the simulated θ data to a new sheet by right-clicking “Paste Special” and selecting “Values” in the dialogue of “Paste Special”. For better display, predicted θ array, all the parameter values, and corresponding SSE value are presented in the same column for each simulation (Figure 5). Optimization of parameters α and n is made independently for each simulation using the Solver tool. The initial values are set as the fitted values obtained before for all simulations to reduce the time required during the optimization. Therefore, initial values of 0.07988 and 2.09920 are set for α and n, respectively (Figure 5). Because the maximum number of variables Solver can solve is 200, we can minimize the SSE values for 100 simulations at one time by minimizing the sum of SSE values of 100 simulations. For example, the parameters α and n for the first 100 simulations can be optimized by minimizing cell E19 by entering “=SUM(L18:DG18)” (Figure 5). Similarly, the parameters for the second 100 simulations can be optimized in cell E20 by entering “=SUM(DH18:HC18)”. Therefore, we obtain 200 values for both parameters (α and n) as shown in cells from L19 to HC20 (Figure 6). The frequency distribution of α and n are shown in Figure 7. Visually, both of them follow a normal distribution. However, the Shapiro-Wilk test shows that the parameter α does not conform to a normal distribution, whereas parameter n does. This indicates that the fitted parameters may not necessarily be normally distributed even if the dependent variable is normally distributed, due to the nonlinear relations between them.

Figure 6

Nonlinear regression fitting for resampled θ using Monte Carlo method (only the first 5 simulations are shown).

Figure 7

Frequency distribution of (a) α and (b) n obtained using Monte Carlo method with 200 simulations. The heights of bars indicate the number of parameter values in the equally spaced bins. The curve is the theoretical normal distribution.

3. Calculate the 95% confidence interval of α or n values with 200 simulations. We copy all the fitted α or n values, then paste them to a new sheet by right-clicking “Paste Special” and selecting “Transpose” in the dialogue of “Paste Special” to list all the fitted α or n values in one column. Select the transposed data, and rank them in an ascending order using Sort tool in Data tab. Find the value of α and n corresponding to the 2.5 percentile and 97.5 percentile, which are the lower limit and upper limit, respectively, at a 95% confidence. The 95% confidence interval are (0.0680, 0.0939) and (1.9185, 2.3689) for α and n, respectively (Table 1). The difference between upper limit and lower limit is 0.0259 and 0.4504 for α and n, respectively.

Table 1 Comparison of parameter uncertainties calculated by different methods

Stepwise application of the bootstrap method in estimating parameter uncertainties with 200 simulations is demonstrated as follows:

1. Resample θ using the bootstrap method in different columns (Figure 8). Take cell L2 for example, the simulated θ (θ L2 ) can be calculated by the following formula:

Figure 8

Resampling dependent variable θ using bootstrap method and initial values set (only the first 5 simulations are shown).

$$ {\theta}_{L2}=\$\mathrm{E}2+\mathrm{INDEX}\left(\$\mathrm{C}:\$\mathrm{C},\mathrm{I}\mathrm{N}\mathrm{T}\left(\mathrm{RAND}\left(\right)*16+2\right)\right) $$

(3)

where the function INDEX is used to randomly select a residue value from row 2 to row 17 in column C (the residual is calculated by subtracting predicted θ from the original θ). The θ values at other rows in column L and in other columns (column M to column HC) are simulated in a similar way. Here, 2 in the right hand side of Eq. (3) means that data start at second row.

2. Similar to the Monte Carlo method, parameters α and n for all simulations are fitted by minimizing the sum of every 100 SSE values using the Solver tool (Figures 8 and 9). Therefore, we can also obtain 200 values for both parameters (α and n) as shown in cells from L19 to HC20 (Figure 9). The frequency distribution of α and n are shown in Figure 10. They also visually follow a normal distribution. However, the Shapiro-Wilk test shows that the parameter α does not conform to normal distribution, whereas parameter n does.

Figure 9

Nonlinear regression fitting for resampled θ using bootstrap method (only the first 5 simulations are shown).

Figure 10

Frequency distribution of (a) α and (b) n obtained using bootstrap method with 200 simulations. The heights of bars indicate the number of parameter values in the equally spaced bins. The curve is the theoretical normal distribution.

3. Similar to the Monte Carlo method, the 95% confidence intervals for these two parameters are calculated. They are (0.0680, 0.0925) and (1.9172, 2.3356), for α and n respectively (Table 1). The corresponding difference between upper limit and lower limit is 0.0245 and 0.4183 for α and n, respectively.

Influences of number of simulation on parameter uncertainty analysis

Datasets of fitted values with different numbers of simulations are obtained using a similar method as demonstrated before (data not shown). According to Shapiro-Wilk test, both parameters do not follow a normal distribution with different numbers of simulations except for a few cases that the number of simulations ≤400. This may indicate that the assumption of normality in the linearization method does not hold true.

The lower limit, upper limit, and their difference change slightly with the number of simulations. However, they are almost constant beyond a certain number of simulations (Figure 11). Here, we determine the number of simulations required for both methods according to the change in difference between the upper limit and lower limit. If the relative difference (RD%) of the difference between the upper limit and lower limit under a certain number of simulation is less than 5% compared with that under 2000 simulations, the number of simulations tested is taken to be the required number of simulations. The RD% can be calculated as

$$ \mathrm{R}\mathrm{D}\%=\left|\frac{{\mathrm{V}}_m-{\mathrm{V}}_{2000}}{{\mathrm{V}}_{2000}}\right|*100\% $$

(4)

where Vm and V2000 are the differences between the upper limit and lower limit under m simulations and 2000 simulations, respectively.

Figure 11

Influences of number of simulation on the lower limit, upper limit, and difference of upper limit and lower limit.

For the Monte Carlo method, the RD% is less than 5% for α and n when the numbers of simulation are ≥100 and 200, respectively. For the bootstrap method, the RD% is less than 5% for α and n when the numbers of simulation are ≥500 and 400, respectively. Therefore, simulation number of 200 and 500 are needed to produce reliable data at the 95% confidence interval of parameters for the Monte Carlo and bootstrap methods, respectively. In this sense, the Monte Carlo method may be better than the bootstrap method. However, the optimal number of simulation may also differ with specific situations. For example, Efron and Tibshirani (1993) stated that a minimum of approximately 1000 bootstrap re-samples was sufficient to obtain accurate confidence interval estimates. In order to obtain reliable confidence interval estimates, we suggest increasing the simulation times by 100 at each step, and the final results can be obtained when the values stabilize within consecutive steps.

Comparison with parameter uncertainty approximated by linear model

The values α and n are estimated to be 0.0799 and 2.0988, respectively, by SigmaPlot 10.0 (Figure 12), which are exactly the same as the estimates made by nonlinear regression in Excel. Based on the linear model, the associated standard errors are estimated to be 0.0066 and 0.1138, respectively (Figure 12). Then the 95% confidence intervals of α and n are (0.0670, 0.0928) and (1.8758, 2.3218), respectively. As Table 1 shows, the difference between upper limit and lower limit by the Monte Carlo and bootstrap methods are comparable, although the Monte Carlo method produces a slightly greater uncertainty than the bootstrap method. The slight difference is due to the differences in re-sampling residues. While the Monte Carlo simulation generates residues based on a theoretical normal distribution, the bootstrap method randomly takes the residues with replacement and no assumption is made about the underlying distributions. They are also comparable to those approximated by the linear model obtained from the SigmaPlot software. However, by comparing the results of these three methods, the lower limit and upper limit of α and n obtained by the Monte Carlo and bootstrap methods are slightly greater than those obtained based on a linear assumption except for upper limit of α by the bootstrap method. Because the linearization method is based on the assumption of normal distribution of parameters and linearity at the vicinity of the estimated parameter value, and it is more complicated in terms of calculation, the Monte Carlo and bootstrap methods may be preferred to the linearization method to calculate the parameter uncertainties in spreadsheets. Furthermore, the Monte Carlo method may be preferred to the bootstrap method considering the less number of simulations required for the Monte Carlo method. However, if the number of measurements is too small to determine the probability distribution for Monte Carlo method, the bootstrap method may be superior.

Figure 12

Estimates of parameters ( α and n ) and associated standard errors using SigmaPlot 10.0.