The large capacity of modern PCs has enabled the use of complex models to simulate and describe an increasing number of natural and artificial processes. From this perspective, numerical models of environmental, hydrological and agrometeorological systems have increased in number and complexity (Willmott et al. 2012Willmott, C. J., Robeson, S.M. and Matsuura, K. A. (2012). A refined index of the model`s performance. International Journal of Climatology, 32, 2088-2094. doi.org/10.1002/joc.2419.doi.org/10.1002/joc.2419. ). Therefore, evaluating model performance, i.e., comparing the estimates produced by the model with observed/reliable values, is a fundamental step in the development and use of the model (Willmott et al., 1985Willmott, C. J., Ackleson, S.
G., Davis, R. E., Feddema, J. J., Klink, K.M., Legates, D. R., O`donnell, J. and Rowe, C.M. (1985). Statistics to evaluate the performance of the model. Zeitschrift für geophysikalische Forschung, 90, 8995-9005. dx.doi.org/10.1029/JC090iC05p08995.dx.doi.org/10.1029/JC090iC05p089. ; Willmott et al. 2012Willmott, C.
J., Robeson, S.M. and Matsuura, K. A. (2012). A refined index of the model`s performance. International Journal of Climatology, 32, 2088-2094. doi.org/10.1002/joc.2419.doi.org/10.1002/joc.2419. ).
This validation process usually involves a definition of criteria based on mathematical measures of how modelled estimates simulate observed values (Willmott et al. 1985Willmott, C. J., Ackleson, S. G., Davis, R. E., Feddema, J. J., Klink, K.M., Legates, D. R., O`donnell, J. and Rowe, C.M. (1985).
Statistics to evaluate the performance of the model. Zeitschrift für geophysikalische Forschung, 90, 8995-9005. dx.doi.org/10.1029/JC090iC05p08995.dx.doi.org/10.1029/JC090iC05p089. ; Krause et al. 2005Krause, P., Boyle, D. P. and Base, F. (2005).
Comparison of different efficiency criteria for the evaluation of hydrological models. Advances in Earth Sciences, 5, 89-97 doi.org/10.5194/adgeo-5-89-2005.doi.org/10.5194/adgeo-5-89-2005. ; Willmott et al. 2012Willmott, C. J., Robeson, S.M. and Matsuura, K. A. (2012). A refined index of the model`s performance.
International Journal of Climatology, 32, 2088-2094. doi.org/10.1002/joc.2419.doi.org/10.1002/joc.2419. ). First, let`s generalize the problem by saying that we are looking for an index that quantifies the matches between the X and Y records. Data sets are measured in the same units and with the same medium. For most geospatial raster datasets, this would mean that both have the same spatial and temporal resolutions. An optimal match index should be: Given the amount of error used in Monte Carlo simulations, as well as in the case of the study, our results suggest that the original version of the willmott index may cause the user to mistakenly select a predictive model that generates wrong estimates. This statement is consistent with previous studies. Our results also suggest that the two new versions of this index (modified and refined) overcome such a problem, resulting in stricter evaluations of predictive models. Therefore, they should be preferred to the original version.
The concept of proximity (or “agreement”) is a concept that many mathematical formulations attempt to grasp. Examples include commonly used indicators such as the Pearson product moment correlation coefficient (r), the coefficient of determination (R2), and mean square error (RMSE). Intuitively, any difference in equality can be classified as a disagreement between two sets of data. Graphically, any deviation of the experimental data points from the 1:1 line is dispersed across the datasets. A set of indicators may therefore be sufficient to express the “distance” of the data points available from line 1:1. For example, the slope and offset of a linear model that adjusts the data, as well as a measure of the dispersion around that line, can accurately represent the correspondence between two data sets. The development of numerical models has further stimulated the need for measurements that could be used to calibrate1 or evaluate their performance2. One disadvantage is that most indicators have to be considered at the same time, making it difficult to compare the correspondence between several data sets. An elegant graphical solution has been proposed by Taylor3 to simultaneously visualize different metrics containing additional information, but this still does not quantify the correspondence in a single index. Some authors4 have suggested a fuzzy logical framework for combining different measures into a single indicator that should reflect how an expert perceives the quality of a model`s performance, but this requires some subjectivity in defining membership functions. A single metric that characterizes models such as r, but also contains information on the extent of deviations, is of obvious interest to many users, as evidenced by the various proposals for special correspondence indices in different fields such as hydrology, climatology and remote sensing5,6,7,8,9.
the last term being proportional to the covariance between X and Y. One way to make sure of this is to create an index that explicitly contains this covariance term in the denominator and limit it so that it is always positive: it shows the similarity with equation (7) and the simplified expression of the Mielke permutation index. As shown in Fig. 3, the resulting index is identical to that of the positively correlated vectors, while remaining at 0 if. Willmott6 proposed another index to assess the performance of the model based on the measured observations, which can be generalized as follows: In this study, we review various measures proposed in the literature that can be used to evaluate the matching of data sets. We then test and compare their performance against synthetic datasets, highlighting some of their shortcomings. We explain why a permutation index, originally proposed by Mielke7, can be considered the most appropriate after a small change, as it meets all the desired properties for such an index, including the fact that it can be interpreted in terms of the correlation coefficient r. We also propose a refined approach to separately examine non-systematic and systematic contributions to disagreement in the dataset. Finally, we apply the available measurements and the proposed index to two real-world comparative case studies: one linked to the normalized difference vegetation index (INV) time series acquired by two different satellite missions over the same period, and the other, which refers to two gross primary production (PPM) time series estimated by different modelling approaches. If we can accept the use of an index based on and not on MAE, we argue that the correct metric should be a slightly modified version of the Mielke index. This reasoning stems from the idea that for an index constructed on the basis of the structure of equation (3), the objective should be to define the denominator μ as the maximum value that the numerator can δ assume. It is important to find the smallest value that represents the meter (i.e.
its supremum) to ensure an index with the maximum possible sensitivity. For indexes based on MSE, it can be demonstrated (see Additional Information) that the counter can be rewritten as follows: The index has the additional desirable property that if and when there is no additive or multiplicative bias, it assumes the value of the correlation coefficient. If there is a bias, the index assumes a value less than r according to a multiplicative coefficient, which α can only assume a value between 0 and 1. Using equation (10), it can be effectively demonstrated (see Additional Information) that: Illustration of how the λf and λ chord indices discussed in this article work across datasets with different correlations and systematic additive or multiplicative biases. The four terms of the denominator can be represented geometrically, as shown in the Additional Information section. Following the explicit addition of the term covariance, the index ensures that if X and Y are negatively correlated, an index is zero, as shown in Fig. 3. However, if the denominator is unnecessarily inflated by this covariance term, since the numerator is always smaller due to the negative sign before the covariance term in equation (10). To solve this problem, we propose to define the index, which we simply call λ, as follows: Legates, D. R.
& McCabe, G. J. A refined index of the model`s performance: a replica. Internationale Zeitschrift für Klimatologie 33, 1053–1056. All Monte Carlo simulations were performed taking into account errors that represented at least 30% of the process average. This limit was established on the assumption that models that lead to errors that meet or exceed this threshold will be rejected without the need to measure compliance. A simple visual analysis of the estimates suggests that the predictive model is malfunctioning. The analysis of monte carlo simulations must first take into account an important difference between Dorig and the other two versions of the Willmott indices. As can be seen in equations 1, 2 and 3, only the original index (equation 1) places the errors (oi – pi). However, when the quadratic function is applied to large error sizes, it increases the influence of these deviations on the number of square errors (Willmott 1982Willmott, C.
J. (1982). Some notes on evaluating the performance of the model. Bulletin of the American Meteorological Society, 63, 1309-1313. doi.org/10.1175/1520-0477(1982)0632.0.CO;2.doi.org/10.1175/1520-0477(1982)0. . .