Abstract
Supervised classification algorithms are frequently applied to identify landslide-prone areas (i.e. landslide susceptibility maps) or to get insights into the static causes of slope instability. However, the landslide inventory data used to train the underlying models is often affected by a systematic spatial incompleteness (e.g. underrepresentation of movements in woodlands or in remote areas). Thus, the quantity to be modelled (i.e. the response variable) may not perfectly represent the spatial distribution of the phenomena of interest. Literature reveals that the effects of such landslide data biases are often ignored when interpreting data-driven landslide susceptibility models.
This research was built on the basis of landslide data from the province of South Tyrol (7400 km²) that systematically represents damage-causing events and ignores landslides far from infrastructure. The created models (M1, M2, M3) represent diverse strategies to handle spatially biased landslide data. The goals were to show why geomorphic cause-effect relationships cannot always be deduced from models that exhibit an apparent high predictive performance (M1), to evaluate the usefulness of a bias-correction approach under serious data bias conditions (M2) and to exploit the underlying data bias to map areas affected by damage causing landslides (M3). The models were critically evaluated by means of statistical associations, variable importance ranking, performance and plausibility.
The presented research may offer an alternative perspective on how flaws in available landslide information can be considered in data-driven landslide modelling. It is demonstrated that under common landslide data bias conditions, the focus should not only lie on the actual geomorphic process (landslide susceptibility effects), but also on the respective landslide data context (landslide data collection effects). The findings showed that none of the three models was able to create a useful representation of landslide susceptibility, despite calculated high predictive performances. In most cases, geomorphic causation could not be deduced by interpreting the modelled relationships between landslide inventory data and the environmental factors. The final impact-oriented model (M3) enabled us to identify (temporally independent) damaging landslides with high accuracy. We conclude that despite the availability of increasingly flexible models and automated variable selection techniques, a thorough qualitative investigation of landslide data limitations will remain essential towards meaningful spatial landslide models. An inference of geomorphic causation may be challenging under landslide data bias conditions, even though model performance indicators might suggest a high model quality.