Figure 4 . An example of identified active rock glacier (ID: wkl037). (a) shows the contrasting wrapped phases between the landform and surrounding background. The ALOS-1 PALSAR image pair generating the interferogram were acquired on 14/11/2008 and 30/12/2008. (b) is the corresponding Google Earth image presenting the geomorphic characteristics of the mapped active rock glacier. The white arrow indicates the direction of the movement, and the red dot marks the location of reference point used for phase correction. This rock glacier is debris-mantled slope-connected.

3.2 Automated mapping of rock glaciers using deep learning

Deep learning is the computer algorithm based on neural networks that are capable of determining functions to map from inputs to output (LeCun et al. 2015). It has proved powerful in semantic segmentation by using a convolutional neural network to progressively extract visual features at different levels from input images (Mottaghi et al. 2014), which is suitable for handling difficult mapping tasks as in the case of delineating rock glaciers. Marcer (2020) first proposed a convolutional neural network to detect rock glaciers from orthoimages and suggested further development of this methodology. Robson et al. (2020) has validated a new methodology to detect rock glaciers semi-automatically by advanced image processing techniques including deep learning and object-based image analysis, yet their method has not been used to compile new inventories. Erharter et al. (2022) developed a framework based on U-Net architecture to support the refinement of existing rock glacier inventories. Among the open-source deep learning architectures designed for semantic segmentation, we adopted the DeepLabv3+ with the backbone of Xception71 (termed as DeepLabv3+Xception71 hereafter) as the framework for us to develop the automatic mapping method (Chen et al. 2018) because of its outstanding performance demonstrated in the past PASCAL VOC tests (the benchmark dataset for assessing performance of semantic segmentation models, as detailed in Everingham et al. 2015) and recent research applications to cryospheric remote sensing (Huang et al. 2020; Huang et al. 2021; Zhang et al. 2021a).
Development of the deep learning model for delineating rock glaciers can be divided into three major steps: (1) preparing input data, (2) training and validating deep learning network, and (3) inferring and post-processing results, as detailed below. Figure 5 illustrates the workflow and full details are provided below.

3.2.1 Preparing input data

The data preparation step aimed to produce a dataset of optical images and corresponding rock glacier label images to feed into the convolutional neural network. The input optical images were cloud-free (cloud cover < 5%) Sentinel-2 Level-2A products (spatial resolution ~10 m) covering the West Kunlun region acquired during July and August of 2018. We pre-processed the images by extracting the visible red, green, and blue bands and converting to 8-bit, so that the satellite images were in the same format as the training datasets used for pre-training the DeepLabv3+ network we adopted (Chen et al. 2018). To generate the label images, i.e., binary rasters that have pixel values as 0 or 1, with 1 indicating rock glaciers and 0 indicating the background, we used the ESRI Shapefiles of the manually identified rock glaciers created in the InSAR-based mapping process to label the Sentinel-2 images. We removed 118 rock glacier samples from the training dataset because they are unrecognizable due to cloud cover or relatively low resolution (10 m) of the Sentinel-2 images. In addition, we delineated 145 negative polygons, which are similar-looking landforms such as debris-covered glaciers identified by GLIMS and solifluction slopes based on our image interpretation, and environments where no rock glaciers occur, e.g., water bodies and villages. These negative polygons were used to produce negative label images which constitute the input dataset along with the positive ones. More negative samples were included during the iterative training and validating process by adding the incorrectly inferred examples to the negative training dataset for the next experiment. We extracted the positive polygons with their surrounding background (a buffer size of 1,500 m) from the optical images to provide environmental information and cropped these sub-images into image patches of sizes no larger than 480x480 pixels. Finally, we split the whole dataset of input image patches by randomly selecting 90% of the data as the training set (2,007 image patches) and the remaining 10% as the validation set (223 image patches).

3.2.2 Training and validating deep learning network

Then we trained the DeepLabv3+Xception71 network with the initial hyper-parameters (e.g., learning rate, learning rate decay, batch size, number of iterations) suggested by Chen et al. (2018) and evaluated the model performance on the training and validation datasets. The evaluation was conducted throughout the training process by monitoring the Intersection over Union (IoU) value, which is defined as:
IoU=TP/(TP+FP+FN)
where TP (true positive), FP (false positive), and FN (false negative) are pixel-based. The mean IoU, which is calculated by averaging the IoU of each class, is commonly adopted to indicate the accuracy of semantic segmentation models. Our network classified each pixel of the optical images into two classes, namely the rock glacier and the background. As the amounts of pixels in the two classes are imbalanced (the rock glacier class only occupies a small portion (~10%) of the image patches), we only used the IoU value of the rock glacier class to represent the model performance. We set 0.80 as the threshold: when the IoU value of a trained model was lower than it, we increased the size and diversity of the training dataset by performing image augmentation (e.g., blurring, rotation, flip) on the positive samples and including incorrectly inferred examples to the negative samples and conducted a new experiment until obtaining a model with target IoU value on the validation dataset and regarded the deep learning network had been well trained. The IoU threshold 0.80 was selected considering the validation mIoU (79.55%) of DeepLabV3+Xception71 on the Cityscapes validation dataset, as detailed in Chen et al. (2018).

3.2.3 Inferring and post-processing results

We applied the trained model to map rock glaciers from Sentinel-2 images covering the West Kunlun. The input data occupied ~ 0.6% of the total mapping area. To refine the inference results, we excluded the predicted polygons smaller than 0.03 km2 due to the limited spatial resolution of the Sentinel-2 images and the usual areal extent of rock glaciers. Then we inspected each automatically delineated landform and modified the boundaries when necessary. Finally, we determined the same set of landform attributes as the InSAR-based sub-dataset (Sect 3.1) and compiled the outputs produced by the two methods into one inventory.