top of page

A simple CNN is better than a sophisticated segmentation model (U-Net) for human settlement mapping

Updated: May 20, 2020

That was a bright day when Chunping and I went to a Shawarma shop for lunch. Chunping is a Ph.D. candidate at the Technical University of Munich, who was often the first and the last in&out office during my visit. 'I am working on a deep learning method to map urban/non-urban extent.' she was stirring her chicken box and told me. 'Oh...?' I was very surprised, as she's already worked a lot on Local Climate Zone, a detailed framework for urban mapping that provides functional profiles in two morphological dimensions: how much the cities go up and sprawl out. Why did she still work with a simple, binary urban map, which is an old story and with many global products at hand? After I looked into this work and engaged it, I found, it could be very useful for two groups of audiences: (1) Users of urban maps for sustainability research (e.g. biodiversity). (2) Developers of machine learning methods in the field of remote sensing. As there are so many scientists use global products such as Global Urban Footprint (GUF) and Global Human Settlement Layer (GHSL) to investigate environmental change, drivers, effects, it would be very interesting to see a thorough evaluation of the base map we are relying upon. In general, the maps produced by non-deep-learning models are noisier and tend to omit roads and villages in the countryside. This omission would lead to the ignorance of new development at a virgin land and some environmental degradation at its early stage might, therefore, be neglected.

So, if deep learning approaches are really helpful, where should we start with? The deep learning approaches are rapidly developed and sophisticated methods come. One approach is segmentation, which we are also using in another study on urban densification. Unlike patch-based convolutional neural networks (CNNs), making a single prediction for the patch, a semantic segmentation approach takes a patch input but giving a pixels-looking output. Semantic segmentation models, such as U-Net, typically have an encoder-decoder architecture, which firstly downscales the patch (like traditional CNN does), then upscales it back to the original resolution. The architecture allows learning not only the intra-patch information but also the inter-patch features.

Should we use a fancy model then? In this study, we attempted to see whether the segmentation model can be beneficial for mapping human settlement extent. It turned out a negative answer. A traditional and shallow CNN with only one pooling layer reaches the goal -- not to lower the resolution much while achieving high accuracy of classification. This finding reminds me that there is no robust method for every classification task. How complex is the task? What are the spatial, temporal, or even radiometric resolution of the data used? Our ways of methods stem from these two fundamental characteristics. Given the classification is simple (i.e., built-up v.s. non-built-up), the Sentinel-2 imagery is at medium resolution (i.e., 20m), it is not surprising that a simple CNN resulted in some nice work.

Some other interesting findings, for instance, are that the model is transferable to subtropical cities while it was trained by data in temperate regions. Attach a case in Kaohsiung, Taiwan, which was a miss that does not present in the final version. It is a bit creepy that the military airport alongside the Gaoping river is well depicted.

The article is open access. :)

Qiu, C., Schmitt, M., Geiß, C., Chen, T. H. K., & Zhu, X. X. (2020). A framework for large-scale mapping of human settlement extent from Sentinel-2 images via fully convolutional neural networks.ISPRS Journal of Photogrammetry and Remote Sensing,163, 152-170.

And the code for the proposed CNN model can be found here:

91 views0 comments


Post: Blog2_Post
bottom of page