Behaviour-Based Quality Assessment of OpenStreetMap Data in Data Scarce Area Using Unsupervised Machine Learning (sotm2025) episode artwork

EPISODE · Oct 3, 2025 · 5 MIN

Behaviour-Based Quality Assessment of OpenStreetMap Data in Data Scarce Area Using Unsupervised Machine Learning (sotm2025)

from Chaos Computer Club - recent audio-only feed · host Maruf Ahmed

This study introduces a behavior-dependent, unsupervised machine learning approach to assess the intrinsic quality of OpenStreetMap (OSM) data in Dhaka, which is both data-starved and urbanizing rapidly urbanizing area. Leveraging enriched contributor metadata and Principal Component Analysis (PCA), latent behavioral patterns and segmented contributors identified using KMeans and HDBSCAN. The silhouette score for PCA-based clustering was 0.951. The results show superior interpretability of KMeans over HDBSCAN. This repeatable methodology provides a scalable and reference-free solution to take quality assurance of VGI datasets to the front-line, in cases of limited or no authoritative data. OpenStreetMap (OSM) is an important source of geospatial information in data-starved urban areas, where official geospatial data are scarce, outdated, or are not readily available. Increasing need for current and accurate geospatial data in fast urbanizing and under surveyed regions makes the use of OpenStreetMap (OSM) an essential resource. As one of the most representative Volunteered Geographic Information (VGI), OSM offers a free world map that is editable and can be contributed by millions of people [1]. The tool is an essential component for urban analytics, transport planning, disaster risk reduction, and spatial modeling in the world [2], [3], [4]. Although widely used, the quality of OSM data varies greatly across regions and contributor skill level, and there is no unified, system level quality assurance mechanism [5]. This heterogeneity can be risk inducing for users making use of this data for precision tasks (e.g., routing, land use modeling and infrastructure design) [2], [6]. Traditional OSM quality assessments rely on extrinsic comparisons with satellite imagery or authoritative datasets, which are often unavailable in the very regions that need the data the most [7], [8]. To overcome this challenge, a reproducible, unsupervised machine learning framework propose to assess OSM data quality intrinsically, based on contributor behavior metadata alone. Specifically, Dhaka —a data-scarce and fast-growing megacity in Bangladesh select as a study area—using the hypothesis that distinct contributor behavioral patterns correlate with different levels of data reliability. This behavior-centric perspective leverages the insight that contributor frequency, recency, thematic focus, and spatial editing behavior can serve as meaningful proxies for feature quality [5], [9]. Roads and buildings for Dhaka extracts by using by a.osm.pbf with the Pyrosm library. Then enriched feature vector creates for each unique contributor, composed of (total_edits, edit_rate, active_days, spatial_extent, pct_road, pct_building, weekday_activity, days_since_last_edit). Principal Component Analysis (PCA) applies for dimensionality reduction and shows that PC1 roughly represents global mapping activity, while PC2 corresponds to thematic attention (road versus building), and PC3 represents the geographical coverage of contributions. These observations are supported by a feature contribution heatmap (Figure 1.(a)), which indicates that it is reasonable to consider the behavioral features to be interpretable and highly separable in the component-reduced space. PCA has also the purpose of reducing noise and gets the data ready for clustering [10]. Next, KMeans clustering (with k = 4) and HDBSCAN, a density-based clustering is performed on the PCA-transformed feature set. The silhouette score of the KMeans model was 0.951, suggesting high cohesion within the clusters and good separation between the clusters of behaviors. The PCA cluster scatterplot (Figure 1.(c)) indicates four separated clusters: (1) most participants (Figure 1. (b)) fall in cluster 0, which mainly encompasses casual or one-hit contributors who probably participate in sporadic mapathons, or make large scale imports, (2) cluster 1 and 2 consist of moderate to heavy contributors, who are relatively more or less stable, with richer semantic tagging, and whose edits are spatially distributed, (3) cluster 3 is composed of a small group of “power users,” who are characterized by high activity volume and a large geographical distribution. HDBSCAN also use on the same dataset in order to analyze its capability of separating varies densities in clusters and noise. HDBSCAN found small, dense clusters, and labeled a large percentage of contributors as noise. Although helpful for identifying anomalousness and potential vandalism, HDBSCAN was unable to produce as clear clusters for the main contributors as KMeans, likely because the extreme imbalance in contributor engagement. This benchmarking demonstrated that KMeans comes with a better interpretability and cluster stability, and is therefore preferred for behavioral segmentation at the high volumes of OSM dataset. To further verify the clustering, the changes in edit volume over time per cluster investigated, and calculated feature distributions per cluster. The contributor distribution bar chart (Figure 1. (b)) shows that the participation structure in OSM is highly skewed, which is also in line with previous VGI studies [11], [12]. Feature analysis showed that clusters associated with more recent, frequent, and thematically rich editing were also responsible for higher-quality contributions—consistent with prior work linking contributor experience to data quality [5], [9], [13]. A key contribution of this work is its extensible and repeatable approach. All data processing, feature engineering, PCA and clustering have been performed in Python (Colab) with open-source packages (scikit-learn, geopandas, pyrosm, matplotlib). This method doesn't need any external validation databases, so it is particularly adapted for developing countries and isolated locations, where reference data are limited or unavailable [8]. This study contributes methodologically to three areas in the sciences, more precisely to the area of geospatial data science, unsupervised machine learning, and VGI quality assurance in showing how user behavior can be harnessed for deriving inherent data quality. It complements the literature about behavior-based contributor profiling, incorporates dimensionality reduction to facilitate the interpretation of results, and is an argument against central quality assessment as well as one for local quality assessment, which seems feasible even in urban settings with complex mobility patterns. Pragmatically, this work can help NGOs, local authorities and the OSM community to support the allocation of resources toward data validation and enrichment where coverage is primarily in lower-quality contribution clusters. It also allows hybrid-quality models with behavior signals are augmented with selective extrinsic checks (such as anomaly detection or community verification). For example, contributors from Cluster 3 (power users) may be assigned higher trust weights in quality models, while edits from Cluster 0 may be flagged for further review or enrichment. In conclusion, a new behaviour-based quality assessment of OSM report based on the specific usage of unsupervised machine learning. This cluster- and PCA-driven design is transparent, and interpretable, and completely reproducible. It is a model that addresses the challenges of working in data scarce urban areas and it paves the way for a behavior driven VGI quality models in the framework of urban resilience, infrastructure planning and humanitarian mapping. Future studies will incorporate spatial error measures and use this methodology with longitudinal OSM data for quality evolution monitoring. Creative Commons Attribution 3.0 Unported https://creativecommons.org/licenses/by/3.0/ about this event: https://2025.stateofthemap.org/sessions/GCAXF9/

This study introduces a behavior-dependent, unsupervised machine learning approach to assess the intrinsic quality of OpenStreetMap (OSM) data in Dhaka, which is both data-starved and urbanizing rapidly urbanizing area. Leveraging enriched contributor metadata and Principal Component Analysis (PCA), latent behavioral patterns and segmented contributors identified using KMeans and HDBSCAN. The silhouette score for PCA-based clustering was 0.951. The results show superior interpretability of KMeans over HDBSCAN. This repeatable methodology provides a scalable and reference-free solution to take quality assurance of VGI datasets to the front-line, in cases of limited or no authoritative data. OpenStreetMap (OSM) is an important source of geospatial information in data-starved urban areas, where official geospatial data are scarce, outdated, or are not readily available. Increasing need for current and accurate geospatial data in fast urbanizing and under surveyed regions makes the use of OpenStreetMap (OSM) an essential resource. As one of the most representative Volunteered Geographic Information (VGI), OSM offers a free world map that is editable and can be contributed by millions of people [1]. The tool is an essential component for urban analytics, transport planning, disaster risk reduction, and spatial modeling in the world [2], [3], [4]. Although widely used, the quality of OSM data varies greatly across regions and contributor skill level, and there is no unified, system level quality assurance mechanism [5]. This heterogeneity can be risk inducing for users making use of this data for precision tasks (e.g., routing, land use modeling and infrastructure design) [2], [6]. Traditional OSM quality assessments rely on extrinsic comparisons with satellite imagery or authoritative datasets, which are often unavailable in the very regions that need the data the most [7], [8]. To overcome this challenge, a reproducible, unsupervised machine learning framework propose to assess OSM data quality intrinsically, based on contributor behavior metadata alone. Specifically, Dhaka —a data-scarce and fast-growing megacity in Bangladesh select as a study area—using the hypothesis that distinct contributor behavioral patterns correlate with different levels of data reliability. This behavior-centric perspective leverages the insight that contributor frequency, recency, thematic focus, and spatial editing behavior can serve as meaningful proxies for feature quality [5], [9]. Roads and buildings for Dhaka extracts by using by a.osm.pbf with the Pyrosm library. Then enriched feature vector creates for each unique contributor, composed of (total_edits, edit_rate, active_days, spatial_extent, pct_road, pct_building, weekday_activity, days_since_last_edit). Principal Component Analysis (PCA) applies for dimensionality reduction and shows that PC1 roughly represents global mapping activity, while PC2 corresponds to thematic attention (road versus building), and PC3 represents the geographical coverage of contributions. These observations are supported by a feature contribution heatmap (Figure 1.(a)), which indicates that it is reasonable to consider the behavioral features to be interpretable and highly separable in the component-reduced space. PCA has also the purpose of reducing noise and gets the data ready for clustering [10]. Next, KMeans clustering (with k = 4) and HDBSCAN, a density-based clustering is performed on the PCA-transformed feature set. The silhouette score of the KMeans model was 0.951, suggesting high cohesion within the clusters and good separation between the clusters of behaviors. The PCA cluster scatterplot (Figure 1.(c)) indicates four separated clusters: (1) most participants (Figure 1. (b)) fall in cluster 0, which mainly encompasses casual or one-hit contributors who probably participate in sporadic mapathons, or make large scale imports, (2) cluster 1 and 2 consist of moderate to heavy contributors, who are relatively more or less stable, with richer semantic tagging, and whose edits are spatially distributed, (3) cluster 3 is composed of a small group of “power users,” who are characterized by high activity volume and a large geographical distribution. HDBSCAN also use on the same dataset in order to analyze its capability of separating varies densities in clusters and noise. HDBSCAN found small, dense clusters, and labeled a large percentage of contributors as noise. Although helpful for identifying anomalousness and potential vandalism, HDBSCAN was unable to produce as clear clusters for the main contributors as KMeans, likely because the extreme imbalance in contributor engagement. This benchmarking demonstrated that KMeans comes with a better interpretability and cluster stability, and is therefore preferred for behavioral segmentation at the high volumes of OSM dataset. To further verify the clustering, the changes in edit volume over time per cluster investigated, and calculated feature distributions per cluster. The contributor distribution bar chart (Figure 1. (b)) shows that the participation structure in OSM is highly skewed, which is also in line with previous VGI studies [11], [12]. Feature analysis showed that clusters associated with more recent, frequent, and thematically rich editing were also responsible for higher-quality contributions—consistent with prior work linking contributor experience to data quality [5], [9], [13]. A key contribution of this work is its extensible and repeatable approach. All data processing, feature engineering, PCA and clustering have been performed in Python (Colab) with open-source packages (scikit-learn, geopandas, pyrosm, matplotlib). This method doesn't need any external validation databases, so it is particularly adapted for developing countries and isolated locations, where reference data are limited or unavailable [8]. This study contributes methodologically to three areas in the sciences, more precisely to the area of geospatial data science, unsupervised machine learning, and VGI quality assurance in showing how user behavior can be harnessed for deriving inherent data quality. It complements the literature about behavior-based contributor profiling, incorporates dimensionality reduction to facilitate the interpretation of results, and is an argument against central quality assessment as well as one for local quality assessment, which seems feasible even in urban settings with complex mobility patterns. Pragmatically, this work can help NGOs, local authorities and the OSM community to support the allocation of resources toward data validation and enrichment where coverage is primarily in lower-quality contribution clusters. It also allows hybrid-quality models with behavior signals are augmented with selective extrinsic checks (such as anomaly detection or community verification). For example, contributors from Cluster 3 (power users) may be assigned higher trust weights in quality models, while edits from Cluster 0 may be flagged for further review or enrichment. In conclusion, a new behaviour-based quality assessment of OSM report based on the specific usage of unsupervised machine learning. This cluster- and PCA-driven design is transparent, and interpretable, and completely reproducible. It is a model that addresses the challenges of working in data scarce urban areas and it paves the way for a behavior driven VGI quality models in the framework of urban resilience, infrastructure planning and humanitarian mapping. Future studies will incorporate spatial error measures and use this methodology with longitudinal OSM data for quality evolution monitoring. Creative Commons Attribution 3.0 Unported https://creativecommons.org/licenses/by/3.0/ about this event: https://2025.stateofthemap.org/sessions/GCAXF9/

NOW PLAYING

Behaviour-Based Quality Assessment of OpenStreetMap Data in Data Scarce Area Using Unsupervised Machine Learning (sotm2025)

0:00 5:19

No transcript for this episode yet

We transcribe on demand. Request one and we'll notify you when it's ready — usually under 10 minutes.

Breaking News Show | eTurboNews Juergen Thomas Steinmetz News is relevant to the global travel and tourism industry, human rights and global issues.Breaking news when it happens and only from the source. That Hoarder: Overcome Compulsive Hoarding That Hoarder Hoarding disorder is stigmatised and people who hoard feel vast amounts of shame. This podcast began life as an audio diary, an anonymous outlet for somebody with this weird condition. That Hoarder speaks about her experiences living with compulsive hoarding, she interviews therapists, academics, researchers, children of hoarders, professional organisers and influencers, and she shares insight and tips for others with the problem. Listened to by people who hoard as well as those who love them and those who work with them, Overcome Compulsive Hoarding with That Hoarder aims to shatter the stigma, share the truth and speak openly and honestly to improve lives. HOMELAND HOMELAND The Church is a body not a building. It's the bride of Jesus Christ! Jesus is coming back for a mature bride. That means it's time for the church of Jesus Christ to move from milk to meat. This is the hour of maturity!HOMELAND is an announcement that the church is being set free. Only the church has the ability to transform the world. The kingdom's of this world will become the kingdoms of our Lord and Savior!All of creation has been waiting for this moment! Sons and daughters of God are rising up and taking their seat! LIGHTS, CAMERA, SMILE! Creatives Club Media Lights, Camera, Smile, is a podcast for anyone with a dream to share something with the world, out of the overflow of themselves - be it their mind, their heart, their personalities, and much more. Each of us are alive in this moment in time, with an innate ability to have ideas and create various things to benefit both ourselves and the people around us for a reason, and here, you will find the encouragement, the inspiration, and the motivation to do just that. Hosted by Cicily, founder of Creatives Club, she dives into various topics surrounding creativity and business. Exploring entrepreneurship for creatives in a corporate reality, sharing tips and tricks in a media centered company, answering questions regarding what a creative actually is are just a few of the things discussed on this podcast. Be encouraged to create for yourself as Cicily gets vulnerable by pivoting the camera to herself for the first time.To submit questions for Cicily to answer, or have her address certain t

Frequently Asked Questions

How long is this episode of Chaos Computer Club - recent audio-only feed?

This episode is 5 minutes long.

When was this Chaos Computer Club - recent audio-only feed episode published?

This episode was published on October 3, 2025.

What is this episode about?

This study introduces a behavior-dependent, unsupervised machine learning approach to assess the intrinsic quality of OpenStreetMap (OSM) data in Dhaka, which is both data-starved and urbanizing rapidly urbanizing area. Leveraging enriched...

Can I download this Chaos Computer Club - recent audio-only feed episode?

Yes, you can download this episode by clicking the download button on the episode player, or subscribe to the podcast in your preferred podcast app for automatic downloads.
URL copied to clipboard!