Behaviour-Based Quality Assessment of OpenStreetMap Data in Data Scarce Area Using Unsupervised Machine Learning (sotm2025) episode artwork

EPISODE · Oct 3, 2025 · 5 MIN

Behaviour-Based Quality Assessment of OpenStreetMap Data in Data Scarce Area Using Unsupervised Machine Learning (sotm2025)

from Chaos Computer Club - recent events feed (high quality) · host Maruf Ahmed

This study introduces a behavior-dependent, unsupervised machine learning approach to assess the intrinsic quality of OpenStreetMap (OSM) data in Dhaka, which is both data-starved and urbanizing rapidly urbanizing area. Leveraging enriched contributor metadata and Principal Component Analysis (PCA), latent behavioral patterns and segmented contributors identified using KMeans and HDBSCAN. The silhouette score for PCA-based clustering was 0.951. The results show superior interpretability of KMeans over HDBSCAN. This repeatable methodology provides a scalable and reference-free solution to take quality assurance of VGI datasets to the front-line, in cases of limited or no authoritative data. OpenStreetMap (OSM) is an important source of geospatial information in data-starved urban areas, where official geospatial data are scarce, outdated, or are not readily available. Increasing need for current and accurate geospatial data in fast urbanizing and under surveyed regions makes the use of OpenStreetMap (OSM) an essential resource. As one of the most representative Volunteered Geographic Information (VGI), OSM offers a free world map that is editable and can be contributed by millions of people [1]. The tool is an essential component for urban analytics, transport planning, disaster risk reduction, and spatial modeling in the world [2], [3], [4]. Although widely used, the quality of OSM data varies greatly across regions and contributor skill level, and there is no unified, system level quality assurance mechanism [5]. This heterogeneity can be risk inducing for users making use of this data for precision tasks (e.g., routing, land use modeling and infrastructure design) [2], [6]. Traditional OSM quality assessments rely on extrinsic comparisons with satellite imagery or authoritative datasets, which are often unavailable in the very regions that need the data the most [7], [8]. To overcome this challenge, a reproducible, unsupervised machine learning framework propose to assess OSM data quality intrinsically, based on contributor behavior metadata alone. Specifically, Dhaka —a data-scarce and fast-growing megacity in Bangladesh select as a study area—using the hypothesis that distinct contributor behavioral patterns correlate with different levels of data reliability. This behavior-centric perspective leverages the insight that contributor frequency, recency, thematic focus, and spatial editing behavior can serve as meaningful proxies for feature quality [5], [9]. Roads and buildings for Dhaka extracts by using by a.osm.pbf with the Pyrosm library. Then enriched feature vector creates for each unique contributor, composed of (total_edits, edit_rate, active_days, spatial_extent, pct_road, pct_building, weekday_activity, days_since_last_edit). Principal Component Analysis (PCA) applies for dimensionality reduction and shows that PC1 roughly represents global mapping activity, while PC2 corresponds to thematic attention (road versus building), and PC3 represents the geographical coverage of contributions. These observations are supported by a feature contribution heatmap (Figure 1.(a)), which indicates that it is reasonable to consider the behavioral features to be interpretable and highly separable in the component-reduced space. PCA has also the purpose of reducing noise and gets the data ready for clustering [10]. Next, KMeans clustering (with k = 4) and HDBSCAN, a density-based clustering is performed on the PCA-transformed feature set. The silhouette score of the KMeans model was 0.951, suggesting high cohesion within the clusters and good separation between the clusters of behaviors. The PCA cluster scatterplot (Figure 1.(c)) indicates four separated clusters: (1) most participants (Figure 1. (b)) fall in cluster 0, which mainly encompasses casual or one-hit contributors who probably participate in sporadic mapathons, or make large scale imports, (2) cluster 1 and 2 consist of moderate to heavy contributors, who are relatively more or less stable, with richer semantic tagging, and whose edits are spatially distributed, (3) cluster 3 is composed of a small group of “power users,” who are characterized by high activity volume and a large geographical distribution. HDBSCAN also use on the same dataset in order to analyze its capability of separating varies densities in clusters and noise. HDBSCAN found small, dense clusters, and labeled a large percentage of contributors as noise. Although helpful for identifying anomalousness and potential vandalism, HDBSCAN was unable to produce as clear clusters for the main contributors as KMeans, likely because the extreme imbalance in contributor engagement. This benchmarking demonstrated that KMeans comes with a better interpretability and cluster stability, and is therefore preferred for behavioral segmentation at the high volumes of OSM dataset. To further verify the clustering, the changes in edit volume over time per cluster investigated, and calculated feature distributions per cluster. The contributor distribution bar chart (Figure 1. (b)) shows that the participation structure in OSM is highly skewed, which is also in line with previous VGI studies [11], [12]. Feature analysis showed that clusters associated with more recent, frequent, and thematically rich editing were also responsible for higher-quality contributions—consistent with prior work linking contributor experience to data quality [5], [9], [13]. A key contribution of this work is its extensible and repeatable approach. All data processing, feature engineering, PCA and clustering have been performed in Python (Colab) with open-source packages (scikit-learn, geopandas, pyrosm, matplotlib). This method doesn't need any external validation databases, so it is particularly adapted for developing countries and isolated locations, where reference data are limited or unavailable [8]. This study contributes methodologically to three areas in the sciences, more precisely to the area of geospatial data science, unsupervised machine learning, and VGI quality assurance in showing how user behavior can be harnessed for deriving inherent data quality. It complements the literature about behavior-based contributor profiling, incorporates dimensionality reduction to facilitate the interpretation of results, and is an argument against central quality assessment as well as one for local quality assessment, which seems feasible even in urban settings with complex mobility patterns. Pragmatically, this work can help NGOs, local authorities and the OSM community to support the allocation of resources toward data validation and enrichment where coverage is primarily in lower-quality contribution clusters. It also allows hybrid-quality models with behavior signals are augmented with selective extrinsic checks (such as anomaly detection or community verification). For example, contributors from Cluster 3 (power users) may be assigned higher trust weights in quality models, while edits from Cluster 0 may be flagged for further review or enrichment. In conclusion, a new behaviour-based quality assessment of OSM report based on the specific usage of unsupervised machine learning. This cluster- and PCA-driven design is transparent, and interpretable, and completely reproducible. It is a model that addresses the challenges of working in data scarce urban areas and it paves the way for a behavior driven VGI quality models in the framework of urban resilience, infrastructure planning and humanitarian mapping. Future studies will incorporate spatial error measures and use this methodology with longitudinal OSM data for quality evolution monitoring. Creative Commons Attribution 3.0 Unported https://creativecommons.org/licenses/by/3.0/ about this event: https://2025.stateofthemap.org/sessions/GCAXF9/

This study introduces a behavior-dependent, unsupervised machine learning approach to assess the intrinsic quality of OpenStreetMap (OSM) data in Dhaka, which is both data-starved and urbanizing rapidly urbanizing area. Leveraging enriched contributor metadata and Principal Component Analysis (PCA), latent behavioral patterns and segmented contributors identified using KMeans and HDBSCAN. The silhouette score for PCA-based clustering was 0.951. The results show superior interpretability of KMeans over HDBSCAN. This repeatable methodology provides a scalable and reference-free solution to take quality assurance of VGI datasets to the front-line, in cases of limited or no authoritative data. OpenStreetMap (OSM) is an important source of geospatial information in data-starved urban areas, where official geospatial data are scarce, outdated, or are not readily available. Increasing need for current and accurate geospatial data in fast urbanizing and under surveyed regions makes the use of OpenStreetMap (OSM) an essential resource. As one of the most representative Volunteered Geographic Information (VGI), OSM offers a free world map that is editable and can be contributed by millions of people [1]. The tool is an essential component for urban analytics, transport planning, disaster risk reduction, and spatial modeling in the world [2], [3], [4]. Although widely used, the quality of OSM data varies greatly across regions and contributor skill level, and there is no unified, system level quality assurance mechanism [5]. This heterogeneity can be risk inducing for users making use of this data for precision tasks (e.g., routing, land use modeling and infrastructure design) [2], [6]. Traditional OSM quality assessments rely on extrinsic comparisons with satellite imagery or authoritative datasets, which are often unavailable in the very regions that need the data the most [7], [8]. To overcome this challenge, a reproducible, unsupervised machine learning framework propose to assess OSM data quality intrinsically, based on contributor behavior metadata alone. Specifically, Dhaka —a data-scarce and fast-growing megacity in Bangladesh select as a study area—using the hypothesis that distinct contributor behavioral patterns correlate with different levels of data reliability. This behavior-centric perspective leverages the insight that contributor frequency, recency, thematic focus, and spatial editing behavior can serve as meaningful proxies for feature quality [5], [9]. Roads and buildings for Dhaka extracts by using by a.osm.pbf with the Pyrosm library. Then enriched feature vector creates for each unique contributor, composed of (total_edits, edit_rate, active_days, spatial_extent, pct_road, pct_building, weekday_activity, days_since_last_edit). Principal Component Analysis (PCA) applies for dimensionality reduction and shows that PC1 roughly represents global mapping activity, while PC2 corresponds to thematic attention (road versus building), and PC3 represents the geographical coverage of contributions. These observations are supported by a feature contribution heatmap (Figure 1.(a)), which indicates that it is reasonable to consider the behavioral features to be interpretable and highly separable in the component-reduced space. PCA has also the purpose of reducing noise and gets the data ready for clustering [10]. Next, KMeans clustering (with k = 4) and HDBSCAN, a density-based clustering is performed on the PCA-transformed feature set. The silhouette score of the KMeans model was 0.951, suggesting high cohesion within the clusters and good separation between the clusters of behaviors. The PCA cluster scatterplot (Figure 1.(c)) indicates four separated clusters: (1) most participants (Figure 1. (b)) fall in cluster 0, which mainly encompasses casual or one-hit contributors who probably participate in sporadic mapathons, or make large scale imports, (2) cluster 1 and 2 consist of moderate to heavy contributors, who are relatively more or less stable, with richer semantic tagging, and whose edits are spatially distributed, (3) cluster 3 is composed of a small group of “power users,” who are characterized by high activity volume and a large geographical distribution. HDBSCAN also use on the same dataset in order to analyze its capability of separating varies densities in clusters and noise. HDBSCAN found small, dense clusters, and labeled a large percentage of contributors as noise. Although helpful for identifying anomalousness and potential vandalism, HDBSCAN was unable to produce as clear clusters for the main contributors as KMeans, likely because the extreme imbalance in contributor engagement. This benchmarking demonstrated that KMeans comes with a better interpretability and cluster stability, and is therefore preferred for behavioral segmentation at the high volumes of OSM dataset. To further verify the clustering, the changes in edit volume over time per cluster investigated, and calculated feature distributions per cluster. The contributor distribution bar chart (Figure 1. (b)) shows that the participation structure in OSM is highly skewed, which is also in line with previous VGI studies [11], [12]. Feature analysis showed that clusters associated with more recent, frequent, and thematically rich editing were also responsible for higher-quality contributions—consistent with prior work linking contributor experience to data quality [5], [9], [13]. A key contribution of this work is its extensible and repeatable approach. All data processing, feature engineering, PCA and clustering have been performed in Python (Colab) with open-source packages (scikit-learn, geopandas, pyrosm, matplotlib). This method doesn't need any external validation databases, so it is particularly adapted for developing countries and isolated locations, where reference data are limited or unavailable [8]. This study contributes methodologically to three areas in the sciences, more precisely to the area of geospatial data science, unsupervised machine learning, and VGI quality assurance in showing how user behavior can be harnessed for deriving inherent data quality. It complements the literature about behavior-based contributor profiling, incorporates dimensionality reduction to facilitate the interpretation of results, and is an argument against central quality assessment as well as one for local quality assessment, which seems feasible even in urban settings with complex mobility patterns. Pragmatically, this work can help NGOs, local authorities and the OSM community to support the allocation of resources toward data validation and enrichment where coverage is primarily in lower-quality contribution clusters. It also allows hybrid-quality models with behavior signals are augmented with selective extrinsic checks (such as anomaly detection or community verification). For example, contributors from Cluster 3 (power users) may be assigned higher trust weights in quality models, while edits from Cluster 0 may be flagged for further review or enrichment. In conclusion, a new behaviour-based quality assessment of OSM report based on the specific usage of unsupervised machine learning. This cluster- and PCA-driven design is transparent, and interpretable, and completely reproducible. It is a model that addresses the challenges of working in data scarce urban areas and it paves the way for a behavior driven VGI quality models in the framework of urban resilience, infrastructure planning and humanitarian mapping. Future studies will incorporate spatial error measures and use this methodology with longitudinal OSM data for quality evolution monitoring. Creative Commons Attribution 3.0 Unported https://creativecommons.org/licenses/by/3.0/ about this event: https://2025.stateofthemap.org/sessions/GCAXF9/

NOW PLAYING

Behaviour-Based Quality Assessment of OpenStreetMap Data in Data Scarce Area Using Unsupervised Machine Learning (sotm2025)

0:00 5:19

No transcript for this episode yet

We transcribe on demand. Request one and we'll notify you when it's ready — usually under 10 minutes.

No similar episodes found.

LIGHTS, CAMERA, SMILE! Creatives Club Media Lights, Camera, Smile, is a podcast for anyone with a dream to share something with the world, out of the overflow of themselves - be it their mind, their heart, their personalities, and much more. Each of us are alive in this moment in time, with an innate ability to have ideas and create various things to benefit both ourselves and the people around us for a reason, and here, you will find the encouragement, the inspiration, and the motivation to do just that. Hosted by Cicily, founder of Creatives Club, she dives into various topics surrounding creativity and business. Exploring entrepreneurship for creatives in a corporate reality, sharing tips and tricks in a media centered company, answering questions regarding what a creative actually is are just a few of the things discussed on this podcast. Be encouraged to create for yourself as Cicily gets vulnerable by pivoting the camera to herself for the first time.To submit questions for Cicily to answer, or have her address certain t Chewing the Fat with WorkForge WorkForge Bite-Sized Conversations for Building a Stronger Workforce Welcome to Chewing the Fat, a podcast delving deep into the world of food manufacturing. Dive into real conversations around critical topics like staffing, retention, onboarding, and career development in this essential industry. Subscribe now to gain insights from your peers, subject matter experts and more on the biggest issues facing food manufacturers today: -Hiring and retaining employees -Addressing the challenges of the Silver Tsunami -Improving time to productivity of new employees -Engaging employees from hire to retire And more... Tune in to Chewing the Fat, a WorkForge podcast, and join the conversation on how to build and sustain a resilient, high-performing workforce in food manufacturing. Sermons | Countryside Bible Church Countryside Bible Church At Countryside Bible Church, we equip believers to joyfully live holy lives, to serve one another, and to share the gospel of Jesus Christ, all to the glory of God. We are committed to a high view of God, and a high view of Scripture. The PFN Cincinnati Bengals Podcast Pro Football Network The PFN Cincinnati Bengals Podcast is where you can stay up-to-date with the latest news and analysis on the Cincinnati Bengals! Our hosts, industry experts Jay Morrison and Dallas Robinson, provide weekly coverage of all the latest rumors and updates about the Bengals. Don’t forget to follow the show to receive new episodes directly in your podcast feed and leave a rating and review to let us know your thoughts.

Frequently Asked Questions

How long is this episode of Chaos Computer Club - recent events feed (high quality)?

This episode is 5 minutes long.

When was this Chaos Computer Club - recent events feed (high quality) episode published?

This episode was published on October 3, 2025.

What is this episode about?

This study introduces a behavior-dependent, unsupervised machine learning approach to assess the intrinsic quality of OpenStreetMap (OSM) data in Dhaka, which is both data-starved and urbanizing rapidly urbanizing area. Leveraging enriched...

Can I download this Chaos Computer Club - recent events feed (high quality) episode?

Yes, you can download this episode by clicking the download button on the episode player, or subscribe to the podcast in your preferred podcast app for automatic downloads.
URL copied to clipboard!