Python for Spatial Clustering in GIS
Spatial clustering is a fundamental technique in Geographic Information Systems (GIS) that groups spatial data points based on their geographic proximity and attribute similarity. Unlike traditional clustering methods that work solely with attribute data, spatial clustering considers both the spatial relationships between features and their characteristics, making it particularly valuable for geographic analysis.
This article explores how Python’s rich ecosystem of geospatial libraries enables sophisticated spatial clustering analysis, from basic point clustering to advanced regionalization techniques.
What is Spatial Clustering?
Spatial clustering identifies groups of geographically proximate objects that share similar characteristics. The key distinction from non-spatial clustering is the incorporation of spatial relationships—objects that are closer together in space are more likely to belong to the same cluster, even if their attributes are somewhat different.
Common applications include:
- Urban planning: Identifying similar neighborhoods
- Epidemiology: Detecting disease outbreak clusters
- Crime analysis: Finding crime hotspots
- Market analysis: Segmenting geographic markets
- Environmental science: Grouping similar ecological zones
Essential Python Libraries
Core Geospatial Libraries
GeoPandas: Built on pandas, it provides spatial data structures and operations for working with geometric data types. Documentation | GitHub
Shapely: Handles geometric objects and spatial operations like intersections, unions, and buffers. Documentation | GitHub
PySAL: A comprehensive library for spatial analysis, including numerous spatial clustering algorithms and spatial weights matrices. Documentation | GitHub
Scikit-learn: While primarily for machine learning, it includes clustering algorithms that can be adapted for spatial analysis. Documentation | GitHub
Folium: For interactive map visualization of clustering results. Documentation | GitHub
Specialized Spatial Clustering Libraries
DBSCAN: Available in scikit-learn, excellent for density-based spatial clustering. DBSCAN Paper
HDBSCAN: An extension of DBSCAN for hierarchical density-based clustering. Documentation | GitHub
Regionalization algorithms: Available in PySAL for creating spatially contiguous clusters. PySAL Regionalization
Additional Resources and References
- Spatial Analysis Methods and Practice: Cambridge University Press handbook
- Geographic Data Science with Python: Free online book covering spatial analysis
- Spatial Data Science: Comprehensive resource for spatial data science concepts
- OpenStreetMap: Source for real-world spatial data
- Natural Earth: Free vector and raster map data
- GDAL/OGR: Geospatial data abstraction library documentation
Types of Spatial Clustering
1. Point-based Clustering
This approach clusters individual point locations based on their coordinates and attributes. Common algorithms include:
K-means with spatial constraints: Traditional k-means modified to consider spatial proximity. Scikit-learn K-means
DBSCAN (Density-Based Spatial Clustering): Identifies clusters of varying shapes and automatically determines the number of clusters based on density. DBSCAN Algorithm | Original Paper
Mean Shift: Finds dense areas of data points and shifts cluster centers toward regions of higher density. Mean Shift Documentation
2. Area-based Clustering (Regionalization)
Regionalization creates spatially contiguous clusters from areal units like census tracts, counties, or administrative boundaries. Key methods include:
SKATER (Spatial K-means with Assortative Regionalization): Creates spatially contiguous clusters while minimizing within-cluster variance. PySAL SKATER
Max-p: Finds the maximum number of regions while maintaining minimum population thresholds. Max-p Documentation
AZP (Automated Zoning Procedure): Optimizes zone design based on specified criteria. AZP Implementation
3. Network-based Clustering
Clusters objects based on network relationships, such as road networks or social networks with geographic components.
Installation and Setup
Before starting, install the required libraries:
# Core libraries
pip install geopandas pandas numpy matplotlib seaborn scikit-learn
# Additional spatial libraries
pip install libpysal folium
# Optional advanced libraries
pip install hdbscan spopt contextily
# For Jupyter notebook users
pip install jupyterlab ipywidgets
# Alternative: Using conda (recommended for geospatial libraries)
conda install -c conda-forge geopandas libpysal folium scikit-learn
For more installation options, see:
# Essential imports for spatial clustering
import geopandas as gpd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from shapely.geometry import Point
from sklearn.cluster import DBSCAN, KMeans
from sklearn.preprocessing import StandardScaler
import libpysal
from libpysal.weights import Queen, Rook, KNN
import folium
# For additional functionality
try:
import contextily as ctx # For basemaps
import hdbscan # For hierarchical DBSCAN
from spopt.region import Skater, MaxP # For regionalization
except ImportError:
print("Some optional libraries not installed. Install with: pip install contextily hdbscan spopt")
Useful Data Sources
- NYC Open Data: Real crime and demographic data
- US Census Bureau: Census tracts and demographic data
- World Bank Data: Global socioeconomic indicators
- OpenStreetMap: Global geographic data
- Natural Earth: Administrative boundaries and physical features
## Practical Implementation: Point-based Clustering
### Example: Clustering Crime Incidents
Let's walk through a complete example of clustering crime incidents in a city using DBSCAN:
```python
# Sample data creation (replace with your actual data loading)
def create_sample_crime_data():
np.random.seed(42)
# Create sample crime locations with some clustering
cluster1 = np.random.multivariate_normal([40.7128, -74.0060], [[0.001, 0], [0, 0.001]], 100)
cluster2 = np.random.multivariate_normal([40.7589, -73.9851], [[0.0005, 0], [0, 0.0005]], 80)
cluster3 = np.random.multivariate_normal([40.7282, -73.7949], [[0.0008, 0], [0, 0.0008]], 60)
coords = np.vstack([cluster1, cluster2, cluster3])
return pd.DataFrame({
'latitude': coords[:, 0],
'longitude': coords[:, 1],
'crime_type': np.random.choice(['theft', 'assault', 'burglary'], len(coords)),
'severity': np.random.randint(1, 6, len(coords))
})
# Load and prepare data
crime_data = create_sample_crime_data()
gdf = gpd.GeoDataFrame(
crime_data,
geometry=gpd.points_from_xy(crime_data.longitude, crime_data.latitude),
crs='EPSG:4326'
)
# Convert to projected CRS for distance calculations
gdf = gdf.to_crs('EPSG:3857') # Web Mercator
# Extract coordinates for clustering
coords = np.column_stack([gdf.geometry.x, gdf.geometry.y])
# Standardize coordinates (important for algorithms like k-means)
scaler = StandardScaler()
coords_scaled = scaler.fit_transform(coords)
# Apply DBSCAN clustering
# eps: maximum distance between points in the same cluster
# min_samples: minimum number of points required to form a cluster
dbscan = DBSCAN(eps=0.3, min_samples=5)
cluster_labels = dbscan.fit_predict(coords_scaled)
# Add cluster labels to GeoDataFrame
gdf['cluster'] = cluster_labels
gdf['is_noise'] = gdf['cluster'] == -1
# Convert back to WGS84 for visualization
gdf = gdf.to_crs('EPSG:4326')
print(f"Number of clusters found: {len(set(cluster_labels)) - (1 if -1 in cluster_labels else 0)}")
print(f"Number of noise points: {sum(cluster_labels == -1)}")
Visualizing Results
# Create a visualization of clusters
fig, ax = plt.subplots(1, 1, figsize=(12, 8))
# Plot noise points
noise_points = gdf[gdf['is_noise']]
if len(noise_points) > 0:
noise_points.plot(ax=ax, color='black', markersize=20, alpha=0.6, label='Noise')
# Plot clusters with different colors
clusters = gdf[~gdf['is_noise']]
if len(clusters) > 0:
clusters.plot(ax=ax, column='cluster', cmap='tab10', markersize=30, alpha=0.8, legend=True)
ax.set_title('Spatial Clustering of Crime Incidents using DBSCAN', fontsize=16)
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
plt.legend()
plt.show()
# Interactive map with Folium
def create_cluster_map(gdf):
# Calculate map center
center_lat = gdf.geometry.y.mean()
center_lon = gdf.geometry.x.mean()
# Create base map
m = folium.Map(location=[center_lat, center_lon], zoom_start=12)
# Define colors for clusters
colors = ['red', 'blue', 'green', 'purple', 'orange', 'darkred',
'lightred', 'beige', 'darkblue', 'darkgreen']
# Add points to map
for idx, row in gdf.iterrows():
color = 'black' if row['is_noise'] else colors[int(row['cluster']) % len(colors)]
popup_text = f"Cluster: {row['cluster']}<br>Crime: {row['crime_type']}<br>Severity: {row['severity']}"
folium.CircleMarker(
location=[row.geometry.y, row.geometry.x],
radius=8,
popup=popup_text,
color=color,
fill=True,
fillColor=color,
fillOpacity=0.6
).add_to(m)
return m
# Create and display interactive map
cluster_map = create_cluster_map(gdf)
Area-based Clustering: Regionalization
Regionalization is particularly useful when working with administrative units or census data:
# Example: Clustering census tracts based on demographic data
def perform_regionalization(gdf, attributes, n_clusters=5):
"""
Perform spatial regionalization using SKATER algorithm
"""
from libpysal import weights
from spopt.region import Skater
# Create spatial weights matrix
w = weights.Queen.from_dataframe(gdf)
# Standardize attributes
scaler = StandardScaler()
X = scaler.fit_transform(gdf[attributes])
# Apply SKATER algorithm
model = Skater(gdf, X, w, n_clusters)
model.solve()
# Add cluster labels to GeoDataFrame
gdf['region'] = model.labels_
return gdf, model
# Example usage with sample data
# (You would replace this with actual census tract data)
def create_sample_census_data():
# This is a simplified example - in practice, you'd load actual census shapefiles
from shapely.geometry import Polygon
# Create sample polygons and demographic data
polygons = []
demographics = []
for i in range(20):
# Create random polygons
x_base = (i % 5) * 0.01
y_base = (i // 5) * 0.01
poly = Polygon([
(x_base, y_base),
(x_base + 0.008, y_base),
(x_base + 0.008, y_base + 0.008),
(x_base, y_base + 0.008)
])
polygons.append(poly)
# Create correlated demographic data
pop_density = np.random.randint(1000, 10000)
median_income = np.random.randint(30000, 120000)
education_level = np.random.uniform(0.3, 0.95)
demographics.append({
'population_density': pop_density,
'median_income': median_income,
'education_pct': education_level,
'tract_id': f'tract_{i:03d}'
})
return gpd.GeoDataFrame(demographics, geometry=polygons, crs='EPSG:4326')
# Demonstrate regionalization
census_gdf = create_sample_census_data()
demographic_vars = ['population_density', 'median_income', 'education_pct']
# Note: This example requires additional libraries (spopt) that may need separate installation
# The concept demonstrates the approach even if not directly executable
Advanced Techniques
Hierarchical Spatial Clustering
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from scipy.spatial.distance import pdist, squareform
def hierarchical_spatial_clustering(gdf, attributes, spatial_weight=0.5):
"""
Perform hierarchical clustering with spatial constraints
"""
# Extract coordinates and attributes
coords = np.column_stack([gdf.geometry.x, gdf.geometry.y])
attrs = gdf[attributes].values
# Standardize both spatial and attribute data
coords_scaled = StandardScaler().fit_transform(coords)
attrs_scaled = StandardScaler().fit_transform(attrs)
# Calculate combined distance matrix
spatial_dist = pdist(coords_scaled)
attr_dist = pdist(attrs_scaled)
# Combine spatial and attribute distances
combined_dist = spatial_weight * spatial_dist + (1 - spatial_weight) * attr_dist
# Perform hierarchical clustering
linkage_matrix = linkage(combined_dist, method='ward')
return linkage_matrix
# Example usage
# linkage_matrix = hierarchical_spatial_clustering(gdf, ['severity'], spatial_weight=0.7)
# clusters = fcluster(linkage_matrix, n_clusters, criterion='maxclust')
Spatial Clustering with Constraints
def constrained_spatial_clustering(gdf, attributes, constraint_field, max_constraint_value):
"""
Perform clustering with spatial and attribute constraints
"""
from sklearn.cluster import AgglomerativeClustering
# Create connectivity matrix based on spatial proximity
from sklearn.neighbors import kneighbors_graph
coords = np.column_stack([gdf.geometry.x, gdf.geometry.y])
connectivity = kneighbors_graph(coords, n_neighbors=6, include_self=False)
# Standardize attributes
X = StandardScaler().fit_transform(gdf[attributes])
# Apply constrained clustering
clustering = AgglomerativeClustering(
n_clusters=None,
connectivity=connectivity,
distance_threshold=0.5,
linkage='ward'
)
cluster_labels = clustering.fit_predict(X)
# Validate constraints
gdf_temp = gdf.copy()
gdf_temp['cluster'] = cluster_labels
# Check if constraint is satisfied
cluster_constraints = gdf_temp.groupby('cluster')[constraint_field].sum()
valid_clusters = cluster_constraints <= max_constraint_value
return cluster_labels, valid_clusters
Best Practices and Considerations
Data Preprocessing
Always preprocess your spatial data appropriately:
- Use projected coordinate systems for distance-based calculations
- Standardize attributes when combining spatial and non-spatial features
- Handle missing values and outliers
- Consider the scale and resolution of your spatial data
Algorithm Selection
Choose clustering algorithms based on your specific needs:
- DBSCAN: Best for irregular shapes and automatic cluster detection
- K-means: Fast but assumes spherical clusters
- Hierarchical methods: Good for exploring different cluster granularities
- Regionalization: Essential for maintaining spatial contiguity
Parameter Tuning
Key parameters to optimize:
- eps (DBSCAN): Distance threshold for point neighborhoods
- min_samples (DBSCAN): Minimum points required for a cluster
- n_clusters: Number of desired clusters (when known)
- spatial_weight: Balance between spatial and attribute similarity
Validation
Evaluate clustering results using:
- Silhouette analysis for cluster quality
- Spatial autocorrelation measures (Moran’s I)
- Domain-specific validation metrics
- Visual inspection of results
Performance Optimization
For large datasets, consider:
- Using spatial indexing (R-tree) for faster neighbor searches
- Implementing parallel processing for independent operations
- Sampling techniques for initial exploration
- Memory-efficient data structures for very large datasets
# Example: Optimized DBSCAN for large datasets
from sklearn.neighbors import BallTree
def optimized_spatial_dbscan(coords, eps, min_samples):
"""
Optimized DBSCAN using spatial indexing
"""
# Use BallTree for efficient neighbor searches
tree = BallTree(coords, metric='euclidean')
# Find neighbors for all points
neighbors = tree.query_radius(coords, r=eps)
# Apply DBSCAN logic with precomputed neighbors
# (This is a simplified example - full implementation would be more complex)
return tree, neighbors
Python’s ecosystem provides powerful tools for spatial clustering in GIS applications. The combination of libraries like GeoPandas, PySAL, and scikit-learn enables sophisticated analysis of spatial patterns and relationships. Success in spatial clustering depends on understanding your data’s characteristics, choosing appropriate algorithms, and carefully validating results.
The key to effective spatial clustering lies in balancing spatial proximity with attribute similarity, considering the specific requirements of your analysis, and leveraging Python’s rich geospatial libraries to implement robust, scalable solutions.
Whether you’re analyzing urban patterns, environmental phenomena, or social datasets, spatial clustering provides valuable insights into the geographic organization of complex systems and helps reveal hidden spatial structures in your data.