Practically every business makes use of location data. For billing and delivery, supply chains, prospects, etc. Does yours also make use of geospatial analytics? Don't worry if you have not started yet. It never was more accessible thanks to broadly adapted standards and a rich open-source and open-data community. The tutorial below will give you a kick-start using Python, the Geoapify Batch API, and just a few lines of code.
Why Python for geospatial analytics?
The Python data science community is undoubtedly one of the richest, with the R community likely its strongest competitor. Don't stop here if your team prefers R. It is easy to follow along and translate to R.
Python comes with a long list of open-source packages for geospatial analytics. Most notably, GeoPandas, the workhorse of the geospatial ecosystem. It's the tool of choice for manipulating arrays of geometric objects, such as computing areas, distances, and joining on the neighborhood. A recent addition to this list is my geobatchpy Python package. The package is using the Geoapify API and has, first and foremost geospatial analytics in mind. It also ships with a command line interface for the Geoapify Batch API.
The following tutorial shows you how to put those tools into action. Do you want to follow along with your data? Don't forget to subscribe to Geoapify if you have not already. The free plan comes with 3k credits daily, including commercial use. This plan will cover our tutorial for almost 2k addresses thanks to the discounts in Batch API pricing.
A receipt to start with geospatial analytics in Python
We executed the code in this tutorial using Python 3.9.13 with main dependencies geobatchpy==0.2.1
and geopandas==0.11.1
. This installation guide shows you have to set up a Conda virtual environment on your computer.
Part 1 - prepare the address data
Our example consists of roughly one thousand address records of hospitals. But the code easily scales to hundreds of thousands more. Again, only your Geoapify subscription dictates the limit.
import pandas as pd
df = pd.read_csv('hospitals-sample.csv', usecols=['Street', 'Housenumber', 'Postcode', 'City', 'Country'])
print(f'Total number of records: {df.shape[0]}')
print(df.head().to_markdown()) # prints first 5 rows
The console output gives you an idea of how our example data looks like:
Street | Housenumber | Postcode | City | Country | |
---|---|---|---|---|---|
0 | Jan Tooropstraat | 164 | 1061AE | Amsterdam | Netherlands |
1 | Barbarastraße | 1 | 45964 | Gladbeck | Germany |
2 | Am Gesundheitspark | nan | 51375 | Leverkusen | Germany |
3 | Rue de Tivoli | nan | 57000 | Metz | France |
4 | Propst-Sievert-Weg | nan | 46325 | Borken | Germany |
The geocoding API accepts structured input and free text search. We encounter data quality issues in practically every real-world dataset, like attributes ingested into the wrong columns. Free text is just the safer, more comfortable choice. For this, we need to parse each row to a single string:
import numpy as np
def concat_columns(df: pd.DataFrame, sep: str = ' ', fill_value: str = np.nan) -> pd.Series:
"""Concatenates row-wise all columns of df and returns series with same index.
"""
return (df.apply(lambda s: sep.join(s.dropna().astype(str).str.strip().replace('', np.nan).dropna()), axis=1)
.replace('', fill_value))
df['AddressLine1'] = concat_columns(df=df[['Street', 'Housenumber']])
df['AddressLine2'] = concat_columns(df=df[['Postcode', 'City']])
addresses = concat_columns(df[['AddressLine1', 'AddressLine2', 'Country']], sep=', ', fill_value='')
print(addresses[:5].rename('Query').to_markdown())
Here is what the first five search queries look like:
Query | |
---|---|
0 | Jan Tooropstraat 164, 1061AE Amsterdam, Netherlands |
1 | Barbarastraße 1, 45964 Gladbeck, Germany |
2 | Am Gesundheitspark, 51375 Leverkusen, Germany |
3 | Rue de Tivoli, 57000 Metz, France |
4 | Propst-Sievert-Weg, 46325 Borken, Germany |
Part 2 - batch geocode
You can choose between geobatchpy's Python client or the geobatch command line interface to run batch jobs. We prefer the 2nd option for larger inputs. And that requires some preparation which we do again with Python:
from geobatchpy.batch import parse_geocoding_inputs
from geobatchpy.utils import write_data_to_json_file
data = {
'api': '/v1/geocode/search', # This tells Geoapify which service to run.
'inputs': parse_geocoding_inputs(locations=addresses),
'params': {'limit': 1, 'format': 'geojson'}, # We recommend always choosing GeoJSON.
'batch_len': 200, # Results in math.ceil(1042 / 200) = 6 jobs in total.
'id': 'tutorial-batch-geocode' # Optional label for your own reference.
}
write_data_to_json_file(data=data, file_path='tutorial-geocode-input.json')
The last cell creates a .json
file inside your active directory, which we use to submit jobs to Geoapify servers on the command line:
geobatch submit tutorial-geocode-input.json tutorial-geocode-urls.json --api-key <your-key>;
Tip: You can omit the --api-key
option when setting your GEOAPIFY_KEY
environment variable.
The file created by the first command, tutorial-geocode-urls.json
, is the input for the next:
geobatch receive tutorial-geocode-urls.json tutorial-geocode-results.json --api-key <your-key>;
Processing on the Geoapify servers can take some time, depending on the size of your inputs, your subscription plan, and how busy the servers are. Pause here and continue until the servers complete all jobs.
Remember, we chose GeoJSON as the output format. Next, we read the results and simplify so that we end up with a list of GeoJSON-like Python dictionaries, one dictionary per address:
from geobatchpy.batch import simplify_batch_geocoding_results
from geobatchpy.utils import read_data_from_json_file
results = read_data_from_json_file('tutorial-geocode-results.json')['results']
results = simplify_batch_geocoding_results(results=results, input_format='geojson')
print(results[0])
We print the first of those to the console:
{
"type": "Feature",
"properties": {
"datasource": {
"sourcename": "openstreetmap",
"attribution": "© OpenStreetMap contributors",
"license": "Open Database License",
"url": "https://www.openstreetmap.org/copyright",
},
"housenumber": "164",
"street": "Jan Tooropstraat",
"suburb": "Bos en Lommer",
"city": "Amsterdam",
"state": "North Holland",
"country": "Netherlands",
"postcode": "1061AE",
"country_code": "nl",
"lon": 4.8391891,
"lat": 52.3712642,
"formatted": "Jan Tooropstraat 164, 1061 AE Amsterdam, Netherlands",
"address_line1": "Jan Tooropstraat 164",
"address_line2": "1061 AE Amsterdam, Netherlands",
"timezone": {
"name": "Europe/Amsterdam",
"name_alt": "Europe/Brussels",
"offset_STD": "+01:00",
"offset_STD_seconds": 3600,
"offset_DST": "+02:00",
"offset_DST_seconds": 7200,
"abbreviation_STD": "CET",
"abbreviation_DST": "CEST",
},
"result_type": "building",
"rank": {
"importance": 0.31100000000000005,
"popularity": 7.836896301654736,
"confidence": 1,
"confidence_city_level": 1,
"confidence_street_level": 1,
"match_type": "full_match",
},
"place_id": "5181a32e63545b1340597a96d695852f4a40f00103f9013b29891b02000000c00203",
},
"geometry": {"type": "Point", "coordinates": [4.8391891, 52.3712642]},
"bbox": [4.8391391, 52.3712142, 4.8392391, 52.3713142],
"query": {
"text": "Jan Tooropstraat 164, 1061AE Amsterdam, Netherlands",
"parsed": {
"housenumber": "164",
"street": "jan tooropstraat",
"postcode": "1061ae",
"city": "amsterdam",
"country": "netherlands",
"expected_type": "building",
},
},
}
And now, you will see why GeoJSON is a perfect fit. It is a standard for storing geographic objects with metadata broadly supported by the community. GeoPandas is no exception. One of its methods parses the geometry
into a Shapely geometric object, puts all properties
into separate columns, and ignores the rest:
import geopandas as gpd
df_geocodes = gpd.GeoDataFrame.from_features(results, crs='EPSG:4326') # 4326 = lon, lat coorindates
show_cols = ['geometry', 'name', 'street', 'housenumber', 'postcode', 'city', 'suburb', 'state', 'country', 'rank']
print(df_geocodes[show_cols].head().to_markdown())
We print the first five rows to the console:
geometry | name | street | housenumber | postcode | city | suburb | state | country | rank | |
---|---|---|---|---|---|---|---|---|---|---|
0 | POINT (4.8391891 52.3712642) | nan | Jan Tooropstraat | 164 | 1061AE | Amsterdam | Bos en Lommer | North Holland | Netherlands | {importance': 0.31100000000000005, 'popularity': 7.836896301654736, 'confidence': 1, 'confidence_city_level': 1, 'confidence_street_level': 1, 'match_type': 'full_match'} |
1 | POINT (6.987436 51.574441) | nan | Barbarastraße | 1 | 45964 | Gladbeck | nan | North Rhine-Westphalia | Germany | {'popularity': 6.574173044673481, 'confidence': 1, 'confidence_city_level': 1, 'confidence_street_level': 1, 'match_type': 'full_match'} |
2 | POINT (7.0350426 51.0299791) | Am Gesundheitspark | Am Gesundheitspark | nan | 51375 | Leverkusen | Schlebusch | North Rhine-Westphalia | Germany | {'importance': 0.3, 'popularity': 6.8695683348015155, 'confidence': 1, 'confidence_city_level': 1, 'confidence_street_level': 1, 'match_type': 'full_match'} |
3 | POINT (6.1923058 49.09872) | Rue de Tivoli | Rue de Tivoli | nan | 57000 | Metz | Plantières-Queuleu | Grand Est | France | {'importance': 0.4, 'popularity': 6.673758921453906, 'confidence': 1, 'confidence_city_level': 1, 'confidence_street_level': 1, 'match_type': 'full_match'} |
4 | POINT (6.8600285 51.8408674) | Propst-Sievert-Weg | Propst-Sievert-Weg | nan | 46325 | Borken | Borken | North Rhine-Westphalia | Germany | {'importance': 0.4, 'popularity': 5.540347061941222, 'confidence': 1, 'confidence_city_level': 1, 'confidence_street_level': 1, 'match_type': 'full_match'} |
Some properties are dictionaries that we can flatten using pandas.json_normalize
:
df_rank = pd.json_normalize(df_geocodes['rank'])
print(df_rank.head().to_markdown())
importance | popularity | confidence | confidence_city_level | confidence_street_level | match_type | |
---|---|---|---|---|---|---|
0 | 0.311 | 7.8369 | 1 | 1 | 1 | full_match |
1 | nan | 6.57417 | 1 | 1 | 1 | full_match |
2 | 0.3 | 6.86957 | 1 | 1 | 1 | full_match |
3 | 0.4 | 6.67376 | 1 | 1 | 1 | full_match |
4 | 0.4 | 5.54035 | 1 | 1 | 1 | full_match |
Table df_rank
is a good starting point to investigate data quality issues in your original address data. Check out this Towards Data Science publication if you want to dive deeper into this topic.
Part 3 - add building details
Geocodes, which are just points on a map, is not the only geometry we get out of Geoapify's services. The Place Details API covers the shapes of buildings and more. And again, a batch version of this service is available at a discount. Requesting this service through the command line is almost identical to the previous example. The main differences we like to stress out are:
- The service accepts geocodes as input.
- Its output format is, by default, GeoJSON.
- We can choose from a long list of feature types.
The service will respond with feature type 'details'
by default. However, we also expand to cover type 'building'
to illustrate how to deal with multiple feature types:
from geobatchpy.batch import parse_geocodes, simplify_batch_place_details_results
from geobatchpy.utils import write_data_to_json_file
geocodes = parse_geocodes(geocodes=[(item.x, item.y) for _, item in df_geocodes['geometry'].items()])
features = ','.join(['details', 'building'])
data = {
'api': '/v2/place-details', # See the Geoapify API docs for other batch APIs.
'inputs': geocodes,
'params': {'features': features},
'batch_len': 200,
'id': 'tutorial-batch-details'
}
write_data_to_json_file(data=data, file_path='tutorial-details-input.json')
The last cell creates a .json
file inside your active directory. Next, we switch to the command line and submit jobs using this file as input:
geobatch submit tutorial-details-input.json tutorial-details-urls.json --api-key <your-key>
The output of the previous step is the input for the next:
geobatch receive tutorial-details-urls.json tutorial-details-results.json --api-key <your-key>
Pause here and continue after the servers complete all your jobs. Again, this can take some time. Next, we simplify the results to a list of lists of Python dictionaries and print one example to the console:
results = read_data_from_json_file('tutorial-details-results.json')['results']
results = simplify_batch_place_details_results(results)
[
{
"type": "Feature",
"properties": {
"feature_type": "details",
"wiki_and_media": {"wikidata": "Q3269552"},
"building": {"type": "hospital", "start_date": 1966},
"categories": ["building", "building.healthcare"],
"datasource": {
"sourcename": "openstreetmap",
"attribution": "© OpenStreetMap contributors",
"license": "Open Database Licence",
"url": "https://www.openstreetmap.org/copyright",
"raw": {
"osm_id": -1837609,
"ref:bag": 363100012074757,
"building": "hospital",
"osm_type": "r",
"wikidata": "Q3269552",
"start_date": 1966,
},
},
"place_id": "517c2f782daa5b1340594adc7e13872f4a40f00101f901290a1c0000000000",
},
"geometry": {
"type": "Polygon",
"coordinates": [
[
[4.8381998, 52.370803499],
...
[4.8404546, 52.370985099],
],
],
},
},
{
"type": "Feature",
"properties": {
"feature_type": "building",
"categories": ["building", "building.healthcare"],
"datasource": {
"sourcename": "openstreetmap",
"attribution": "© OpenStreetMap contributors",
"license": "Open Database Licence",
"url": "https://www.openstreetmap.org/copyright",
"raw": {
"osm_id": -1837609,
"ref:bag": 363100012074757,
"building": "hospital",
"osm_type": "r",
"wikidata": "Q3269552",
"start_date": 1966,
},
},
"wiki_and_media": {"wikidata": "Q3269552"},
"building": {"type": "hospital", "start_date": 1966},
"area": 15475,
"place_id": "517c2f782daa5b1340594adc7e13872f4a40f00101f901290a1c0000000000",
},
"geometry": {
"type": "Polygon",
"coordinates": [
[
[4.8381998, 52.370803499],
...
[4.8404546, 52.370985099],
],
],
},
},
]
Every element of the outer list corresponds to a location - a single geocode. The inner list consists of GeoJSON-like dictionaries, ideally one per requested feature type. But not all locations are buildings; in some cases, even the details are unavailable.
print('Number of feature types per location:')
print(pd.Series([len(res) for res in results]).value_counts().rename('Count').to_markdown())
Number of feature types per location:
Count | |
---|---|
2 | 602 |
1 | 439 |
0 | 1 |
Since it is again GeoJSON, we can use GeoPandas to bring the data into tabular format, one row per location. This time it is a bit more tricky but still simple enough to do this in a few lines of code:
from typing import List
def convert_place_details_results_to_dataframe(results: List[dict], index: pd.Index = None) -> pd.DataFrame:
if index is None:
index = pd.Index(range(len(results)))
index_name = index.name if index.name is not None else 'index'
df_details = pd.concat([gpd.GeoDataFrame.from_features(res).assign(**{index_name: idx})
for idx, res in zip(index, results)])
dfs_details = [df_split.set_index(index_name) for _, df_split in df_details.groupby('feature_type')]
dfs_details = [df_split.rename(columns={col: df_split['feature_type'][0] + '.' + col
for col in df_split.columns})
for df_split in dfs_details]
df_details = pd.concat(dfs_details, axis=1)
prop_null_is_1 = df_details.isnull().mean().eq(1.)
cols_all_null = prop_null_is_1[prop_null_is_1].index.values
return df_details.drop([col for col in df_details.columns
if col.endswith('feature_type') or col in cols_all_null], axis=1)
We apply the conversion function to our data and consolidate the two geometries into a single column for simplicity:
df_details = convert_place_details_results_to_dataframe(results=results, index=df.index)
df_details['geometry'] = df_details['building.geometry']
df_details.loc[df_details['geometry'].isnull(), 'geometry'] = df_details['details.geometry']
df_details = df_details.drop(['building.geometry', 'details.geometry'], axis=1)
df_details.geometry = df_details.geometry.set_crs('EPSG:4326')
df_details
is a wide table with a geometry, many atomic attributes, and a few dictionary columns. If you prefer having dictionaries spread across multiple columns, use pandas.json_normalize
.
show_cols = ['geometry', 'details.name', 'building.area', 'building.building']
# we manipulate the geometry column for pretty printing:
print(df_details
.assign(geometry=df_details.geometry.astype(str).str.slice(stop=20) + '...')
[show_cols].head().to_markdown())
index | geometry | details.name | building.area | building.building |
---|---|---|---|---|
0 | POLYGON ((4.8382 52.... | nan | 15475 | {'type': 'hospital', 'start_date': 1966} |
1 | POLYGON ((6.986498 5... | nan | 7006 | {'levels': 3, 'type': 'hospital', 'roof': {'levels': 1}} |
2 | LINESTRING (7.035218... | Am Gesundheitspark | nan | nan |
3 | POLYGON ((6.192371 4... | Rue de Tivoli | 130 | {'type': 'detached'} |
4 | LINESTRING (6.860072... | Propst-Sievert-Weg | nan | nan |
We could move on using the Isolines API, extending our addresses by reachability regions. But we will stop here with the geospatial data preparation and move on with some analytics.
Part 4 - manipulate geometries with GeoPandas
The first three parts are about data preparation, particularly geometries. Next, we will touch on how we can use geometries other than plotting on a map. We start with the shapes of buildings put together in df_details
. Here is what the first building looks like according to our data:
df_details.geometry[0]
With GeoPandas, we can summarize the occupied area and total length of boundaries. But first, we need to tell GeoPandas which coordinate reference system to use for these statistics. EPSG:3035 is a metric system and a good fit for European locations:
# crs=3035 is for the metric system, European locations
df_geometry_stats = pd.concat([df_details.geometry.to_crs(crs='EPSG:3035').area,
df_details.geometry.boundary.to_crs(crs='EPSG:3035').length], axis=1)
df_geometry_stats.columns = ['AreaSquareMeters', 'BoundaryLengthMeters']
print(df_geometry_stats.head().to_markdown())
index | AreaSquareMeters | BoundaryLengthMeters |
---|---|---|
0 | 15501 | 1361.78 |
1 | 7016.58 | 697.259 |
2 | 0 | 0 |
3 | 130.552 | 46.5606 |
4 | 0 | 0 |
Expanding the building's shape to a convex hull may make more sense. GeoPandas also helps you with this and much more. Check out the GeoPandas documentation to learn more.
Our 2nd illustration is about joining two spatial datasets. Traditionally, we join tables by exact matches on keys. And most relational database engines also support pattern matching. And GeoPandas allows you to match by proximity which we illustrate using a dataset of airports from opentraveldata under the CC BY 4.0 license.
from shapely.geometry import Point
df_airports = pd.read_csv('airports.csv')
df_airports = (df_airports.loc[df_airports.type.isin(['medium_airport', 'large_airport'])]
.rename(columns={'name': 'AirportName'}).reset_index(drop=True))
geometry = [Point(xy) for xy in zip(df_airports.longitude_deg, df_airports.latitude_deg)]
df_airports = gpd.GeoDataFrame(df_airports, geometry=geometry, crs='EPSG:4326')[['geometry', 'AirportName']]
print(df_airports.head().to_markdown())
geometry | AirportName | |
---|---|---|
0 | POINT (-158.617996216 59.2826004028) | Aleknagik / New Airport |
1 | POINT (69.80734 33.284605) | Khost International Airport |
2 | POINT (160.054993 -9.428) | Honiara International Airport |
3 | POINT (157.263 -8.32797) | Munda Airport |
4 | POINT (102.35224 32.53154) | Hongyuan Airport |
For every address in our original input, you may ask, what is the closest airport and how far away is that? So we provide this answer below:
df_join = (df_geocodes.to_crs('EPSG:3035')
.sjoin_nearest(df_airports.to_crs('EPSG:3035'),
how='left', distance_col='DistanceToAirport')
.to_crs('EPSG:4326').rename(columns={'AirportName': 'ClosestAirport'})
.drop('index_right', axis=1))
show_cols = ['formatted', 'ClosestAirport', 'DistanceToAirport']
print(df_join[show_cols].head().to_markdown())
formatted | ClosestAirport | DistanceToAirport | |
---|---|---|---|
0 | Jan Tooropstraat 164, 1061 AE Amsterdam, Netherlands | Amsterdam Airport Schiphol | 8659.8 |
1 | Barbarastraße 1, 45964 Gladbeck, Germany | Düsseldorf Airport | 35221.7 |
2 | Am Gesundheitspark, 51375 Leverkusen, Germany | Cologne Bonn Airport | 19763.5 |
3 | Rue de Tivoli, 57000 Metz, France | Metz-Nancy-Lorraine Airport | 13671.2 |
4 | Propst-Sievert-Weg, 46325 Borken, Germany | Twente Airport | 48446.2 |
A brief outlook on geospatial advanced analytics
We illustrated how to enrich address data by geometries (points and shapes), manipulate those, and join them with other geospatial datasets. And these data manipulations typically are 90%, if not more, of every data analytics problem we face. This last section gives an outlook on so-called "advanced" analytics problems using geospatial data.
The most used geospatial technique is spatial autocorrelation. Say, with every of your address records, you also observe a numeric attribute like household income. With spatial autocorrelation, you can study if a household's income correlates with its neighbors. This PySAL tutorial shows you how to evaluate such questions by formal statistical testing. Again, it integrates well with what we did in the tutorial: it uses a GeoPandas table as input to compute spatial weights to quantify geographic proximity.
Say you study customer churn using a classification model, predicting if any given customer will stay with your business for a future period or not. Provided you also have your customer's addresses, have you considered integrating geospatial features into your model? As a first step, you can confirm the significance by studying spatial autocorrelation on customer churn. If yes, incorporate average churn rates (or any other customer signals) weighted over space and time. E.g., compute the average churn rate in a neighborhood of every current customer over the last 12 months.