While working on a recent project, I wanted to map our data for our presentation. Too late in the process I came across GeoPandas, and I wasn't able to get it working in time. Thankfully another team member was skilled in Tableau (thanks Abass!) and got the map done, but I still was interested in learning more about GeoPandas for future use. Before reading the rest of this post I'd recommend taking a look at the Introduction on the GeoPandas site, as that's what I started with.
GeoPandas is a python library for dealing with geographic and spatial data. Rather than using multiple libraries in order to import, clean, and plot this type of data, GeoPandas bundles all this functionality for a more seamless experience. In this tutorial, I'll show how to get a basic notebook up and running with GeoPandas and stitch a few of its more interesting features together.
Much of the functionality available in GeoPandas could be accomplished by employing CartoPy, Pandas, and GeoPy. Unsurprisingly, those last two are dependencies of GeoPandas (though I believe GeoPy is optional). As the name might imply, GeoPandas primarily is set up to use and mimic Pandas to deal with data, most exemplified by GeoSeries and GeoDataFrame being subclasses of the equivalently named Pandas structures. Effectively this gives all the power and versatility of dealing with data in Pandas, without needing to add in different conversions beforehand.
To start with we'll take this popular dataset of Covid cases from kaggle. There are entries for each country, on different days, for different variants. We'll use GeoPandas to plot just the Omicron cases for one day first, and then all cases the data set has. The accompanying repository for this post has the data in a csv file already, but kaggle has easy download instructions on their site.
import pandas as pd
import numpy as np
import geopandas
covid_cases = geopandas.read_file('covid-variants.csv')
covid_cases['date'] = pd.to_datetime(covid_cases['date'])
/home/thedefect/anaconda3/envs/geotest/lib/python3.9/site-packages/geopandas/geodataframe.py:600: RuntimeWarning: Sequential read of iterator was interrupted. Resetting iterator. This can negatively impact the performance. for feature in features_lst:
Because each entry in the data contains a value for total cases in a specific country on a specific day, we only need to select for one variant in order to have total cases available to use.
omicron_cases = covid_cases[covid_cases.variant == 'Omicron'] # Selecting just the Omicron variant.
single_day = omicron_cases[omicron_cases['date'] == '2021-12-27'] # Selecting just one day to make things a bit simpler
Now that we have the covid data we want, let's get some geographic data to plot. We'll use GeoPandas's geocoding ability. Geocoding is a way to get geographic point data by searching for a place name in a database like Open Street Map. This is accomplished with GeoPy, using very similar syntax, though with a few issues, as we shall see.
country_locations = geopandas.tools.geocode(single_day.location) # generates a list of points from location names
single_day.index.equals(country_locations.index) # Let's see if the indices match, just to be sure.
True
Now that we've loaded the country points, let's add it to our covid data frame and create a map. GeoPandas has a very nice .explore()
method on its GeoDateFrame
objects. This utilizes a library called follium which itself uses the leaflet.js library to create interactive maps.
All GeoDataFrames have an "active geometry", that's the column we filled with our point data, which is used when geometric operations are applied to the data frame.
single_day = single_day.assign(geometry=country_locations['geometry']) # Fill the Geometry column with our new point data
fixed_day = single_day.astype({'num_sequences': 'int64', 'num_sequences_total': 'int64'}) # Fixing datatypes so we can appropriately sort by them
one_day_omicron = fixed_day.sort_values(by='num_sequences').drop('date', axis=1)
one_day_omicron.explore(column='num_sequences', # We give the column we want map to our color gradient
tooltip='location', # hover text and popups can be set to display different values
popup='num_sequences',
marker_type='circle_marker', # We can control the type and size of map markers
marker_kwds={'radius':10})
Well our nice interactive map was created, but why are there extra points in the U.S.? Here we run into a limitation of geocoding straight from GeoPandas: if the initial values our Geocoder selects are wrong, it can be difficult to remedy that from GeoPandas alone. Below we'll quickly use GeoPy to fix those three lost points.
# The GeoPandas geocode call grabbed the wrong Morocco, Switzerland, and Georgia, so we have to manually correct.
from geopy.geocoders import Photon
from shapely.geometry import Point
photon_fixer = Photon() # The Geocoding object that will make our requests
# .geocode is returning a list of entries thanks to 'exactly_one=False' so we can index for the correct one.
Morocco = photon_fixer.geocode("Morocco", exactly_one=False, language='en')[0]
Switzerland = photon_fixer.geocode("Switzerland", exactly_one=False, language='en')[1]
Georgia = photon_fixer.geocode("Georgia", exactly_one=False, language='en')[2]
def fix_geometry(gdf, points):
"""Simple function to correct Geometry column with new points."""
for point in points:
gdf.loc[gdf.location == f'{point}'.split(', ')[0], ['geometry']] = Point(point.latitude, point.longitude)
fix_geometry(fixed_day, [Morocco, Switzerland, Georgia])
# And now we can plot again, correctly this time.
cases_sorted = fixed_day.sort_values(by='num_sequences').drop('date', axis=1)
cases_sorted.explore(column='num_sequences',
tooltip='location',
popup='num_sequences',
marker_type='circle_marker',
marker_kwds={'radius':10})
Now that we've figured that out, we can map total cases for each country in the data set. Although it's worth noting that our covid data seems to be a bit off somehow, more on that later.
# Same process as before, just not restricting on a single day.
fixed_sequence_type = omicron_cases.astype({'num_sequences_total': 'int64'})
total_country_cases = fixed_sequence_type.loc[:, ['location', 'num_sequences_total']].groupby('location').sum()
total_country_cases = total_country_cases.reset_index()
total_locations = geopandas.tools.geocode(total_country_cases.location)
total_country_cases = total_country_cases.assign(geometry=total_locations['geometry']) # Fill the Geometry column
fix_geometry(total_country_cases, [Morocco, Switzerland, Georgia]) # We can still fix these three points
total_country_cases.explore(column='num_sequences_total',
tooltip='location',
popup='num_sequences_total',
marker_type='circle_marker',
marker_kwds={'radius':10})
So not too bad so far, but there a still a few things we could improve.
Before we get to that though, let's talk about our covid data first. This dataset comes from kaggle, which gave a nice 10.0 score for usability, which is why I selected it. The individual day map could have just looked odd based on the reporting available that day, but looking at this full map it's clear something is off. This data was apparently generated through a webscraping script, but more than that is not clear to me. As it is, this shows the plotting and geocoding capabilities I wanted to discuss in this tutorial, but I would caution that these maps obviously aren't useful outside of that.
Moving on to what we can improve with our plots: it would be nice if we could adjust the size of each circle to reflect the number of cases. Technically this is possible, but it requires creating the follium map from scratch and looping through to set the marker size based on the case numbers. If we still want something a bit clearer to look at, we can take advantage of an included dataset in GeoPandas. naturalearth_lowres
will provide us with country shapes to map. This dataset originally comes from here, where more detailed data with different features can also be found, but naturalearth_lowres
will be fine for demonstration purposes here.
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
world.head()
pop_est | continent | name | iso_a3 | gdp_md_est | geometry | |
---|---|---|---|---|---|---|
0 | 920938 | Oceania | Fiji | FJI | 8374.0 | MULTIPOLYGON (((180.00000 -16.06713, 180.00000... |
1 | 53950935 | Africa | Tanzania | TZA | 150600.0 | POLYGON ((33.90371 -0.95000, 34.07262 -1.05982... |
2 | 603253 | Africa | W. Sahara | ESH | 906.5 | POLYGON ((-8.66559 27.65643, -8.66512 27.58948... |
3 | 35623680 | North America | Canada | CAN | 1674000.0 | MULTIPOLYGON (((-122.84000 49.00000, -122.9742... |
4 | 326625791 | North America | United States of America | USA | 18560000.0 | MULTIPOLYGON (((-122.84000 49.00000, -120.0000... |
Unfortunately for us, world['name']
and total_country_cases['location']
don't exactly match up. The natural earth dataset does come with an iso country code identifier though, which will give us a better chance of matching entries, if only we can convert our covid dataset to country codes. We can use a nice little module called pycountry
to do this.
import pycountry
mapping = {country.name: country.alpha_3 for country in pycountry.countries}
total_country_cases['country_codes'] = total_country_cases['location'].replace(to_replace=mapping)
# This shows how GeoPandas can store two different geometries for each entry and switch between them.
combined_df = total_country_cases.merge(world, left_on='country_codes', right_on='iso_a3')
combined_df = combined_df.set_geometry('geometry_y')
combined_df.head()
location | num_sequences_total | geometry_x | country_codes | pop_est | continent | name | iso_a3 | gdp_md_est | geometry_y | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Angola | 1055 | POINT (17.56912 -11.87758) | AGO | 29310273 | Africa | Angola | AGO | 189000.0 | MULTIPOLYGON (((12.99552 -4.78110, 12.63161 -4... |
1 | Argentina | 8411 | POINT (-64.96728 -34.99650) | ARG | 44293293 | South America | Argentina | ARG | 879400.0 | MULTIPOLYGON (((-68.63401 -52.63637, -68.25000... |
2 | Australia | 47199 | POINT (134.75500 -24.77611) | AUS | 23232413 | Oceania | Australia | AUS | 1189000.0 | MULTIPOLYGON (((147.68926 -40.80826, 148.28907... |
3 | Austria | 12580 | POINT (14.12456 47.59397) | AUT | 8754413 | Europe | Austria | AUT | 416600.0 | POLYGON ((16.97967 48.12350, 16.90375 47.71487... |
4 | Bangladesh | 3700 | POINT (90.29344 24.47693) | BGD | 157826578 | Asia | Bangladesh | BGD | 628400.0 | POLYGON ((92.67272 22.04124, 92.65226 21.32405... |
It's worth noting here that we've stored two different geometries for each entry into one GeoDataFrame now. This is one of the really nice parts of working with GeoPandas, we can clean and add features to our data just as we would in a normal data frame, and all we have to do if we want to operate on different geometries is set a different column as our "active geometry".
combined_df.explore(column='num_sequences_total',
tooltip='location',
popup='num_sequences_total',
marker_type='circle_marker',
marker_kwds={'radius':10})
Clearly we still missed a few countries available to us in the covid data set, but we've shown the initial steps in matching up map shapes to other data. Any set with an iso label feature, and we'd have a very easy time displaying it.
The large amount of dependencies, and their specific version requirement,s has been a hurdle to using GeoPandas so far, though this is understandable when the main advantage of the library is packaging together smaller modules in to something more coherently usable. The conda dependency manager wasn't able to easily install GeoPandas into any of my previously existing environments, and running conda create -n temp-env geopandas
produced an environment with 119 packages. Obviously not the end of the world and many of them are packages that you would have up and running in your environment anyway, but it lends to dependency conflicts and long solves from the package manager. If you're only using GeoPandas to create visualizations, this is of course not as much of an issue, as you can just make a standalone environment, import your data, and save your map. Otherwise though it's probably better to make sure this is added in first.
More significantly, the geocoding ability is somewhat limited. .geocode
calls in GeoPy will accept keyword arguments to help you select an appropriate point. Unfortunately, the GeoPandas .geocode
method does not allow passing these arguments into the query, only parameters for the initialization of the Geocoder object. Since GeoPy is a needed dependency to use geocoding in GeoPandas, this isn't the end of the world, but in that case why bother with the GeoPandas version if it's more restrictive. In the above example, I had to use a separate GeoPy Geocoder to correct the locations for Georgia, Switzerland, and Morocco (using exactly_one = False
which was not able to be passed in the GeoPandas version), since the initial GeoPandas call confused those locations for ones in the US
If you have data that has geometries already associated with it, then GeoPandas provides a nice way to manipulate and display it. If you need to create geometry data for your project, it's a bit more difficult to manage. In that situation I'd view GeoPandas more as a way to get going and then transition to using the modules it's based on to come up with a final product. Notably though it was still possible to see that something was up with our covid data even when we couldn't display all of it, let alone display it in the best way.