Using GeoPandas to Plot and Manipulate Geographic Data

While working on a recent project, I wanted to map our data for our presentation. Too late in the process I came across GeoPandas, and I wasn't able to get it working in time. Thankfully another team member was skilled in Tableau (thanks Abass!) and got the map done, but I still was interested in learning more about GeoPandas for future use. Before reading the rest of this post I'd recommend taking a look at the Introduction on the GeoPandas site, as that's what I started with.

What is it and why?

GeoPandas is a python library for dealing with geographic and spatial data. Rather than using multiple libraries in order to import, clean, and plot this type of data, GeoPandas bundles all this functionality for a more seamless experience. In this tutorial, I'll show how to get a basic notebook up and running with GeoPandas and stitch a few of its more interesting features together.

Much of the functionality available in GeoPandas could be accomplished by employing CartoPy, Pandas, and GeoPy. Unsurprisingly, those last two are dependencies of GeoPandas (though I believe GeoPy is optional). As the name might imply, GeoPandas primarily is set up to use and mimic Pandas to deal with data, most exemplified by GeoSeries and GeoDataFrame being subclasses of the equivalently named Pandas structures. Effectively this gives all the power and versatility of dealing with data in Pandas, without needing to add in different conversions beforehand.

Importing Data

To start with we'll take this popular dataset of Covid cases from kaggle. There are entries for each country, on different days, for different variants. We'll use GeoPandas to plot just the Omicron cases for one day first, and then all cases the data set has. The accompanying repository for this post has the data in a csv file already, but kaggle has easy download instructions on their site.

Because each entry in the data contains a value for total cases in a specific country on a specific day, we only need to select for one variant in order to have total cases available to use.

Now that we have the covid data we want, let's get some geographic data to plot. We'll use GeoPandas's geocoding ability. Geocoding is a way to get geographic point data by searching for a place name in a database like Open Street Map. This is accomplished with GeoPy, using very similar syntax, though with a few issues, as we shall see.

Now that we've loaded the country points, let's add it to our covid data frame and create a map. GeoPandas has a very nice .explore() method on its GeoDateFrame objects. This utilizes a library called follium which itself uses the leaflet.js library to create interactive maps. All GeoDataFrames have an "active geometry", that's the column we filled with our point data, which is used when geometric operations are applied to the data frame.

Well our nice interactive map was created, but why are there extra points in the U.S.? Here we run into a limitation of geocoding straight from GeoPandas: if the initial values our Geocoder selects are wrong, it can be difficult to remedy that from GeoPandas alone. Below we'll quickly use GeoPy to fix those three lost points.

Now that we've figured that out, we can map total cases for each country in the data set. Although it's worth noting that our covid data seems to be a bit off somehow, more on that later.

So not too bad so far, but there a still a few things we could improve.

Before we get to that though, let's talk about our covid data first. This dataset comes from kaggle, which gave a nice 10.0 score for usability, which is why I selected it. The individual day map could have just looked odd based on the reporting available that day, but looking at this full map it's clear something is off. This data was apparently generated through a webscraping script, but more than that is not clear to me. As it is, this shows the plotting and geocoding capabilities I wanted to discuss in this tutorial, but I would caution that these maps obviously aren't useful outside of that.

Moving on to what we can improve with our plots: it would be nice if we could adjust the size of each circle to reflect the number of cases. Technically this is possible, but it requires creating the follium map from scratch and looping through to set the marker size based on the case numbers. If we still want something a bit clearer to look at, we can take advantage of an included dataset in GeoPandas. naturalearth_lowres will provide us with country shapes to map. This dataset originally comes from here, where more detailed data with different features can also be found, but naturalearth_lowres will be fine for demonstration purposes here.

Unfortunately for us, world['name'] and total_country_cases['location'] don't exactly match up. The natural earth dataset does come with an iso country code identifier though, which will give us a better chance of matching entries, if only we can convert our covid dataset to country codes. We can use a nice little module called pycountry to do this.

It's worth noting here that we've stored two different geometries for each entry into one GeoDataFrame now. This is one of the really nice parts of working with GeoPandas, we can clean and add features to our data just as we would in a normal data frame, and all we have to do if we want to operate on different geometries is set a different column as our "active geometry".

Clearly we still missed a few countries available to us in the covid data set, but we've shown the initial steps in matching up map shapes to other data. Any set with an iso label feature, and we'd have a very easy time displaying it.

Summing Up

Limitations

The large amount of dependencies, and their specific version requirement,s has been a hurdle to using GeoPandas so far, though this is understandable when the main advantage of the library is packaging together smaller modules in to something more coherently usable. The conda dependency manager wasn't able to easily install GeoPandas into any of my previously existing environments, and running conda create -n temp-env geopandas produced an environment with 119 packages. Obviously not the end of the world and many of them are packages that you would have up and running in your environment anyway, but it lends to dependency conflicts and long solves from the package manager. If you're only using GeoPandas to create visualizations, this is of course not as much of an issue, as you can just make a standalone environment, import your data, and save your map. Otherwise though it's probably better to make sure this is added in first.

More significantly, the geocoding ability is somewhat limited. .geocode calls in GeoPy will accept keyword arguments to help you select an appropriate point. Unfortunately, the GeoPandas .geocode method does not allow passing these arguments into the query, only parameters for the initialization of the Geocoder object. Since GeoPy is a needed dependency to use geocoding in GeoPandas, this isn't the end of the world, but in that case why bother with the GeoPandas version if it's more restrictive. In the above example, I had to use a separate GeoPy Geocoder to correct the locations for Georgia, Switzerland, and Morocco (using exactly_one = False which was not able to be passed in the GeoPandas version), since the initial GeoPandas call confused those locations for ones in the US

What Should I use it for then?

If you have data that has geometries already associated with it, then GeoPandas provides a nice way to manipulate and display it. If you need to create geometry data for your project, it's a bit more difficult to manage. In that situation I'd view GeoPandas more as a way to get going and then transition to using the modules it's based on to come up with a final product. Notably though it was still possible to see that something was up with our covid data even when we couldn't display all of it, let alone display it in the best way.


Thanks!

This repo has everything used in this post, including an environment file to create a conda environment with the same packages I used.

Works Referenced