Prediction of air pollution over Bogota in C with LibCapy

A few weeks ago I've completed the online course "AI for Good" on Coursera. The course by Robert Monarch was interesting and well done, with several real-world case studies and applied learning projects. These projects are provided as ready to use Jupyter notebooks. That's convenient but left me unsatisfied: I'd rather learn by coding it myself. So I decided to reproduce one of the projects from scratch, in C using my LibCapy library. In this article I'll introduce succinctly how I did and the results I've obtained.

The project I choose is the one of the "Public Health" section of the course. It consists of predicting the air pollution over Bogota city during the year 2021. Data are provided as two csv files, one for the meteorological stations measurement and another for their location. To make it compatible with LibCapy I had to add a header describing the data format as follow:

Other preprocessing of the data consisted of converting the encoding to ASCII as my library doesn't support others, and removing the station "BOS" from the second dataset as it never appears in the measurements of the first dataset.

The first task was to plot the density distribution of the pollutants. This can be easily achieved with the plotAllDensityDistributionsGivenCatField method of the CapyGraphPlotter class and gives the expected results as follow:

CapyGraphPlotter is a new class of the freshly released LibCapy v0.8.0. It contains the bare minimum features to render the graphics I needed on this project. The font for text rendering, hand made by myself of course, is recycled from an ancestor of LibCapy and I was pretty happy to have it done and ready to reuse. It's not pretty but does it job! Values along the axis is the missing feature I wish the most, however as they are not essential for this project I left them for a future version.

Next task was to plot the correlation graphs, which can be done with the plotAllCorrelationGraph method of CapyGraphPlotter. In the course, the correlation matrix and the correlation plots are separated in two graphs. As these matrix and plot are triangular, I thought it would be nice to combine them into one single graphic. I had no idea about what to put on the diagonal, if you have a suggestion for some data that would fit well here, let me know!

As it's done in the course I'm using here the Pearson correlation coefficient. But I've many time wondered if there wouldn't be a better metric for correlation, so it was time to dig a bit in that direction. I found there exists also the Distance correlation which immediately looked much more appealing to me. I've added it to LibCapy features and rendered again the correlation graphs.

The downside of Distance correlation compared to Pearson coefficient correlation is that it's awfully slow to compute. However, results are really worth it. On the graph above we see much more clearly the strong correlation between NO and NOX, and the other dubious ones between other pollutants have faded away.

The next task was to display a map of Bogota with the location of the stations and a representation of their data. I've added a new class, CapyGeoMap, to handle the geographic data, and I used the already existing CapyDisplay and CapyPen to add the date, the stations location, their name and their measurement with color encoding (empty circle for missing data). For the background map, I had no data available and needed to find some. OpenStreetMap saved the day!

From there the task was to predict the missing data in stations measurement. As a base model, the last known value and an approximation based on nearest station measurement is used. The listing below shows the result I obtained, similar to those of the course:

To improve on these base models, the course moves on to training a neural network to predict the missing values. LibCapy already had a CapyPredictor class (implementing SVM and neural network, cf these previous articles: 1, 2), however only categorical output were supported. I've then added the possibility for numerical output and trained a model similar to the one used in the course (two hidden layers of respecively 64 and 32 nodes, and ReLU activation).

First results were disappointing. Looking for what could be improved in my implementation of neural networks and their training with back propagation, I've learnt about the importance of data normalization. I was already doing some kind of normalization, but it was a simple min-max normalization. On the graphs above we clearly see that the data are all concentrated toward the min, I guess that's why min-max normalization is a poor choice. I then extended my implementation to allow for other kind of normalization, and after retraining my model with standardization (or z-score normalization) I've obtained results at par with those of the course.

With a good prediction model it was now possible to fill in the missing measurement. To try something a bit different from the course, instead of interpolating the missing data for other pollutant, I've used the nearest neighbour method. I imagine that would be nearer from what a final product would do: an app displaying the current pollutant level, hence using only the current measurement at each station. That doesn't really matter here, especially as it is only a training project. In reality the stakeholder would have the last word about it.

A new map of the city with no missing data can then be generated.

The final step consisted of creating an animation of the air pollution above the whole city. I don't know how they interpolate the value between the station in the course, and I wanted to take the opportunity to study options to accomplish such a task. I've learnt in particular about what's called Kriging. It ended up being another rabbit hole, and as I was pressed by the release of the v0.8.0 of LibCapy I finally used a simple inverse square distance weighting method to generate the final result: a video of the pollution level over the city for the whole year 2021 (one day per second).

As a simple visual check, one can see that the CSE station is the one recording highest pollution level in average in both the video and the graphs.

I'm glad I've been able to reproduce the results of the course with nothing but my own library. It was the occasion to add a few more features and LibCapy is growing up steadily. It was also a good excuse and motivation to learn a few more things beside what I've learnt during the course. I initially planned to include in this article more code snippets, but as I was rushing to synchronize v0.8.0 and the end of this project, the code I ended up with has some super dirty portion, so I refrain from sharing it. Sorry!