In this article I report my attempt at solving the Fish Welfare Initiative challenge "Remote-Sensing Water Quality with Satellites".
Problem definition
The goal of the challenge is to use satellite imagery to predict as many as possible among 4 indicators of water quality (dissolved oxygen, pH, ammonia, chlorophyll) in aquaculture ponds in India.
The required accuracies of the regression models for each indicator are:
do (mg/L) | ph | am (mg/L) | ch (\(\mu\)g/L) | |
\(R^2\) | \(\ge0.66\) | \(\ge0.71\) | \(\ge0.66\) | \(\ge0.73\) |
RMSE | \(\le1.62\) | \(\le0.13\) | \(\le0.08\) | \(\le59\) |
MAE | \(\le1.07\) | \(\le0.11\) | \(\le0.06\) | \(\le42\) |
Where \(R^2\) is the coefficient of determination, RMSE is the root mean square error, and MAE is the mean absolute error. Defined as follow:
$$
\begin{array}{l}
MAE=\frac{\sum_{i\in[1,n]}|y_i-y'_i|}{n}\\
RMSE=\sqrt{\frac{\sum_{i\in[1,n]}(y_i-y'_i)^2}{n}}\\
R^2=1-\frac{\sum_{i\in[1,n]}(y_i-y'_i)^2}{\sum_{i\in[1,n]}(y_i-\hat{y})^2}\\
\end{array}
$$
where \(n\) is the number of prediction, \(y_i\) is the truth value of the indicator \(y\) for prediction \(i\), \(y'_i\) is the predicted value of the indicator \(y\) for prediction \(i\), and \(\hat{y}=\frac{\sum_{i\in[1,n]}y_i}{n}\).
If the predictor was a classifier, according to this document and this document, the OK/NG thresholds would be:
optimal range | required range | |
do (morning) | 4-5mg/L | 3-5mg/L |
do (evening) | 8-10mg/L | 8-12mg/L |
ph | 7-8 | 6.5-8.5 |
am | <0.15mg/L | <0.5mg/L |
ch | 100-150mg/L or 150-220mg/L | ? |
FWI provides the following data:
This challenge is the second iteration on this problem. The first iteration didn't produced models meeting the accuracy requirements, hence the second iteration. As the efforts so far were unsuccesful there are concerns that a prediction model is actually possible, however FWI still hopes that it may be possible (cf this blog post).
It looks like a very difficult problem, several others already tried and failed, and I don't have particular experience to solve that kind of problem. My hope for solving this challenge is extremely low, but at least this is a very interesting real world problem to keep learning about remote sensing and prediction models, in the continuity of the Omdena challenge about deforestation I did earlier this year (cf here) or the online course about air pollution prediction I did last year (cf here), and a good occasion to use and improve LibCapy along the way!
Related works
This webpage gather all the articles from FWI related to satellite imagery.
FWI has shared the slides of a prior attempt to predict chlorophyll from NDCI calculated from Sentinel imagery. This attempt failed to reach the requirements. It used machine learning, but the slides give no details, and a nearest neighbour lookup approach (pixel NDCI value matched to nearest in-situ measurement).
FWI has performed a study (slides) of the difference between measurements at the edge and measurements in the interior of the ponds. They found a significant difference in absolute value but not significant for the purpose of their program.
FWI has shared the final write-up of an apparently very successful attempt to create a classifier for "in-range/out-of-range indicators" using different data (not satellite imagery).
FWI has published a report about dissolved oxygen challenges in aquaculture systems. It shows the relation between fish welfare, dissolved oxygen, ammonia and chlorophyll, and also with other factors as fish density and weather.
FWI has published a report about their first attempt to predict indicators from satellite imagery. Despite the results from this first attempt were very good, later they couldn't be reproduced and raised skepticism (cf. this blog post).
In this post FWI shares some internal thought about the project and discuss another possible approach: forecasting models.
In this article the authors summarize from many studies the challenges and benefits of remote sensing in water monitoring. Interesting information from the perspective of the FWI challenge are as follow. They report a formula (requires calibration) to calculate chlorophyll concentration given spectral reflectance at 460nm, 490nm and 520nm. They indicate a correlation of dissolved oxygen with temperature and chlorophyll. And they confirm that satellite imagery plus ANN is an interesting approach but with many challenges to overcome.
In this article the authors show that dissolved oxygen in coastal sea water can be accurately predicted from temperature and chlorophyll data from the MODIS and VIIRS datasets using regression models.
In this article the authors show that dissolved oxygen in river can be accurately predicted from bands B5 and B8 of the Sentinel-2 dataset using random forest or support vector machine.
In this article the authors show that dissolved oxygen in coastal sea water can be predicted with relatively good \(R^2\) from RGB bands of the Landsat-8 dataset using a polynomial of order 3.
In this article and this article the authors show that dissolved oxygen in lake can be accurately predicted from temperature, clarity and chlorophyll data from the bands data of MODIS and Himawari-8 datasets using extreme gradient boosting and multimodal deep neural network.
In this article the author calculates the chlorophyll density in Indian lakes directly from the Sentinel-8 bands ratio (B5+B6)/B4. In this article the authors use the same bands ratio for the same purpose in Turkey coastal water. However, in this article the authors show that several bands ratio can be used, and the best band ratio to use varies with the water body considered and should be decided on a case by case basis.
In this article the authors use yet another bands ratio to predict chlorophyll and argue that top-of-atmosphere (TOA) reflectance dataset should be preferred to the bottom-of-atmosphere (BOA) reflectance dataset in case of shallow water bodies.
Ponds coordinates and area
First, I prepare the coordinates data in a format compatible with LibCapy. For verification and better understanding I make a visualisation of the ponds location. I'm reusing the tool I developped during the Omdena challenge to create a composite RGB image of the two districts, and I plot a blue circle for each pond on that image. The center of each circle is supposed to match the center of the pond (GPS coordinates in the dataset) and its radius is such as the circle area is equal to the pond area (provided with the coordinates). (Sentinel-2 imagery, 10m/px, click on images to enlarge)
The range of coordinates for each region is as follow:
Longitude | Latitude | |
Eluru | 81.012177,81.271376 | 16.574134,16.743473 |
Nellore | 80.075645,80.151875 | 14.446507,14.529491 |
The location and size of the circles seems to match the underlying ponds, hence my parsing of coordinates and areas data seems correct. However, there are also some ponds for which the center doesn't match with a pond center in the imagery.
This paper suggests a subpixel accuracy for Sentinel-2 images registration. One can then assume that the measurement is to blame when the pond center in the data doesn't match the pond center in the imagery (GPS measurement inaccuracy, or measurement location actually not at the center of the pond despite being introduced as such). Then, I'm assuming that the data coordinates actually matches the pond it refers to and the shift in imagery comes from the measurement having been taken only 'roughly' at the center of the pond (assumption reinforced by some images of ponds in FWI blog posts: 1, 2).
Another observation about the ponds maps is that some ponds are extremely smalls relatively to Sentinel-2 spatial resolution (in particular in the Nellore district).
The smallest pond in the dataset has an area of 0.13 acres (id: 1_73_2). It's 526 square meters, or 22x22m, or at most 2x2px in the Sentinel-2 imagery. We saw we can assume the single pixel at data coordinates to be a measure of the correct pond. However using several pixels around those coordinates would allow to reduce possible noise and get a better representation of the pond as a whole by using gaussian blur. From that perspective the smallest ponds are too small, and the shift of the coordinates away from the real center is worrying (risk of using irrelevant pixels out of the pond).
The distribution of sizes is as follow:
However the smallest ponds is the only one with size equivalent to less than 3x3 pixels. Then I decide to use that 3x3 pixels (or 30x30m) area around the given pond coordinates, apply gaussian blur (radius 1, strength 0.5), and use the value at the coords (1,1) pixel.
Measurements data
Next I prepare the measurements dataset. It includes many information, some of those are irrelevant (e.g. who made the measurement) or useless (e.g. behaviour indicators, in the sense they can't be remotely acquired from a satellite). I choose to retain the following measurements: coord, date, temperature, and of course the 4 indicators to be predicted. I also initially kept the 'equipment' and 'treatment' fields, but later decided to ignore them.
I remove three samples which I consider to be anomalous:
This gives for each predicted indicator the following info:
indicator | samples quantity | value range | date range |
do | 4805 | 0.0,27.1 | 2021-07-12 06:15:00,2024-10-31 17:18:00 |
ph | 4806 | 5.3,9.54 | 2021-07-12 06:15:00,2024-10-31 17:18:00 |
am | 2774 | 0.0,3.0 | 2021-07-29 06:20:00,2024-10-31 07:32:00 |
ch | 96 | 7.54,225.13 | 2023-04-25 06:48:00,2023-06-09 18:01:00 |
Note that chlorophyll has very few samples (over less than 2 months, in only 6 ponds of the Elore district).
The correlation plots and Distance correlation for all measures and indicators look like this:
The correlation plots and values show only low correlation between indicators and available measures.
The density distributions for each indicator look like this (bin size sets to match approximatively the MAE requirements):
Indicator values are all very concentrated. From the machine learning perspective we would like to have a flat density distribution (all possible values equally represented). Simply returning the median value would be sufficient to be within the MAE requirements almost all the time for chlorophyll and most of the time for ammonia. This will make more difficult the judgment of how good the prediction models are. Another thing to note, the dissolved oxygen density plot seems to have a bimodal shape, which can also be seen in the correlation plots.
Baseline models
The analysis of the data directly gives us the recipe for baseline models: simply return the median value for each indicator. Evaluating this base models gives the following results:
As predicted by the data analysis these base models return apparently good values (low MAE and RMSE, not terribly far from the requirements or even within them for chlorophyll), but of course have no predictive power (terrible \(R^2\)).
ANN models from in-situ data
As a first attempt to improve upon those dumb baseline models I try an artificial neural network with inputs: 'coord', 'date' and 'temperature'. For the 'coord' I'm using the GPS coordinates (longitude/lattitude). For the 'date' I'm using cos/sin encodings for the day in the year and the time in the day. This makes 7 numerical inputs and the number of sample becomes as follow:
indicator | samples quantity |
do | 4800 |
ph | 4804 |
am | 2773 |
ch | 96 |
The correlation plots and Distance correlation for all inputs and indicators look like this:
I use the following parameters to train the ANNs (the architecture is choosen to obtain good results over the whole dataset and consistent results between training and validation during cross validation):
architecture: | ANN, 1 layer of 5 nodes for chlorophyll, 2 layers of 10 nodes each for others, relu activation |
feature scaling: | standard normalisation |
loss function: | huber |
optimiser: | adam (0.9, 0.999) |
learning rate: | 0.001 |
training time: | 120s |
cross validation: | 10 fold |
If using only 'coord' and 'date', the results become as follow.
If using only 'coord' and 'temperature', the results become as follow.
If using only 'coord', the results become as follow.
Putting all these results together:
MAE | RMSE | \(R^2\) | |
baseline | 3.13 | 3.96 | -0.25 |
coord | 3.207 | 3.542 | 0.003 |
coord,date | 0.992 | 1.517 | 0.817 |
coord,temperature | 2.480 | 3.043 | 0.264 |
coord,date,temperature | 0.957 | 1.454 | 0.832 |
requirement | <1.07 | <1.62 | >0.66 |
For dissolved oxygen, the coordinates do not help the prediction. The temperature has a little more positive influence. On the other hand the date has a strong impact (as expected from the correlation plots), enough to bring the prediction within range for all three requirements. The best accuracy is obtained when using both.
For chlorophyll, date is the only input having an influence, improving slightly the prediction, with MAE and RMSE always within requirement. However \(R^2\) is still far from requirements.
For pH, date and temperature both improve the prediction, and best accuracy is reached when using both. However it is insufficient to be within requirements.
For ammonia, coordinates, date and temperature all improve slightly the prediction and best accuracy is obtained when using them altogether. However it is still insufficient to be within requirements.
For all indicators, the good stability in the cross validation suggests they would perfom with similar accuracy in practice.
Temperature data acquisition
The temperature seems to have a positive impact on the prediction of 3 out of 4 indicators. It would then be interesting to have a way to retrieve it automatically.
Satellite data like the MOD21A1D.061 dataset available on GEE provides a daily average temperature at 1km spatial resolution. The spatial resolution is probably enough (it sounds reasonable to me to expect the temperature to not vary much within 1km because the considered areas are not mountainous), but I'm not sure of how much impact using the average temperature over one day has.
Another solution is OpenWeatherMap which provides an API easy to use and free up to a certain number of call per day. After registration one can retrieve temperature at a location (long/lat) and time (unix timestamp, beware of the timezone) with the following API call similar to this (replace 'xxx' with your API key):
curl --silent "https://api.openweathermap.org/data/3.0/onecall/timemachine?lat=16.63&lon=81.15&dt=1685586656&units=metric&appid=xxx" | jq '.data.[0].temp'
As far as I understand the temperature measurement in the FWI dataset is the water temperature. OpenWeatherMap provides the air temperature, then a small difference is expected. The test below gives an idea of that difference, which seems reasonable. I should have done a more rigorous study of how both temperatures are related, or even maybe make a predictor for that, I haven't by lack of time. Training the ANN with API temperatures instead of FWI temperatures will also probably mitigate the influence of that difference.
Predictor as a web application
I have already enough to make a minimalist web app predicting the 4 indicators, so lets do it. Even if the results are not great and raise concerns, it is still possible to improve the predictors later and I'll have the basis ready.
It allows to input coordinates, date and time, and returns the temperature (automatically retrieved from OpenWeatherMap) and the predicted values for 4 indicators (using the ANNs introduced in the previous paragraphs).
Satellite data
I now move on to the search for models using satellite imagery (instead of in-situ measurement) for remote monitoring of water quality. The documents about the challenge refer to the Sentinel-2 imagery. I intend to use GEE to retrieve this imagery, because I have some experience working with this platform. Info about the Sentinel-2 dataset on GEE are available here.
The maximum available resolution for that imagery is sufficient to isolate at least one pixel corresponding to even the smallest pond (cf above). However the temporal resolution is only 5 days (not accounting for unavailable days due to cloud covering). To be able to train a model having satellite inputs and in-situ measurement outputs it is also necessary to have matching sampling dates and times. It is even more important for this problem as we've seen that the level of dissolved oxygen is strongly correlated with the hour of the day. Given such temporal resolution it is very unlikely that it exists imagery samples matching in-situ samples within the same hour.
The harmonised Landsat and Sentinel dataset imagery is a composite dataset of Landsat and Sentinel imagery, which brings the temporal resolution to 2-3 days (not accounting for days unavailable due to cloud covering). It's better, but it also decreases the spatial resolution to 30x30m. If using the same criteria for filtering out too small ponds (3x3 pixels, equivalent to 2.0 acres), it removes 31 ponds from the training data.
The Sentinel-3 dataset has a 2 days temporal resolution and 21 spectral bands, but only a 300m spatial resolution. On larger bodies of water it would make a great dataset, but not for ponds considered here.
To estimate how much the temporal resolution is a problem I search for days when there is both satellite imagery and in-situ measurement. As it is time and resource consuming I do it first only for the pond with maximum number of measurement for dissolved oxygen: pond 1_1_1. And I do it twice because stupid me initially forget to convert from Indian time (Asia/Kolkata) to UTC.
About cloud covering, I choose to discard satellite images with DN at the pond location above 200, as a visual check on the RGB composite convinces me that it gives better results than referring to the CLOUD_COVER parameter. That could probably be improved with more rigourous methods, but that's a rabbit hole I don't have time to fall into. This article seems a good reference for later.
The results are as follow:
Composite RGB image from B4, B3, B2 bands, 500m wide centered on the pond coordinates:
The in-situ measurement have a (usable) satellite measurement on the same day 23 times (within 161), and the in-situ measurement occured early morning or late afternoon while the satellite measurement occured around noon always at the same time. The RGB images gives confidence that the data retrieval is working fine, and the cloud filtering looks ok. There is one arguable image, but I don't expect a perfect result, the center pixel (the only one used) is clear, and it looks worst than it actually is due to normalisation. When I'll run this over all the ponds it looks like there may be enough data to work with, and I'll address the timing mismatch problem later. Good news!
Chlorophyll prediction from satellite data
From the related works, chlorophyll seems to be the "easiest" indicator to predict. Efficient methods have been found and documented by many others, in very similar contexts to the FWI challenge. If I can predict it correctly it would also seem to be a precious input data for the prediction of dissolved oxygen. I then decide to focus on that indicator first.
I scan through the Sentinel-2 imagery for matching days and uncloudy images. It gives me:
indicator | samples quantity | value range | date range |
ch | 9 | 69.26,166.20 | 2023-04-25 10:43:59,2023-05-25 10:43:59 |
That's so few samples that training an ANN looks hopeless. But, the related works show that regression on a correctly choosen bands ratio can give good prediction. So I decide to try it myself (that's the only option I have anyway).
I retrieve the bands B4, B3, B2, B8, B11, B5, B6, and calculate the indices NDVI, NDBI, EVI, NDCI, NDWI and MNDWI, NDTI (choice coming from what I already have from the Omdena challenge and addition influenced by the related works).
About time difference between in-situ measure and satellite data, well there is not much to do, then I decide to ignore it. Hopefully it doesn't change much within a few hours, I'm not specialist at all but that sounds plausible to me. FWI also apparently takes only one measure per day, which suggests it doesn't.
As the related works suggest a case by case optimal ratio I decide to perform an exhaustive search on all the combinations of the bands and indices I have. I consider all the ratio of form A/B or (A+B)/C. For the regression, also following related works, I try for each ratio a polynomial of order 1, 2 and 3. About TOA and BOA, I tried both and couldn't find any difference.
This gives the following results:
order 1 | x=(evi+ndti)/ndbi \(94.863720+0.536875x\) mae:15.020041 rmse:20.477803 \(R^2\):0.529485 |
order 2 | x=(red_edge_1+red_edge_2)/red \(-1736.449638+1320.008671x-232.544017x^2\) mae:13.875895 rmse:16.619855 \(R^2\):0.690072 |
order 3 | x=(red+evi)/green \(9593.149603-41891.410976x+60448.165645x^2-28514.787873x^3\) mae:6.405667 rmse:7.051341 \(R^2\):0.944211 |
To be honest, for order 2 polynomial, landing right on the same ratio (B5+B6)/B4 as the one in the related works made me a bit smile. Then, the result for order 3 got me worried. Is that a real improvement, or just too good to be true ? The plots of ratio/value for order 2 and 3 look like this:
I would really like to have more samples to confirm that result. If I try relaxing the filtering on DN value to 500 it doubles the number of matching days, (B5+B6)/B4 is still the best ratio, but the measure becomes noisy and and \(R^2\) drops significantly.
Using Sentinel-2 bands name the ratio \(x=(red_edge_1+red_edge_2)/red\) is \((B5+B6)/B4\), and the ratio \(x=(red+evi)/green\) is: \(x=(B_4+(2.5(B_8-B_4)/(B_8+6B_4-7.5B_2+1)))/B_3\)).
Updating chlorophyll in the web application
Let's integrate this chlorophyll predictor to the web application. The process flow becomes: the user inputs the location and date, the app retrieves the temperature from OpenWeatherData, it retrieves the satellite data from GEE, it predicts the chlorophyll from the satellite data using the predictor of the previous paragraph, and it predicts the 3 other indicators as before.
However the date choosen by the user is most likely not a date at which there are satellite data. Then I decide to proceed as follow: the app automatically search for the nearest date having saltellite data within 5 days before the user selected date. If there is not (due to clouds), the prediction fails, if there is, the retrieval of temperature and prediction of indicators is done at the satellite date (which I display in the result).
And while I'm at it, as I now have the satellite data, I can add the satellite view of the pond with the result of prediction. Yahoo!
Dissolved oxygen prediction from satellite data and chlorophyll
Next I move to the prediction of the dissolved oxygen. I now have the Sentinel-2 bands data and indices, the OpenWeatherData air temperature, and the predicted chlorophyll. After trying both I decide to keep order 2 and order 3 predictors in the training dataset and let the ANN decide which to use.
But here I have two new problems. First, the dissolved oxygen varies a lot within a day and for this reason FWI takes two measures (morning/afternoon). Of course none matches the satellite time. Then, I decide to linearly interpolate the in-situ measurements to the satellite time and use it instead. Second, the chlorophyll prediction returns some crazy values for input values out of the training range. Then I decide to clip this result of prediction to [0,250] (based on the range of in-situ measurements).
Scanning through the Sentinel-2 dataset for days matching in-situ samples for dissolved oxygen gives me:
indicator | samples quantity | value range | date range |
do | 144 | [2.04,10.60] | [2021-11-16 10:44:00,2024-10-21 10:43:59] |
I was expecting more samples and that's a bit demotivating. At least that's enough to try to train an ANN. The range of dissolved oxygen values is also reduced by a third, which isn't great of course.
The correlation plot for the training dataset becomes:
Dissolved oxygen shows no particular correlation with any input value. Let's hope an ANN can still make something out of that. I use the following training parameters:
architecture: | ANN, 1 layer, 2 nodes, sigmoid activation |
feature scaling: | standard normalisation |
loss function: | huber |
optimiser: | adam (0.9, 0.999) |
learning rate: | 0.0005 |
training time: | 120s |
cross validation: | 10 fold |
The results are:
Comparing with previous results:
MAE | RMSE | \(R^2\) | |
baseline | 3.13 | 3.96 | -0.25 |
coord | 3.207 | 3.542 | 0.003 |
coord,date | 0.992 | 1.517 | 0.817 |
coord,temperature | 2.480 | 3.043 | 0.264 |
coord,date,temperature | 0.957 | 1.454 | 0.832 |
coord,date,temperature,bands,chlorophyll | 0.461 | 0.684 | 0.724 |
requirement | <1.07 | <1.62 | >0.66 |
The accuracy improves, the cross validation ok, and \(R^2\) is still above requirement. However, I don't have much trust in these results to be honest, there are just so few samples... As I'm reaching the end of the challenge and don't know what to try else I'll keep it like this.
Ammonia and pH prediction from satellite data
Next I move to the two remaining indicators: ammonia and pH. I proceed exactly like for dissolved oxygen. The available samples for training with satellite data gives:
indicator | samples quantity | value range | date range |
ammonia | 122 | [0.00,1.06] | [2021-12-11 10:43:59,2024-10-21 10:43:59] |
pH | 144 | [7.22,8.62] | [2021-11-16 10:44:00,2024-10-21 10:43:59] |
Here again the reduction in value ranges and number of samples are worrying me. The correlation plots are:
No surprise here, no particular correlations are revealed for the indicators. I use the following training parameters for ANN:
architecture: | ANN, 1 layer and 2 nodes for ammonia, 1 layer with 2 nodes followed by 1 layer with 1 node for pH, sigmoid activation |
feature scaling: | standard normalisation |
loss function: | huber |
optimiser: | adam (0.9, 0.999) |
learning rate: | 0.0005 for ammonia, 0.001 for pH |
training time: | 60s |
cross validation: | 10 fold |
The results of training are:
Comparing with previous results:
MAE | RMSE | \(R^2\) | |
baseline | 0.10 | 0.17 | -0.04 |
coord | 0.092 | 0.151 | 0.194 |
coord,date | 0.083 | 0.126 | 0.435 |
coord,temperature | 0.087 | 0.141 | 0.292 |
coord,date,temperature | 0.080 | 0.127 | 0.427 |
coord,date,temperature,bands,chlorophyll | 0.047 | 0.067 | 0.842 |
requirement | <0.06 | <0.08 | >0.66 |
The results have improved, enough to bring ammonia within requirements, still not enough for pH.
Integration of remaining indicators in the web application
Finally, I integrate these predictors for dissolved oxygen, ammonia and pH into the web application. I clip the result values to avoid crazy results (particularly expected due to the limited ranges in training): dissolved oxygen within [0,30], ammonia within [0,3], pH within [0,14]. And to improve a little the app, I add some visual indication of values out of optimal/required ranges.
The whole app is made of one PHP file (~500 lines), one binary exectuable (~30kb) written in C (~460 lines of which 3/4 were automatically generated), one Python script (~1300 lines). No AI, 100% self coded. Including the Python third-party packages (GEE, OpenCV, ...) which are the only external dependencies, it weights ~270kb on disk and ~5Mb on RAM. It could run on a simple Linode Nanode instance, no need for virtual environment or container, just copy the app in the appropriate folder and set the correct access permissions. I haven't kept track of the time I've spent on that challenge, estimating from the git log it's equivalent to around 4 weeks full time (including the redaction of this article).
Conclusion
And that's where I end my participation to this challenge. I have a running web application which predicts the 4 indicators from satellite data and online data, 3 of them within the required accuracy. That's already a success for me, even if I'm absolutely not confident it will perform well on the tests that conclude the challenge, and more generally on a real daily use.
I'm proud of the tools I've developped over the years with LibCapy, which allowed me to process easily the datasets and to perform the regression for chlorophyll or to train ANNs on other indicators. In particular the training step is really just running a command on a dataset and copy pasting the source code it returns into the web app. Works like a charm ! I've made some improvement to them along the way, next time will be even better.
Creating the web application was a matter of minutes thanks to my experience with all my previous "one single PHP file apps" (cf here). About satellite imagery, almost all the code I needed was ready since the Omdena challenge. Only the OpenWeatherData API was completely new to me and required some learning. In the end most of my time was spent on preparing data and searching how to solve that problem, which is a good sign I think.
The web application as it is now is barebone of course, but I believe there is no point in making a fancy app while I haven't a good prediction model to integrate into ! As long as it's enough to run the end challenge tests, and prove that the whole pipeline from data to result of prediction is running, it's good enough to me. If I had better models, I would probably add a database of ponds, to avoid the user typing the GPS coordinates, add a functionality to run the prediction on all ponds at once, improve the UI to display the whole area and see at once ponds with/without problems (like an interactive map ?), and keep the results in the database to let the user check the ponds' history. But in the end that's the user who have to decide what functionalities to add.
About the prediction models I've made, again I don't think they have much useful value. From that point of view it is a failure. The reason is the small amount of data I had to train them. For chlorophyll, finding myself the ratio which several other researchers have found was a real pleasure, and brought some hope in the darkness. Maybe, on the range it has been trained, my chlorophyll predictor isn't too bad ? For other indicators, it would take much more data to train and test on before I feel confident enough to judge their quality.
Training data is obviously the main problem of any machine learning project. On that particular one, beside the recurrent annoyance of cloud covering when working with satellite imagery, the size of ponds relative to the spatial resolution, and the speed of variation of dissolved oxygen relative to the time resolution are two drawbacks.
On top of that, there are several layers of noise. One is the mismatch in time necessiting an interpolation from morning and evening measures. The only other option would be to train two predictors, one that predict the morning value from the observation at satellite time, and another for the evening value from the same observation at satellite time. That doesn't make sense to me. Predicting once per day at maximum seems the only option, in which case it doesn't fit the current FWI process which has different requirement for morning and evening dissolved oxygen values.
Another source of noise is the prediction from air temperature rather than water temperature. That may be not as bad as it looks, I don't know. I'd like to hear a specialist point of view about it if I could.
Another source of noise is the variation of values accross the pond, added to the fact that the satellite observation isn't at the measurement location. Averaging over the pond would help but, the size of the small ponds is problematic, and automatic averaging means we need pond segmentation, which is another problem in itself. However I remember reading that FWI may already have the contour coordinates of ponds, which would probably help.
Another source of noise is the quality of measurement as FWI confesses itself, and the influence of applied treatments on indicator values. I have no idea about how to deal with that. There are already so few data, filtering only the "no treatment" + "measurement with good equipment" would surely left nothing to work with.
Last problem about the data: the density distribution, which shows a clear unbalance between values. It would take much more samples from the "extreme" cases to train predictors accurately on the whole ranges.
Future work
About ways to address these problems and obtain better results on my side. I would like to improve my knowledge about cloud filtering for satellite imagery. The article I've linked above seems a good resource to do so. Cloud covering is a recurrent problem on that kind of project, so getting better at dealing with it would be profitable for eventual future projects using that kind of data.
I regret not asking to FWI about the ponds without coordinates. They probably have them, and if I could use them it would certainly increase the number of training samples.
If I could have much more samples, I would like to check about the order 3 polynomial I've found for chlorophyll. The perfect fit is really intriguing me. With much more samples I think I would also like to try a per-pond approach: train one model for each pond. They all have their own properties and dynamic I believe, due to different size, depth, fish population, surrounding vegetation, farmer's management method. I would expect a model dedicated to a pond to be much more accurate.
I haven't used the fish density because I can't retrieve it remotely and I stayed within the challenge perimeter. However it is clearly an important factor. I'm wondering what would be the results if it was allowed as a user input for the prediction models.
On FWI's side, keep collecting data ! In particular, focus on "problematic" ponds to equilibrate the samples density over the whole range of values. Being consistent in the measurement method and equipment is also important, even if they seem to already be doing an amazing job about it.
I understand the logistic hurdle it is, but having several samples at different locations and at the same time for a same pond would certainly help to match the satelilte data (average over samples \(\leftrightarrow\) average over pixels). Or at least, collect one sample in the middle of the pond, with coordinates matching the pond coordinates in the dataset ?
No doubt that matching the satellite time is another logistic nightmare, and anyway collecting data every time possible can't hurt. In the end, nothing guarantee that the satellite approach is the good one, so data unusable here can be precious for another, in the end more successfull, approach. However, if the satellite approach is kept there is a point to clarify: how to integrate the fact that the satellite can produce only one prediction per day to the fact that dissolved oxygen requirements are defined at two times per day.
About the temperature, it seems to me easy to add the air temperature to the collected data. It would have helped me, it may have helped other ?
Completely different approaches should be also explored. FWI has talked about forecasting models. It may work in itself, but I don't see how it would help their goal of avoiding to have to collect data in-situ. Staying on the remote sensing approach, I wonder what results would give automated sensors installed in the ponds. This article looks like an interesting read about that subject. I'm also wondering if a classifier model (OK, low/high warning, low/high alert) isn't more appropriate than a regression one for their purpose.
Keep an eye on satellite actuality. While some assholes are destroying them, others keep building and successfully launching new ones, like the Sentinel-5 this week. They'll increase the amount of available data and open new possibilities in term of prediction models.
That's all folks!