Baillehache Pascal's personal website

Prediction of water quality in aquaculture ponds

In this article I report my attempt at solving the Fish Welfare Initiative challenge "Remote-Sensing Water Quality with Satellites".

Problem definition

The goal of the challenge is to use satellite imagery to predict as many as possible among 4 indicators of water quality (dissolved oxygen, pH, ammonia, chlorophyll) in aquaculture ponds in India.

The required accuracies of the regression models for each indicator are:

	do (mg/L)	ph	am (mg/L)	ch ($\mu$g/L)
$R^2$	$\ge0.66$	$\ge0.71$	$\ge0.66$	$\ge0.73$
RMSE	$\le1.62$	$\le0.13$	$\le0.08$	$\le59$
MAE	$\le1.07$	$\le0.11$	$\le0.06$	$\le42$

Where $R^2$ is the coefficient of determination, RMSE is the root mean square error, and MAE is the mean absolute error. Defined as follow:
$$ \begin{array}{l} MAE=\frac{\sum_{i\in[1,n]}|y_i-y'_i|}{n}\\ RMSE=\sqrt{\frac{\sum_{i\in[1,n]}(y_i-y'_i)^2}{n}}\\ R^2=1-\frac{\sum_{i\in[1,n]}(y_i-y'_i)^2}{\sum_{i\in[1,n]}(y_i-\hat{y})^2}\\ \end{array} $$ where $n$ is the number of prediction, $y_i$ is the truth value of the indicator $y$ for prediction $i$, $y'_i$ is the predicted value of the indicator $y$ for prediction $i$, and $\hat{y}=\frac{\sum_{i\in[1,n]}y_i}{n}$.

If the predictor was a classifier, according to this document and this document, the OK/NG thresholds would be:

	optimal range	required range
do (morning)	4-5mg/L	3-5mg/L
do (evening)	8-10mg/L	8-12mg/L
ph	7-8	6.5-8.5
am	<0.15mg/L	<0.5mg/L
ch	100-150mg/L or 150-220mg/L	?

FWI provides the following data:

GPS coordinates (available upon request) for a set of ponds in the Eluru and Nellore districts both in the Indian state of Andhra Pradesh
a dataset of in situ measurements of the predicted indicators and other variables, available here

This challenge is the second iteration on this problem. The first iteration didn't produced models meeting the accuracy requirements, hence the second iteration. As the efforts so far were unsuccesful there are concerns that a prediction model is actually possible, however FWI still hopes that it may be possible (cf this blog post).

It looks like a very difficult problem, several others already tried and failed, and I don't have particular experience to solve that kind of problem. My hope for solving this challenge is extremely low, but at least this is a very interesting real world problem to keep learning about remote sensing and prediction models, in the continuity of the Omdena challenge about deforestation I did earlier this year (cf here) or the online course about air pollution prediction I did last year (cf here), and a good occasion to use and improve LibCapy along the way!

Related works

This webpage gather all the articles from FWI related to satellite imagery.

FWI has shared the slides of a prior attempt to predict chlorophyll from NDCI calculated from Sentinel imagery. This attempt failed to reach the requirements. It used machine learning, but the slides give no details, and a nearest neighbour lookup approach (pixel NDCI value matched to nearest in-situ measurement).

FWI has performed a study (slides) of the difference between measurements at the edge and measurements in the interior of the ponds. They found a significant difference in absolute value but not significant for the purpose of their program.

FWI has shared the final write-up of an apparently very successful attempt to create a classifier for "in-range/out-of-range indicators" using different data (not satellite imagery).

FWI has published a report about dissolved oxygen challenges in aquaculture systems. It shows the relation between fish welfare, dissolved oxygen, ammonia and chlorophyll, and also with other factors as fish density and weather.

FWI has published a report about their first attempt to predict indicators from satellite imagery. Despite the results from this first attempt were very good, later they couldn't be reproduced and raised skepticism (cf. this blog post).

In this post FWI shares some internal thought about the project and discuss another possible approach: forecasting models.

In this article the authors summarize from many studies the challenges and benefits of remote sensing in water monitoring. Interesting information from the perspective of the FWI challenge are as follow. They report a formula (requires calibration) to calculate chlorophyll concentration given spectral reflectance at 460nm, 490nm and 520nm. They indicate a correlation of dissolved oxygen with temperature and chlorophyll. And they confirm that satellite imagery plus ANN is an interesting approach but with many challenges to overcome.

In this article the authors show that dissolved oxygen in coastal sea water can be accurately predicted from temperature and chlorophyll data from the MODIS and VIIRS datasets using regression models.

In this article the authors show that dissolved oxygen in river can be accurately predicted from bands B5 and B8 of the Sentinel-2 dataset using random forest or support vector machine.

In this article the authors show that dissolved oxygen in coastal sea water can be predicted with relatively good $R^2$ from RGB bands of the Landsat-8 dataset using a polynomial of order 3.

In this article and this article the authors show that dissolved oxygen in lake can be accurately predicted from temperature, clarity and chlorophyll data from the bands data of MODIS and Himawari-8 datasets using extreme gradient boosting and multimodal deep neural network.

In this article the author calculates the chlorophyll density in Indian lakes directly from the Sentinel-8 bands ratio (B5+B6)/B4. In this article the authors use the same bands ratio for the same purpose in Turkey coastal water. However, in this article the authors show that several bands ratio can be used, and the best band ratio to use varies with the water body considered and should be decided on a case by case basis.

In this article the authors use yet another bands ratio to predict chlorophyll and argue that top-of-atmosphere (TOA) reflectance dataset should be preferred to the bottom-of-atmosphere (BOA) reflectance dataset in case of shallow water bodies.

Ponds coordinates and area

First, I prepare the coordinates data in a format compatible with LibCapy. For verification and better understanding I make a visualisation of the ponds location. I'm reusing the tool I developped during the Omdena challenge to create a composite RGB image of the two districts, and I plot a blue circle for each pond on that image. The center of each circle is supposed to match the center of the pond (GPS coordinates in the dataset) and its radius is such as the circle area is equal to the pond area (provided with the coordinates). (Sentinel-2 imagery, 10m/px, click on images to enlarge)

The range of coordinates for each region is as follow:

	Longitude	Latitude
Eluru	81.012177,81.271376	16.574134,16.743473
Nellore	80.075645,80.151875	14.446507,14.529491

The location and size of the circles seems to match the underlying ponds, hence my parsing of coordinates and areas data seems correct. However, there are also some ponds for which the center doesn't match with a pond center in the imagery.

This paper suggests a subpixel accuracy for Sentinel-2 images registration. One can then assume that the measurement is to blame when the pond center in the data doesn't match the pond center in the imagery (GPS measurement inaccuracy, or measurement location actually not at the center of the pond despite being introduced as such). Then, I'm assuming that the data coordinates actually matches the pond it refers to and the shift in imagery comes from the measurement having been taken only 'roughly' at the center of the pond (assumption reinforced by some images of ponds in FWI blog posts: 1, 2).

Another observation about the ponds maps is that some ponds are extremely smalls relatively to Sentinel-2 spatial resolution (in particular in the Nellore district).

The smallest pond in the dataset has an area of 0.13 acres (id: 1_73_2). It's 526 square meters, or 22x22m, or at most 2x2px in the Sentinel-2 imagery. We saw we can assume the single pixel at data coordinates to be a measure of the correct pond. However using several pixels around those coordinates would allow to reduce possible noise and get a better representation of the pond as a whole by using gaussian blur. From that perspective the smallest ponds are too small, and the shift of the coordinates away from the real center is worrying (risk of using irrelevant pixels out of the pond).

The distribution of sizes is as follow:

However the smallest ponds is the only one with size equivalent to less than 3x3 pixels. Then I decide to use that 3x3 pixels (or 30x30m) area around the given pond coordinates, apply gaussian blur (radius 1, strength 0.5), and use the value at the coords (1,1) pixel.

Measurements data

Next I prepare the measurements dataset. It includes many information, some of those are irrelevant (e.g. who made the measurement) or useless (e.g. behaviour indicators, in the sense they can't be remotely acquired from a satellite). I choose to retain the following measurements: coord, date, temperature, and of course the 4 indicators to be predicted. I also initially kept the 'equipment' and 'treatment' fields, but later decided to ignore them.

The 'coord' and 'date' are the two minimum inputs, the one which the user will interact with ("what is the water quality of that pond at that time?"). A very few samples (ph:5, do:5, ammonia:1) have unknown time of measurement, I simply remove them. Many ponds have unknown coordinates, they are useless for our purpose and thus the corresponding samples are removed.
From the related works, the temperature is clearly an important factor. I don't know yet if I can retrieve it or predict it, so I keep it for now and will see later how to use it.

I remove three samples which I consider to be anomalous:

for dissolved oxygen: pond 2_85_2, date 12-09-2023 17:24, has value 80.8 while all the others are below 27.1
for ammonia: pond 2_20_1, date 23-10-2021 7:07, has value 9 while all the others are below 3.
for pH: pond 1_7_2, date 18-04-2022 07:59, has value 3.38 while all the others are above 5.3.

This gives for each predicted indicator the following info:

indicator	samples quantity	value range	date range
do	4805	0.0,27.1	2021-07-12 06:15:00,2024-10-31 17:18:00
ph	4806	5.3,9.54	2021-07-12 06:15:00,2024-10-31 17:18:00
am	2774	0.0,3.0	2021-07-29 06:20:00,2024-10-31 07:32:00
ch	96	7.54,225.13	2023-04-25 06:48:00,2023-06-09 18:01:00

Note that chlorophyll has very few samples (over less than 2 months, in only 6 ponds of the Elore district).

The correlation plots and Distance correlation for all measures and indicators look like this:

The correlation plots and values show only low correlation between indicators and available measures.

The density distributions for each indicator look like this (bin size sets to match approximatively the MAE requirements):

Indicator values are all very concentrated. From the machine learning perspective we would like to have a flat density distribution (all possible values equally represented). Simply returning the median value would be sufficient to be within the MAE requirements almost all the time for chlorophyll and most of the time for ammonia. This will make more difficult the judgment of how good the prediction models are. Another thing to note, the dissolved oxygen density plot seems to have a bimodal shape, which can also be seen in the correlation plots.

Baseline models

The analysis of the data directly gives us the recipe for baseline models: simply return the median value for each indicator. Evaluating this base models gives the following results:

MAE:3.13, RMSE:3.96, R^2:-0.25
Requirement, MAE<1.07, RMSE<1.62, R^2>0.66

MAE:0.10, RMSE:0.17, R^2:-0.04
Requirement, MAE<0.06, RMSE<0.08, R^2>0.66

MAE:0.22, RMSE:0.29, R^2:-0.01
Requirement, MAE<0.11, RMSE<0.13, R^2>0.71

MAE:24.70, RMSE:33.84, R^2:-0.00
Requirement, MAE<42.0, RMSE<59.0, R^2>0.73

As predicted by the data analysis these base models return apparently good values (low MAE and RMSE, not terribly far from the requirements or even within them for chlorophyll), but of course have no predictive power (terrible $R^2$).

ANN models from in-situ data

As a first attempt to improve upon those dumb baseline models I try an artificial neural network with inputs: 'coord', 'date' and 'temperature'. For the 'coord' I'm using the GPS coordinates (longitude/lattitude). For the 'date' I'm using cos/sin encodings for the day in the year and the time in the day. This makes 7 numerical inputs and the number of sample becomes as follow:

indicator	samples quantity
do	4800
ph	4804
am	2773
ch	96

The correlation plots and Distance correlation for all inputs and indicators look like this:

Dissolved oxygen has a strong correlation with the hour of the day (lower values in the morning and higher values in the evening) and some correlation with the temperature (which seems logic given the correlation with hour of the day).
Ammonia has some correlation with the location (maybe some regional difference in the environment between Eluru and Nellore ?).
pH has some correlation with the hour of the day.
Chlorophyl has no particular correlation with any input value.

I use the following parameters to train the ANNs (the architecture is choosen to obtain good results over the whole dataset and consistent results between training and validation during cross validation):

architecture:	ANN, 1 layer of 5 nodes for chlorophyll, 2 layers of 10 nodes each for others, relu activation
feature scaling:	standard normalisation
loss function:	huber
optimiser:	adam (0.9, 0.999)
learning rate:	0.001
training time:	120s
cross validation:	10 fold

Perform 10-fold cross validation...
10-fold cross valid, split #0, acc. train 1.017, valid 1.019
10-fold cross valid, split #5, acc. train 1.011, valid 1.067
10-fold cross valid, split #7, acc. train 1.031, valid 0.957
10-fold cross valid, split #1, acc. train 1.021, valid 1.013
10-fold cross valid, split #4, acc. train 1.007, valid 1.081
10-fold cross valid, split #2, acc. train 1.025, valid 0.972
10-fold cross valid, split #8, acc. train 1.021, valid 1.056
10-fold cross valid, split #3, acc. train 1.017, valid 1.031
10-fold cross valid, split #6, acc. train 1.017, valid 1.088
10-fold cross valid, split #9, acc. train 1.021, valid 0.989
Avg acc training: 1.01875, avg acc validation: 1.02732
Train the predictor on the whole dataset...
Evaluation, MAE:0.957, RMSE:1.454, R^2:0.832
Requirement, MAE<1.07, RMSE<1.62, R^2>0.66

Perform 10-fold cross validation...
10-fold cross valid, split #3, acc. train 0.088, valid 0.080
10-fold cross valid, split #2, acc. train 0.088, valid 0.087
10-fold cross valid, split #0, acc. train 0.087, valid 0.095
10-fold cross valid, split #7, acc. train 0.088, valid 0.087
10-fold cross valid, split #1, acc. train 0.089, valid 0.083
10-fold cross valid, split #6, acc. train 0.088, valid 0.086
10-fold cross valid, split #5, acc. train 0.086, valid 0.098
10-fold cross valid, split #4, acc. train 0.086, valid 0.093
10-fold cross valid, split #8, acc. train 0.088, valid 0.084
10-fold cross valid, split #9, acc. train 0.087, valid 0.089
Avg acc training: 0.08748, avg acc validation: 0.08822
Train the predictor on the whole dataset...
Evaluation, MAE:0.080, RMSE:0.127, R^2:0.427
Requirement, MAE<0.06, RMSE<0.08, R^2>0.66

Perform 10-fold cross validation...
10-fold cross valid, split #5, acc. train 0.176, valid 0.174
10-fold cross valid, split #9, acc. train 0.176, valid 0.185
10-fold cross valid, split #3, acc. train 0.178, valid 0.168
10-fold cross valid, split #7, acc. train 0.177, valid 0.174
10-fold cross valid, split #1, acc. train 0.176, valid 0.186
10-fold cross valid, split #4, acc. train 0.176, valid 0.178
10-fold cross valid, split #0, acc. train 0.176, valid 0.180
10-fold cross valid, split #6, acc. train 0.178, valid 0.164
10-fold cross valid, split #2, acc. train 0.176, valid 0.180
10-fold cross valid, split #8, acc. train 0.176, valid 0.181
Avg acc training: 0.17650, avg acc validation: 0.17686
Train the predictor on the whole dataset...
Evaluation, MAE:0.166, RMSE:0.236, R^2:0.341
Requirement, MAE<0.11, RMSE<0.13, R^2>0.71

Perform 10-fold cross validation...
10-fold cross valid, split #8, acc. train 24.856, valid 13.538
10-fold cross valid, split #0, acc. train 23.863, valid 21.445
10-fold cross valid, split #4, acc. train 23.869, valid 22.417
10-fold cross valid, split #1, acc. train 23.203, valid 27.249
10-fold cross valid, split #7, acc. train 22.580, valid 34.010
10-fold cross valid, split #2, acc. train 22.619, valid 30.074
10-fold cross valid, split #6, acc. train 24.025, valid 21.473
10-fold cross valid, split #5, acc. train 22.061, valid 38.450
10-fold cross valid, split #3, acc. train 24.303, valid 24.432
10-fold cross valid, split #9, acc. train 23.263, valid 25.645
Avg acc training: 23.46408, avg acc validation: 25.87310
Train the predictor on the whole dataset...
Evaluation, MAE:18.104, RMSE:25.935, R^2:0.413
Requirement, MAE<42.0, RMSE<59.0, R^2>0.73

If using only 'coord' and 'date', the results become as follow.

Perform 10-fold cross validation...
10-fold cross valid, split #3, acc. train 1.011, valid 1.136
10-fold cross valid, split #0, acc. train 1.018, valid 1.057
10-fold cross valid, split #2, acc. train 1.009, valid 1.080
10-fold cross valid, split #5, acc. train 1.019, valid 1.059
10-fold cross valid, split #1, acc. train 1.021, valid 1.074
10-fold cross valid, split #6, acc. train 1.015, valid 1.103
10-fold cross valid, split #4, acc. train 1.020, valid 1.058
10-fold cross valid, split #9, acc. train 1.036, valid 0.952
10-fold cross valid, split #8, acc. train 1.030, valid 0.932
10-fold cross valid, split #7, acc. train 1.034, valid 0.963
Avg acc training: 1.02126, avg acc validation: 1.04125
Train the predictor on the whole dataset...
Evaluation, MAE:0.992, RMSE:1.517, R^2:0.817
Requirement, MAE<1.07, RMSE<1.62, R^2>0.66

Perform 10-fold cross validation...
10-fold cross valid, split #3, acc. train 0.089, valid 0.094
10-fold cross valid, split #6, acc. train 0.092, valid 0.088
10-fold cross valid, split #4, acc. train 0.093, valid 0.091
10-fold cross valid, split #9, acc. train 0.096, valid 0.085
10-fold cross valid, split #7, acc. train 0.094, valid 0.095
10-fold cross valid, split #2, acc. train 0.087, valid 0.110
10-fold cross valid, split #0, acc. train 0.091, valid 0.087
10-fold cross valid, split #1, acc. train 0.089, valid 0.091
10-fold cross valid, split #5, acc. train 0.090, valid 0.087
10-fold cross valid, split #8, acc. train 0.089, valid 0.100
Avg acc training: 0.09102, avg acc validation: 0.09282
Train the predictor on the whole dataset...
Evaluation, MAE:0.083, RMSE:0.126, R^2:0.435
Requirement, MAE<0.06, RMSE<0.08, R^2>0.66

Perform 10-fold cross validation...
10-fold cross valid, split #1, acc. train 0.176, valid 0.201
10-fold cross valid, split #5, acc. train 0.178, valid 0.180
10-fold cross valid, split #3, acc. train 0.180, valid 0.168
10-fold cross valid, split #0, acc. train 0.181, valid 0.161
10-fold cross valid, split #6, acc. train 0.178, valid 0.179
10-fold cross valid, split #4, acc. train 0.179, valid 0.170
10-fold cross valid, split #7, acc. train 0.179, valid 0.179
10-fold cross valid, split #9, acc. train 0.178, valid 0.178
10-fold cross valid, split #2, acc. train 0.178, valid 0.185
10-fold cross valid, split #8, acc. train 0.177, valid 0.185
Avg acc training: 0.17841, avg acc validation: 0.17860
Evaluation, MAE:0.169, RMSE:0.239, R^2:0.322
Requirement, MAE<0.11, RMSE<0.13, R^2>0.71

Perform 10-fold cross validation...
10-fold cross valid, split #6, acc. train 24.635, valid 20.778
10-fold cross valid, split #1, acc. train 23.573, valid 28.325
10-fold cross valid, split #8, acc. train 22.497, valid 9.305
10-fold cross valid, split #4, acc. train 22.148, valid 19.213
10-fold cross valid, split #5, acc. train 20.376, valid 38.015
10-fold cross valid, split #7, acc. train 20.681, valid 30.934
10-fold cross valid, split #0, acc. train 21.313, valid 22.760
10-fold cross valid, split #3, acc. train 21.706, valid 25.362
10-fold cross valid, split #2, acc. train 20.320, valid 28.886
10-fold cross valid, split #9, acc. train 19.956, valid 26.668
Avg acc training: 21.72042, avg acc validation: 25.02470
Train the predictor on the whole dataset...
Evaluation, MAE:16.237, RMSE:24.110, R^2:0.492
Requirement, MAE<42.0, RMSE<59.0, R^2>0.73

If using only 'coord' and 'temperature', the results become as follow.

Perform 10-fold cross validation...
10-fold cross valid, split #7, acc. train 2.650, valid 2.675
10-fold cross valid, split #6, acc. train 2.639, valid 2.754
10-fold cross valid, split #3, acc. train 2.646, valid 2.731
10-fold cross valid, split #0, acc. train 2.648, valid 2.656
10-fold cross valid, split #5, acc. train 2.648, valid 2.630
10-fold cross valid, split #2, acc. train 2.647, valid 2.679
10-fold cross valid, split #1, acc. train 2.651, valid 2.659
10-fold cross valid, split #8, acc. train 2.669, valid 2.552
10-fold cross valid, split #4, acc. train 2.665, valid 2.599
10-fold cross valid, split #9, acc. train 2.658, valid 2.605
Avg acc training: 2.65210, avg acc validation: 2.65408
Train the predictor on the whole dataset...
Evaluation, MAE:2.480, RMSE:3.043, R^2:0.264
Requirement, MAE<1.07, RMSE<1.62, R^2>0.66

Perform 10-fold cross validation...
10-fold cross valid, split #1, acc. train 0.098, valid 0.093
10-fold cross valid, split #2, acc. train 0.097, valid 0.097
10-fold cross valid, split #0, acc. train 0.096, valid 0.102
10-fold cross valid, split #3, acc. train 0.097, valid 0.088
10-fold cross valid, split #8, acc. train 0.097, valid 0.093
10-fold cross valid, split #7, acc. train 0.097, valid 0.092
10-fold cross valid, split #5, acc. train 0.095, valid 0.105
10-fold cross valid, split #6, acc. train 0.097, valid 0.094
10-fold cross valid, split #4, acc. train 0.096, valid 0.104
10-fold cross valid, split #9, acc. train 0.096, valid 0.099
Avg acc training: 0.09657, avg acc validation: 0.09675
Train the predictor on the whole dataset...
Evaluation, MAE:0.087, RMSE:0.141, R^2:0.292
Requirement, MAE<0.06, RMSE<0.08, R^2>0.66

Perform 10-fold cross validation...
10-fold cross valid, split #6, acc. train 0.193, valid 0.199
10-fold cross valid, split #3, acc. train 0.194, valid 0.191
10-fold cross valid, split #1, acc. train 0.194, valid 0.193
10-fold cross valid, split #4, acc. train 0.194, valid 0.196
10-fold cross valid, split #0, acc. train 0.194, valid 0.196
10-fold cross valid, split #2, acc. train 0.196, valid 0.177
10-fold cross valid, split #7, acc. train 0.194, valid 0.191
10-fold cross valid, split #9, acc. train 0.192, valid 0.206
10-fold cross valid, split #5, acc. train 0.194, valid 0.195
10-fold cross valid, split #8, acc. train 0.194, valid 0.195
Avg acc training: 0.19385, avg acc validation: 0.19398
Train the predictor on the whole dataset...
Evaluation, MAE:0.188, RMSE:0.260, R^2:0.201
Requirement, MAE<0.11, RMSE<0.13, R^2>0.71

Perform 10-fold cross validation...
10-fold cross valid, split #7, acc. train 22.924, valid 31.648
10-fold cross valid, split #4, acc. train 23.919, valid 21.242
10-fold cross valid, split #0, acc. train 23.973, valid 22.100
10-fold cross valid, split #1, acc. train 23.197, valid 29.796
10-fold cross valid, split #8, acc. train 25.111, valid 10.955
10-fold cross valid, split #6, acc. train 24.068, valid 20.769
10-fold cross valid, split #2, acc. train 22.890, valid 29.062
10-fold cross valid, split #5, acc. train 22.093, valid 38.030
10-fold cross valid, split #3, acc. train 24.624, valid 18.163
10-fold cross valid, split #9, acc. train 23.159, valid 25.627
Avg acc training: 23.59576, avg acc validation: 24.73908
Train the predictor on the whole dataset...
Evaluation, MAE:22.892, RMSE:32.375, R^2:0.085
Requirement, MAE<42.0, RMSE<59.0, R^2>0.73

If using only 'coord', the results become as follow.

Perform 10-fold cross validation...
10-fold cross valid, split #4, acc. train 3.208, valid 3.240
10-fold cross valid, split #8, acc. train 3.207, valid 3.244
10-fold cross valid, split #1, acc. train 3.199, valid 3.285
10-fold cross valid, split #3, acc. train 3.210, valid 3.209
10-fold cross valid, split #0, acc. train 3.220, valid 3.130
10-fold cross valid, split #7, acc. train 3.210, valid 3.219
10-fold cross valid, split #9, acc. train 3.226, valid 3.077
10-fold cross valid, split #6, acc. train 3.215, valid 3.146
10-fold cross valid, split #2, acc. train 3.201, valid 3.291
10-fold cross valid, split #5, acc. train 3.199, valid 3.272
Avg acc training: 3.20953, avg acc validation: 3.21136
Train the predictor on the whole dataset...
Evaluation, MAE:3.207, RMSE:3.542, R^2:0.003
Requirement, MAE<1.07, RMSE<1.62, R^2>0.66

Perform 10-fold cross validation...
10-fold cross valid, split #4, acc. train 0.098, valid 0.095
10-fold cross valid, split #0, acc. train 0.099, valid 0.093
10-fold cross valid, split #8, acc. train 0.097, valid 0.100
10-fold cross valid, split #5, acc. train 0.098, valid 0.096
10-fold cross valid, split #2, acc. train 0.096, valid 0.110
10-fold cross valid, split #6, acc. train 0.098, valid 0.099
10-fold cross valid, split #1, acc. train 0.098, valid 0.099
10-fold cross valid, split #3, acc. train 0.097, valid 0.103
10-fold cross valid, split #9, acc. train 0.099, valid 0.087
10-fold cross valid, split #7, acc. train 0.098, valid 0.097
Avg acc training: 0.09774, avg acc validation: 0.09785
Train the predictor on the whole dataset...
Evaluation, MAE:0.092, RMSE:0.151, R^2:0.194
Requirement, MAE<0.06, RMSE<0.08, R^2>0.66

Perform 10-fold cross validation...
10-fold cross valid, split #0, acc. train 0.214, valid 0.193
10-fold cross valid, split #4, acc. train 0.212, valid 0.206
10-fold cross valid, split #3, acc. train 0.213, valid 0.202
10-fold cross valid, split #1, acc. train 0.209, valid 0.235
10-fold cross valid, split #9, acc. train 0.212, valid 0.209
10-fold cross valid, split #8, acc. train 0.211, valid 0.215
10-fold cross valid, split #5, acc. train 0.211, valid 0.216
10-fold cross valid, split #7, acc. train 0.212, valid 0.208
10-fold cross valid, split #2, acc. train 0.210, valid 0.222
10-fold cross valid, split #6, acc. train 0.212, valid 0.212
Avg acc training: 0.21156, avg acc validation: 0.21176
Train the predictor on the whole dataset...
Evaluation, MAE:0.210, RMSE:0.283, R^2:0.051
Requirement, MAE<0.11, RMSE<0.13, R^2>0.71

Perform 10-fold cross validation...
10-fold cross valid, split #0, acc. train 25.053, valid 21.935
10-fold cross valid, split #2, acc. train 24.212, valid 27.424
10-fold cross valid, split #7, acc. train 23.755, valid 33.771
10-fold cross valid, split #5, acc. train 23.136, valid 38.525
10-fold cross valid, split #9, acc. train 24.800, valid 21.257
10-fold cross valid, split #6, acc. train 24.926, valid 22.240
10-fold cross valid, split #1, acc. train 23.949, valid 32.419
10-fold cross valid, split #4, acc. train 24.858, valid 22.395
10-fold cross valid, split #3, acc. train 25.288, valid 21.495
10-fold cross valid, split #8, acc. train 26.060, valid 11.509
Avg acc training: 24.60360, avg acc validation: 25.29686
Train the predictor on the whole dataset...
Evaluation, MAE:24.791, RMSE:33.268, R^2:0.033
Requirement, MAE<42.0, RMSE<59.0, R^2>0.73

Putting all these results together:

	MAE	RMSE	$R^2$
baseline	3.13	3.96	-0.25
coord	3.207	3.542	0.003
coord,date	0.992	1.517	0.817
coord,temperature	2.480	3.043	0.264
coord,date,temperature	0.957	1.454	0.832
requirement	<1.07	<1.62	>0.66

For dissolved oxygen, the coordinates do not help the prediction. The temperature has a little more positive influence. On the other hand the date has a strong impact (as expected from the correlation plots), enough to bring the prediction within range for all three requirements. The best accuracy is obtained when using both.

For chlorophyll, date is the only input having an influence, improving slightly the prediction, with MAE and RMSE always within requirement. However $R^2$ is still far from requirements.

For pH, date and temperature both improve the prediction, and best accuracy is reached when using both. However it is insufficient to be within requirements.

For ammonia, coordinates, date and temperature all improve slightly the prediction and best accuracy is obtained when using them altogether. However it is still insufficient to be within requirements.

For all indicators, the good stability in the cross validation suggests they would perfom with similar accuracy in practice.

Temperature data acquisition

The temperature seems to have a positive impact on the prediction of 3 out of 4 indicators. It would then be interesting to have a way to retrieve it automatically.

Satellite data like the MOD21A1D.061 dataset available on GEE provides a daily average temperature at 1km spatial resolution. The spatial resolution is probably enough (it sounds reasonable to me to expect the temperature to not vary much within 1km because the considered areas are not mountainous), but I'm not sure of how much impact using the average temperature over one day has.

Another solution is OpenWeatherMap which provides an API easy to use and free up to a certain number of call per day. After registration one can retrieve temperature at a location (long/lat) and time (unix timestamp, beware of the timezone) with the following API call similar to this (replace 'xxx' with your API key):
curl --silent "https://api.openweathermap.org/data/3.0/onecall/timemachine?lat=16.63&lon=81.15&dt=1685586656&units=metric&appid=xxx" | jq '.data.[0].temp'

As far as I understand the temperature measurement in the FWI dataset is the water temperature. OpenWeatherMap provides the air temperature, then a small difference is expected. The test below gives an idea of that difference, which seems reasonable. I should have done a more rigorous study of how both temperatures are related, or even maybe make a predictor for that, I haven't by lack of time. Training the ANN with API temperatures instead of FWI temperatures will also probably mitigate the influence of that difference.

1_1_1, 01-06-2023 07:59, FWI:32.7, API:30.29
1_1_1, 03-05-2023 07:51, FWI:29.7, API:27.10
1_1_1, 05-01-2023 07:40, FWI:25.8, API:24.58
1_1_1, 09-08-2023 17:21, FWI:31.8, API:30.89
1_1_1, 09-05-2023 17:30, FWI:34.9, API:31.25
1_1_1, 19-07-2023 16:49, FWI:27.6, API:26.28

Predictor as a web application

I have already enough to make a minimalist web app predicting the 4 indicators, so lets do it. Even if the results are not great and raise concerns, it is still possible to improve the predictors later and I'll have the basis ready.

It allows to input coordinates, date and time, and returns the temperature (automatically retrieved from OpenWeatherMap) and the predicted values for 4 indicators (using the ANNs introduced in the previous paragraphs).

Satellite data

I now move on to the search for models using satellite imagery (instead of in-situ measurement) for remote monitoring of water quality. The documents about the challenge refer to the Sentinel-2 imagery. I intend to use GEE to retrieve this imagery, because I have some experience working with this platform. Info about the Sentinel-2 dataset on GEE are available here.

The maximum available resolution for that imagery is sufficient to isolate at least one pixel corresponding to even the smallest pond (cf above). However the temporal resolution is only 5 days (not accounting for unavailable days due to cloud covering). To be able to train a model having satellite inputs and in-situ measurement outputs it is also necessary to have matching sampling dates and times. It is even more important for this problem as we've seen that the level of dissolved oxygen is strongly correlated with the hour of the day. Given such temporal resolution it is very unlikely that it exists imagery samples matching in-situ samples within the same hour.

The harmonised Landsat and Sentinel dataset imagery is a composite dataset of Landsat and Sentinel imagery, which brings the temporal resolution to 2-3 days (not accounting for days unavailable due to cloud covering). It's better, but it also decreases the spatial resolution to 30x30m. If using the same criteria for filtering out too small ponds (3x3 pixels, equivalent to 2.0 acres), it removes 31 ponds from the training data.

The Sentinel-3 dataset has a 2 days temporal resolution and 21 spectral bands, but only a 300m spatial resolution. On larger bodies of water it would make a great dataset, but not for ponds considered here.

To estimate how much the temporal resolution is a problem I search for days when there is both satellite imagery and in-situ measurement. As it is time and resource consuming I do it first only for the pond with maximum number of measurement for dissolved oxygen: pond 1_1_1. And I do it twice because stupid me initially forget to convert from Indian time (Asia/Kolkata) to UTC.

About cloud covering, I choose to discard satellite images with DN at the pond location above 200, as a visual check on the RGB composite convinces me that it gives better results than referring to the CLOUD_COVER parameter. That could probably be improved with more rigourous methods, but that's a rabbit hole I don't have time to fall into. This article seems a good reference for later.

The results are as follow:

pond_id,satellite_date,in_situ_date,do
1_1_1,05-04-2023 10:44,05-04-2023 07:47,4.000000
1_1_1,05-04-2023 10:44,05-04-2023 17:41,12.000000
1_1_1,10-04-2023 10:44,10-04-2023 07:54,1.800000
1_1_1,10-04-2023 10:44,10-04-2023 17:45,10.200000
1_1_1,10-05-2023 10:44,10-05-2023 07:57,1.600000
1_1_1,10-05-2023 10:44,10-05-2023 17:22,19.000000
1_1_1,11-03-2022 10:44,11-03-2022 07:13,3.630000
1_1_1,11-03-2022 10:44,11-03-2022 17:17,8.850000
1_1_1,14-06-2022 10:44,14-06-2022 07:15,2.040000
1_1_1,15-05-2023 10:44,15-05-2023 07:56,2.300000
1_1_1,15-05-2023 10:44,15-05-2023 17:41,12.800000
1_1_1,20-04-2023 10:44,20-04-2023 07:47,0.700000
1_1_1,20-04-2023 10:44,20-04-2023 17:10,14.000000
1_1_1,21-03-2023 10:44,21-03-2023 07:22,0.900000
1_1_1,21-03-2023 10:44,21-03-2023 16:49,8.800000
1_1_1,24-02-2022 10:44,24-02-2022 07:36,2.100000
1_1_1,24-02-2022 10:44,24-02-2022 17:05,8.840000
1_1_1,25-05-2023 10:44,25-05-2023 07:54,1.500000
1_1_1,25-05-2023 10:44,25-05-2023 17:33,12.500000
1_1_1,26-03-2022 10:44,26-03-2022 07:37,3.190000
1_1_1,26-03-2022 10:44,26-03-2022 17:24,8.580000
1_1_1,31-03-2023 10:44,31-03-2023 07:51,1.900000
1_1_1,31-03-2023 10:44,31-03-2023 17:48,8.400000

Composite RGB image from B4, B3, B2 bands, 500m wide centered on the pond coordinates:

The in-situ measurement have a (usable) satellite measurement on the same day 23 times (within 161), and the in-situ measurement occured early morning or late afternoon while the satellite measurement occured around noon always at the same time. The RGB images gives confidence that the data retrieval is working fine, and the cloud filtering looks ok. There is one arguable image, but I don't expect a perfect result, the center pixel (the only one used) is clear, and it looks worst than it actually is due to normalisation. When I'll run this over all the ponds it looks like there may be enough data to work with, and I'll address the timing mismatch problem later. Good news!

Chlorophyll prediction from satellite data

From the related works, chlorophyll seems to be the "easiest" indicator to predict. Efficient methods have been found and documented by many others, in very similar contexts to the FWI challenge. If I can predict it correctly it would also seem to be a precious input data for the prediction of dissolved oxygen. I then decide to focus on that indicator first.

I scan through the Sentinel-2 imagery for matching days and uncloudy images. It gives me:

indicator	samples quantity	value range	date range
ch	9	69.26,166.20	2023-04-25 10:43:59,2023-05-25 10:43:59

That's so few samples that training an ANN looks hopeless. But, the related works show that regression on a correctly choosen bands ratio can give good prediction. So I decide to try it myself (that's the only option I have anyway).

I retrieve the bands B4, B3, B2, B8, B11, B5, B6, and calculate the indices NDVI, NDBI, EVI, NDCI, NDWI and MNDWI, NDTI (choice coming from what I already have from the Omdena challenge and addition influenced by the related works).

About time difference between in-situ measure and satellite data, well there is not much to do, then I decide to ignore it. Hopefully it doesn't change much within a few hours, I'm not specialist at all but that sounds plausible to me. FWI also apparently takes only one measure per day, which suggests it doesn't.

As the related works suggest a case by case optimal ratio I decide to perform an exhaustive search on all the combinations of the bands and indices I have. I consider all the ratio of form A/B or (A+B)/C. For the regression, also following related works, I try for each ratio a polynomial of order 1, 2 and 3. About TOA and BOA, I tried both and couldn't find any difference.

This gives the following results:

order 1	x=(evi+ndti)/ndbi $94.863720+0.536875x$ mae:15.020041 rmse:20.477803 $R^2$:0.529485
order 2	x=(red_edge_1+red_edge_2)/red $-1736.449638+1320.008671x-232.544017x^2$ mae:13.875895 rmse:16.619855 $R^2$:0.690072
order 3	x=(red+evi)/green $9593.149603-41891.410976x+60448.165645x^2-28514.787873x^3$ mae:6.405667 rmse:7.051341 $R^2$:0.944211

To be honest, for order 2 polynomial, landing right on the same ratio (B5+B6)/B4 as the one in the related works made me a bit smile. Then, the result for order 3 got me worried. Is that a real improvement, or just too good to be true ? The plots of ratio/value for order 2 and 3 look like this:

I would really like to have more samples to confirm that result. If I try relaxing the filtering on DN value to 500 it doubles the number of matching days, (B5+B6)/B4 is still the best ratio, but the measure becomes noisy and and $R^2$ drops significantly.

Using Sentinel-2 bands name the ratio $x=(red_edge_1+red_edge_2)/red$ is $(B5+B6)/B4$, and the ratio $x=(red+evi)/green$ is: $x=(B_4+(2.5(B_8-B_4)/(B_8+6B_4-7.5B_2+1)))/B_3$).

Updating chlorophyll in the web application

Let's integrate this chlorophyll predictor to the web application. The process flow becomes: the user inputs the location and date, the app retrieves the temperature from OpenWeatherData, it retrieves the satellite data from GEE, it predicts the chlorophyll from the satellite data using the predictor of the previous paragraph, and it predicts the 3 other indicators as before.

However the date choosen by the user is most likely not a date at which there are satellite data. Then I decide to proceed as follow: the app automatically search for the nearest date having saltellite data within 5 days before the user selected date. If there is not (due to clouds), the prediction fails, if there is, the retrieval of temperature and prediction of indicators is done at the satellite date (which I display in the result).

And while I'm at it, as I now have the satellite data, I can add the satellite view of the pond with the result of prediction. Yahoo!

Dissolved oxygen prediction from satellite data and chlorophyll

Next I move to the prediction of the dissolved oxygen. I now have the Sentinel-2 bands data and indices, the OpenWeatherData air temperature, and the predicted chlorophyll. After trying both I decide to keep order 2 and order 3 predictors in the training dataset and let the ANN decide which to use.

But here I have two new problems. First, the dissolved oxygen varies a lot within a day and for this reason FWI takes two measures (morning/afternoon). Of course none matches the satellite time. Then, I decide to linearly interpolate the in-situ measurements to the satellite time and use it instead. Second, the chlorophyll prediction returns some crazy values for input values out of the training range. Then I decide to clip this result of prediction to [0,250] (based on the range of in-situ measurements).

Scanning through the Sentinel-2 dataset for days matching in-situ samples for dissolved oxygen gives me:

indicator	samples quantity	value range	date range
do	144	[2.04,10.60]	[2021-11-16 10:44:00,2024-10-21 10:43:59]

I was expecting more samples and that's a bit demotivating. At least that's enough to try to train an ANN. The range of dissolved oxygen values is also reduced by a third, which isn't great of course.

The correlation plot for the training dataset becomes:

Dissolved oxygen shows no particular correlation with any input value. Let's hope an ANN can still make something out of that. I use the following training parameters:

architecture:	ANN, 1 layer, 2 nodes, sigmoid activation
feature scaling:	standard normalisation
loss function:	huber
optimiser:	adam (0.9, 0.999)
learning rate:	0.0005
training time:	120s
cross validation:	10 fold

The results are:

Perform 10-fold cross validation...
10-fold cross valid, split #0, acc. train 0.881, valid 1.186
10-fold cross valid, split #5, acc. train 0.900, valid 0.907
10-fold cross valid, split #1, acc. train 0.888, valid 1.052
10-fold cross valid, split #2, acc. train 0.933, valid 0.608
10-fold cross valid, split #7, acc. train 0.890, valid 1.035
10-fold cross valid, split #9, acc. train 0.922, valid 0.722
10-fold cross valid, split #6, acc. train 0.885, valid 1.062
10-fold cross valid, split #8, acc. train 0.933, valid 0.664
10-fold cross valid, split #4, acc. train 0.898, valid 0.976
10-fold cross valid, split #3, acc. train 0.871, valid 1.220
Avg acc training: 0.90005, avg acc validation: 0.94304
Train the predictor on the whole dataset...
Evaluation, MAE:0.461, RMSE:0.684, R^2:0.724
Requirement, MAE<1.07, RMSE<1.62, R^2>0.66

Comparing with previous results:

	MAE	RMSE	$R^2$
baseline	3.13	3.96	-0.25
coord	3.207	3.542	0.003
coord,date	0.992	1.517	0.817
coord,temperature	2.480	3.043	0.264
coord,date,temperature	0.957	1.454	0.832
coord,date,temperature,bands,chlorophyll	0.461	0.684	0.724
requirement	<1.07	<1.62	>0.66

The accuracy improves, the cross validation ok, and $R^2$ is still above requirement. However, I don't have much trust in these results to be honest, there are just so few samples... As I'm reaching the end of the challenge and don't know what to try else I'll keep it like this.

Ammonia and pH prediction from satellite data

Next I move to the two remaining indicators: ammonia and pH. I proceed exactly like for dissolved oxygen. The available samples for training with satellite data gives:

indicator	samples quantity	value range	date range
ammonia	122	[0.00,1.06]	[2021-12-11 10:43:59,2024-10-21 10:43:59]
pH	144	[7.22,8.62]	[2021-11-16 10:44:00,2024-10-21 10:43:59]

Here again the reduction in value ranges and number of samples are worrying me. The correlation plots are:

No surprise here, no particular correlations are revealed for the indicators. I use the following training parameters for ANN:

architecture:	ANN, 1 layer and 2 nodes for ammonia, 1 layer with 2 nodes followed by 1 layer with 1 node for pH, sigmoid activation
feature scaling:	standard normalisation
loss function:	huber
optimiser:	adam (0.9, 0.999)
learning rate:	0.0005 for ammonia, 0.001 for pH
training time:	60s
cross validation:	10 fold

The results of training are:

Perform 10-fold cross validation...
10-fold cross valid, split #2, acc. train 0.126, valid 0.154
10-fold cross valid, split #5, acc. train 0.121, valid 0.150
10-fold cross valid, split #7, acc. train 0.123, valid 0.139
10-fold cross valid, split #0, acc. train 0.129, valid 0.093
10-fold cross valid, split #3, acc. train 0.123, valid 0.136
10-fold cross valid, split #1, acc. train 0.129, valid 0.103
10-fold cross valid, split #4, acc. train 0.117, valid 0.186
10-fold cross valid, split #6, acc. train 0.112, valid 0.110
10-fold cross valid, split #8, acc. train 0.117, valid 0.170
10-fold cross valid, split #9, acc. train 0.125, valid 0.127
Avg acc training: 0.12217, avg acc validation: 0.13683
Train the predictor on the whole dataset...
Evaluation, MAE:0.047, RMSE:0.067, R^2:0.842
Requirement, MAE<0.06, RMSE<0.08, R^2>0.66

Perform 10-fold cross validation...
10-fold cross valid, split #1, acc. train 0.198, valid 0.108
10-fold cross valid, split #4, acc. train 0.188, valid 0.169
10-fold cross valid, split #7, acc. train 0.184, valid 0.246
10-fold cross valid, split #0, acc. train 0.191, valid 0.162
10-fold cross valid, split #3, acc. train 0.188, valid 0.168
10-fold cross valid, split #9, acc. train 0.184, valid 0.206
10-fold cross valid, split #8, acc. train 0.186, valid 0.183
10-fold cross valid, split #5, acc. train 0.187, valid 0.178
10-fold cross valid, split #2, acc. train 0.183, valid 0.210
10-fold cross valid, split #6, acc. train 0.177, valid 0.263
Avg acc training: 0.18662, avg acc validation: 0.18933
Train the predictor on the whole dataset...
Evaluation, MAE:0.109, RMSE:0.165, R^2:0.547
Requirement, MAE<0.11, RMSE<0.13, R^2>0.71

Comparing with previous results:

	MAE	RMSE	$R^2$
baseline	0.10	0.17	-0.04
coord	0.092	0.151	0.194
coord,date	0.083	0.126	0.435
coord,temperature	0.087	0.141	0.292
coord,date,temperature	0.080	0.127	0.427
coord,date,temperature,bands,chlorophyll	0.047	0.067	0.842
requirement	<0.06	<0.08	>0.66

The results have improved, enough to bring ammonia within requirements, still not enough for pH.

Integration of remaining indicators in the web application

Finally, I integrate these predictors for dissolved oxygen, ammonia and pH into the web application. I clip the result values to avoid crazy results (particularly expected due to the limited ranges in training): dissolved oxygen within [0,30], ammonia within [0,3], pH within [0,14]. And to improve a little the app, I add some visual indication of values out of optimal/required ranges.

The whole app is made of one PHP file (~500 lines), one binary exectuable (~30kb) written in C (~460 lines of which 3/4 were automatically generated), one Python script (~1300 lines). No AI, 100% self coded. Including the Python third-party packages (GEE, OpenCV, ...) which are the only external dependencies, it weights ~270kb on disk and ~5Mb on RAM. It could run on a simple Linode Nanode instance, no need for virtual environment or container, just copy the app in the appropriate folder and set the correct access permissions. I haven't kept track of the time I've spent on that challenge, estimating from the git log it's equivalent to around 4 weeks full time (including the redaction of this article).

Conclusion

And that's where I end my participation to this challenge. I have a running web application which predicts the 4 indicators from satellite data and online data, 3 of them within the required accuracy. That's already a success for me, even if I'm absolutely not confident it will perform well on the tests that conclude the challenge, and more generally on a real daily use.

I'm proud of the tools I've developped over the years with LibCapy, which allowed me to process easily the datasets and to perform the regression for chlorophyll or to train ANNs on other indicators. In particular the training step is really just running a command on a dataset and copy pasting the source code it returns into the web app. Works like a charm ! I've made some improvement to them along the way, next time will be even better.

Creating the web application was a matter of minutes thanks to my experience with all my previous "one single PHP file apps" (cf here). About satellite imagery, almost all the code I needed was ready since the Omdena challenge. Only the OpenWeatherData API was completely new to me and required some learning. In the end most of my time was spent on preparing data and searching how to solve that problem, which is a good sign I think.

The web application as it is now is barebone of course, but I believe there is no point in making a fancy app while I haven't a good prediction model to integrate into ! As long as it's enough to run the end challenge tests, and prove that the whole pipeline from data to result of prediction is running, it's good enough to me. If I had better models, I would probably add a database of ponds, to avoid the user typing the GPS coordinates, add a functionality to run the prediction on all ponds at once, improve the UI to display the whole area and see at once ponds with/without problems (like an interactive map ?), and keep the results in the database to let the user check the ponds' history. But in the end that's the user who have to decide what functionalities to add.

About the prediction models I've made, again I don't think they have much useful value. From that point of view it is a failure. The reason is the small amount of data I had to train them. For chlorophyll, finding myself the ratio which several other researchers have found was a real pleasure, and brought some hope in the darkness. Maybe, on the range it has been trained, my chlorophyll predictor isn't too bad ? For other indicators, it would take much more data to train and test on before I feel confident enough to judge their quality.

Training data is obviously the main problem of any machine learning project. On that particular one, beside the recurrent annoyance of cloud covering when working with satellite imagery, the size of ponds relative to the spatial resolution, and the speed of variation of dissolved oxygen relative to the time resolution are two drawbacks.

On top of that, there are several layers of noise. One is the mismatch in time necessiting an interpolation from morning and evening measures. The only other option would be to train two predictors, one that predict the morning value from the observation at satellite time, and another for the evening value from the same observation at satellite time. That doesn't make sense to me. Predicting once per day at maximum seems the only option, in which case it doesn't fit the current FWI process which has different requirement for morning and evening dissolved oxygen values.

Another source of noise is the prediction from air temperature rather than water temperature. That may be not as bad as it looks, I don't know. I'd like to hear a specialist point of view about it if I could.

Another source of noise is the variation of values accross the pond, added to the fact that the satellite observation isn't at the measurement location. Averaging over the pond would help but, the size of the small ponds is problematic, and automatic averaging means we need pond segmentation, which is another problem in itself. However I remember reading that FWI may already have the contour coordinates of ponds, which would probably help.

Another source of noise is the quality of measurement as FWI confesses itself, and the influence of applied treatments on indicator values. I have no idea about how to deal with that. There are already so few data, filtering only the "no treatment" + "measurement with good equipment" would surely left nothing to work with.

Last problem about the data: the density distribution, which shows a clear unbalance between values. It would take much more samples from the "extreme" cases to train predictors accurately on the whole ranges.

Future work

About ways to address these problems and obtain better results on my side. I would like to improve my knowledge about cloud filtering for satellite imagery. The article I've linked above seems a good resource to do so. Cloud covering is a recurrent problem on that kind of project, so getting better at dealing with it would be profitable for eventual future projects using that kind of data.

I regret not asking to FWI about the ponds without coordinates. They probably have them, and if I could use them it would certainly increase the number of training samples.

If I could have much more samples, I would like to check about the order 3 polynomial I've found for chlorophyll. The perfect fit is really intriguing me. With much more samples I think I would also like to try a per-pond approach: train one model for each pond. They all have their own properties and dynamic I believe, due to different size, depth, fish population, surrounding vegetation, farmer's management method. I would expect a model dedicated to a pond to be much more accurate.

I haven't used the fish density because I can't retrieve it remotely and I stayed within the challenge perimeter. However it is clearly an important factor. I'm wondering what would be the results if it was allowed as a user input for the prediction models.

On FWI's side, keep collecting data ! In particular, focus on "problematic" ponds to equilibrate the samples density over the whole range of values. Being consistent in the measurement method and equipment is also important, even if they seem to already be doing an amazing job about it.

I understand the logistic hurdle it is, but having several samples at different locations and at the same time for a same pond would certainly help to match the satelilte data (average over samples $\leftrightarrow$ average over pixels). Or at least, collect one sample in the middle of the pond, with coordinates matching the pond coordinates in the dataset ?

No doubt that matching the satellite time is another logistic nightmare, and anyway collecting data every time possible can't hurt. In the end, nothing guarantee that the satellite approach is the good one, so data unusable here can be precious for another, in the end more successfull, approach. However, if the satellite approach is kept there is a point to clarify: how to integrate the fact that the satellite can produce only one prediction per day to the fact that dissolved oxygen requirements are defined at two times per day.

About the temperature, it seems to me easy to add the air temperature to the collected data. It would have helped me, it may have helped other ?

Completely different approaches should be also explored. FWI has talked about forecasting models. It may work in itself, but I don't see how it would help their goal of avoiding to have to collect data in-situ. Staying on the remote sensing approach, I wonder what results would give automated sensors installed in the ponds. This article looks like an interesting read about that subject. I'm also wondering if a classifier model (OK, low/high warning, low/high alert) isn't more appropriate than a regression one for their purpose.

Keep an eye on satellite actuality. While some assholes are destroying them, others keep building and successfully launching new ones, like the Sentinel-5 this week. They'll increase the amount of available data and open new possibilities in term of prediction models.

That's all folks!

Edit on 2025/11/18:

So, I've received the results of the challenge a few days ago. Of course my models completely failed. But they received 169 models and none of them met their requirements! That was really too difficult. I'm not even motivated to look further into the chlorophyll model. Anyway I've enjoyed participating, and I'll keep trying.

2025-08-17
in AI/ML, All, C programming, LibCapy examples, Web app,
114 views
A comment, question, correction ? A project we could work together on ? Email me!
Learn more about me in my profile.

	do (mg/L)	ph	am (mg/L)	ch (\(\mu\)g/L)
\(R^2\)	\(\ge0.66\)	\(\ge0.71\)	\(\ge0.66\)	\(\ge0.73\)
RMSE	\(\le1.62\)	\(\le0.13\)	\(\le0.08\)	\(\le59\)
MAE	\(\le1.07\)	\(\le0.11\)	\(\le0.06\)	\(\le42\)

order 1	x=(evi+ndti)/ndbi \(94.863720+0.536875x\) mae:15.020041 rmse:20.477803 \(R^2\):0.529485
order 2	x=(red_edge_1+red_edge_2)/red \(-1736.449638+1320.008671x-232.544017x^2\) mae:13.875895 rmse:16.619855 \(R^2\):0.690072
order 3	x=(red+evi)/green \(9593.149603-41891.410976x+60448.165645x^2-28514.787873x^3\) mae:6.405667 rmse:7.051341 \(R^2\):0.944211

	MAE	RMSE	\(R^2\)
baseline	0.22	0.29	-0.01
coord	0.210	0.283	0.051
coord,date	0.169	0.239	0.322
coord,temperature	0.188	0.260	0.201
coord,date,temperature	0.166	0.236	0.341
requirement	<0.11	<0.13	>0.71

	MAE	RMSE	\(R^2\)
baseline	24.70	33.84	0.00
coord	24.791	33.268	0.033
coord,date	16.237	24.110	0.492
coord,temperature	22.892	32.375	0.085
coord,date,temperature	18.104	25.935	0.413
requirement	<42.0	<59.0	>0.73

	MAE	RMSE	\(R^2\)
baseline	0.22	0.29	-0.01
coord	0.210	0.283	0.051
coord,date	0.169	0.239	0.322
coord,temperature	0.188	0.260	0.201
coord,date,temperature	0.166	0.236	0.341
coord,date,temperature,bands,chlorophyll	0.109	0.165	0.547
requirement	<0.11	<0.13	>0.71