Focus measurement in digital photography

Have you ever wonder how your digital camera can automatically focus correctly the lens ? Instead of you turning the focus ring to bring the focus plane at the correct distance, an eventually incredibly small motor does it for you and magically stops at the right position (sometime...).

In this article I share my notes about how this is done, in particular how the focus level is measured with what's called "contrast detection in passive systems", and how the underlying algorithms can also be used to estimate afterward the focus quality of an image as a postprocessing step, or even repurposed for some completely different goals.

Focus what?

The controller inside your camera needs to know when the image is correctly focused. It needs a measurement using the current image or incoming light as input and returning a value representing the focus level. Then it's just a matter of scanning the range of possible distances for the focus plane and find which one maximises that value. Of course I assume here that the focus plane is thin enough that we need to worry about it, not like a lens with nearly infinite depth of field.

"But...", says the savvy reader, "it doesn't make any sense! How does the camera knows what to focus on?". Indeed, what 'correctly focused' means is an extremely subjective question. Only the user knows which subject in the scene is the one which should be in focus. Simply, cameras choose a default position inside the image and consider that what happens to be there is what should be in focus, for the best and for the worst.

On my old digital cameras, a small area around the center pixel was the one used, and I had to go through a two-steps method to take pictures correctly focused on somewhere else: put the subject at the center of the image, half press the shutter button to activate the autofocus, then reorient the camera to compose the image as needed, and finally fully press the shutter button to take the picture. Not super handy but already so much better than trying to focus manually using focus screens on my even older analog cameras.

Later, cameras added several default focus positions, displayed as small rectangles in the viewfinder, between which the user could select, or even pinpoint on a touch screen. Nowadays, cameras can also detect faces in the image and automatically decide to focus on them, or do even more advanced detection/selection magic.

'Active' and 'Passive' systems.

Once the problem of knowing what should be in focus is solved, comes the problem of focus measurement. It can be performed in several ways. "Active" autofocus systems use a device as a distance sensor to estimate the actual distance to the subject, from which it is quite easy to automatically calculate the correct position of the lenses. "Passive" autofocus systems analyse the incoming light on the sensor to estimate the correct focusing. These two systems are also sometime used in combinaison to get the best of both worlds.

Passive systems are divided into two methods: phase detection, and contrast detection. The phase detection method analyses the incoming light rays: the subject is in focus if for a given position on the sensor the light incoming from one direction matches the light incoming from another. This is a fast and efficient method, however it requires an optical device inside the lens to split the flow of light and can be done only while we are taking the picture.

The contrast detection method in the other hand only needs to know the amount of light that has reached the sensor. Hence it can even be used after the picture has been taken, as a post-processing step to estimate the focus quality of an image from its pixel values. This method is based on the fact that a well focused image is expected to be sharper (at least around the edges of shapes and textures) than a blurry one. It is then possible to evaluate the focus from the contrast difference of neighbour pixels.

Referring to this paper, the contrast detection method can be implemented using many various metrics grouped in four main classes (p.11-16): differential, statistical, correlative, spectral. Bad news, that means one will have to worry about which one to choose. Good news, results of comparison between these metrics in the same paper on p.65 suggest that almost all of them just work fine, then there is maybe not that much to worry about.

Below I'll look more in details into some of those metrics, subjectively chosen according to how much effort it would take me to add them to LibCapy. Meanwhile, lazy python programmers who don't care much about anything will be happy to know that an OpenCV blog post provides the code for a few metrics ready to copy paste.

Examples of metric for contrast detection

In what follows \(n\) is the number of pixels and \(I_{x,y}\) is the light intensity at position (x,y) in the image appropriately converted to greyscale.

Correlative measure: symmetric Vollath F4

This measure exploits the idea that a sharper image should have a larger gradient between neighbour pixels. It uses the absolute value of the intensity of each pixel multiplied by the gradients along both axis of its neighbour pixels. The formula is as follow: $$ \begin{array}{l} \frac{1}{n}\left(\left|\sum_{x,y}I_{x,y}\left(I_{x+1,y}-I_{x+2,y}\right)\right|+\right.\\ \left|\sum_{x,y}I_{x,y}\left(I_{x-1,y}-I_{x-2,y}\right)\right|+\\ \left|\sum_{x,y}I_{x,y}\left(I_{x,y+1}-I_{x,y+2}\right)\right|+\\ \left.\left|\sum_{x,y}I_{x,y}\left(I_{x,y-1}-I_{x,y-2}\right)\right|\right)\\ \end{array} $$

Differential measure: Tenengrad

This measure also exploits the idea that a sharper image should have a larger gradient between neighbour pixels, but it uses instead the squared norm at each pixel of the Sobel gradient filter applied to the image. The formula is as follow: $$ \frac{1}{n}\sum_{x,y}(S^{\bf{x}}_{x,y})^2+(S^{\bf{y}}_{x,y})^2 $$ where \(S^{\bf{x}}_{x,y}\) and \(S^{\bf{y}}_{x,y}\) are respectively the output of the Sobel filter at position (x,y) along the \(\bf{x}\) and \(\bf{y}\) axis.

Statistical measure: normalized variance

This measure exploits the idea that blurred images have a relatively more uniform intensity, hence a lower variance in the intensity value of the pixels. It's simply the squared standard deviation divided by the squared mean of the intensity values: $$ \frac{\sigma^2_{x,y}(I)}{\mu^2_{x,y}(I)} $$

Spectral measure: normalized DFT Shannon entropy

This measure applies the Fourier transform to the image and uses the spectral image instead. It is convenient because it allows to simply filter the high frequencies, in other words to remove the noise which has a bad influence on all contrast detection method. The spectral image of a well focus image should have higher Shannon entropy than a blurry one.

I couldn't make sense of this one and finally gave up on spectral measure because I had spent too much time trying to understand it. I note here the formula anyway: $$ -\frac{1}{4r^2}\sum_{(-r\lt x\lt r),(-r\lt y\lt r)}\frac{F_{x,y}(I)}{L_2(F(I))}\left|log_2\left(\frac{F_{x,y}(I)}{L_2(F(I))}\right)\right| $$ where, as far as I can understand, \(F(I)\) is the discrete Fourier transform of the image \(I\) (see also this previous post about the DFT), \(L_2(F)\) is the Euclidean norm of the spectral image (square root of the sum of the squared intensity value), and \(r\) is the radius of the low-pass filter. \(-r\lt x\lt r\) suggests negative coordinates in the spectral image, is that relative coordinates in the centered version of the spectral image ? If you can help me understand email me please !

Differential+statistical measure: Laplacian-variance

One can also combine different classes of measure to make a new one, like the Laplacian-variance measure. It uses the Laplacian to detect edges and the variance to measure how strong they are. A sharper image should have well detected edges, hence a higher variance in the intensity values of the edge map. The formula is as follow: $$ \sigma^2_{x,y}(D^2_{x,y}(I)) $$ where $$ \begin{array}{l} D^2_{x,y}(I) =\\ (I(x-1,y-1)+I(x+1,y-1))/4\\ +(I(x-1,y+1)+I(x+1,y+1))/4\\ +(I(x+1,y)+I(x-1,y))/2\\ +(I(x,y+1)+I(x,y-1))/2\\ -3I(x,y) \end{array} $$

Test dataset

Next I want to see how these measures perform on a dataset of pictures. A synthetic one has the advantage of allowing me to control all parameters and know exactly what the result should be, hence also to verify that my implementation is correct. Time to use again that good old POV-Ray! The script looks like this:

The scene is made of three spheres in front of the camera at different distances. 100 frames are rendered with progressively increasing focus distance. The level of noise can be controlled with the BlurSample variable, I've used 100 (no noise) and 10 (noise). It can be rendered with the command: povray focus.pov +ODataset001/focus.png +H400 +W800 +A +KFI0 +KFF99 +KI-3.0 +KF3.0 +HR Declare=BlurSample=100.

Synthetic datasets are interesting but don't guarantee it would work the same in real life, hence I've also took an afternoon to photoshoot some real pictures. The final dataset is made of 17 scenes and a total of 615 pictures. I've tried to choose as various scenes as possible, but there was time and material limit. The camera was on a tripod, but not perfectly still. The scene contains not moving part as much as possible, and the light condition barely changes between frames. Also, trying to regularly scan the focus range by hand was quite challenging! But in the end I think I was able to gather some useful data. I must mention that I also did a quick search on the web for already existing datasets, but haven't found anything except two (1, 2) that didn't satisfied me (only pairs of images).

If you'd like to perform tests yourself, my dataset is available here.

I introduce each scene in the dataset below with a short description, a video of all the images in the scene, and one example image of the scene (with eventual ROI displayed as black or white rectangles, if no ROI only the whole image is considered).

Synthetic images. Three white spheres in front of the camera at different distances.

Results

Results are shown in the plots below. All results are mapped from [min,max] to [0,1] to be easily comparable. Grey areas correspond to frames correctly focused (according to myself) for the scene's ROIs (or whole image if there is no ROI). Abciss is the frame index, ordinate is the focus score, the higher the better focused.

Comments on results

On the synthetic dataset without noise (Dataset001), given a ROI all but VollathF4 on ROI 2 detects correctly the frame in focus. However they do it more or less efficiently. LapacianVar has the clearest response, followed by Tenengrad, and then NormVar and VollathF4. If given the whole image, NormVar fails completely, VollathF4 and Tenengrad detect the two first spheres but not the third one, and LaplacianVar returns an impressively clear signal for all three spheres.

When we add noise to the synthetic dataset (Dataset002), the results are almost the same, a bit more noisy as it could be expected. VollatF4 now fails on ROI 2 and 3. There is a very interesting effect on LaplacianVar: near the frame in focus there is still a strong local maximum, but it's now spoiled by a strong response on frames far from the focus, to the point that NormVar and Tenengrad are now more reliable.

For real photographs, the responses are really good. Given a ROI, all measures detect correctly (in the sense the maximum response matches the frame in focus), except for Dataset017 where the branches in the background of the ROI produce a stronger signal than the intended flower in the foreground, whatever the metric used. Overall, the order of quality of the response is quite clearly (from best to worst) LaplacianVar, Tenengrad, VollathF4, NormVar.

In conclusion, given the results on this dataset, I would recommend using LaplacianVar by default, and consider Tenengrad if strong noise is expected.

Repurpose of the focus measurement

This article could end here, however all this gave me a few ideas: couldn't this focus measurement be repurposed ? It would be a shame not to exploit more such a clear signal as the one returned by LaplacianVar...

One idea is that instead of giving a ROI and measuring the focus, one could do the opposite and get the ROI from the focus. Take an image, split it into small chunks, calculate the focus score for each chunk, you get a heat map of areas in the image that are in focus. An example of use I imagine is, after the user has setup an image capture device, instead of requesting the user to also indicate the ROI in the software processing the captured images, it would be done automatically based on that heat map. Another application of such a heat map, which I've found in this paper, is to use it in the context of automated focus quality assessment for medical imagery.

Another idea: the focus measurement tells us when an object in the scene is in the focal plane. We can calculate the distance from the camera to the focal plane based on the lens properties, as well as the ray direction for each pixel. Hence we can estimate the 3D coordinates for each pixel for the corresponding location in the scene. Yes that would be super rough, but filter only those pixels with a good focus measurement response should give us a not too bad point cloud of edges. As a complement to other photogrammetric methods, or as a "better than nothing" solution it may be interesting. Indeed, looking for that idea on the web led me to this paper which studies exactly the same idea.

A little more extravagant idea would be, a system that tracks what the user is/was looking at and guesses what's interesting him/her in the scene based on what it's focusing on. A bit SF, but clearly there is a lot more to explore about this concept of focus measurement!

2026-01-18
in All, Computer graphics, Photography,
22 views
A comment, question, correction ? A project we could work together on ? Email me!
Learn more about me in my profile.

Copyright 2021-2026 Baillehache Pascal