A slight speed improvement on the DIANA algorithm

Recently I had to do some data clustering work for a client. It was the occasion to learn about the DIANA algorithm, introduced by Kaufman and Rousseeuw in this book and described in this Wikipedia page. In this article I'll introduce a slight speed improvement to the algorithm as it is introduced on Wikipedia.

I must precise right now that I don't have access to the book, and have read only the available pages on books.google. I may be missing something from the other pages, if someone wants to correct me comments are always welcome!

The formal algorithm introduced in the Wikipedia page looked odd to me. First, the goal is to construct a hierarchical clustering (in other words, a tree structure), however step 2.5 simply adds the new clusters resulting from splitting to the initial set made of one cluster containing all elements. So, it is an algorithm transforming a set of one cluster containing n elements into a set of n clusters containing one element each. How to get a tree out of this is only mentioned later in the article. In the available pages of the original book, there also seems to be no explanation about how one creates the resulting tree. It seems only implied by figure 1 page 257.

Second, step 2.1 looks for the largest cluster within all clusters created so far. However, the splitting of one cluster only depends on that cluster, not the others. Hence, the order of splitting doesn't infuence the result at all and we don't really care about which one is the largest.

One explanation I can see for why it is introduced that way would be that the algorithm as it was initially designed and intended to be used was sequentially outputing the current clusters at each step (guessing from the textual output in the book). Instead of displaying a tree, the sequence of split itself was probably what made up the analysis.

If we assume instead that we have a way to display the final tree with its branching structure, and we're only interested in building that resulting tree, not the creation steps, I think we can make some speed optimisation. As the order doesn't matter anymore, instead of searching the largest cluster at each step, one can simply use whichever cluster is the most convenient. A particular choice is to traverse the resulting tree in a depth-first manner as it is constructed. Within the three steps of the algorithm, "search the largest cluster", "search the splitting element", "migrate elements", the first one disappear and one can expect a speed improvement.

Here comes the pseudocode:

Lets make some tests for datasets of various sizes, using random 2D vectors (uniformly distributed in \([0,1]^2\)) as elements, comparing the version above with the standard version. I've clustered the same set of 1000 datasets of various sizes (100, 200, 300, 400, 500 elements), and measured the average relative time of the 'fast' version compare to the 'standard' version. I've also checked that both versions return the same result tree and that it was correctly constructed to ensure I hadn't completely messed up the implementation. Results are as follow (graph generated using gnuplot, average and \(3\sigma\) error bar for each dataset size, the lower the bigger speed improvement):

On these tests the alternative version was indeed faster, but only by a very small ratio. Moreover, as the dataset size increases that ratio decreases. Profiling the code helps understand why. On the graph below (generated using gprof, gprof2dot, and dot), one can clearly see how the "migration" step (GetAverageDist) dwarves the two other steps (FindLargestTree and GetDissimilarity). Looking back at the algorithm I think it's \(O(n^3)\) against \(O(n^2)\).

So in conlusion, yes it's a speed improvement but very small, to the point it becomes insignificant for very large dataset. Bummer!

I couldn't leave my dear readers empty handed after reading so far, then as a consolation here comes how I display trees with some nice ascii art:

And to conclude with a real world example, here comes the hierarchical clustering of the famous iris dataset obtained with the 'fast' version of the DIANA algorithm and the ascii art printing introduced here:

2025-06-17
in AI/ML, All,
14 views
A comment, question, correction ? A project we could work together on ? Email me!
Learn more about me in my profile.

Copyright 2021-2025 Baillehache Pascal