Baillehache Pascal's personal website

A slight speed improvement on the DIANA algorithm

Recently I had to do some data clustering work for a client. It was the occasion to learn about the DIANA algorithm, introduced by Kaufman and Rousseeuw in this book and described in this Wikipedia page. In this article I'll introduce a slight speed improvement to the algorithm as it is introduced on Wikipedia.

I must precise right now that I don't have access to the book, and have read only the available pages on books.google. I may be missing something from the other pages, if someone wants to correct me comments are always welcome!

The formal algorithm introduced in the Wikipedia page looked odd to me. First, the goal is to construct a hierarchical clustering (in other words, a tree structure), however step 2.5 simply adds the new clusters resulting from splitting to the initial set made of one cluster containing all elements. So, it is an algorithm transforming a set of one cluster containing n elements into a set of n clusters containing one element each. How to get a tree out of this is only mentioned later in the article. In the available pages of the original book, there also seems to be no explanation about how one creates the resulting tree. It seems only implied by figure 1 page 257.

Second, step 2.1 looks for the largest cluster within all clusters created so far. However, the splitting of one cluster only depends on that cluster, not the others. Hence, the order of splitting doesn't infuence the result at all and we don't really care about which one is the largest.

One explanation I can see for why it is introduced that way would be that the algorithm as it was initially designed and intended to be used was sequentially outputing the current clusters at each step (guessing from the textual output in the book). Instead of displaying a tree, the sequence of split itself was probably what made up the analysis.

If we assume instead that we have a way to display the final tree with its branching structure, and we're only interested in building that resulting tree, not the creation steps, I think we can make some speed optimisation. As the order doesn't matter anymore, instead of searching the largest cluster at each step, one can simply use whichever cluster is the most convenient. A particular choice is to traverse the resulting tree in a depth-first manner as it is constructed. Within the three steps of the algorithm, "search the largest cluster", "search the splitting element", "migrate elements", the first one disappear and one can expect a speed improvement.

Here comes the pseudocode:

C: cluster of elements
FUNCTION GetClusteringTree(C):
  IF C has only one element THEN
    RETURN a new tree with one single node containing C
  ELSE
    E = element in C most dissimilar to other elements in C
    Ca,Cb: two new empty clusters
    C.remove(E)
    Ca.add(E)
    DO
      Em = argmax_{Em}(D(Em)), where D(.) is as described
        in the Wikipedia article
      IF D(Em) > 0 THEN
        C.remove(Em)
        Ca.add(Em)
    UNTIL D(Em) < 0
    move all remaining elements from C to Cb
    T = new empty tree
    T.addChild(GetClusteringTree(Ca))
    T.addChild(GetClusteringTree(Cb))
    RETURN T

C0 = initial cluster containing all elements
The resulting tree T for C0 can be obtain with:
T = GetClusteringTree(C0)

Lets make some tests for datasets of various sizes, using random 2D vectors (uniformly distributed in \([0,1]^2\)) as elements, comparing the version above with the standard version. I've clustered the same set of 1000 datasets of various sizes (100, 200, 300, 400, 500 elements), and measured the average relative time of the 'fast' version compare to the 'standard' version. I've also checked that both versions return the same result tree and that it was correctly constructed to ensure I hadn't completely messed up the implementation. Results are as follow (graph generated using gnuplot, average and \(3\sigma\) error bar for each dataset size, the lower the bigger speed improvement):

On these tests the alternative version was indeed faster, but only by a very small ratio. Moreover, as the dataset size increases that ratio decreases. Profiling the code helps understand why. On the graph below (generated using gprof, gprof2dot, and dot), one can clearly see how the "migration" step (GetAverageDist) dwarves the two other steps (FindLargestTree and GetDissimilarity). Looking back at the algorithm I think it's \(O(n^3)\) against \(O(n^2)\).

So in conlusion, yes it's a speed improvement but very small, to the point it becomes insignificant for very large dataset. Bummer!

I couldn't leave my dear readers empty handed after reading so far, then as a consolation here comes how I display trees with some nice ascii art:

FUNCTION TreePrintRec(tree, depth, prefix):
  IF tree has no child THEN
    last = 0
    FOR i IN 0..(depth-1)
      IF prefix[i] == '└' || prefix[i] == '├' THEN
        last = i
    FOR i IN 0..(depth-1)
      IF prefix[i] == '└' AND i < last
        PRINT " "
      ELSE IF prefix[i] == '├' AND i < last
        PRINT "│"
      ELSE IF prefix[i] == '┬' AND i < last
        PRINT "│"
      ELSE PRINT prefix[i]
    PRINT "─"
    PRINT the data at the root of 'tree'
    PRINT "\n"
  ELSE
    prefix[depth] = '┬'
    prefix[depth+1] = '\0'
    child = tree.child
    WHILE child
      TreePrintRec(child, depth+1, prefix):
      child = child.brother
      IF 'child' is not NONE and has a brother
        prefix[depth] = '├'
      ELSE
        prefix[depth] = '└'
    prefix[depth] = '\0'

FUNCTION TreePrint(tree):
  prefix: array of 'd' UTF-8 characters where 'd' is the depth of 'tree',
    all initialized to '\0'
  TreePrintRec(tree, 0, prefix)

And to conclude with a real world example, here comes the hierarchical clustering of the famous iris dataset obtained with the 'fast' version of the DIANA algorithm and the ascii art printing introduced here:

┬┬┬┬┬─ 109 Iris-virginica
││││└┬─ 117 Iris-virginica
││││ └─ 131 Iris-virginica
│││└┬┬─ 118 Iris-virginica
│││ │└┬─ 122 Iris-virginica
│││ │ └─ 105 Iris-virginica
│││ └┬─ 135 Iris-virginica
│││  └┬─ 125 Iris-virginica
│││   └┬─ 107 Iris-virginica
│││    └─ 130 Iris-virginica
││└┬┬─ 129 Iris-virginica
││ │└┬─ 134 Iris-virginica
││ │ └┬─ 108 Iris-virginica
││ │  └┬┬─ 77 Iris-versicolor
││ │   │└┬─ 110 Iris-virginica
││ │   │ └─ 147 Iris-virginica
││ │   └┬─ 111 Iris-virginica
││ │    └┬┬─ 132 Iris-virginica
││ │     │└─ 128 Iris-virginica
││ │     └┬─ 103 Iris-virginica
││ │      └┬─ 116 Iris-virginica
││ │       └─ 137 Iris-virginica
││ └┬┬─ 100 Iris-virginica
││  │└┬─ 115 Iris-virginica
││  │ └┬─ 136 Iris-virginica
││  │  └─ 148 Iris-virginica
││  └┬┬┬─ 141 Iris-virginica
││   ││└─ 145 Iris-virginica
││   │└┬─ 139 Iris-virginica
││   │ └─ 112 Iris-virginica
││   └┬─ 102 Iris-virginica
││    └┬─ 104 Iris-virginica
││     └┬─ 124 Iris-virginica
││      └┬┬─ 143 Iris-virginica
││       │└─ 120 Iris-virginica
││       └┬─ 144 Iris-virginica
││        └─ 140 Iris-virginica
│└┬┬┬─ 60 Iris-versicolor
│ ││└┬─ 62 Iris-versicolor
│ ││ └┬┬─ 64 Iris-versicolor
│ ││  │└─ 79 Iris-versicolor
│ ││  └┬─ 59 Iris-versicolor
│ ││   └┬┬─ 69 Iris-versicolor
│ ││    │└┬─ 81 Iris-versicolor
│ ││    │ └─ 80 Iris-versicolor
│ ││    └┬─ 53 Iris-versicolor
│ ││     └─ 89 Iris-versicolor
│ │└┬─ 106 Iris-virginica
│ │ └┬┬┬─ 90 Iris-versicolor
│ │  ││└─ 55 Iris-versicolor
│ │  │└┬─ 84 Iris-versicolor
│ │  │ └─ 66 Iris-versicolor
│ │  └┬┬─ 71 Iris-versicolor
│ │   │└─ 61 Iris-versicolor
│ │   └┬┬─ 67 Iris-versicolor
│ │    │└┬─ 92 Iris-versicolor
│ │    │ └─ 82 Iris-versicolor
│ │    └┬┬─ 94 Iris-versicolor
│ │     │└─ 99 Iris-versicolor
│ │     └┬─ 88 Iris-versicolor
│ │      └┬─ 96 Iris-versicolor
│ │       └─ 95 Iris-versicolor
│ └┬┬┬┬─ 68 Iris-versicolor
│  │││└─ 119 Iris-virginica
│  ││└┬┬─ 72 Iris-versicolor
│  ││ │└┬─ 83 Iris-versicolor
│  ││ │ └─ 133 Iris-virginica
│  ││ └┬─ 146 Iris-virginica
│  ││  └┬─ 123 Iris-virginica
│  ││   └─ 126 Iris-virginica
│  │└┬─ 114 Iris-virginica
│  │ └┬┬─ 121 Iris-virginica
│  │  │└┬─ 113 Iris-virginica
│  │  │ └┬─ 142 Iris-virginica
│  │  │  └─ 101 Iris-virginica
│  │  └┬─ 149 Iris-virginica
│  │   └┬─ 70 Iris-versicolor
│  │    └┬─ 138 Iris-virginica
│  │     └─ 127 Iris-virginica
│  └┬─ 87 Iris-versicolor
│   └┬┬┬─ 85 Iris-versicolor
│    ││└┬─ 56 Iris-versicolor
│    ││ └─ 51 Iris-versicolor
│    │└┬┬─ 74 Iris-versicolor
│    │ │└─ 97 Iris-versicolor
│    │ └┬─ 73 Iris-versicolor
│    │  └┬─ 78 Iris-versicolor
│    │   └┬─ 91 Iris-versicolor
│    │    └─ 63 Iris-versicolor
│    └┬┬─ 76 Iris-versicolor
│     │└┬─ 86 Iris-versicolor
│     │ └┬─ 50 Iris-versicolor
│     │  └─ 52 Iris-versicolor
│     └┬┬─ 54 Iris-versicolor
│      │└─ 58 Iris-versicolor
│      └┬─ 65 Iris-versicolor
│       └─ 75 Iris-versicolor
└┬┬─ 98 Iris-versicolor
 │└┬─ 93 Iris-versicolor
 │ └─ 57 Iris-versicolor
 └┬┬─ 41 Iris-setosa
  │└┬─ 22 Iris-setosa
  │ └┬┬─ 13 Iris-setosa
  │  │└┬─ 42 Iris-setosa
  │  │ └┬─ 38 Iris-setosa
  │  │  └─ 8 Iris-setosa
  │  └┬┬─ 6 Iris-setosa
  │   │└┬─ 24 Iris-setosa
  │   │ └─ 11 Iris-setosa
  │   └┬┬─ 35 Iris-setosa
  │    │└─ 49 Iris-setosa
  │    └┬┬─ 25 Iris-setosa
  │     │└┬┬─ 45 Iris-setosa
  │     │ │└─ 1 Iris-setosa
  │     │ └─ 12 Iris-setosa
  │     └┬┬┬─ 29 Iris-setosa
  │      ││└─ 3 Iris-setosa
  │      │└┬─ 2 Iris-setosa
  │      │ └─ 47 Iris-setosa
  │      └┬─ 30 Iris-setosa
  │       └┬┬─ 37 Iris-setosa
  │        │└─ 34 Iris-setosa
  │        └─ 9 Iris-setosa
  └┬┬┬─ 14 Iris-setosa
   ││└┬─ 15 Iris-setosa
   ││ └─ 33 Iris-setosa
   │└┬┬─ 32 Iris-setosa
   │ │└─ 16 Iris-setosa
   │ └┬─ 18 Iris-setosa
   │  └─ 5 Iris-setosa
   └┬┬─ 44 Iris-setosa
    │└┬┬─ 10 Iris-setosa
    │ │└─ 48 Iris-setosa
    │ └┬─ 46 Iris-setosa
    │  └┬─ 19 Iris-setosa
    │   └─ 21 Iris-setosa
    └┬┬─ 36 Iris-setosa
     │└┬─ 31 Iris-setosa
     │ └─ 20 Iris-setosa
     └┬┬─ 43 Iris-setosa
      │└┬─ 23 Iris-setosa
      │ └─ 26 Iris-setosa
      └┬┬─ 4 Iris-setosa
       │└┬─ 40 Iris-setosa
       │ └┬─ 17 Iris-setosa
       │  └─ 0 Iris-setosa
       └┬┬─ 7 Iris-setosa
        │└─ 39 Iris-setosa
        └┬─ 28 Iris-setosa
         └─ 27 Iris-setosa

2025-06-17
in AI/ML, All,
52 views
A comment, question, correction ? A project we could work together on ? Email me!
Learn more about me in my profile.