Essays/Dendrite
The dendrite method described here differs from other cluster analysis techniques in that building a spanning tree of nodes is essential, and clusters are viewed as subtrees. In construction, when nodes are combined into clusters, new edges are still added between nodes (with least distance), not between clusters embracing the nodes in them. It is close to the Single Linkage (nearest neighbor) method.
The process of building a dendrite is the same as in Kruskal algorithm for Minimal Spanning Tree. Each step, adding another edge, modifies the set of clusters from n single-node clusters to one n-node cluster, in n-1 steps.
In the original fine implementation of Kruskal algorithm by Roger Hui, f is the forest key, marking all nodes of every tree with the index of the first node in it.
The almost identical verb mstc is modified
- to accumulate the keys in each iteration into a cluster generations table, and append to the resulting edges list;
- and to guarantee that cluster is identified by the lowest index node.
mstc=: 4 : 0 z=. 0 2$f=. i.k=. x.[w=. 0$~0,x. for_e. y. do. if. ~:/j=. |.^:(>/) e{f do. z=. z,e if. 1=k=.<:k do. z;w,x.#0 return. end. w=.w,f=. ({.j) (f I.@:= {:j)} f end. end. assert. 0 [ 'graph is not connected' )
This gives the steps of cluster building iterations. In each step, a new edge is added and with it some two trees merge. The iteration goes just after a forest of signle-node trees to a single tree of all nodes.
e=: _2]\1 3 5 6 3 4 3 6 4 6 3 5 2 5 1 4 0 1 2 3 0 2 0 3 7 mstc e +---+-------------+ |1 3|0 1 2 1 4 5 6| |5 6|0 1 2 1 4 5 5| |3 4|0 1 2 1 1 5 5| |3 6|0 1 2 1 1 1 1| |2 5|0 2 2 2 2 2 2| |0 1|0 0 0 0 0 0 0| +---+-------------+
In a key cluster diagram the iterations look like this
boxclust=: (</. i.@#)"1 boxclust 1{::7 mstc e +-------------+-----------+-+---+---+-+ |0 |1 3 |2|4 |5 |6| +-------------+-----------+-+---+---+-+ |0 |1 3 |2|4 |5 6| | +-------------+-----------+-+---+---+-+ |0 |1 3 4 |2|5 6| | | +-------------+-----------+-+---+---+-+ |0 |1 3 4 5 6 |2| | | | +-------------+-----------+-+---+---+-+ |0 |1 2 3 4 5 6| | | | | +-------------+-----------+-+---+---+-+ |0 1 2 3 4 5 6| | | | | | +-------------+-----------+-+---+---+-+
The following will convert a data set given in a coordinate list to a complete graph with edges sorted by Euclidean distances between the adjacent nodes.
dist=: +/&.:*:@:-"1/~ sharma=: _2]\5 5 6 6 15 14 16 15 25 20 30 19 dist sharma 0 1.41421 13.4536 14.8661 25 28.6531 1.41421 0 12.0416 13.4536 23.6008 27.2947 13.4536 12.0416 0 1.41421 11.6619 15.8114 14.8661 13.4536 1.41421 0 10.2956 14.5602 25 23.6008 11.6619 10.2956 0 5.09902 28.6531 27.2947 15.8114 14.5602 5.09902 0 od2=: 2&# #: i.@*: NB. 2 coord odometer edgesort=: ((] /: ({~ <"1)) (#~ </"1)) od2@# |:edgesort dist sharma 0 2 4 3 2 1 0 1 3 0 2 1 0 1 0 1 3 5 4 4 2 2 3 5 3 5 4 4 5 5
Let us compare the Dendrite method with one of the methods by Brian Schott.
]D=: boxclust 1{:: (# mstc edgesort) dist sharma +-----------+-------+---+-+-+ |0 1 |2 |3 |4|5| +-----------+-------+---+-+-+ |0 1 |2 3 |4 |5| | +-----------+-------+---+-+-+ |0 1 |2 3 |4 5| | | +-----------+-------+---+-+-+ |0 1 |2 3 4 5| | | | +-----------+-------+---+-+-+ |0 1 2 3 4 5| | | | | +-----------+-------+---+-+-+ D -: WM cluster sharma 1
Next of Kin is such order of nodes in cluster generations,
that at every step of Kruskal process merging always happens
between two adjacent trees.
To obtain next of kin order, we employ cumulative key ordering. The result has the same shape as cluster generations, with rows shifted down by one: adding a first row with the permutation defining the next of kin; and removing the last trivial row.
nok=: 3 : 0 z=. (i.{:$y.),}:y. for_i. i.&.<:#y. do. z=. z /:"1 i{z end. )
In such order, nodes do not change places between generations. Thus it is possible to obtain a hierarchical (icicle) diagram with linkage distances.
trimbox=: {.@(];.0~ _2 _2&({.!.1)@(-&2)@{:@$)@": iceview=: (# {. }.) ([: trimbox </.)"1 {. linkage=: {~ <"1 icicle=: {.@] , <@:iceview@nok@(1&{::)@] , (<@,.@linkage 0&{::) (icicle (# mstc edgesort)) dist 4|.sharma +---+-----------+-------+ |2 3|0|1|4|5|2 3|1.41421| |4 5|0|1|4 5|2 3|1.41421| |0 1|0 1|4 5|2 3|5.09902| |0 5|0 1 4 5|2 3|10.2956| |3 4|0 1 4 5 2 3|12.0416| +---+-----------+-------+
Silhouette diagram is an image of the next of kin cluster generations, color-coded with node indices, which also identify growing clusters.
require 'viewmat' silnok=: 3 : 'nok 1{:: (# mstc edgesort) dist y.' silview=: viewmat@silnok
Compare the left figure with the icicle diagram above.
Nub Spanning Forest
Nub Spanning Forest (NSF) is a forest obtained after the Kruskal process, where an edge is added only if it brings along a free-standing node.
nsf=: #~ _2&(+./\ ~:)@, nsf edgesort dist sharma 0 1 2 3 4 5
Number of edges in NSF is between >.n%2 (all nodes in <.n%2 pairs) and n-1 (a single tree). The fewer edges, the better Single Linkage method is suited for that type of data set.
See also
- Applications, some interesting examples
- Essays/Minimal Spanning Tree
- User:Brian Schott/code/cluster
- Electronic Statistics Textbook. StatSoft, Inc.
Contributed by Oleg Kobchenko