User:Brian Schott/code/nonhier
Statistical cluster analysis can be done on data sets using a nonhierarchical approach in which the user designates n data items as cluster seeds, and then the clustering program iteratively revises the cluster seed values by finding the center of each cluster for which the points around each seed are the closest to their seed. The clusters' centers at each iteration becomes their new seed. Typically the number of iterations is very small. The simple method described is often referred to as kmeans and has the advantage over hierarchical cluster methods that very large sample sizes can be accommodated.
To help find starting seeds which likely partition the observations well, another iterative algorithm can be constructed to find starting seeds. The algorithm shown here is similar to the one used in SAS, but is different. The SAS algorithm always assumes the first n observations are the preseeds and considers each next observation in the data set as a candidate to replace an existing seed if that candidate is farther away from all of the existing seeds than the closest two existing seeds are from one another.
The algorithm here iteratively considers all the observations at every iteration, until no viable replacement candidate exists. At each iteration all of the nonseed observations are candidates for replacing one of the two existing closest seeds. For there to exist a viable replacement, at least one of the nonseed observations must be further away from all of the existing seeds than the closest two existing seeds are from one another. If that condition exists, then the replacement seed is the nonseed that is furthest from all of the existing seeds. This algorithm as implemented here has not been found in the literature, but seems to be similar to the tradtional one and in my opinion better suits J, and seems to produce decent results in a small amount of testing. I would be happy to get suggestions for implementing the traditional algorithm.
The methodology employed here is very basic and is inspired by the statistical analysis package developed by SAS in North Carolina. However, all errors are completely mine. An (extremely simple) example data set is included in the listing: sharma is data taken from a multivariate stats text by Sharma. A slightly more substantial, well-behaved and widely studied data set is also provided here: Fisher's iris data. The rightmost column of the iris data is the known cluster ID for the observation, and is not used in the cluster analysis, but is used for validation of results.
I have little experience with nonhierarchical clustering. Please let me know if you have any recommendations for this material.
download a script: File:Nonhiercluster.ijs
First some utility verbs are defined.
mean =: +/ % # length =: +/&.:*: dist=:length@:-"1 sut =: </~@i. NB. strictly upper triangular matrix isut =: I.@,@sut NB. indices of sut matrix CP=: {@(,&<) seedpairs =: isut@# {&, CP/~ iomin =: i.<./ NB. index of (IO) minimum iomax =: i.>./ NB. index of (IO) maximum newseedQ =: *./@:< NB. x<y everywhere?
Two data sets are provided.
sharma =: _2]\5 5 6 6 15 14 16 15 25 20 30 19 iris =: 150 5 $; ". each <;._2 noun define 50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3 63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2 59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2 65 30 52 20 3 56 25 39 11 2 65 30 55 18 3 58 27 51 19 3 68 32 59 23 3 51 33 17 05 1 57 28 45 13 2 62 34 54 23 3 77 38 67 22 3 63 33 47 16 2 67 33 57 25 3 76 30 66 21 3 49 25 45 17 3 55 35 13 02 1 67 30 52 23 3 70 32 47 14 2 64 32 45 15 2 61 28 40 13 2 48 31 16 02 1 59 30 51 18 3 55 24 38 11 2 63 25 50 19 3 64 32 53 23 3 52 34 14 02 1 49 36 14 01 1 54 30 45 15 2 79 38 64 20 3 44 32 13 02 1 67 33 57 21 3 50 35 16 06 1 58 26 40 12 2 44 30 13 02 1 77 28 67 20 3 63 27 49 18 3 47 32 16 02 1 55 26 44 12 2 50 23 33 10 2 72 32 60 18 3 48 30 14 03 1 51 38 16 02 1 61 30 49 18 3 48 34 19 02 1 50 30 16 02 1 50 32 12 02 1 61 26 56 14 3 64 28 56 21 3 43 30 11 01 1 58 40 12 02 1 51 38 19 04 1 67 31 44 14 2 62 28 48 18 3 49 30 14 02 1 51 35 14 02 1 56 30 45 15 2 58 27 41 10 2 50 34 16 04 1 46 32 14 02 1 60 29 45 15 2 57 26 35 10 2 57 44 15 04 1 50 36 14 02 1 77 30 61 23 3 63 34 56 24 3 58 27 51 19 3 57 29 42 13 2 72 30 58 16 3 54 34 15 04 1 52 41 15 01 1 71 30 59 21 3 64 31 55 18 3 60 30 48 18 3 63 29 56 18 3 49 24 33 10 2 56 27 42 13 2 57 30 42 12 2 55 42 14 02 1 49 31 15 02 1 77 26 69 23 3 60 22 50 15 3 54 39 17 04 1 66 29 46 13 2 52 27 39 14 2 60 34 45 16 2 50 34 15 02 1 44 29 14 02 1 50 20 35 10 2 55 24 37 10 2 58 27 39 12 2 47 32 13 02 1 46 31 15 02 1 69 32 57 23 3 62 29 43 13 2 74 28 61 19 3 59 30 42 15 2 51 34 15 02 1 50 35 13 03 1 56 28 49 20 3 60 22 40 10 2 73 29 63 18 3 67 25 58 18 3 49 31 15 01 1 67 31 47 15 2 63 23 44 13 2 54 37 15 02 1 56 30 41 13 2 63 25 49 15 2 61 28 47 12 2 64 29 43 13 2 51 25 30 11 2 57 28 41 13 2 65 30 58 22 3 69 31 54 21 3 54 39 13 04 1 51 35 14 03 1 72 36 61 25 3 65 32 51 20 3 61 29 47 14 2 56 29 36 13 2 69 31 49 15 2 64 27 53 19 3 68 30 55 21 3 55 25 40 13 2 48 34 16 02 1 48 30 14 01 1 45 23 13 03 1 57 25 50 20 3 57 38 17 03 1 51 38 15 03 1 55 23 40 13 2 66 30 44 14 2 68 28 48 14 2 54 34 17 02 1 51 37 15 04 1 52 35 15 02 1 58 28 51 24 3 67 30 50 17 2 63 33 60 25 3 53 37 15 02 1 )
The verb reviseseeds is presented first and you can experiment by trying examples like the following for which the y argument is the indices of the planned seeds.
seedrevisiondemo =: noun define sharma reviseseeds 0 1 2 sharma reviseseeds 5 4 3 2 (}:"1 iris) reviseseeds 35 41 70 35 41 70 {iris 35 89 70 {iris (}:"1 iris) reviseseeds 0 1 2 0 1 2 {iris 0 1 84 {iris ) reviseseeds =: dyad define data =. x seeds =. y notseeds =. seeds -.~ i.@#data minseedpair =: isut@# iomin@:{&,dist"1 2 &({&data)/~ closeseeds =. (minseedpair pick seedpairs)seeds NB. distances between seeds and nonseeds (objects) s2o =. seeds dist"1 _ &({&data) notseeds closepairdist =. dist &({&data)/closeseeds NB. distance between closest seeds while. +./ closepairdist newseedQ"0 1 |:s2o NB. if this is true ... do. NB. then newseed is the object with the greatest minimum dist to another seed newseed =. notseeds{~ iomax"0 1 <./s2o oldseed =. closeseeds {~ iomin newseed dist"1 _&({&data) closeseeds iooldseed =. seeds i. oldseed smoutput 'revision: ',": seeds =. newseed iooldseed}seeds notseeds =. seeds -.~ i.@#data closeseeds =. (minseedpair pick seedpairs)seeds NB. distances between seeds and nonseeds (objects) s2o =. seeds dist"1 _ &({&data) notseeds closepairdist =. dist &({&data)/closeseeds NB. distance between closest seeds end. seeds )
Nonhierarchical clustering is very simple in J as seen below.
demo =: noun define irs =: }:"1 iris irs&([mean/.~ [:iomin"1 dist"1 _)^:(i. 5)irs{~ irs reviseseeds 35 41 70 irs&([mean/.~ [:iomin"1 dist"1 _)^:(i. 5)irs{~ irs reviseseeds 0 1 2 )
Contributed by Brian Schott <<DateTime>>