User:Brian Schott/code/cluster
< User:Brian Schott | code
Jump to navigation
Jump to search
Statistical cluster analysis can be done on small data sets using a hierarchical approach which starts with all observations as single clusters, and combining two "close" clusters iteratively at each step. The methodology employed here is very basic and is inspired by the statistical analysis package developed by SAS in North Carolina. However, all errors are completely mine. Only one (extremely simple) example data set is included in the listing: sharma is data taken from a multivariate stats text by Sharma. Nonhierarchical cluster analysis can handle large data sets, but I have not much experience with such. Please let me know if you have any recommendations for this material.
NB. clusterbasic.ijs NB. 12/17/5 NB. revising clusoverall.ijs NB. revising cluster.ijs NB. also starting from scratch NB. usage t =. <method> cluster <data> NB. eg: t =. WM cluster sharma NB. then: t show sharma NB. shows cluster history NB. or: sharma show~3{."1]_3{t NB. shows history for 3 clusters NB. and: R2 NB. show Rsquared history NB. and: nTies NB. show ties history NB. and: r NB. show cluster pointer history NB. methods available: NB. WM:Ward's NB. CL:Complete linkage NB. CM:Centroid NB. AL:Average linkage NB. SL:Single linkage (nearest neighbor) show =: [ {&.> [: < ] NB. to see cluster results starting at maxclus maxclus =: 10 NB. sets the final number of clusters for which detailed statistics NB. can be sent to an output file or printed cluster =: dyad define init y method =. x dists =: dists__method currclus while. <:#currclus do. nTies =: nTies ,<:#ndx =. (locmins # i.@#) lfm dists curr =: >{:tmp =. ,tmp0 =. ndx&{"1 lfm"2 (<"0 ,: cp) dists if. 1&<#ndx NB. optionally deal with ties do. dists2 =. (>{:tmp0) dist__method currclus ndx2 =. tmp0 FAR +/"1 dists2 curr =: >{:tmp =. ndx2{"1 tmp0 end. cluskey =: ;(>{: tmp) combine <"0 i. #clusptrs clusptrs =: cluskey <@;/. pclusptrs =. clusptrs currclus =: cluskey <@;/. pcurrclus =: currclus if. maxclus>:#currclus do. R =. Rs >each currclus R2 =: R2,-.R r =: r, clusptrs end. dists =: curr distsch curr dist__method pcurrclus end. r ) init =: monad define clusptrs =: <"0 i. N =: #obs =. ;/y clusno =: N # N pts =: ({:"1 y) </. obs currclus =: clusptrs </. obs nTies =: i. 0 SS =: +/diag SSCP&:> obs R2 =: i. 0 NB. allr_cluster_ =: i. 10 0 r =: 0 0$a: ) distsch =: dyad define t =. x new =. 0 t} y fix =. (_1{ t)} NB. this _1 must coordinate with combine dists =. new&fix &.|: new&fix dists drop =. (<<<0{ t)&{ NB. this 0 must coordinate with combine drop"1 drop dists ) length =: sum &. (*:"_) Edstnc =: length@:- Edstnc =: ss@:- NB. more efficient nrmlz =: ] % length cells =: +/~ @:(#&>) NB. table of cluster sizes NB. routines for strictly lower triangular matrices to lists and back expand =: #^:_1 cslt =: &.(lfm :. mfl) NB. create slt from full matrix slt =: >/~@i. NB. strictly lower triangular matrix lfm =: slt@# #&, ] NB. list from matrix fsl =: <:@fms { los@>: NB. find string length needed fms =: ndxle >:@i. 1: NB. find matrix size los =: +/\@ i. NB. list of sizes ndxle =: (<:los)~ NB. index <: list of sizes mfl =: (],])@fms@# $,@slt@fms@#expand fsl@#{.] NB. slt from list NB. routines for describing cluster homogeneity, etc. indices =: ]each cslt@(CP&:(i.@#)/~) spr =: (-`+)/@:(+/@ diag @SSCP&>)each cslt@(ap&:>each/~) ns =:(,&#)each/~ ap =: <@,,; diag =: (<0 1)&|: NB. cluster distances singlelinkage =: <./@,&> completelinkage =: >./@,&> averagelinkage =: avg@,&> centroid =: ctr each@(> each) NB. ways to deal with ties FAR =: locmax@] FAR =: locclose FAR =: sas min =: <. max =: >. locmins =: = min/ v_locmins =: ]<: >:@%&100@[ * min/@] locmin =: i. min/ locmax =: i. max/ min_numb =: ({.,#)@(#~ locmins) mins =: &v_locmins sas =: dyad define t =. x( >@{:@[{<./&>@])clusptrs NB. min ID for each potential cluster record tt =. {.@:/:@:(/:~"1) t NB. index of smallest of sorted of each cluster's 2 IDs ) locclose =: dyad define t =. x locmin y ) incrQ =: 2&(</\)&(,&0) frst0 =: i.&0 transpose =: |: sel =:[: ; [ {.&:>each ([: - [) {each ] NB. measuring changes in clustering coefficients dif =: 2: -~/\] pd =: dif % }: NB. measuring sums of squares ss =: sum @: *: sum =: +/ sqrt =: %: n =: # avg =: sum%n ctr =: avg"2 NB. centroid mnc =: ] -"1 ctr mp =: sum . * sscp =: transpose mp ] transpose =: |: SSCP =: sscp@mnc df =: <: @: n df =: +/@:(#&>)-1: NB. added 2/4/2 dfw =: +/@:>@:(#each) - # dfw =: +/@:(#&>) - # NB. added 2/4/2 SSCPw =: sum@:>@:(SSCP each) SSw =: SSCPw%dfw SSt =: SSCPt%dfw NB. added 2/4/2 SSb =: SSCPb%dfw NB. added 2/4/2 SSCPt =: SSCP@;@,. SSCPb =: SSCPt - SSCPw Rs =: sum@:((<0 1)&|:@SSCPb)%sum@:((<0 1)&|:@SSCPt) det =: -/ . * CP =: {@(,&<) NB. Catalog of dyadic values cp =: CP&(i.@#)~ NB. Catalog of a array's indices freqcount=: (\: {:"1)@(~. ,. #/.~) combine =: (nit <@:# foc)`({:@[)`]} foc =: {.@[{.@>@{] NB. first atom of object to be copied nit =: ({:@[#@>@{]) NB. number of atoms in target object stddev =: ss@mnc sqrt @ % df variance =: ss@mnc % df std =: mnc %"1 stddev S =: SSCP % df R =: SSCP@std % df NB. load <path,'iris.data' NB. load <path,'wine.data' NB. load <path,'glass.data' sharma =: transpose 2 6$5 6 15 16 25 30 5 6 14 15 20 19 NB. ************************************************** NB. ************************************************** coclass 'method' create =: verb define 0 ) destroy =: codestroy cocurrent 'base' WM =: conew 'method' CL =: conew 'method' CM =: conew 'method' AL =: conew 'method' SL =: conew 'method' dists__WM =: 3 :'> -:@:-:@:Edstnc__ each/~ centroid__ y' NB. **HALF** of distance squared dist__WM =: dyad define t=. x Nj =. n__"2 &>@(> each) y NkNl =. t{Nj Nkl =. +/|: NkNl NjNkl =. NkNl +/ Nj NjNm =. Nkl +/ Nj DjDkl =. t{dists__ Dkl =. (<"1 t){dists__ ((+/"2 NjNkl*DjDkl)-Nj*/~Dkl)%NjNm ) dists__CL =: 3 :'completelinkage__ Edstnc__"1/&:> each/~y' dist__CL =: dyad define t=. x >./"2 t{ dists__ ) dists__SL =: 3 :'singlelinkage__ Edstnc__"1/&:> each/~y' dist__SL =: dyad define t=. x <./"2 t{ dists__ ) dists__CM =: 3 :'> Edstnc__ each/~ centroid__ y' dist__CM =: dyad define t=. x Nj =. n__"2 &>@(> each) y NkNl =. t{Nj Nkl =. +/|: NkNl DjDkl =. t{dists__ Dkl =. (<"1 t){dists__ DklNkNl =. */Dkl ,|:NkNl ((+/"2 NkNl*DjDkl)% Nkl)- (DklNkNl%*:Nkl) ) dists__AL =: 3 :'> Edstnc__"1 each/~ centroid__ y' dist__AL =: dyad define t=. x Nj =. n__"2 &>@(> each) y NkNl =. t{Nj Nkl =. +/|: NkNl DjDkl =. t{dists__ Dkl =. (<"1 t){dists__ (+/"2 NkNl*DjDkl)% Nkl )
Contributed by Brian Schott