Scripts/Forum Time Series
Purpose
To explore trends in Jsoftware.com forum messages using statistical time series analysis, J's plot procedure and J's data analysis features are very enabling. Through a few examples, we attempt herein to suggest a possible approach which is rather simple and transparent.
Data acquisition
The JForum data are available from the following links: programming, general, beta and chat.
The links below give the full URLs to the first month of data available for each forum organized in "thread" order. Similar links exist organized by "author" and "date" order.
Other data sources have been mentioned such as the one at Gmane in our JForum.
As demonstrated in this forum post, it can be quite easy to collect the required web-based data using the verb httpget (Scripts/HTTP Get). The example session below shows that httpget can be used to confirm that 357 messages were posted to the Programming forum in October, 2007.
load '~user/httpget.ijs' require 'regex' A=: httpget'http://www.jsoftware.com/pipermail/programming/2007-October/thread.html' 'Messages:[^0-9]*([0-9]+)' (,.@:{:@rxmatch ];.0 ]) A 357
First we create a verb yearmonth which constructs the unique portion of the web data links: "2007-October" in the example above was such a unique string. [{{#file: "timeseries.ijs"}} Download script: timeseries.ijs ]
load '~user/httpget.ijs' require 'regex' months =: <;._2 ]0 : 0 January February March April May June July August September October November December ) NB.* year v NB. monad triple: start year after 2000 NB. start month (eg. April = 4) NB. number of months NB. year 3 4 15 is start at 2003-April and contain 15 months year =: (+ 2000+12<.@%~])`(+ <:@i.)/ NB. month v NB. monad triple: same inputs as year triple month =: <:@(1&{) 12&|@+ i.@{: yearmonth =: ,each/@(;/@('-',.~":)@,. at year,:months{~month) NB. yearmonth 3 4 15 «readdata» «datasets» «movingaverage» «my3» «pdutilities»
With the yearmonth string generator, httpget is put into a for. loop in readdata to automate the "Programming JForum" data collection for the process. This is a rather slow process, which took about 3 minutes for the 34 pages accessed in the example. [{{#file: "readdata"}} Download script: readdata ]
urlhead =: <'http://www.jsoftware.com/pipermail/programming/' urltail =: <'/thread.html' readdata =: monad define result =: i. 0 y =. ;"1 urlhead,.y,.urltail for_x. y do. temp =. httpget x temp =. 'Messages:[^0-9]*([0-9]+)' (,.@:{:@rxmatch ];.0 ]) temp result =. result, ". temp end. result ) $messages =: readdata yearmonth 5 10 34
By the time others access the messages data above, the most recent month of data may have changed, so to preserve the data we reproduce it and the size data set, used later, below in two nouns. [{{#file: "datasets"}} Download script: datasets ]
messages =: ;<@".;._2 (0 : 0) 37 261 335 381 295 576 273 270 172 291 226 285 450 380 372 374 380 456 479 570 357 366 415 234 357 293 371 323 279 477 336 244 290 113 ) size =: ;<@".;._2 (0 : 0) NB. in KBs 14 86 142 136 113 266 100 109 70 94 90 112 174 129 157 137 164 201 219 230 148 149 150 83 143 114 165 130 117 181 149 100 124 27 )
Time series data analysis
To smooth out the effect of seasons from time series data, a simple approach is to compute a centered moving average (CMA) of a multiple of the length of a year. In the case of monthly data, a 12-month (or 24-month or 36-month) CMA is appropriate; the more months in the CMA, the more months are lost from the two ends of the data (6, 12, or 18 months from each end). A CMA is required, instead of a plain moving average (MA) in order to because 12, 24, and 36 are even numbers, and without the centering the MAs would not align correctly with the original months.
Of course, the MA produces the desired smoothing. Traditionally MAs are computed with the statistical mean, but the median -- used here -- possesses a robustness property. The J script library stats contains verb definitions for both mean and median. [{{#file: "movingaverage"}} Download script: movingaverage ]
load'stats' MA =: &(median\) NB. moving average CMA =: 1 : 'm MA@}: -:@+ m MA@}.' NB. centered moving average
While exercising the moving average verbs, we notice that the number of messages is reduced from 34 to 22 in this example by CMA.
#messages 34 #12 CMA messages NB. notice 12 are lost 22 _6]\ 12 CMA messages NB. _6]\ is for display only 283.5 290.5 293 293 312.25 331.5 352.25 375 377 377 378.5 380 378.5 375 372.75 370.5 365 361.5 359.25 351.75 338 318.75 0 0
Plot data
To get a quick view of these results, try the next plots. They are all crude renderings, but we can see quite a bit from them. From the first see that the first and last months are low and unrepresentative. The third plot is noticeably shorter and smoother, suggesting a growth period and perhaps a decline. The last plot shows the original data and the smoothed data together, enabling us to see them in perspective, but to produce this plot in this manner we lose even the 12-period x ruler.
load 'plot' 'xtic 12' plot messages 'xtic 12' plot }.}:messages 'xtic 12' plot (6+i.20);12 CMA }.}: messages pd bind 'show' @pd"1 ]((i.32);}.}:messages),:(6+i.20);12 CMA }.}: messages
Some utilities will permit more refined plots.
For example, here we prepare to better label the plot's x axis with the verb my3. fmt is defined here to force the year number to be 2 digits padded in front with a 0, if necessary. If more space is needed and there is only room for 1 digit, you may redefine it: fmt =: ": [{{#file: "my3"}} Download script: my3 ]
MTH3=: _3 ]\ 'JanFebMarAprMayJunJulAugSepOctNovDec' fmt=: [: , ('r<0>2.0')&(8!:2) monthyear =: 1 : '[: (;:^:_1) 0 12 <@((,~&fmt) {&m)/@#: ]' my3 =: MTH3 monthyear
Experiment my exercising the verb my3.
my3 14 92 202 Mar01 Sep07 Nov16
Some utilities putting together pd commands are shown next. [{{#file: "pdutilities"}} Download script: pdutilities ]
pdsetup =: monad define pd 'reset' pd 'type line,marker' pd 'graphbackcolor mediumgray' pd 'gridcolor 230 230 230' pd 'axes 1 0;axiscolor slategray' pd 'color red,blue,green' pd 'pensize 2;markersize 1.5' pd 'xtic 12' 'title subtitle' =. y pd 'title ',title pd 'subtitle ',subtitle ) pdxlabel =: dyad define startmonth =. x data =. y NB. original data before CMA xticpos=. ((#data)$0 0 1)#i. #data NB. 0 0 1 is configurable pd 'xticpos ',":xticpos pd 'xlabel ',my3 startmonth + xticpos ) pdshow =: pd bind 'show' @pd"1
Now we show how the pd utilities can be used on two of the earlier plot examples.
pdsetup 'Monthly message count';'' 70 pdxlabel }.}: messages pdshow }.}: messages pdsetup 'Monthly message count';'and smoothed' 70 pdxlabel }.}: messages pdshow((i.32);}.}:messages) ,: (6+i.20);12 CMA }.}: messages
An interesting relationship for the trend of the size of Programming messages is uncovered next.
pdsetup 'Monthly KBs/message ';'and smoothed' 70 pdxlabel }.}: messages pdshow((i.32);}.}:size%messages) ,: (6+i.20);12 CMA }.}: size%messages
This page was contributed by Brian Schott but incredible contributions were made by
Oleg Kobchenko and Ric Sherlock . Also special thanks to Raul Miller.
CategoryWorkInProgress CategoryCodeNeeded