NYCJUG/2011-10-11/SmoothOperators
Smooth Operators
Smooth a jagged curve by repeated averaging:
meanSmooth=: 3 : 0 3 meanSmooth y : assert. 1=2|x y=. (({.y)$~-:<:|x),y,({:y)$~-:<:|x NB. Weight end-points to move less. if. 0<x do. x mean\y else. (|x)([: mean {. , {:)\y end. NB. Neg. window averages only endpoints. )
[Here's a question: looking at the line which weights the end-points, is the following tacit version of that line any better?]
y=. (-:<:|x)([|.] ,~ [# [: ({:,{.)])y NB. Weight end-points to move less.
Test by noising up a sine wave:
pp=. (1 o. i:5j99)++/<:+:3 100?@$0 'title Smoothing Example;pensize 2;keystyle fat;key Original "Noised Original" "3&meanSmooth^:10" "5&meanSmooth^:10" "9&meanSmooth^:10";keypos rti' plot (1 o. i:5j99),pp,(3 meanSmooth^:10] pp),(5 meanSmooth^:10]pp),:9 meanSmooth^:10]pp
An Example Use
Say we have some timing data on runs of a model on different input files arranged in order of increasing file size. We see from the graph that lines/second increases for larger files though it appears to be asymptotic. If we have repeated timings on the same set of files so we can average out noise, we’d like to identify those that are persistently faster or slower than they ought to be.
The trick here lies in determining what the speed ought to be since the expected value varies according to the number of lines per file as seen in the graph below. It would be useful to have some sort of local average of speeds to provide a basis for comparison, i.e. we need a smoother line than the one based on the actual measurements.
Once we have this smoothed line, we can identify the most significant outliers above or below this line to tell us which instances are anomalously slow or fast.
Here’s how we might do it for a vector of averaged lines/second called “base”.
sm1=. 5&meanSmooth"1^:20 base NB. Smoothed basis to identify outliers. ord=. (0 1)-.~\:|difs=. base-sm1 NB. Don’t consider first two points. 'loo hio'=. (10{.ord#~_1=*ord{difs);10{.ord#~1=*ord{difs
We want to consider low outliers (faster) separately from high (slower) ones. Let’s plot all the results together to ensure it looks like we expect it to. We’re also excluding the first two points since they tend to be outliers only because of the initial steep slope of the curve.
showOutliers=: 3 : 0 'base sm0 sm1 loo hio'=. y pd 'title Two Smoothers and Some Outliers;pensize 2;keystyle fat' pd ';key Base "3 Smoother^:10" "5 Smoother^:20"' pd base,sm0,:sm1 [ pd 'type line' pd 'type point;pensize 4' pd loo j. loo{base [ pd hio j. hio{base pd 'show' ) sm0=. 3&meanSmooth"1^:10 base showOutliers base;sm0;sm1;hio;loo
Note that this graph has been post-processed to add axis labels, include the legend entries for the outliers, and to label the X-axis with the actual number of lines per file.
Possible Extension: Linear Interpolation
One interesting variation on this idea of smoothing is to apply a different underlying function than “mean” as we’ve done above. An interesting variant is to use a simple linear interpolation on groups of points rather than taking their average. Here we see one example of this applied to our timing data though, in this case, we’ve applied the linear interpolation to the smoothed curve rather than to the original one.
To do this, we need to generalize our “smoother” to be an adverb:
generalSmooth=: 1 : 0 3 mean generalSmooth y : assert. 1=2|x y=. (({.y)$~-:<:|x),y,({:y)$~-:<:|x NB. Weight end-points to move less. ;,x u\y )
The illustration here shows in detail how the linear interpolator works in practice.
Some of the verbs and adverbs we used in the preceding:
linterpEnds=: 3 : '(0{y)+((<:#y)%~-/_1 0{y)*i.#y' gmSmooth=: 4 : '(#y){.x linterpEnds generalSmooth y' lintSmooth=: 4 : '(0{x)gmSmooth (1{x)meanSmooth^:(2{x)"1 y'
The usefulness of this verb may be more apparent with a different kind of series as shown here.
Here we see how a piecewise linear overlay to the curves gives us a way to quantify and compare one aspect of them – whether they are rising are falling and the steepness of the change in a given period.