Doc/Articles/Play161
At Play With J ... Table of Contents ... Previous Chapter ... Next Chapter
21. New Big Deal
. By Eugene McDonnell, with a major contribution from Roger Hui. First published in Vector, 16, 1, (July 1999), 113-119.
Chris Burke was displeased, to begin with. He had tried J's deal with a small left argument and a very large right argument, and this took a perceptibly long time. He tried it with a larger argument and was told he had run out of space. He tried it with a still larger argument, and was told he had exceeded deal's limits. He wrote to Roger Hui about these distressing circumstances, showing him several examples:
(time=:6!:2) '1 ? 1e7' NB. this takes too long 0.93 1 ? 1e8 |out of memory | 1 ?100000000 1 ? 1e9 |limit error | 1 ?1000000000
Roger's first thought was how to get around this limitation. He said that if the left argument is much smaller than the right Chris would be better off just doing ?x$y ; because deal allocates a boolean vector of length y to compute unique answers in the result of x?y. Thus a very large y would give the results Chris noted. He forwarded Chris's message to me, copying Chris, with the message "Perhaps Eugene can comment."
I checked the literature on my shelf but couldn't find any worthwhile "selection without replacement" algorithms. I then remembered the very early days of APL\360, before deal was made a primitive -- some time in 1966, perhaps. I had written a defined function to perform deal, using an algorithm that was quite fast. I remembered it vaguely, and thought that with I might be able to do better with J. I wrote one, tried it out on a dozen or so cases, then communicated it to Chris and Roger:
deal =: dyad define NB. experimental deal NB. for small x and very large y count =: 0 t =. i.0 k =. 1.1 NB. adjust as you see fit NB. maybe make it a function of y u =. >. k * x whilst. (# t) < x do. t =. ~. ? u # y count =: count + 1 end. x {. t )
As you can see, I was uncertain about the factor 1.1. I had put a counter in to see how often more than one execution of the whilst section was needed. In the few dozen cases I tried, there were none. Happily, the timings showed a large improvement over current deal for Chris's cases: the cases that were slow were much faster, and the range of the right argument was significantly extended.
Here are the timings I experienced:
time '1 ? 1e7' 4.07 100 time '1 deal 1e7' 0 1000 time '1 deal 1e7' 0.00044 1000 time '1 deal 1e8' 0.00044 1000 time '1 deal 1e9' 0.00044 1000 time '100 deal 1e7' 0.00082 count 1 1000 time '100 deal 1e8' 0.00076 count 1 1000 time '100 deal 1e9' 0.00083 count 1 ts=: 6!:2 , 7!:2@] NB. time and space 100 ts '1 deal 1e7' 0 2240 100 ts '1 deal 1e8' 0 2240 100 ts '1 deal 1e9' 0.0005 2240 100 ts '100 deal 1e7' 0.0005 4288 100 ts '100 deal 1e8' 0.0005 4288 100 ts '100 deal 1e9' 0.0005 4288
Roger thought this was neat. He implemented my algorithm, invoked when x<0.01*y. He rewrote my algorithm, simplifying it:
deal =: dyad define u =. >. 1.1 * x while. x># t=. ~. ? u # y do. end. x {. t )
His C implementation looked like this:
static A bigdeal(m,n)I m,n;{A t,x,y; RZ(x=sc((I)floor(1.11*m))); RZ(y=sc(n)); do{RZ(t=nub(roll(reshape(x,y))));}while(m>AN(t)); R vec(INT,m,AV(t)); } /* E.E. McDonnell circa 1966, small m and large n */
But he worried that this would run into the birthday problem, which gets its name from its most celebrated instance, that the odds are in your favor if you bet that in a group of 23 or more people at least two of them will have the same birthday. The greater the number of people, of course, the higher the probability that this will occur. With more than 365 people, the probability is certain, or 1, since the pigeon hole principle dictates that if you have x pigeonholes and y pigeons, with x less than y, at least one of the pigeonholes will have more than one pigeon in it. What Roger wanted to know was, what is the value of x as a function of y where the probability of a duplicate appearing was 0.5. If he had this information, he could make the multiplier more accurate. Perhaps 1.11 wasn't good enough.
I wrote the following:
Hui_test =: dyad define tests =: 0 successes =: 0 whilst. 1000 > tests =. >: tests do. successes =: successes + *./~:?x$y end. <. 100 * 0.001 * successes )
I ran this test for y in all of the hundreds, thousands, ten thousands, hundred thousands, millions, and 10,000,000 and 20,000,000. To make it easier to digest I'll only show the results for y=:10^2 3 4 5 6 7. The other results are consistent with these.
y x 100 12 1000 37 10000 116 100000 370 1000000 1180 10000000 3740
I thought this looked roughly like a square root relationship so tried a few maneuvers:
%:y 10 31.6228 100 316.228 1000 3162.28 %:1.4*y 11.8322 37.4166 118.322 374.166 1183.22 3741.66
This was quite a good fit, and shows:
y = (x^2) % 1.4 x = %: 1.4 * y
to a close approximation. I communicated these results to Roger. He studied this and was able to apply some more theory to it, and wrote back to me: In choosing m items from a universe of n items, the probability of all the m items distinct is (*/n-i.m) % (n^m). The numerator is the number of ways of choosing m distinct items; the denominator is the number of ways of choosing m items. Thus:
(*/n-i.m) % (n^m) (a) (*/n-i.m) % (*/m$n) */ (n-i.m) % (m$n) */ (n-i.m) % n f=: [: */ ] %~ i.@[ -~ ] 22 23 f"0 ] 365 NB. the classic birthday problem 0.524305 0.492703
The first m, for which m f n is less than 0.5, is:
1 + (0.5 > */\ n %~ n - i.n) i. 1 f1=: >: @ (i.&1) @ (0.5&>) @ (*/\) @ ([ %~ [ - i.) f1 365 23 n=: , (,10^1 2 3 4 5)*/1 2 4 8 m=: f1"0 n n,.m ,. %: 1.4*n 10 5 3.74166 20 6 5.2915 40 8 7.48331 80 11 10.583 100 13 11.8322 200 17 16.7332 400 24 23.6643 800 34 33.4664 1000 38 37.4166 2000 53 52.915 4000 75 74.8331 8000 106 105.83 10000 119 118.322 20000 167 167.332 40000 236 236.643 80000 334 334.664 100000 373 374.166 200000 527 529.15 400000 745 748.331 800000 1054 1058.3
For extremely large values of n, Roger saw that the function f1 is impractical to compute, and he concluded that a method based on root finding on the original function f, using my approximation of m=: %: 1.4*n as an initial guess, would be more suitable.
1e7 3724 1e8 11775 1e9 37234 2e9 52656
The end result was that Roger found that the rule I had originally suggested of switching to the "roll" algorithm for m?n when m<0.01*n does run into the birthday problem, and he replaced it with the more accurate rule I had found following his suggestion. There the matter rests, with one small postscript. As I studied Roger's mathematics, particularly the phrase (*/n-i.m) in line (a) above, I recalled J's "falling factorial" function. The J Dictionary defines this as follows:
The fit conjunction applies to ^ to yield a stope defined as follows: x^!.k n is */x + k*i. n. In particular, ^!._1 is the falling factorial function.
Let me elaborate on this. Think of the caret (^) as being defined in the first instance as denoting a function of three arguments: a base, a count, and a step. Then caret (base, count, step) is the product over base + step * integers count. For example,
'base count step' =. 3 5 7 i. count 0 1 2 3 4 step * i. count 0 7 14 21 28 base + step * i. count 3 10 17 24 31 */ base + step * i. count 379440 caret =: monad define NB. general caret function NB. y is a list of three values: NB. base NB. count NB. step 'base count step' =. y */ base + step * i. count ) caret(3 5 7) 379440
This generality hides the fact that there are really just three variations of significant interest, steps having values _1, 0, and 1. These three provide falling factorials, steady factorials (powers) and rising factorials. Each of these has to do with a product over count number of values, beginning with base , and continuing with values increasing by step . See what results come with a base of 5 and a count of 3, with each of the three significant step sizes:
caret 5 3 _1 60 5 * 4 * 3 60 caret 5 3 0 125 5 * 5 * 5 125 5 ^ 3 125 caret 5 3 1 210 5 * 6 * 7 210
The case with step zero is the default case of caret, and is the power function. We can use the falling and steady factorial cases to write a more compact version of Hui's function f:
probdupes =: dyad define NB. Probability of no duplicates when drawing with NB. replacement from among the first count integers base =. x count =. y (base ^!._1 count) % (base ^!.0 count) ) NB. simplified version of probdupe pd =: ^!._1 % ^ NB. version exploiting family resemblances pdx =: [: %/ ^!._1 0 365 probdupes 23 0.492703 365 pd 23 0.492703 365 pdx 23 0.492703
These functions fail when the values of the falling factorial get very large, causing its value to exceed the maximum real value, and thus to be represented by machine infinity. When this happens, the steady factorial (power) value will already be machine infinity, and the quotient of two infinities is Not a Number, or NaN. To get around this problem, use extended integers as arguments, and extended precision inverse on the result. When the result is smaller than the smallest nonzero number, a result of zero is forced:
365 pd 200 NB. result is NaN error, from _ % _ |NaN error: pd | 365 pd 200 x:^:_1 [ 365x pd 200x NB. result of zero is forced 0
Vector Vol.16 No.1