NYCJUG/2010-02-09/LargeDatasetsForParallelProgramTesting

Generating Large Datasets For Parallel Program Testing

The following, part of a larger effort to use J to take advantage of the increasingly common multi-core architectures we see coming about, outlines some datasets that can easily be generated to be as large as needed in order to test differing degrees of parallel programming problems from the simple to the complex. These examples are intended to be fairly simple but sufficiently realistic to suggest actual problems worth working on. We'll want to work with datasets that can be arbitrarily large as this is the sort of problem for which parallelization ought to provide the most value.

Also, these specific, simple examples are arranged in order, from simpler to increasingly complex, to typify fundamentally different types of problems and provide concrete, replicable examples we can use to benchmark specific proposals for parallelization. It's too easy to make useless generalizations - the real test is in specific applications. Remember, no job is too difficult to someone who doesn't have to do it himself.

Look here for the code used and developed below. Also, there is work on spinning off threads from J as well as some code for "mutex" - mutual exclusion of processes.

Case 0: Present-Value Cashflows Under Different Interest-rate Scenarios

First, we'll generate some small datasets, just one hundred cashflows and interest rate scenarios, to show how it might be done.

Cashflows from about $1000 to $10,000/month for 30 years, ordered to group similar cashflows in same series.

   cfs=. |:/:~"1]1000+100%~<.900000*360 100?@$0

We can take a look at the distribution of these cashflows and see that they are fine for our purposes - the distributions don't seem highly unusual.

Interest Rate Generation

Now let's generate some pseudo-random interest rates and see if they look reasonable. We'll generate these rate series based on random walks, then adjust them to fit into a reasonable range.

   irp=. +/\1000%~<:+:100 360?@$0            NB. Rates change randomly
   irp=. (3 : '(100%~>:?0)+y-<./y')"1 irp    NB. Eliminate negative rates
   irp=. (3 : 'y*(0.10+10%~?0)%>./y')"1 irp  NB. Max rate 10-20%
   irp=. irp/:*/"1 >:irp                     NB. Order for neatness

Looking at how these are distributed, we see that these numbers look like plausible, if slightly high by recent historical standards.

Now that we have our two datasets, we'll calculate the present values for all cashflows for each interest rate path.

   pvs=. +/&>(<"1 cfs)%&.>/<"1 */\"1 >:irp
   usus ,pvs                      NB. Min, max, mean, standard deviations.
7232.5742 142270.46 51913.099 27130.893

We can see the distribution of these present values here.

Case 1: Sort Many Records by Date, Movie, or User

Inspired by the Netflix Prize, we'll create some random movie rating records, each consisting of a date, a movie number, a user number, and a rating from 0 to 9. The task to accomplish with this data will be to write different versions of the same records, sorted by each of these attributes.

Unlike the previous case in which each calculation is completely independent of every other one, this case raises difficulties of co-ordinating the outputs and resolving the possibility of collisions.

First, we'll generate some arbitrary records.

   3{.recdmur=. (100#.todate 70476+?1e5$6264),.1e5 3?@$20000 1e6 10
19930509 19884 235453 9
20091003 19361 166043 9
20050706 12626 723593 3

   whereDefined 'todate'
C:\Program Files\j602\system\main\dates.ijs

Next, we'll show what this set of records looks like sorted by date.

   3{.recdmur/:0{"1 recdmur   NB. Sorted by date
19921216 11337 147335 5
19921216  6658 942808 3
19921216 16905 608632 1

Next, we'll sort them by the movie number.

   3{.recdmur/:1{"1 recdmur   NB. Sorted by movie number
20080918 0 925980 3
19950825 0  72029 1
19970502 0 908915 9

Finally, by the user number.

   3{.recdmur/:2{"1 recdmur   NB. Sorted by user ID
20000918  3262 12 5
20030606  4339 21 0
20010121 17163 26 2

Obviously this number of records is sufficiently few that the sorting can be done very simply. The challenge is if we have many records, say a hundred million, and want to write our sorted records to files, say one file per unique sort key value.

This is realistic for a parallel programming exercise because it deals with the slowness of I/O relative to in-memory processing - a realistic concern. As mentioned before, it also raises the complication of separate parallel processes having to co-ordinate with each other to resolve collisions caused by multiple attempts to write to the same file simultaneously. This also reflects a real problem.

Case 2: De-speckle an Image

See the description of this problem referenced here and a proposed algorithm here. The added complexity of a problem like this comes from the likelihood that we would parallelize it by breaking an image into (at least) two-dimensional tiles. However, this raises difficulties with the added complexity of working across the edges of the tiles. This complexity may limit the scalability of such a solution.

Storing Large Datasets as Tab-Delimited Files

Let's use a simple convention of storing pieces of a large array as separate files consisting of tab-delimited strings representing numbers. Though this is potentially very costly in terms of processing time, it's simple and should be netted out in any comparisons of speed-ups achieved through parallelism.

Here's some examples of generating sets of files for both of our example cases. The definitions of the functions can be found here.

First, let's write out 10 pairs of cashflow and interest-rate files and see how long this takes.

   6!:2 '1e4 wrCFIRFls^:10]0' NB. Write 10 file sets w/10,000 records each
209.46692

Where

wrCFIRFls=: 4 : 0
   (":&.>genCFs x) writeTSVFl '.tsv',~'CF0_',":y
   (":&.>genIRs x) writeTSVFl '.tsv',~'IR0_',":y
   >:y
)
NB.EG 1e4 wrCFIRFls^:10]0     NB. Write 10 file sets w/10,000 records each

genCFs=: 13 : '|:/:~"1]1000+100%~<.900000*(360,y)?@$0'
NB.EG cf0=. genCFs 1e4                       NB. 10,000 30-year cashflows

and

writeTSVFl=: 4 : '(enc2TSV x) fwrite y'
enc2TSV=: 13 : ';(LF,~[:}:[:; TAB,&.>~])&.><"1 y'

Now let's time writing out 10 sets of movie-rating files.

   6!:2 '1e6 wrDMURFl^:10]0'  NB. Make 10 sets of 1 million records each
72.577966

Where

wrDMURFl=: 4 : '>:y[(":&.>genDMURRecs x) writeTSVFl ''.tsv'',~''DMUR0_'',":y'
NB.EG 1e6 wrDMURFl^:10]0      NB. Make 10 sets of 1 million records each

genDMURRecs=: 3 : '(100#.todate 70476+?y$6264),.(y,3)?@$20000 1e6 10'
NB.EG dmur0=. genDMURRecs 1e6

where todate is defined in dates.

Once we have these sets of files, we can attempt to parallelize work on them.