NYCJUG/2016-02-09
Beginner's Regatta
Here we look at an example of solving the Traveling Salesman Problem using a genetic algorithm.
Traveling Salesman Problem Using Genetic Algorithms
This, from this site, demonstrates solving the Traveling Salesman Problem using a genetic algorithm. It implements the solution in C# but provides a good explanation of the method and the issues involved so we can use the article as guidance for writing our own version.
Problem Statement
From the article:
I have developed a solution to the Traveling Salesman Problem (TSP) using a Genetic Algorithm (GA). In the Traveling Salesman Problem, the goal is to find the shortest distance between N different cities. The path that the salesman takes is called a tour.
Testing every possibility for an N city tour would be N! math additions. A 30 city tour would have to measure the total distance of be 2.65 X 1032 different tours. Assuming a trillion additions per second, this would take 252,333,390,232,297 years. Adding one more city would cause the time to increase by a factor of 31. Obviously, this is an impossible solution.
A genetic algorithm can be used to find a solution is much less time. Although it might not find the best solution, it can find a near perfect solution for a 100 city tour in less than a minute. There are a couple of basic steps to solving the traveling salesman problem using a GA.
Basic Algorithm
The article outlines these steps:
- First, create a group of many random tours in what is called a population. This algorithm uses a greedy initial population that gives preference to linking cities that are close to each other.
- Second, pick 2 of the better (shorter) tours parents in the population and combine them to make 2 new child tours. Hopefully, these children tour will be better than either parent.
- A small percentage of the time, the child tours are mutated. This is done to prevent all tours in the population from looking identical.
- The new child tours are inserted into the population replacing two of the longer tours. The size of the population remains the same.
- New children tours are repeatedly created until the desired goal is reached.
As the name implies, Genetic Algorithms mimic nature and evolution using the principles of Survival of the Fittest.
Difficulties with Crossover
We are next introduced to an issues unique to TSP with genetic algorithms: the problem with the crossover step, the fourth on above where a part of one tour is inserted into another tour. This raises the problem that since each path has exactly one instance of each node, steps must be taken to deal with duplicates and missing elements when inserting an arbitrary piece of one path into another.
This author deals with the problem by using a doubly-linked-list with a link to both the prior and subsequent node in a path. How exactly this helps is not explained but is presumably available in the code.
Parameters
In this article, there are six parameters that control how the GA works. These are
- Population size - how many tours to look at at once.
- Neighborhood size - number of tours to choose from the population at each iteration.
- Mutation rate - a percentage determining how likely it is for a mutation to occur.
- Number of nearby cities - the number of cities in a grouping of cities that are considered close to each other.
- Nearby city odds - the chance that a given city will choose the next step in the path from one of the nearby cities rather than from the population in general.
- Random seed - allow for setting the starting point of the random number generation; supposed to be handy for debugging.
- City list - list of cities and locations
Preliminary J Work Implementing TSP Using GA
We don't have a complete solution here but show some preliminary work in J to work toward an implementation of this method. We did a more complete job on this for a later meeting.
Here’s some code:
NB.* geneticTSP.ijs: look at using genetic algos to solve the NB. Traveling Salesman Problem. disBtwPts=: 13 : '%:+/*:x-y' disBtwPts=: [: %: [: +/ [: *: - NB.* mat2col2Rxy: 2-col mat -> R assignments mat2col2Rxy=: 3 : 0 assert. 2={:$y NB. 2-columns assert. 2=#$y NB. 2-dimensional str=. 'xx<- ',1|.')c(',}:;',',~&.>j2n &.> 0{"1 y str,'; yy<- ',1|.')c(',}:;',',~&.>j2n &.> 1{"1 y ) J2Rassign2rowMat=: 3 : 0 smoutput ('xx<- c('),')',~}.;',',&.>j2n &.> 0{"1 y smoutput ('yy<- c('),')',~}.;',',&.>j2n &.> 1{"1 y ) NB.* circuit: repeat 1st point at end -> complete circuit. circuit=: ],{. eg=: 0 : 0 |:locs=. 6 2?@$0 0.375026 0.821357 0.843802 0.151142 0.188003 0.994509 0.448847 0.00569082 0.136437 0.294528 0.603056 0.184087 ) egShowPath=: 0 : 0 pd 'reset' pd 'title TSP Example: Shortest Path' pd <"1 |:locs [ pd 'type point;pensize 2' pd <"1 |:circuit locs{~ord [ pd 'type line;itemcolor green' dis=. +/2 disBtwPts/\circuit locs{~ord pd 'text 100 100 Total Distance=',":dis pd 'show' )
Here’s some exploration of how we might use this:
load 'plot' pd 'reset' |:circuit locs=. 6 2?@$0 0.375026 0.821357 0.843802 0.151142 0.188003 0.994509 0.375026 0.448847 0.00569082 0.136437 0.294528 0.603056 0.184087 0.448847 $locs 6 2 !#locs 720 alldiss=. +/&>2 disBtwPts/\&.> circuit &.> <"2 locs{~(i.!#locs) A. i.#locs alldiss i. <./alldiss 84 ]ord=. 84 A. i.#locs 0 4 3 1 2 5 ]dis=. +/2 disBtwPts/\circuit locs{~ord 2.24734 pd 'reset' pd <"1 |:locs [ pd 'type point;pensize 2' pd 'title TSP Example: Shortest Path' pd <"1 |:circuit locs{~ord [ pd 'type line;itemcolor green' pd 'text 100 100 Total Distance=', ":dis pd 'show'
Expanding on this:
|:locs=. 10 2?@$0 0.432281 0.981256 0.865989 0.175182 0.422217 0.188374 0.323206 0.348423 0.876925 0.421128 0.497782 0.18366 0.432289 0.617645 0.646875 0.10098 0.537415 0.345876 0.870853 0.217829 ord=. i.#locs pd 'reset' pd 'title TSP Example: Initial Path' pd <"1 |:locs [ pd 'type point;pensize 2' pd <"1 |:circuit locs{~ord [ pd 'type line;itemcolor green' dis=. +/2 disBtwPts/\circuit locs{~ord pd 'text 100 100 Total Distance=',":dis pd 'show'
!#locs NB. Possible number of paths 3.6288e6 $paths=. (10?!#locs) A. i.#locs NB. Some random permutations. 10 10 +/&>2 disBtwPts/\&.>circuit &.> <"2 locs{~paths 5.04213 4.41731 4.68363 5.36805 5.00227 5.45471 4.61821 5.26628 4.29059 5.18575 /:+/&>2 disBtwPts/\&.>circuit &.> <"2 locs{~paths 8 1 6 2 4 0 9 7 3 pd 'reset' pd 'title TSP Example: First Best (Random) Guess' pd <"1 |:locs [ pd 'type point;pensize 2' pd <"1 |:circuit locs{~8{paths [ pd 'type line;itemcolor green' dis=. +/2 disBtwPts/\circuit locs{~8{paths pd 'text 100 100 Total Distance=',":dis pd 'show'
$best=. paths{~(<.0.5+-:#paths){./:+/&>2 disBtwPts/\&.>circuit &.> <"2 locs{~paths 5 10 +/&>2 disBtwPts/\&.>circuit &.> <"2 locs{~best 4.29059 4.41731 4.61821 4.68363 5.00227 best 8 4 6 9 0 1 2 5 7 3 5 0 9 2 8 1 6 4 7 3 3 6 0 1 9 5 8 4 2 7 4 5 6 0 1 7 3 9 2 8 3 9 4 6 1 8 7 5 2 0 A."1 best 3093843 1851015 1290908 1633018 1429439 best i."(1 0) 0 4 1 2 3 9 (best i."(1 0) 0)|."(0 1) best 0 1 2 5 7 3 8 4 6 9 0 9 2 8 1 6 4 7 3 5 0 1 9 5 8 4 2 7 3 6 0 1 7 3 9 2 8 4 5 6 0 3 9 4 6 1 8 7 5 2 A."1 (best i."(1 0) 0)|."(0 1) best 1812 332002 38092 26538 117743 best=. (best i."(1 0) 0)|."(0 1) best NB. Start at same point for each path. ]ixs=. A."1 best 1812 332002 38092 26538 117743 i:5 _5 _4 _3 _2 _1 0 1 2 3 4 5 $ixs +/ i: 5 5 11 _1 A. i.3 2 1 0 newixs=. ~.,ixs +/ i: 5 3{.newixs 1807 1808 1809 (3{.newixs) A. i.#locs 0 1 2 5 7 3 6 4 9 8 0 1 2 5 7 3 6 8 4 9 0 1 2 5 7 3 6 8 9 4 diss=. +/&>2 disBtwPts/\&.>circuit &.> <"2 locs{~newixs A. i.#locs (<./,>./) diss 4.03417 5.54674 $paths=. locs{~newixs A. i.#locs 55 10 2 $paths=. <"2 locs{~newixs A. i.#locs 55 $best=. paths{~(<.0.5+-:#paths){./:+/&>2 disBtwPts/\&.>circuit &.> paths 28 diss{~(#best){./:diss 4.03417 4.13188 4.18143 4.20547 4.23095 4.29059 4.29463 4.29489 4.30649 4.35064 4.36072 4.39746 4.40792 4.40901 4.41017 4.41731 4.43094 4.44933 4.45013 4.46523 4.49365 4.51075 4.52487 4.56752 4.57172 4.58078 4.58707 4.61821 ]newixs=. newixs{~5{.\:diss 26537 26536 26533 26535 26534 newixs A. i.#locs 0 1 7 3 9 2 6 8 5 4 0 1 7 3 9 2 6 8 4 5 0 1 7 3 9 2 6 4 8 5 0 1 7 3 9 2 6 5 8 4 0 1 7 3 9 2 6 5 4 8 bp=. {.newixs pd <"1 |:circuit locs{~bp [ pd 'type line;itemcolor blue' pd 'reset' pd 'title TSP Example: Next Best (Permutation Neighbors) Guess' pd <"1 |:locs [ pd 'type point;pensize 2' pd <"1 |:circuit locs{~bp [ pd 'type line;itemcolor green' dis=. +/2 disBtwPts/\circuit locs{~bp pd 'text 100 100 Total Distance=',":dis bp=. 4{newixs pd <"1 |:circuit locs{~bp [ pd 'type line;itemcolor red;pensize 3' dis=. +/2 disBtwPts/\circuit locs{~bp pd 'textcolor red;text 100 200 Total Distance=',":dis pd 'show' pd <"1 |:circuit locs{~bp [ pd 'type line;itemcolor red;pensize 5' pd 'textcolor red;text 100 200 Total Distance=',":dis pd 'show'
Odds'n'Sods
We looked at an article on "Corpora and Vector Spaces" and implementing this in Python.
Corpora and Vector Spaces
We read about Genesim, a Python library for analyzing bodies ("corpora") of plain text by building vector spaces of the words in a given corpus. A vector space represents a word as a high-dimensional vector of numbers.
The complete article is included in the Materials section below.
About the Entry
An explanation of the winner of the 2015 "Underhanded C Contest": we looked at the rather simple bit of C code for comparing pairs of gamma-ray spectra as well as at a more general explanation of the results of that contest in which there was discussion of "NaN poisoning attacks". These attacks rely on the undefined behavior of "NaN" (Not a Number) which has representation in many programming languages, including J.
The complete articles are included in the Materials section below.
Got to be a Better Way
Here we looked at some code in Javascript for doing financial analysis. This perhaps falls under the heading of things for which a particular language is not well-suited.
Similarly, we examined some Python code for calculating means, medians, and quantile creation.
We compared the J code for doing this with the Python code.
J Code for Mean and Median
mean=: (+/ % #) :wmean median=: -:@(+/)@((<. , >.)@midpt { /:~) midpt=: -:@<:@# wmean=: +/@[ %~ +/@:*
Python Code for Mean and Median (with graphing)
Here's some Python code, with wrapping for plotting the resulting quantiles, for calculating these values.
# percentile breaker-upper import pandas as pd from pandas import read_csv, DataFrame import matplotlib.pyplot as plt plt.rcdefaults() import numpy as np import matplotlib.pyplot as plt path_to_file = input("path:") #path_to_file = 'test_data.csv' # remove trailing space if exists, as happens when getting the file path # by dragging an icon from the desktop on a Mac. path_to_file = path_to_file.strip() # read the file into a DataFrame dframe = read_csv(path_to_file) list_headers = list(dframe.columns.values) len_data = len(dframe.index) print('length of data: ',len_data) # print a list of the headers, with an index number to select for x in range(0,len(list_headers)): print('column #',x,' ',list_headers[x]) col_sort = int(input('enter column to sort by: ')) col_meanmed = int(input('enter mean/median column: ')) tiles = int(input("enter number of quantiles to use: ")) # sort the data dframe = dframe.sort_values([list_headers[col_sort]],ascending=True) # handle uneven number of rows by assigning remainder to first and last quantiles. # if remainder is odd, the first quantile gets the extra one. int_tile = int(len_data/tiles) modrem = len_data % tiles offset = int((modrem/2) + (modrem % 2)) stuff_begin = int_tile + offset stuff_end = len_data - int(modrem/2) - int_tile # this is the computed quantile mean and median data tile_data =[] # the first and last quantiles are handled outside the loop, since they # are possibly of differing row lengths to the inside quantiles. new_slice=slice(0,stuff_begin) temp_mean = dframe.iloc[new_slice,col_meanmed] print('\n\nquantile data:\n\nmean/median 0 :\t',DataFrame.mean(temp_mean),'\t',DataFrame.median(temp_mean)) tile_data.append([0,DataFrame.mean(temp_mean),DataFrame.median(temp_mean)]) for x in range(1,tiles-1): new_slice=slice((int_tile * x)+ offset, (int_tile * (x+1)) + offset) temp_mean = dframe.iloc[new_slice,col_meanmed] print('mean/median',x,':\t',DataFrame.mean(temp_mean),'\t',DataFrame.median(temp_mean)) tile_data.append([x,DataFrame.mean(temp_mean),DataFrame.median(temp_mean)]) new_slice = slice(stuff_end,len_data) temp_mean = dframe.iloc[new_slice,col_meanmed] print('mean/median',x+1,' :\t',DataFrame.mean(temp_mean),'\t',DataFrame.median(temp_mean)) tile_data.append([x+1,DataFrame.mean(temp_mean),DataFrame.median(temp_mean)]) # matplotlib seems to want tuples instead of other data types. # adding 1 to each element in the quantile row, so that quantiles don't start # at zero. row_t = tuple([row[0]+1 for row in tile_data]) row_mn = tuple([row[1] for row in tile_data]) row_md = tuple([row[2] for row in tile_data]) print(""" making bar chart.... """) # graphing stuff. Copied and pasted from samples, with much head-scratching involved. n_groups = len(row_t) title_graph = str(row_t[len(row_t)-1]) + ' quantiles, sorted by "' + str(list_headers[col_sort]) + '", data: "' + str(list_headers[col_meanmed]) + '"' fig, ax = plt.subplots() index = np.arange(n_groups) bar_width = 0.35 opacity = 1 rects1 = plt.bar(index, row_mn, bar_width, alpha=opacity, color='b', label='Mean') rects2 = plt.bar(index + bar_width, row_md, bar_width, alpha=opacity, color='r', label='Median') plt.xlabel('quantiles') plt.ylabel('value') plt.title(title_graph) plt.xticks(index + bar_width, row_t) plt.legend() plt.show()
Miscellaneous
We looked at an example of a crossword using regular expressions as clues.
Here is a partially solved one to illustrate how this works.
Materials
- File:2015 Winner of the Underhanded C Contest.pdf
- File:Corpora and Vector Spaces.pdf
- File:Easy Mean and Median Quantile Creation in Python.pdf
- File:ExplanationOf2015WinnerOfUnderhandedCContest one.pdf
- File:Introduction to Javascript for Financial Analysts.pdf
- File:Traveling Salesman Problem Using Genetic Algorithms.pdf