Machine Learning Homework’s

Taking CS445 has been interesting. I’ve noticed quite a few gaps in my understanding of statistics, which makes sense given my focus on more theoretical or structure driven maths. We’re half way through the term and we’ve covered a few types of learning mechanisms as well as contrasted them. I look forward to learning about more of them and eventually ensemble methods.

Keep in mind most of these homework’s were programmed in a day…

Homework 1 : Perceptrons : For a database of grey-scale images of handwritten digits: mnist. Use a perceptron to classify them. The accuracy stumbled around 80% and overall wasn’t impressive, which was exactly what was expected of a perceptron. Source

Homework 2 : Neural Network : Given the same problem we instead applied an NN with a single hidden layer. The accuracy of this one got near perfect on the training set (possibly over-fitting) and around 95% on the test set. The implementation, if you poke around it, isn’t crazy fast but allows for as many layers and nodes as you like. Source

Homework 3 : SVM : For this one we created a spam filter based on SVM_light and spambase data set. We also played around with weighted feature selection vs random feature selection. Overall this one was more wrestling with the oddities of SVM_light and does not reflect understanding of the internals of an SVM. Oh well. (will post source when this has been due for more than a week.)

The n-clique problem

The n-clique problem, is finding all of the n sized cliques for a given graph. More generally, given a metric space and a set of elements. We can build a graph based on a distance threshold. So for a metrizable space X, a set S \subset X, and a graph G. We let nodes of G correspond to elements of S. Let d(s_n,s_m)< \epsilon \Rightarrow (v_n,v_m)\in E(G).  For some epsilon.

It should be noted that we can divide the problem into connected subgraphs if the graph is not connected, this is check-able in linear time and as such is computationally free relative to what we’re about to deal with. This also will greatly reduce the expense at lower thresholds.

There are several ways to “check” for n-cliques, though most are brute force searches and are incredibly computationally expensive.

The simplest, is to check every x\subset V(G) of n size. For a k=|S| there are k \choose n possible subsets. Which simplifies to factorial time. We also have to verify each subset: \forall v_n,v_m \in x,  (v_n,v_m)\in E(G). Which has n \choose 2 possible checks. So our total complexity is a double factorial: k!*n! …

Assuming you have all (n-1)-cliques, you can count the number of (n-1)-cliques contained in each possible n-clique, which becomes (k-1)!*k! This also requires k! memory.

We can also go by edge addition, for a subgraph of G, the addition of an edge in G will create cliques. \forall (a,b)\in V(G) added to a proto-graph H, the set of new n-cliques C_n is… C_2 = \{\{x,y\}\subset V(G) | xy \in E(G) \}\\ \forall n>2 \\C_n = \{A\cup B | a\in A\in C_{n-1}, b\in B\in C_{n-1}, |A\cap B|=n-2\}.

If we can create a hierarchy of containment for cliques this might be fairly efficient.

 

Road Generation For City Layouts

This is a step in a larger project for detailed city generation within Unreal Engine. It encompasses the generation of road patterns within a boundary following a rule set.

Tiers: Roads are generated in tiers (normally highway, major, minor). The first tier starts at a random position on the boundary. All other tiers use trace seeds generated by the previous tier.

Tracing: The road is generated by greedily tracing the major and minor gradients of a continuous function on the x,y plane describing terrain height. At an interval a road trace marks its current location and places seeds for tracing the gradient (perpendicular to the current road) in either direction. Spacing between roads is managed by casting an intersection test perpendicular to the current trace and aborting the trace if another parallel road is to close. Road traces stop either when intersecting a parallel road (which they then merge with) or after a certain length. In either case a seed is placed at the end to continue the road later.

Hills: If the gradient is over a threshold, it skews the traces so as to create a sweeping hill climbing pattern. It also adjusts the threshold for parallel to maintain clean spacing.

Because the vertex count is so high (hundreds of thousands to millions) it uses several kd-trees with nearest neighbor and range search.

Currently the project takes ~30 seconds to generate a layout. Calculating the complexity of this is hard as its so dependent on spacing rules. The complexity of a trace however is log(n)*k where k is the trace iteration count.

Next in this project is to simplify the graph by merging paths. Then find the borders of all planar regions in the graph. Unless I decide to implement water/bridges… So many options

Bak-Sneppen

baksnep_start

Bak-Sneppen, A simplistic model of competitive evolution…

Definition of the Bak-Sneppen model

This problem was brought to my attention by another student working on developing it for a research problem. I thought I’d mock up a parallel version just to play around. Its not necessarily appropriate, but it was the flavor of the month for my hobbies.

The system works as follows:

  1. Create a random environment of unsigned 32-bit integers.
  2. For each element in the environment, give it an index of the element directly less than and greater than or equal to.
  3. Move this structures onto the GPU along with…
  4. A 3 * batchSize list random integers
  5. Now each thread checks if it has a valid less than index, if not, it and its neighbors pop integers off the random list, overwrite their values, record their indexes, and remove themselves from the structure.
  6. Each thread not removed checks if the removed elements are greater than or equal to them, and less than their greater value, if so they re-insert those elements above them.
  7. It does this batchSize times
  8. Then once a batch is done, the GPU starts over

Meanwhile

  • The CPU is updating an array with new random values, and swapping pointers to keep random values fresh.
  • The CPU is copying the recorded changed indices to keep a record of what elements “died”

All together it ends up being pretty darn fast.

baksnep_end

Hamr Online

preview_b

Hamr Online Map Generator project

This is a project that fell by the way-side when school started back up, as well as my departure from using the source engine in many projects. Quite a bit of work went into it, and one day I think I’d like to revisit it.

The source is included in the download if anyone is interested in poking it. It works fairly well. Though its polygon triangulation can be buggy with more complex shapes.

SDPS_Brute

seednum

This is a small console application built to solve the puzzles presented by a fellow student’s mobile math game. It gives you access to 7 basic arithmetic functions and a single digit. The goal is to reach a target value, using only those tools. The secondary goal is to complete the task with the fewest instances of the given digit.

I first solved a few puzzles by hand. With advice by the student (who had played the game more than any sane man should) I started to develop a basic strategy for how to quickly solve puzzles. The functions given are +, -, *, /, ^, square root, and factorial. Two of these take in a single value, square root and factorial. The other five take in two.

For n digits:

I decided to only allow for factorial and square root to be used on the given digit, and build up a set of constants from the 5 possible combinations of these two. Being x, sqrt(x), x!, sqrt(x!), and (sqrt(x))!. I throw out any of the ones resulting in non-integers. So we have a list of n choices from these constants.

Next is the function application and order of operations. I create a binary tree of n-1 digits, initialized so that it is only left descending. This tree represents the order of operations. It uses a solver function recursively downward that ends in the root reaching the statement value. I also create a list of n-1 s choices from the functions between any 2 digits.

We then iterate through every combination of functions and digits, stopping if we reach the target value. If not, we move the left-most leaf of the tree directly right of the leaf directly right of it, and reset the lists to iterate again.

This method is entirely unwieldy, and STILL does not cover the entire search space.

It has a worst case step count of… (25^n)/5 * (2(n-1) choose (n-1))

Dear god why did I do this. Source

Thoughts:

  • If a solving tree member returns a non-integer, quit the solution process and iterate.
  • Some function combinations are useless. Scan for these ( find …*(x/x) structures)
  • Perhaps use an exploratory method using factor similarity as a metric for shortest path.
  • Tree members ALSO need a state, as a factorial/ square root could also be applied to a statement fragment. This is a bit of a problem though as factorial could be applied recursively…
  • Half the order permutations are identical, so after the symmetric case I can stop