KFold cross validation

Source code

While working with IA you most of the time have to train your implementation with a data set.

Here you get two well known problems, if you train your machine with the whole set you'll get a perfect trained machine for that data set but you can't be sure if out from that you've done a nice work or not.

For that reason you usally create two groups, first one it's the one you use for train, and second one for testing it

First approach to create these two groups it's to consider train data from beginning of the whole set to an index, and from there it will be testing data.

The problem with this easy technique is that you could get non-representative population of the original data, so both the training and the testing could be wrong.

A better solution is to use for example K-fold cross validation where you divide randomly the data into K balanced boxes.

Once you get the K boxes, you iterate from 1 to K and on each step you use the box(i) for testing while all the other boxes will be used for training. This method will give you a much better way to test and train your data getting

There're many implementations of K-fold crossvalidation for languages such us matlab, but it was hard for me to find an implementation for C++ so I just created a simple implementation I think could be useful for someone else, so here is it.

The following function it's just an implementation in C++ from the wikipedia article Knuth shuffle:

void shuffleArray(int* array,int size) 
{
  int n = size;
  while (n &gt; 1) 
  {
    // 0 <= k < n.
    int k = rand()%n;		

    // n is now the last pertinent index;
    n--;					

    // swap array[n] with array[k]
    int temp = array[n];	
    array[n] = array[k];
    array[k] = temp;
  }
}

And the next one it's the simple function that returns the array of indices to arrange the crossvalidation:

int* kfold( int size, int k )
{
  int* indices = new int[ size ];

  for (int i = 0; i < size; i++ )
    indices[ i ] = i%k;

  shuffleArray( indices, size );

  return indices;
}

I've included also a little project for VS2003 that you can execute to see the result using size=20 and k=4:

2 2 1 0 1 1 1 3 0 1 0 2 3 2 0 3 2 3 3 0