CPS: User Manual

Context

The aim of this program is to place media along an axis in a way determined by human workers. However, since crowdsoucing is used, there is no reason to allege the wokers are qualified. Furthermore, if you must have human doing this classification, the criteria used is most likely very subjective : it would definitely NOT be a good idea to ask them to simply give, say, a mark between one an ten. Even making an average may not be the solution because each question is paid : it could be very costly, without assuring a significant augmentation of the overall quality of the results.

The ideas behind the splitsort

On the use of comparisons rather than marks

Workers cannot be asked to answer a complex or too subjective question such as giving a mark between one and ten to a media :we are not sure of the quality of the results and doing so would require you to pay more for each question. Therefore, CPS only asks them to perform a comparison between to media : " which one is the most 'your axis'?".

From comparisons to a coordinate

All right, workers answer comparisons, it is more sure. But how do we use them to give each media a coordinate? The trick is to assume media follow a given distribution along 'your axis' (here, a uniform one, but this can be modified by re-writing the rank2value() method). We sort media using a quicksort and then give a value to their rank given the distribution their supposed to follow.

On the separation of the comparisons from the sort

In order to increase even more the quality of the results, CPS submits the results of the comparisons to a program, get another label, written by P. Ipeirotis (an american researcher specialised in crowdsourcing, more info on his blog) and his students. Based on confusion matrix, it increases the quality of the results.

However, in order this to be efficient, it must be sent as many comparisons at a time as possible. That is why comparisons are gathered before they are sent to CrowdFlower. This also limits the number of uploads.

The part that performs the quicksort and the one that compares the media are split, hence splitsort.

A scheme of the splitsort principle

In order for all this to be more clear, let's take a look at an example. Even though it is rather... interesting to consider that six media will follow any type of probability distribution, we will say they do for the sake of the example. So, we have six media to place along 'your axis', M1, M2, M3, M4, M5 and M6. If have never heard of the quicksort before, you may consider doing your homework before going any further. Now, release the splitsort!

First iteration

First, just like a "normal" quicksort, we choose a pivot. CPS always takes the first element of the array to sort. Here, it is M1. Comparisons are then generated : in order to continue, we need to compare if (M1 and M2), (M1 and M3), (M1 and M4), (M1 and M5) and (M1 and M6). These comparisons are then sent to a crowd that performs them. Results are then retrieved and processed using get another label.

We now have good results, so we can finish our quicksort iteration. As it turned out, (M1 > M3,M4,M6) and (M1 < M2,M5), so we split the initial array in three : one that contains the smallest (M3,M4,M6), one that contains the pivot (M1) and one that contains the greatest (M2,M5).

a scheme representing the first iteration of a splitsort

Second iteration

We are not done yet, another iteration must be performed. Again, we choose the first element of each array as the pivot, therefore they are M3 and M2. We then go through the same steps as the first iteration, but there is one thing you must notice : all comparisons, either for the first or the second array, are submitted at the same time! In the end of this iteration, the media are sorted.

Placing media

Now that we are done with sorting, it is time to place the media along 'your axis'. Assuming they follow a uniform distribution within [-1,1], it will look like this :

the result : media are sorted along <your axis>

Tada! You have your results and are therefore most likely very happy.

Principle of the splitsort