Multiple types of measures have been developed to measure the similarity between two spike trains. These were extensively used to classify neuron responses according to stimuli and to validate mathematical models that predict the spike times. Here we analyze the existing similarity measures in the light of trial-to-trial variability. Using a small set of spike train it is often impossible to discriminate correctly between different generative processes. In particular we find that many measures cannot discriminate appropriately for shifts in overall firing intensity or for the amount of jitter in the spike timing. We find that it is possible to modify some of the existing measures by taking into account the variance of the measure across spike trains from the same set. In so doing we remove a sample bias and we find that it is possible to discriminate correctly in all cases. Finally, we demonstrate that without sample bias compensation the similarity of real neurons with spiking neuron models having low stochasticity will be overrated.