
Testing a new site design
The web team at AcmeContent have been hard at work, developing a new site to encourage visitors to stick around for an extended period of time. They've used all the latest techniques and, as a result, we're pretty confident that the site will show a marked improvement in dwell time.
Rather than launching it to all users at once, AcmeContent would like to test the site on a small sample of visitors first. We've educated them about sample bias, and as a result, the web team diverts a random 5 percent of the site traffic to the new site for one day. The result is provided to us as a single text file containing all the day's traffic. Each row shows the dwell time for a visitor who is given a value of either "0" if they used the original site design, or "1" if they saw the new (and hopefully improved) site.
Performing a z-test
While testing with the confidence intervals previously, we had a single population mean to compare to.
With z-testing, we have the option of comparing two samples. The people who saw the new site were randomized, and the data for both groups was collected on the same day to rule out other time-dependent factors.
Since we have two samples, we also have two standard errors. The z-test is performed against the pooled standard error, which is simply the square root of the sum of the variances divided by the sample sizes. This is the same as the result we would get if we took the standard error of the samples combined:

Here, is the variance of sample a and
is the variance of sample b. na and nb are the sample sizes of a and b, respectively. The pooled standard error can be calculated in Clojure like this:
(defn pooled-standard-error [a b] (i/sqrt (+ (/ (i/sq (standard-deviation a)) (count a)) (/ (i/sq (standard-deviation b)) (count b)))))
To determine if the difference we're seeing is unexpectedly large, we can take the ratio of the observed difference between the means over the pooled standard error. This quantity is given the variable name z:

Using our pooled-standard-error
function, the z-statistic can be calculated like this:
(defn z-stat [a b] (-> (- (mean a) (mean b)) (/ (pooled-standard-error a b))))
The ratio z captures how much the means differ relative to the amount we would expect given the standard error. The z-statistic therefore tells us how many standard errors apart the means are. Since the standard error has a normal probability distribution, we can associate this difference with a probability by looking up the z-statistic in the normal CDF:
(defn z-test [a b] (s/cdf-normal (z-stat a b)))
The following example uses the z-test to compare the performance of the two sites. We do this by grouping the rows by site, returning a map that indexes the site to the collection of rows for the site. We call map-vals
with (partial map :dwell-time)
to convert the collection of rows into a collection of dwell times. map-vals
is a function defined in Medley (https://github.com/weavejester/medley), a library of lightweight utility functions:
(defn ex-2-14 [] (let [data (->> (load-data "new-site.tsv") (:rows) (group-by :site) (map-vals (partial map :dwell-time))) a (get data 0) b (get data 1)] (println "a n:" (count a)) (println "b n:" (count b)) (println "z-stat: " (z-stat a b)) (println "p-value:" (z-test a b)))) ;; a n: 284 ;; b n: 16 ;; z-stat: -1.6467438180091214 ;; p-value: 0.049805356789022426
Setting a significance level of 5 percent is much like setting a confidence interval of 95 percent. In essence, we're looking to see if the observed difference falls outside the 95 percent confidence interval. If it does, we can claim to have found a result that's significant at the 5 percent level.
This code returns a value of 0.0498, equating to 4.98 percent. As it is just less than our significance threshold of 5 percent, we can claim to have found something significant.
Let's remind ourselves of the null and alternative hypotheses:
- H0: The dwell time for the new site is no different from the dwell time of the existing site
- H1: The dwell time is greater for the new site compared to the existing site
Our alternate hypothesis is that the dwell time is greater for the new site.
We are ready to claim statistical significance, and that the dwell time is greater for the new site compared to the existing site, but we have a problem—with a smaller sample, there is an increased uncertainty that the sample standard deviation matches the population standard deviation. Our new site sample has only 16 visitors, as shown in the output of the previous example. Samples as small as this invalidate the assumption that the standard error is normally distributed.
Fortunately, there is a statistical test and an associated distribution which models the increased uncertainty of standard errors for smaller sample sizes.
Student's t-distribution
The t-distribution was popularized by William Sealy Gossett, a chemist working for the Guinness Brewery in Ireland, who incorporated it into his analysis of Stout.
While the normal distribution is completely described by two parameters—the mean and standard deviation, the t-distribution is described by only one parameter called the degrees of freedom. The larger the degrees of freedom, the closer the t-distribution resembles the normal distribution with a mean of zero and a standard deviation of one. As the degrees of freedom decreases, the distribution becomes wider with tails that are fatter than the normal distribution.

The earlier chart shows how the t-distribution varies with respect to the normal distribution for different degrees of freedom. Fatter tails for smaller sample sizes correspond to an increased chance of observing larger deviations from the mean.
Degrees of freedom
The degrees of freedom, often abbreviated to df, is closely related to the sample size. It is a useful statistic and an intuitive property of the series that can be demonstrated simply by example.
If you were told that the mean of two values is 10 and that one of the values is 8, you would not need any additional information to be able to infer that the other value is 12. In other words, for a sample size of two and a given mean, one of the values is constrained if the other is known.
If instead you're told that the mean of three values is 10 and the first value is also 10, you would not be able to deduce what the remaining two values are. Since there are an infinite number of sets of three numbers beginning with 10 whose mean is 10, the second value must also be specified before you can infer the value of the third.
For any set of three numbers, the constraint is simple: you can freely pick the first two numbers, but the final number is constrained. The degrees of freedom can thus be generalized in the following way: for any single sample, the degrees of freedom is one less than the sample size.
When comparing two samples of data, the degrees of freedom is two less than the sum of the sample sizes, which is the same as the sum of their individual degrees of freedom.