data:image/s3,"s3://crabby-images/896ab/896ab2a38df8acc0d9b991d98008dc8e418c6f17" alt="Clojure for Data Science"
Visualizing different populations
Let's remove the filter for weekdays and plot the daily mean dwell time for both week days and weekends:
(defn ex-2-12 [] (let [means (->> (load-data "dwell-times.tsv") (with-parsed-date) (mean-dwell-times-by-date) (i/$ :dwell-time))] (-> (c/histogram means :x-label "Daily mean dwell time unfiltered (s)" :nbins 20) (i/view))))
The code generates the following histogram:
data:image/s3,"s3://crabby-images/05abd/05abd363a3818af91c7160ea44be92d7ebff2a10" alt="Visualizing different populations"
The distribution is no longer a normal distribution. In fact, the distribution is bimodal—there are two peaks. The second smaller peak, which corresponds to the newly added weekend data, is lower both because there are not as many weekend days as weekdays and because the distribution has a larger standard error.
Note
In general, distributions with more than one peak are referred to as multimodal. They can be an indicator that two or more normal distributions have been combined, and therefore, that two or more populations may have been combined. A classic example of bimodality is the distribution of people's heights, since the modal height for men is larger than that for women.
The weekend data has different characteristics than the weekday data. We should make sure that we're comparing like with like. Let's filter our original dataset just to weekends:
(defn ex-2-13 [] (let [weekend-times (->> (load-data "dwell-times.tsv") (with-parsed-date) (i/$where {:date {:$fn p/weekend?}}) (i/$ :dwell-time))] (println "n: " (count weekend-times)) (println "Mean: " (s/mean weekend-times)) (println "Median: " (s/median weekend-times)) (println "SD: " (s/sd weekend-times)) (println "SE: " (standard-error weekend-times)))) ;; n: 5860 ;; Mean: 117.78686006825939 ;; Median: 81.0 ;; SD: 120.65234077179436 ;; SE: 1.5759770362547665
The grand mean value at weekends (based on 6 months of data) is 117.8s, which falls within the 95 percent confidence interval of the marketing sample. In other words, although 130s is a high mean dwell time, even for a weekend, the difference is not so big that it couldn't simply be attributed to chance variation within the sample.
The approach we have just taken to establish a genuine difference in populations (between the visitors to our site on weekends compared to the visitors during the week) is not the way statistical testing would conventionally proceed. A more usual approach is to begin with a theory, and then to test that theory against the data. The statistical method defines a rigorous approach for this called hypothesis testing.