﻿ Standard deviation of combined data - 开发者网 - DeveloperSite.cn

# Standard deviation of combined data

Keywords：r

Question:

I have a data set with mean values, standard deviations and n. One of the variables has an equal sample size, while the sample size for the other one varies.

``````dat <- data.frame(variable = c(rep("x", 2), rep("y", 3)), replicate = c(1,2,1,2,3),
mean = c(3.4, 2.5, 6.5, 5.7, 5.1), sd = c(1.2, 0.7, 2.4, 4.0, 3.5),
n = c(3,3,5,4,6))
``````

I need to combine `x` and `y` variables and am trying to find a code-sparing way to calculate combined standard deviation for instance using by `aggregate` function. The equation for combined standard deviation is following:

And for unequal sample sizes (same source):

My combined data frame should look like this:

``````variable    mean    sd
x           2.95    sd_x
y           5.76    sd_y
``````

How to make a function in R that calculates the combined standard deviation? Or alternatively, if there is a package designed for this, it counts as an answer too =)

Rudmin (2010) states that exact variance of pooled data set is the mean of the variances plus the variance of the means. flodel has already provided an answer and function that gives similar values to Rudmin's statement. Using Rudmin's data set and flodel's function based on Wikipedia:

``````df <- data.frame(mean = c(30.66667, 31.14286, 40.33333), variance = c(8.555555, 13.26531, 1.555555), n = c(6,7,3))

grand.sd   <- function(S, M, N) {sqrt(weighted.mean(S^2 + M^2, N) -
weighted.mean(M, N)^2)}

grand.sd(sqrt(df\$variance), df\$mean, df\$n)^2

#[1] 22.83983 = Dp variance in Rudmin (2010).
``````

However this solution gives slightly different values compared to the function 5.38 from Headrick (2010) (unless there is a mistake somewhere):

``````dat <- data.frame(variable = c(rep("x", 2), rep("y", 3)), replicate = c(1,2,1,2,3),
mean = c(3.4, 2.5, 6.5, 5.7, 5.1), sd = c(1.2, 0.7, 2.4, 4.0, 3.5),
n = c(3,3,5,4,6))

x <- subset(dat, variable == "x")

((x\$n[1]^2)*(x\$sd[1]^2)+
(x\$n[2]^2)*(x\$sd[2]^2)-
(x\$n[2])*(x\$sd[1]^2) -
(x\$n[2])*(x\$sd[2]^2) -
(x\$n[1])*(x\$sd[1]^2) -
(x\$n[1])*(x\$sd[2]^2) +
(x\$n[1])*(x\$n[2])*(x\$sd[1]^2) +
(x\$n[1])*(x\$n[2])*(x\$sd[2]^2) +
(x\$n[1])*(x\$n[2])*(x\$mean[1] - x\$mean[2])^2)/
((x\$n[1] + x\$n[2] - 1)*(x\$n[1] + x\$n[2]))

#[1] 1.015

grand.sd(x\$sd, x\$mean, x\$n)^2

#[1] 1.1675
``````

To answer my own question, the desired `data.frame` would be acquired followingly:

``````library(plyr)
ddply(dat, c("variable"), function(dat) c(mean=with(dat,weighted.mean(mean, n)),  sd = with(dat, grand.sd(sd, mean, n))))

variable     mean       sd
1        x 2.950000 1.080509
2        y 5.726667 3.382793
``````