Let us consider the estimation of sample size for a cross-sectional study.
In order to estimate the required sample size, we need to know the following:
p: The prevalence of the condition/ health state. If the prevalence is 32%, it may be either used as such (32%), or in its decimal form (0.32).
q: i. When p is in percentage terms: (100-p)
ii. When p is in decimal terms: (1-p)
d (or l): The precision of the estimate. This could either be the relative precision, or the absolute precision. This will be discussed later in this post.
Za [Z alpha]: The value of z from the probability tables. If the values are normally distributed, then 95% of the values will fall within 2 standard errors of the mean. The value of z corresponding to this is 1.96 (from the standard normal variate tables).
The formula for estimating sample size is given as:
(Za)^2[p*q] where the symbol ^ means ‘to the power of’; * means ‘multiplied by’
N= d^2 that is, “Z-alpha squared into pq; upon d-square”
substituting the values of Za, we get:
N= (1.96)^2[p*q]
d^2
We can round off the value of Za (1.96) to 2, to obtain:
N= (2)^2[p*q]
d^2
or, N= 4pq/ d^2 that is, “4 pq by d-square”
Example:
I wish to conduct a cross-sectional study on awareness of Hepatitis B among school children. A literature search reveals that other investigators have reported knowledge to range from 5% to 20% among students of grades 6 through 8. What should the size of my sample be?
The formula requires us to input the value of d (precision). If the absolute precision is known, there is no problem. However, often we can only input a relative precision. Where do we get the value of relative precision from?
Typically, relative precision is taken as a proportion of ‘p’. The maximum permissible limit is 20% of ‘p’.
In the above example, if ‘p’ is 20%, then ‘d’ will be (20/100)*20= 0.2*20= 4 {Taking a relative precision of 20%}.
This means that we will be able to detect a ‘p’ (prevalence) of 18% or more {half the value of relative precision on either side of ‘p’–> +/- 2%: 18% to 22%}.
That is, by taking a relative precision of 20% of ‘p’, the study will be able to detect the true awareness level if the actual prevalence is 18% or more. If the actual prevalence is less than 18%, however, the study will be unable to detect it accurately.
Therefore, the larger the value of ‘p’ (prevalence), the larger the possible value of ‘d’ (relative precision), keeping ‘d’ fixed (say, at 20% of ‘p’). If the prevalence is 50%, ‘d’ (20% of ‘p’) would then be 0.2*50= 10 (as compared to ‘d’ = 4 when ‘p’ = 20%).
The reverse is also true: the smaller the value of ‘p’, the smaller the value of ‘d’. A smaller ‘d’ implies a larger sample size. Therefore, the choice of ‘p’ is crucial.
We can now input the values in the formula to obtain the sample size:
For the calculation we will take ‘d’ as 4. This yields:
N= (4*20*80)/ (4*4)
= 400 this sample size will enable us to detect the truth if the prevalence is between 18-22% (or more).
If we took ‘p’= 5, then the sample size would be:
N= (4*5*95)/(1*1) [‘d’= 0.2*5= 1]
= 1900 this sample size will enable us to detect the truth if the prevalence is between 4-6% (or more).
So should I take ‘p’= 20% or ‘p’=5%?
That depends upon:
1. The location of the original study- if you are planning to conduct the study in an urban area, use the prevalence reported by studies conducted in urban areas, and vice versa.
2. The available resources (time, manpower, money, etc.). Aim for the largest feasible sample size. The size should be adequate to yield 80% power. Do not unnecessarily increase the sample size unless the intention is to obtain greater power. If so, please mention the same in the methodology section.
3. The results of your pilot study. If you have conducted a pilot study, the prevalence obtained from that study should be taken as ‘p’. This will be much more accurate than any other external value.
Note 1: If you have multiple objectives, you must calculate the required sample size for each objective, then choose the largest sample size thus obtained. This will ensure adequate power for all objectives, else the study will lack power for one or more objectives. That is, you may not be able to detect a significant result where it actually exists because you failed to include enough subjects to detect it.
Note 2: It is advisable to mention a range rather than a single value for sample size. This is standard practice in the west, but not in India. A range may be obtained by calculating the sample size for different values of ‘p’.
Dear Dr Roopesh, please I am conducting a cross sectional study on assessment of biomedical waste management and disposal practices among selected hospitals in Port Harcourt Nigeria. I am looking for a formula to use in calculating my sample size. Thanks
LikeLike
Dear Tonia,
The formula for cross-sectional studies is the same as that mentioned in the article.
You will have to substitute the values of p, q, and determine the relative precision desired to compute the sample size.
Regards,
Dr. Roopesh
LikeLike
Thanks Dr, I used 30% as my prevalence hope it is good?
LikeLike
Dear Tonia,
30% is okay if you have literature citing that prevalence.
Regards,
Dr. Roopesh
LikeLike
Thanks Dr, please what is the formula’s or author’s names?
LikeLike
I am doing a cross sectional study checking serum endothelin1 levels in heart failure patients and its correlates , comparing characteristics of patients with elevated levels with those with normal levels of endothelin1, please can i use this formula to calculate sample size?
LikeLike
Dear Ekikere Marcel,
If your study is cross-sectional, then you can use the formula mentioned in the article.
Regards,
Dr. Roopesh
LikeLike