Sample size calculation: Cross-sectional studies

Let us consider the estimation of sample size for a cross-sectional study.

In order to estimate the required sample size, we need to know the following:

p: The prevalence of the condition/ health state. If the prevalence is 32%, it may be either used as such (32%), or in its decimal form (0.32).

q: i. When p is in percentage terms: (100-p)

ii. When p is in decimal terms: (1-p)

d (or l): The precision of the estimate. This could either be the relative precision, or the absolute precision. This will be discussed later in this post.

Za [Z alpha]: The value of z from the probability tables. If the values are normally distributed, then 95% of the values will fall within 2 standard errors of the mean. The value of z corresponding to this is 1.96 (from the standard normal variate tables).

The formula for estimating sample size is given as:

(Za)^2[p*q] where the symbol ^ means ‘to the power of’; * means ‘multiplied by’

N= d^2 that is, “Z-alpha squared into pq; upon d-square”

substituting the values of Za, we get:

N= (1.96)^2[p*q]

d^2

We can round off the value of Za (1.96) to 2, to obtain:

N= (2)^2[p*q]

d^2

or, N= 4pq/ d^2 that is, “4 pq by d-square”

Example:

I wish to conduct a cross-sectional study on awareness of Hepatitis B among school children. A literature search reveals that other investigators have reported knowledge to range from 5% to 20% among students of grades 6 through 8. What should the size of my sample be?

The formula requires us to input the value of d (precision). If the absolute precision is known, there is no problem. However, often we can only input a relative precision. Where do we get the value of relative precision from?

Typically, relative precision is taken as a proportion of ‘p’. The maximum permissible limit is 20% of ‘p’.

In the above example, if ‘p’ is 20%, then ‘d’ will be (20/100)*20= 0.2*20= 4 {Taking a relative precision of 20%}.

This means that we will be able to detect a ‘p’ (prevalence) of 18% or more {half the value of relative precision on either side of ‘p’–> +/- 2%: 18% to 22%}.

That is, by taking a relative precision of 20% of ‘p’, the study will be able to detect the true awareness level if the actual prevalence is 18% or more. If the actual prevalence is less than 18%, however, the study will be unable to detect it accurately.

Therefore, the larger the value of ‘p’ (prevalence), the larger the possible value of ‘d’ (relative precision), keeping ‘d’ fixed (say, at 20% of ‘p’). If the prevalence is 50%, ‘d’ (20% of ‘p’) would then be 0.2*50= 10 (as compared to ‘d’ = 4 when ‘p’ = 20%).

The reverse is also true: the smaller the value of ‘p’, the smaller the value of ‘d’. A smaller ‘d’ implies a larger sample size. Therefore, the choice of ‘p’ is crucial.

We can now input the values in the formula to obtain the sample size:

For the calculation we will take ‘d’ as 4. This yields:

N= (4*20*80)/ (4*4)

= 400 this sample size will enable us to detect the truth if the prevalence is between 18-22% (or more).

If we took ‘p’= 5, then the sample size would be:

N= (4*5*95)/(1*1) [‘d’= 0.2*5= 1]

= 1900 this sample size will enable us to detect the truth if the prevalence is between 4-6% (or more).

So should I take ‘p’= 20% or ‘p’=5%?

That depends upon:

1. The location of the original study- if you are planning to conduct the study in an urban area, use the prevalence reported by studies conducted in urban areas, and vice versa.

2. The available resources (time, manpower, money, etc.). Aim for the largest feasible sample size. The size should be adequate to yield 80% power. Do not unnecessarily increase the sample size unless the intention is to obtain greater power. If so, please mention the same in the methodology section.

3. The results of your pilot study. If you have conducted a pilot study, the prevalence obtained from that study should be taken as ‘p’. This will be much more accurate than any other external value.

Note 1: If you have multiple objectives, you must calculate the required sample size for each objective, then choose the largest sample size thus obtained. This will ensure adequate power for all objectives, else the study will lack power for one or more objectives. That is, you may not be able to detect a significant result where it actually exists because you failed to include enough subjects to detect it.

Note 2: It is advisable to mention a range rather than a single value for sample size. This is standard practice in the west, but not in India. A range may be obtained by calculating the sample size for different values of ‘p’.

282 thoughts on “Sample size calculation: Cross-sectional studies”

Tonia March 8, 2023 at 4:47 PM

Dear Dr Roopesh, please I am conducting a cross sectional study on assessment of biomedical waste management and disposal practices among selected hospitals in Port Harcourt Nigeria. I am looking for a formula to use in calculating my sample size. Thanks

LikeLike

Reply ↓
1. drroopesh Post authorMarch 10, 2023 at 6:47 PM
  
  Dear Tonia,
  
  The formula for cross-sectional studies is the same as that mentioned in the article.
  You will have to substitute the values of p, q, and determine the relative precision desired to compute the sample size.
  
  Regards,
  Dr. Roopesh
  
  LikeLike
  
  Reply ↓
  1. Tonia March 19, 2023 at 6:34 PM
    
    Thanks Dr, I used 30% as my prevalence hope it is good?
    
    LikeLike
    
    Reply ↓
    1. drroopesh Post authorMarch 25, 2023 at 6:27 AM
      
      Dear Tonia,
      
      30% is okay if you have literature citing that prevalence.
      
      Regards,
      Dr. Roopesh
      
      LikeLike
      
      Reply ↓
  2. Anonymous October 11, 2023 at 3:32 PM
    
    Dear Dr Roopesh ,
    How does one calculate a sample size for an unknown prevalence of a condition, especially if it is a pilot study.
    
    LikeLike
    
    Reply ↓
    1. drroopesh Post authorOctober 13, 2023 at 6:31 AM
      
      Dear Anonymous,
      
      For a pilot study one typically surveys up to 20 or 30 individuals. Their responses are not included in the main study later, though.
      Where the prevalence is unknown, one enquires with local practitioners, general public to guesstimate the prevalence in addition to reviewing literature for clues about the same. One would often be able to obtain a possible range of values (from x% to y%, for instance). Next, one estimates sample sizes for the lower value and the higher value to determine feasibility of conducting a study with the estimated values (it is desirable to use the higher sample size estimate), and finalizes sample size.
      
      I hope this helps.
      Regards,
      
      Dr. Roopesh
      
      LikeLike
      
      Reply ↓
Tonia March 13, 2023 at 1:58 AM

Thanks Dr, please what is the formula’s or author’s names?

LikeLike

Reply ↓
ekikere marcel April 25, 2023 at 10:10 PM

I am doing a cross sectional study checking serum endothelin1 levels in heart failure patients and its correlates , comparing characteristics of patients with elevated levels with those with normal levels of endothelin1, please can i use this formula to calculate sample size?

LikeLike

Reply ↓
1. drroopesh Post authorApril 28, 2023 at 7:03 PM
  
  Dear Ekikere Marcel,
  
  If your study is cross-sectional, then you can use the formula mentioned in the article.
  
  Regards,
  Dr. Roopesh
  
  LikeLike
  
  Reply ↓
Tabe Glorias June 16, 2023 at 1:08 PM

Hi Dr, I’m doing a research project tilted the effects of worker’s incentives on employee performance in higher institutions in buea, cameroon and I’m using the cross sectional sampling technique. I’m confused with how to calculate my sample size from a population of 120 people

LikeLike

Reply ↓
1. drroopesh Post authorJune 16, 2023 at 3:32 PM
  
  Dear Tabe Glorias,
  
  What type of study are you planning to conduct (qualitative or quantitative)? A qualitative study may be more appropriate from what you have written. Alternatively, you could simply analyze some metric(s) of employee performance using routinely collected data.
  
  Do let me know.
  Regards,
  Dr. Roopesh
  
  LikeLike
  
  Reply ↓
Hassan June 18, 2023 at 12:55 AM

Dear Dr. Roopesh

I am planning to conduct a cross sectional study for cardiovascular health behaviors and associated factors among coronary artery disease patients. However, there is no prior studies assessing the prevalence of CAD or these variables in my country despite a very thorough literature review and by addressing different health sectors.
In this case, how I can calculate my sample size? can I use the prevalence of in other countries in the same region?

LikeLike

Reply ↓
1. drroopesh Post authorJune 24, 2023 at 6:13 AM
  
  Dear Hassan,
  
  Yes, you can definitely use the prevalence of a place with similar population and social profile.
  
  Dr. Roopesh
  
  LikeLike
  
  Reply ↓
Dr Ginsau July 2, 2023 at 1:58 AM

Dear Dr Roopesh

I will be conducting a cross-sectional study involving three hospitals in a metropolis. My total sample size is 160 for the whole metropolis using above formula. Is there a formula I can use to calculate the sample size for each hospital?
Thank you.

LikeLike

Reply ↓
1. drroopesh Post authorJuly 7, 2023 at 11:43 AM
  
  Dear Dr. Ginasu,
  
  You could use the formula provided to calculate overall sample size, then use stratified random sampling (with each hospital constituting one stratum) to determine the proportion of the overall sample size that must be obtained from individual hospitals. Within each stratum you will have to apply an appropriate sampling method to obtain the required sub-sample.
  
  I hope this helps.
  Regards,
  Dr. Roopesh
  
  LikeLike
  
  Reply ↓
Marissa July 10, 2023 at 10:49 PM

Hi Dr. Roopesh,

I am currently conducting a superiority trial evaluating 4 drugs have on improving hemoglobin levels in anemic patients to determine which one is best. I was wondering how to go about conducting a sample size calculation for this. Particularly, what prevalence should I be searching the literature for? Should it be the prevalence of anemia?

Best,
Shrey

LikeLike

Reply ↓
1. drroopesh Post authorJuly 14, 2023 at 11:37 AM
  
  Dear Marissa,
  
  The requirements of a clinical trial with four arms are very different from one with two arms. As analysis will be complex, it is best to consult a statistician experienced with such designs beforehand. The effect size will be needed to estimate sample size here. Therefore, you should search literature for trials that will help determine effect size (the magnitude of differences between drugs). You will also have to apriori set the superiority margin.
  
  I hope this helps.
  
  Regards,
  Dr. Roopesh
  
  LikeLike
  
  Reply ↓
Sophie Carrard July 14, 2023 at 12:26 PM

Dear Dr. Roopesh,

I am planning to carry a cross sectional survey about use of technologies of a specific population, which parameters should I take? Thanks for your answer and kind regards, Sophie

LikeLike

Reply ↓
1. drroopesh Post authorJuly 15, 2023 at 9:24 AM
  
  Dear Sophie,
  
  It depends on the research gap, study population, and your research question (which in turn will influence the objectives). If the study population is students, then factors influencing academic performance may be important. The use of technology among elderly for specific needs may require inclusion of different parameters. Technology aided healthcare service delivery may warrant inclusion of other parameters. In essence, the choice of parameters is dictated by their influence on the outcome of interest.
  
  I hope this helps.
  Regards,
  Dr. Roopesh
  
  LikeLike
  
  Reply ↓
Lelisa August 8, 2023 at 1:04 AM

Dear, Dr. Roopesh, I’m studying A five years prevelance and associated factors of HBV among pregnant Women In Kelem Walaga Zone,(Oromia) from Ethiopia. How can I Determine my Sample size? which value of P is permissible for this Study?

LikeLike

Reply ↓
1. drroopesh Post authorAugust 10, 2023 at 1:57 PM
  
  Dear Lelisa,
  
  Any value of P is permissible as long there is adequate justification for the same. The value could be obtained from a pilot study, prior research, etc.
  Please also read the following:
  https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-022-01694-7
  I hope the above helps.
  
  Regards,
  Dr. Roopesh
  
  LikeLike
  
  Reply ↓
Mak August 19, 2023 at 3:35 AM

Dear Dr.Roopesh,
Conducting cross-sectional study on prevalence of delay in patient,diagnostic and treatment delay and associated factors among breast cancer patients. How can I calculate the sample size,which of the 3 p values to take to calculate sample size? Possible way of data analysis, can it be done with logistic regression ?

LikeLike

Reply ↓
1. drroopesh Post authorAugust 25, 2023 at 6:47 AM
  
  Dear Mark,
  
  Calculate sample size using each of the three prevalence values in turn, then choose the largest sample size as the sample size for the study. This way the study will be adequately powered for each of your three objectives.
  
  Logistic regression is used when you are dealing with a single categorical outcome variable and want to investigate the influence of one or more independent variables on the outcome variable. If your variables of interest fit this requirement, you can definitely perform logistic regression. Before performing logistic regression, however, it is important to perform univariate (counts and frequencies/ descriptive statistics) and bivariate analyses (t-test, chi-square test, etc.) to understand the data and discover patterns/relationships.
  
  Hope this helps.
  Regards,
  Dr. Roopesh
  
  LikeLike
  
  Reply ↓
2. Anonymous September 11, 2023 at 7:09 PM
  
  Hi Dr Roopesh,
  I am conducting a study to ascertain what screening method is more acceptable between the Breast self examination and Clinical Breast Examination. How do a incoporate a 10% difference in the groups in my sample size calculation. Also do I have to calculate a sample size different fro BSE and CBE
  
  LikeLike
  
  Reply ↓
  1. drroopesh Post authorSeptember 15, 2023 at 8:21 AM
    
    Dear Anonymous,
    
    You may want to use the sample size calculator with Epi Info for this- it allows users to input prevalence for both comparison groups.
    Although separate sample size calculation is not required for BSE and CBE (the difference is accounted for in the calculation as mentioned above), you must perform separate sample size calculation for each objective and choose the largest feasible sample size estimate. This will ensure there is adequate power for each objective.
    
    I hope this helps.
    Regards,
    
    Dr. Roopesh
    
    LikeLike
    
    Reply ↓
Frank Gondwe September 20, 2023 at 4:01 AM

Hi Dr Roopesh
am doing a study to determine proportion of pregnant mothers who received TTV in a particular location. The national proportion of pregnant mothers who received TTV is 23% and the national prevalence of pregnant women is 12%. Which of the two will i use to determine the sample size and which formula to use.

LikeLike

Reply ↓
1. drroopesh Post authorSeptember 22, 2023 at 8:24 PM
  
  Dear Frank,
  
  You must use the prevalence of TTV in pregnant women to estimate sample size.
  
  Regards,
  Dr. Roopesh
  
  LikeLike
  
  Reply ↓
Anonymous February 10, 2024 at 12:04 AM

Hy Dr
I want to carry out validation study of Cervical cancer biomarkers in urine samples of patients with healthy volunteers as control in my community and from a teaching hospital. The disease has a prevalence of 13.6 as reported. Please how do I calculate my sample size. Thank you in advance

LikeLike

Reply ↓
1. drroopesh Post authorFebruary 10, 2024 at 5:44 AM
  
  Dear Anonymous,
  
  If you are planning to conduct a validation study using a Case-Control study design, I would recommend using the sample size formula for case control studies instead of cross-sectional studies. Alternatively, you could use software like GPower to estimate sample size based on the main statistical analysis you intend to perform. However, you will need to supply effect size (the [anticipated] magnitude of difference between the two groups).
  
  Regards,
  Dr. Roopesh
  
  LikeLike
  
  Reply ↓
Anonymous March 15, 2024 at 5:10 PM

Hi Dr.Roopesh,

I am planning to conduct a cross sectional survey study to evaluate women’s response to the dense breast notification in 1, awareness of their breast density, 2, attained knowledge on breast density, 3, cancer worry and 4, how each of breast density awareness, attained knowledge and cancer worry would impact on women’s intentions to be screened. How do I calculate the sample size needed?

LikeLike

Reply ↓