How to calculate Sample Size with Epi Info 7: Cross-Sectional studies

I receive a lot of queries regarding sample size calculation for this article. Recently, someone asked a question that involved calculating sample size with Epi Info 7. I believe many more would benefit from a public response, hence this article.

Background Information:

Epi Info™ is public domain set of software tools developed by the United States’ Centers for Disease Control and Prevention (CDC) for use by public health professionals and researchers. The latest version of Epi Info is Epi Info 7.

It provides for easy data entry form and database construction, a customized data entry experience, and data analyses with epidemiologic statistics, maps, and graphs for public health professionals who may lack an information technology background. It also includes a tool for sample size calculation.

Tutorial

A. Getting and installing Epi Info 7

Epi Info™ can be downloaded from here.

The CDC has produced several tutorial videos for Epi Info 7 that can be viewed here.

If you want to view the video providing instructions on downloading the software, you may do so here.

B. Sample size calculation 

B.1. Background information

The investigator wishes to determine the prevalence of Non-Alcoholic Fatty Liver Disease (NAFLD) among persons with Coronary Artery Disease (CAD).

Therefore, the study population is patients with CAD.

Since one intends to determine the prevalence of NAFLD among CAD patients, the outcome of interest is NAFLD

Similarly, the exposure is CAD (those who have CAD are ‘exposed’)

By extension, those without CAD (general population) would constitute the ‘unexposed’ group.

As the investigator wishes to determine the prevalence of NAFLD, the appropriate study design is a cross-sectional study (cross-sectional studies are also called ‘prevalence’ studies).

B. 2. Launching Cross-Sectional Study within Stat Calc tool of Epi Info 7

Step 1: Launch Epi Info 7 (please watch video above on downloading and installing Epi Info 7)

Step 2: Select StatCalc from the menu of options (shown in red)

1. Epi Info 7 main window showing the StatCalc tool

Step 3: Select Cross Sectional Study from the options (shown in red)

2. Epi Info 7 StatCalc window with cross sectional study selected

B.3. Requirements for calculating sample size using Epi Info 7 (cross-sectional studies)

In order to calculate sample size using Epi Info 7, one requires to provide the following information (shown in red):

3. StatCalc window for Unmatched Cohort and Cross Sectional Studies showing details that need to be supplied

Confidence level: usually set at 95%

Power: usually set at 80%

Ratio  of unexposed to exposed: depends upon the outcome of interest and study population

% outcome in unexposed group: the proportion of unexposed people with the outcome of interest (in this case, the proportion of general population with NAFLD)

% outcome in exposed group: the proportion of exposed people with the outcome of interest (in this case, the proportion of CAD patients with NAFLD)

The values of Odds Ratio and Risk ratio will be populated automatically based on the other values supplied.

B. 4. Obtaining the values for sample size calculation

Although we already know what values to supply for confidence level and power, other values are unknown. These need to be determined from literature.

In this case, we need to determine two values:

  1. the proportion of NAFLD in the general population
  2. the proportion of NAFLD among patients with CAD

The proportion of NAFLD in the general population is reported to be between 5 to 30%

indianheartjournal 2014nafld-in general population

The proportion of NAFLD among CAD patients is reported to be between 69.2% to 80.4%

NAFLD in CAD article Choi et al

The above study is from South Korea, so we must try to obtain literature from India for better estimation.

NAFLD in CAD article Aligarh

B. 5. Performing the sample size calculation

Having obtained all the information required, we can now proceed with sample size estimation.

Step 1: Selecting the desired confidence level.

The default value is 99.9%, but this may inflate the estimate. Therefore, we click on the drop-down menu and choose 95% instead.

4. StatCalc CS Studies Confidence Interval level selection

Step 2: Supplying the desired power.

Typically, the power is kept at 80%. Increasing the value will increase the sample size.

5. StatCalc CS Studies Power value

Step 3: Supplying the ratio of unexposed to exposed individuals.

Here, one must provide a single value, not ratios (1; not 1:1). If the proportion of unexposed will be less than the number of exposed, the value will be less than one. In the present example, the ratio is approximately 30:70 (The study population will consist of CAD patients. Among them, those without NAFLD would be around 30%, while those with NAFLD would be around 70%). Performing the calculation (30/70), one obtains a value of 0.4- this is supplied in the appropriate cell.

6. StatCalc CS Studies Ratio Unexposed to Exposed value 0.4

Step 4: Supplying the percentage of outcome among unexposed. 

We already know from literature that this value lies between 5% to 30% (NAFLD in general population). Since 5% is rather low, we will use 9% instead.

7. StatCalc CS Studies Percent outcome in unexposed group value

Once this value has been entered, the remaining cells are automatically populated. The default value for percentage outcome in exposed group is 0%. The values in the grid reflect sample size estimates based on this data. Since the value for percentage outcome in exposed group is non-zero, we will ignore the output for now.

Step 5: Supplying the percentage outcome in exposed group

8. StatCalc CS Studies Percent outcome in Exposed group

After supplying the value for percentage outcome in exposed group, we can now examine the output in the adjoining grid.

The first column provides estimates based on the approach described by Kelsey et al. According to this approach, the total sample size required is 27 subjects.

The second column provides estimates based on the approach described by Fleiss et al. They described two approaches- one without continuity correction, and another with continuity correction. The second column provides estimates without continuity correction. Here, the total sample size is 23 subjects.

The third column provides estimates with continuity correction, and estimate 30 subjects for the study.

Details of the approaches may be found here.

Note: The above estimates are the lowest possible for a study on this topic.

Step 6: Refining the estimate

This requires one to manipulate the values to obtain a reasonable estimate. While we will not alter the values obtained from literature, we can increase the others.

First, we will sequentially increase the value for confidence level, and see how that alters the estimate:

9. StatCalc CS Studies 99 Percent Confidence level

When the confidence level is increased from 95% to 99%, the estimate increased to a maximum of 42 subjects (above).

10. StatCalc CS Studies 99.9 Percent Confidence level

When the confidence level is increased to 99.9%, the maximum estimate is 59 (above).

11. StatCalc CS Studies 99.99 Percent Confidence level

With the confidence level at 99.99%, the maximum estimate touches 76 subjects (above).

What would happen if the ratio of unexposed to exposed is reduced further (even fewer unexposed compared to exposed)?

12. StatCalc CS Studies Ratio Unexposed to Exposed value 0.3

The maximum estimate now touches 86 subjects (above).

Is this the maximum possible sample size estimate for the study? Perhaps not.

Remember, the prevalence of NAFLD in the general population ranges from 5% to 30%. While we supplied the lower value, we did not do so for the higher value. Let us see what happens when the higher value (30%) is supplied.

First, we will keep the confidence level at 95%; and the percentage of outcome in unexposed as 30%.

13. StatCalc CS Studies 30 Percent outcome in Unexposed group

As can be seen above, the maximum estimate is 72 subjects.

Let us increase the confidence level value to 99%, keeping everything else the same.

14. StatCalc CS Studies 99 Percent CI 30 Percent outcome in unexposed group

Now the maximum estimate is 101 (above).

What if the confidence level value were to be increased further? Let us see what happens when the value is increased to 99.9%.

15. StatCalc CS Studies 99.9 Percent CI 30 Percent outcome in Unexposed group

As can be seen above, all estimates are in excess of 130, with the maximum being 142.

If the confidence level were increased to 99.99%, the sample size would increase correspondingly.

16. StatCalc CS Studies 99.99 Percent CI 30 Percent outcome in Unexposed group

If the value of power were increased from 80% to 90%, the estimated sample size will increase further.

What is the maximum possible sample size with the available data?

17. StatCalc CS Studies 99.99 Percent CI 99.99 Percent Power 30 Percent outcome in Unexposed

As can be seen above, now the estimates are around the 500 subject value.

The final sample size chosen should be the largest feasible value, considering available resources (time, materials, manpower, money).

Key messages:

The estimation of sample size is informed by existing literature. A thorough review of literature should be performed before determining the values for use in calculation.

Increasing the confidence level increases the estimated sample size.

Increasing power will increase the sample size.

If the difference in values of percentage outcome in exposed and unexposed is small, the sample size will increase, and vice versa (lower sample size estimate when the values were 9% and 69.2%, as compared to 30% and 69.2%).

Larger sample sizes are preferred as they have greater power to detect a difference when it exists. 

Useful Links:

Link to Epi Info user guide:

https://www.cdc.gov/epiinfo/support/userguide.html

Link to previous article on sample size calculation for cross sectional studies:

https://communitymedicine4asses.com/2014/05/11/sample-size-calculation-cross-sectional-studies/

 

Advertisement

13 thoughts on “How to calculate Sample Size with Epi Info 7: Cross-Sectional studies

  1. Mary Ogunleye

    I cannot fanthom how to calculate my sample size of pig farms in Lagos state, as I am working on comparative study on biosecurity and its effect basing it on ASF in pigs, with the prevalence of ASF in Lagos as 13% from previous study. how do i calculate my sample size of farms to administer questionnaires. I do not have information on exposed and unexposed group.

    Like

    Reply
  2. Pingback: Sample Size Calculation: The Essentials (Part 1) | communitymedicine4all

  3. Pingback: Sample Size Calculation: The Essentials (Part 2) | communitymedicine4all

  4. beyene

    what is problem if we left the outcome in exposed since by it self when we enter the outcome of in unexposed it automatically calculate it?

    Like

    Reply
    1. drroopesh Post author

      Dear Beyene,

      The application estimates the outcome in exposed and supplies a value. However, the observed proportion of unexposed developing outcome may be different from what the application estimates. In such situations it is useful to enter the value ourselves. If the estimated value corresponds with the observed value, it should be fine not to supply a value.

      Regards,
      Dr. Roopesh

      Like

      Reply
      1. Beyene

        Dear Dr Roopesh
        First of all Thank you for your immediate and polite response ..I am confusing with calculation of such senario when I enter the outcome of unexposed and the rest required with out entering outcome of exposed I get sample size I want to proceed but if enter outcome of exposed the sample size is extremely small so what shall I do?

        Like

        Reply
  5. Hadi

    Hi, Dr Roopesh. Can you help me to calculate sample size for our research?
    Title of our research is Knowledge, Perception and Peeventive practices of colorectal cancer and their associated factors among adults in our area of study.
    Given the total population is 235 adults. How do we calculate the sample size?
    Previously, we use the prevalence feequency via openepi. But our supervisor advised us to use odds ratio or risk factors instead.

    Like

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.