Jump to:

Philip Morris

Quantitative Evaluation of Multiplicity in Epidemiology and Public Health Research

Date: 19980401/P
Length: 5 pages
2063633673-2063633677
Jump To Images
snapshot_pm 2063633673-2063633677

Fields

Author
Ottenbacher, K.J.
Type
PSCI, PUBLICATION SCIENTIFIC
BIBL, BIBLIOGRAPHY
FOOT, FOOTNOTES
Area
CARCHMAN,RICHARD/OFFICE
Litigation
Iwoh/Produced
Characteristic
EXTR, EXTRA
MARG, MARGINALIA
Site
R530
Named Organization
Univ of Tx
Society for Epidemiology Research
Bureau of Maternal + Child Health
Hhs, Dept of Health and Human Services
Mcj
Author (Organization)
Am J Epidemiol
American Journal of Epidemiology
Johns Hopkins Univ
Univ of Tx
Named Person
Ottenbacher, K.J.
Master ID
2063633486/4072
Related Documents:
Date Loaded
07 Jun 1999

Document Images

Text Control

Highlight Text:

OCR Text Alignment:

Image Control

Image Rotation:

Image Size:

Page 1: 2063633673 Log in for more options!
Volume 147 Number 7 April 1, 1998 ORIGINAL CONTRIBUTIONS American Journal of f EPIDEMIOLOGY Copyright 0 1998 by The Johns Hopkins Un/versfty School of Hygiene and Public Health Sponsored by the Society for Epidemiologic Research A BRIEF ORIGINAL CONTRIBUTION Quantitative Evaluation of Multiplicity in Epidemiology and Public Health Research Kenneth J. Ottenbacher Epidemiologic and public health researchers frequently include several dependent variables, repeated assessments, or subgroup analyses in their investigations. These factors result in multiple tests of statistical significance and may produce type 1 experimental errors. This study examined the type 1 error rate in a sampl~ of public health and epidemiologic research. A total of 173 articles chosen at random from 1996 issues of the Amefcan Journal of Public Health and the American Journal of Epidemiology were examined to determine the in.c, idenc,,e of type 1 en'ors. Three different methods of computing type 1 error rates were used: expedment- w=se error rate, error rate per experiment, and percent error rate. The results indicate a type 1 error rate substantially higher than the traditionally assumed level of 5% (p < 0.05). No practical or statistically significant difference was found between type 1 error rates across the two journals. Methods to determine and correct type I errors should be reported in epidemiotogic and public health research investigations that include multiple statistical tests. Am J Epidemiol 1998;147:615-19. bias (epidemiology}; probability; research design; significance tests Levin noted recendy, "Multiple comparisons are a very common feature--and, indeed, very often a ne- cessity-in epidemiologic and public health research'" (1, p. 628). He went on to discuss various procedures used to protect against type 1 errors, including the commonly used Bonferronl method and a procedure developed by Hol~kin and G-ensler (3) argue that the Holm-adjusted p value should be routinely used to reduce the type 1 error rate in studies involving multiple statistical tests. Received for publication February 10, 1997, and accepted for publication October 10, 1997. Abbreviations: EP, error rate per experiment; EW, experiment- wise error rate; PE, percent error rate. From the University of Texas Medical Branch at Galveston, Galveston, TX. Reprint requests to Dr. Kenneth J. Ottenbacher, SAHS, Rm. 4.202, University of Texas Medical Branch, 301 University Blvd., Galveston, TX 77755-1028. Problems involving multiple statistical-testing of hypotheses in health care and medical research arise for the following reasons: 1) the repeated analysis of accumulating data; 2) the use of multiple dependent measures; and 3) the analysis of data from subgroups (4). All three of these practices are common in public health and epidemiologic research. For example, Godfrey (5) demonstrated that researchers frequendy present and analyze means from several groups within the same study. She found that the most common method of statistically comparing several means in- volved the use of multiple t tests. Godfrey correctly argued that the use of urtivariate statistical procedures to analyze the results of studies containing multiple contrasts was inappropriate. Her analysis revealed that of 50 articles examined from the New England Journal of Medicine, a majority (54 percent) used improper univariate statistical procedures to analyze differences between sub~'oup means. 615
Page 2: 2063633674 Log in for more options!
616 Ottenbacher The use of several dependent variables in the anal- ysis of data from a single sample also results in mul- tiple statistical tests being reported. The complex na- ture of epidemiologic and public health research has led investigators to routinely include multiple depen- dent variables in their investigations (6). An epidemi- ologic researcher may be interested in the effect of a particular intervention on dependent variables such as weight, blood pressure, hematocrit, and serum cho- lesterol values in a sample of patients. As the number of dependent variables increases, so does the number of statistical tests. When this occurs, the researcher may obtain positive results on the basis of sampling error (7). hlumerous clinical researchers have suggested that multiple hypothesis testing without adjusting for inflated type 1 error rates is a common problem in medical and public health research (8-10). The purposes of this investigation were: 1) to examine the extent of the multiple testing in epidemiologic and public health research, and 2) to determine the prevalence of type 1 errors in a sample of pub- lished research. ME'FHODS Five issues of both the American Journal of Public Health and the American Journal of Epidemiology were randomly selected from the journal issues pub- lished in 1996. Each individual article was examined to determine the experiment-wise error rate, the error rate per experiment, and the percent error rate (see descriptions of error rates below). All articles that reported tests of statistical significance were included in the investigation. Articles that summarized the re- suits of previously published research and articles that did not report statistical significance tests were not included in the analysis. Experiment-wise error The overall experiment-wise error rate (EW) is the probability of making at least one type 1 error for the collection of tests performed in the investigation. The experLment-wise error rate can never be smaller than the error rate per comparison. The relatidn of per-comparison and experiment-wise error rates de- pends on the degree of statistical dependence of the tests. For totally independent tests, the experiment- wise error rate is equal to 1 - (1 - a)c, where c is the number of independent tests and a is the error rate per test (traditionally 0.05 or 0.01). From this equation, it is apparent that experiment-wise error rate increases rapidly with the number of h.ypotheses statistically examined. For example, in a study for which five statistical tests are conducted at the 0.05 level of significance, the EW is I - (I - 0.05)5 or 0.23. Error rate per experiment The error rate per experiment (EP).is the expected number of type 1 errors in a particular group of sta- tistical significance tests and is computed using the formula EP "= c(¢~), where c represents the number of comparisons, and ¢~ is the significance level and remains constant across all tests. For example, given 20 independent statistical comparisons at the p = 0.05 confidence level, EP = 20(0.05) = 1. This means that at the 0.05 level we would expect one type 1 error in 20 tests of statistical significance. It is important to note that the error rate per experiment (EP) is an expected value, while the experiment-wise error rate (EW), as defined above, is a probability. The experi- ment-wise error rate for 20 comparisons at the 0.05 significance level is I - (1 - 0.05)-'0 or 0.64. indicating that the probability of at least one type 1 error occurring among these tests reported as s, ignifi- cant at the 0.05 level is 0.64. Percent error rate The formula for computing the percent error rate (PE) is PE = lOOccdM, where c is the total number of comparisons, ~ is the alpha level for a set of comparisons, and M is the number of statistical tests less than the designated alpha level The percent error rate reflects the proportion of results labeled as statis- tically significant that are likely to be chance results. As the ratio approaches 1.00 (100 percent), it indicates that the number of tests found to be statistically significant approximates-the number of tests one would expect to find to be significant purely by chance. As the ratio decreases and approaches the individual alpha level for a set of comparisons, it reflects the percent of results that are attributable to chance. The percent of results Iikely to be caused by non-chance factors is equal to 100 - PE. For example, if 1 out of 20 comparisons evaluated at the 0.05 level is statistically significant, the PE = 100(20)(0.05)/1 = 100 percent, suggesting that the number of tests found to be significant, that is 1, is the number expected by chance. On the other hand, if 4 out of 20 comparisons conducted at the 0.05 significance level are found to be statistically significant, then PE = 100(20)(0.05)/4 = 25 percent, indicating that about 25 percent of the results are expected as the result of chance, while the remaining 75 percent (three tests) are likely to be due to non-chance factors. Am J Epidemiol Vol. 147, No. 7, 1998
Page 3: 2063633675 Log in for more options!
Quantitative Evaluation of Multiplicity 617 Rating process The reporting style in some of the articles made the determination of the exact number of statistical tests conducted and the number found statistically sigrtifi- cant a difficult task. Two independent raters with research degrees (PhDs) reviewed all articles and identified both the total number of tests conducted and the number reported as statistically significant. When the two raters did not agree, a third rater reviewed the article in question and the value agreed upon by at least two raters was used in the analysis. In spite of the high agreement between the raters (see below), the results reported in this investigation should be viewed as approximations of the various error rates rather than as exact values. A post hoc analysis is necessarily somewhat arbitrary in determining the number of tests conducted because the actual number cannot be pre- cisely determined without direct access to the original The relation between error rates per comparison and error rates per experiment is complex with dependent tests, a condition which may be assumed to always hold to some degree when multiple statistical tests are conducted using subjects from the same sample. Stra- han (11) has argued that, although it may be difficult to estimate the exact experiment-wise error rate due to correlation among the variables, it should be clear that it is greater than 5 percent. When discussing the im- pact of non-independence on error rates, it is important to distinguish types of non-independence that may exist. Ryan (12) originally identified the following four instances where non-independence may occur. The first includes all those situations where several groups or subgroups are statistically compared within the context of one study. The second case is referred to as "multiple tests with intercorrelated variables." This most commonly occurs when researchers compute multiple correlation coefficients for a single sample. The third instance of multiple testing is the use of multiple factors in the analysis of variance. The F ratios obtained from a factorial analysis of variance may not be independent if a common error estimate is used across the tests. Similar problems arise if other statistical procedures such as multiple t tests are used to analyze data in what is essentially a factorial design. The final type of multiple testing situation is what Ryan (12) referred to as "replicated tests of a single hypothesis.'" This classification includes studies for which several different methods of assessing the same dependent variable are employed, The situations described by Ryan (12) are not mu- tually exclusive. They do serve, however, to make it clear that interdependence between multiple statistical tests is complex and produced by numerous factors. Although the lack of independence may influence error rates, Ryan argues that it is not the main problem in interpreting error rates. He states that "~'he error rate per comparison and per experiment are com- pletely unaffected by independence or lack of it. The only important factor in these rates is th~ number of comparisons to be made. Only the experiment-wise error rate is affected by lack of independence" (12, p. 34). In the case of the experiment-wise error rate, the more highly related the tests, the closer the experi- ment-wise error rate is to the error rate specified for an individual comparison. In this examination, multivariate statistical tests that included procedures to control for type 1 error rates were considered as a single statistical test. This in- cluded analysis of variance (ANOVA) involving tests of interaction and accompanying post-hoc procedures using Scheffe, Tukey, Duncfin, Newman-Keuls, or other appropriate methods of post-hoe analysis. Each ANOVA, including the post hoe analysis, was counted as one statistical procedure. RESULTS The 71 articles in five issues of volume 86 of the American Journal of Public Health and the 102 arti- cles in five issues of volume 141 of the American Journal of Epidemiology contained sufficient statisti- cal information to be included in the analysis. The interrater agreement for all information coded from each of the articles was examined using the intraclass correlation coefficient (ICC) (13). The ICCvalues for all recorded information ranged from 0.91 to 1.00. Descriptive information for the experiment-wise error rate, the error rate per experiment, and the percent error rate for the articles published in the two journals appear in table 1. A comparison of the values for different error rates illustrates that experiment-wise error rate (EW) and the percent error rate (PE) have an easier interpretation than the error rate per experiment (EP), since EW and PE are essei~tially bounded while EP has no upper limit. The tabled values indicate that the EW in many articles is high, revealing a likelihood of type I errors in the reports, This is not surprising given the stochas- tic nature of the quantitative analysis of public health research. The prospect that many articles which report large numbers of statistical significance tests also re- port occasional type 1 errors does not seem alarming. What is of more concern is the percent error rate. The average individual alpha, level used in a ~ven study provides a lower bound for the percent error rate. Thus, for most of the investigations included in the ao.alysis, 5 percent is the lowest value PE can achieve given the 0.05 significance level. Yet. in many of the .4m J E,~idemiol Vol. 1.4-47, No. 7, 1998
Page 4: 2063633676 Log in for more options!
618 Ottenbacher TABLE 1. Type 1 error rates for random articles published in the American JoumalofPub#c Health and the American Joun~l of Epldemtotogy, 1996 No. Expedmem.,..,~¢~ Enor rate Pement Journal of ~ rate per expe~mem enor rate a~ Mean SO" IVkmn SO Meen SO Am J Public Health, Vol. 86 (nos. 3, 4, 7, g, 12) 71 0.68 0.24 0..90 0.57 19.16 9.01 Am J Epiderniol, Vol. 141 (nos. 2, 5, 6, 9, 10) 102 0.70 0.29 0.87 0.51 18.73 9.32 "SD, standard deviation. studies, the PE indicated that approximately 20 per- cent or more of the findings may be erroneous. The average PE for the studies in the American Journal of Public Health was I9.16 percent, while the average mean PE for articles in the American Journal of Epi- demiology was 18.73 percent (table 1). In a majority of the 173 articles (n = 156), the error rate per experiment (EP) was greater than 5 percent. The analysis also suggests that the percent error rate provides information not specifically contained in the EW and EP. The correlation between EW and EP rates for the articles included in table 1 was r = 0.47. The correlation of PE with EP was r = 0.41 and the correlation of PE with EW was r = 0.32. DISCUSSION AND CONCLUSIONS The problem of multiple hypothesis testing has im- plications regarding the interpretation and implemen- tation of epidemiologic research. For example, more than a decade ago the Food and Drug Administration refused to approve sulfinpyrazone (Anturane®, CIBA, Summit, New Jersey) as a medication to reduce mor- tality in the fast 6 months following myocardial in- farction (14). The refusal was based in part on the results of a clinical trial that included the repeated analysis of accumulated data. No procedure was used to control for the effect of multiplicity and the validity of the results was open to question. The probability of obtaining statistically significant results from two independent tests that address the same research question can be obtained by multiplying the individual probabilities that each test will produce a significant result. For p = 0.05, the probability that both tests will be statistically significant is 0.05 x 0.05 = 0.0025. The probability that neither result will be significant is 0.95 × 0.95 = 0.9025. The probabil- ity that at least one of the two test results will be statistically significant is 1 - 0.9025, or 0.0975. Thus, the probability of incorrectly deciding that the mem- bers of either one or both pairs of means are unequal using just two tests is nearly twice the probability of making the same error for a single test (0.0975 vs. 0.05). If we add a third comparison, the probability that none of the three tests will be significant is 0.95 × 0.95 X 0.95 = 0.8574, so the probability that at least one test will be significant is about 14 percent or nearly three times the 0.05 level. As the number of independent statistical tests increases, the probability becomes much larger than 0.05, the original alpha (see table 1). In trials wh~re multiple dependent variables are used, the obvious soludon to control or reduce experiment- wise error is to use some form of multivariate analysis, Multivariate procedures such as Hotelling's T2, ,disclqA'rli- nant function analysis, and logistic regression offer via- ble alternatives to traditional tmivariate approaches when multiple dependent variables are present. These proce- dures have been described by public health and epide- miologic reseamhers and are beyond the scope of this paper (15, 16). In some instances, the best solution may be to re- duce the per comparison significance level to a more stringent criterion. The Bonferroni adjustment pro- vides a widely advocated procedure to achieve this goal. The Bonferroni inequality involves dividing the alpha level desired for the overall family of statistical tests (usually 0.05) by the number of statistical com- parisons to be conducted. If two groups are compared on five separate dependent measures, each statistical comparison would be evaluated at 0.05/5 = 0.01. The Bonferroni method controls the type 1 error rate for each decision and maintains the selected alpha level (e.g., 0.05). for all the tests conducted in the investi- gation. The limitation of the Bonferroni method is that as the probability of making a type 1 error is de-. creased, the chance of committing a type 2 error is increased. Silverstein (17) demonstrated that when more than a small number of comparisons (say, five to eight) are included in a study, the Bonferroni proce- dure results in a dramatic loss in statistical power. Benjamini and Hochberg (18) have recendy described alternatives to the Bonferroni adjustment that do not result iri substantial reduction in statistical sensitivity.. The Bonferroni and other p value adjustment methods, however, are viewed as too conservative by some Am J EDiderniol Vol. t47, No. 7, 1998
Page 5: 2063633677 Log in for more options!
Quantitative Evaluation of Multiplicity 619 investigators (17). Levin noted that researchers who ar~ reluctant to use conservative correction methods such as the Bonferroni adjustment "will want to ex= plore some newer techniques.., in which less strin= gent but still interesting criteria replace the familywise error rate criterion" (1, p. 629). Procedures such as the percent error rate do not directly control type I error, but they do provide the investigator (and reader) with valuable information concerning the possible presence of a type 1 error in a family of statistical tests. Determining the experiment-wise error rate for a "family" of statistical procedures can be a complex task. In this study, the statistical test was the unit of analysis and no distinction was made among statistical procedures within a study versus those between stud- ies. Statistical tests conducted within a study generally use data from the same sample and are, therefore, assumed to be more related than statistical tests from different investigations (or samples). It is possible, however, that two different samples may be included in one research report, or that a single research article might include the results of more than one investiga- tion. An argument could be made that the family-wise error rate should be determined based on statistical tests that address the same research question across multiple investigations, or even across the lifetime of an investigator working in a particular area (12). The individual statistical test was the unit for determining the experiment-wise error rate in this study. Other units are possible, for example, the study sample, the research report, the research question, or even the investigator. How the different units of analysis effect the experiment-wise error rate for a "family" of sta- tistical tests is a question that can only be answered by additional research. Technical or statistical solutions to the problem of multiplicity in cpidemiologic research should not ob- scure a more fundamental scientific principle. There is a continuing need in health-related research to formu- late concise research questions and hypotheses before the collection and analysis of data. Stati.stical hypoth- esis testing is necessarily an empirical compromise between claiming too much and suggesting too little. Public health and epidemiologic researchers must pro- spectively define research questions and hypotheses as succinctly as possible and interpret the results using an alpha level appropriate to the extent of multiple test- ing. Knowledge of experiment-wise err6r procedures can help achieve this goal. ACKNOWLEDGMENTS This research was partially supported by grant no. MCJ- 360646-010 from the US Department of Health and Human Services, Bureau of Maternal and Child Health. REFERENCES I. Levin B. Annotation: on Holm. Simes, and Hochberg multiple test procedures. (Comment). Am J Public Health I996;86: 628-9. 2. Holm S. A simple sequentially rejective multiple test proce- dure. Scand J Star 1979;6:65-70. 3. Aickin M, Gensler H. Adjusting for muldple testing when repotting research results: the Bonferroni vs. HoLm methods. Am J Public Health 1996;86:726-8. 4. Wa~ JI-L Most~ll~r F. Ingelfinger JA. P-values. In: Bailar JC RI. Mostcllex F, cots. Medical uses of statistics. Waltham. MA: NEJM Books, 1986:179-~77. 5. Godfrey K. Statistics in practice. Comparing the means of several groups. N Engl J Med 1985;313:1450-6. 6. Tuk~y JW. Some thoughts on clinical trials, especially prob- lems of multiplicity. Science 1977:198:679-84. 7. Savitz DA. Oishan AF. Muldpte comparisons and related issues in the interpretation of epidemiologic data. Am J Epi- demiol 1995:142:904-8.. 8. Abt K. Problems of repeated significance tesdng. Control Clin Trials 1981:1:377-81. 9. Cupples LA, Heeren T, Schatzldn A. et al. Multiple testing of hypotheses in comparing two groups. Ann Intern Med 1984; 100:122-9. Thomas DC, Siemiatycki J. Dewar R. et al. The problem of muldple inference in studies designed to generate hypotheses. Am J Epidemiol 1985;122:1080-95. Strahan Pal:. Multivariate analysis and problems of type I error..I Court Psych 1982:29:1"~5-9. Ryan TA. Muldple comparisons in psychological research. Psych Bull 1959:56:26-47. - Shrout PE, Fleiss JL Intraclass correlations: us~'s in assessing rater reliability. Psych Bull 1979;86:420-8. Anonymous. Sulfinpyrazone in the prevention of sudden death after myocardial infarction. The Anturane Reinfarction Trial Research Group. N Engl J Med 1980;302:9_50-6. Bray J'H, Maxwell SE. Multivariate analysis of variance. Bev- erly Hills, CA: Sage Publications, 1985. Altman 13t3. Practical statistics for medical research. New York: Chapman & Hall, 1991. Silverstein AB. Power lost and statistical power regained. The Bonferroni procedure in exploratory research. Educ Psych Meas 1986:46:303-7. Benjamini Y, Hochbe~ Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Smt SOC [B] 1995:57:289-300. 10. 11. 12. 13. 14. 15. 16. 17. 18. Am J Epiderniol Vot. 1~,7, No. 7, 1998

Text Control

Highlight Text:

OCR Text Alignment:

Image Control

Image Rotation:

Image Size: