Originally written in February 2019.
Even within scientific and technical communities, one of the most common misconceptions is that math is purely objective. This fallacy greatly influences the way that we analyze data, most notably in clinical trials that directly affect the health of their subjects. While numbers in and of themselves hold no intrinsic bias, the way that we use and interpret those numbers makes them subjective. By nature, humans are biased, and when we take numbers in service of our motivations, we make those numbers biased as well. Specifically, with statistics, we employ numbers to either support or disprove an assumption, which creates an initial foundation of bias (Lewis). And when we apply statistics to clinical trials, in order to analyze our hypotheses, we must eliminate other variables and choose certain pieces of data to focus on, only amplifying the bias. In order to extract useful and ethical information from our analysis of clinical trials, which affects the health and medical treatment of people, we must be cognizant of this bias and provide context for the inference and implementation of our results instead of treating the numbers as pure, irrefutable facts. Rather than focusing on obvious ethical concerns, like informed consent of participants and falsification of data, which are already addressed in standard ethical guidelines for statistics and clinical trials, we will look at how sample size and opposing methods like p-values and confidence intervals affect the way that people interpret and make decisions based on the resulting analysis of clinical trials.
One of the fundamental ideas taught in every introductory statistics course is that a large sample size, or the number of participants in a study, is necessary to create good, usable data. This idea is so ingrained in the scientific community that it is nearly impossible to publish a study in a reputable journal, regardless of all other parameters and results of the study, if the sample size is considered too small (Bacchetti). However, some statisticians advocate for smaller sample sizes, especially when the study in question is a clinical trial that can affect the well-being or daily lives of its subjects. For support, some of these statisticians cite the burden that clinical trials place on the participants by restricting certain aspects of their lives to control confounding variables and ensure consistent collection of data (Bacchetti). This burden is heightened in many clinical trials when the participants are patients who are undergoing an experimental treatment or occasionally no treatment in the control group, potentially placing the patients’ health at risk. The concept behind shrinking sample size is that no matter the size of the study, all of the participants will bear the same burden, but if the sample is smaller, then each participant contributes more value to the study and fewer people must endure the burden of the study.
Sample size can also predetermine the bias of a study by limiting which groups can conduct a study based on their available resources. Studies are most often limited by funding, and the larger the sample size for the study, the more funding is required (Lewis). By necessitating a large sample size for acceptance by the scientific community, we are restricting research and trials to the few sponsored by the government agencies, but mostly studies sponsored by industry, companies who already have skin in the game. While scientists may want to further scientific knowledge at heart, at the end of the day they answer to the sponsors who are providing funding. In the case of the government agencies, this often means that studies are chosen for their potential for high impact results in short time periods because shorter trials require less money and “shocking” results create widely-shared headlines that are likely to incentivize more government spending in the scientific community (Lewis). This could lead scientists and investigators to shorten studies that may have produced more valuable data over a longer study period and to choose statistical methods that provide the most headline-worthy results. As far as industry funded studies, they are usually conducted to verify the safety, efficacy, and need for a certain product, creating a clear initial bias in how the study will be conducted and analyzed (Lewis). This common bias highlights a glaring gap in the American Statistical Association’s ethical guidelines, which details responsibility to science and sponsor in the same category, ignoring the conflict in motivations between scientific discovery and industries or government (Advanced Solutions International, Inc.). While large sample sizes often provide more certainty that an observed trend is not merely a random fluke, if studies with smaller sample sizes were more acceptable I believe we would have more research in the service of furthering scientific knowledge instead of research focused on the marketability of government agencies and commercial products.
Many studies published in highly respected scientific journals analyze the collected data with one of the most basic statistical methods: a hypothesis test and the calculation of a single number, the p-value, to determine the significance or insignificance of the results. The p-value is a function of the mean difference in results between the experimental and control groups, the standard deviation of those results, and the sample size. Scientists and investigators have invested immeasurable time and money into developing equipment and methods to collect and analyze continuous data, but after all of this work, most clinical studies base the importance of the results entirely on a single, arbitrary, dichotomous pair: “significant” and “insignificant,” separated by the cutoff p = 0.05 (Ebramzadeh, Bacchetti). Why 0.05? There is actually no scientific basis for this number. Fisher, a statistician from the early and mid-20thcentury, popularized the use of 0.05 when he published Statistical Methods for Research Workers with Yates in 1925 (Dahiru). While this number was more of a suggestion or guideline in the early 20th century, it has become so entrenched in basic statistical education that most respected scientific journals will not publish studies that find p-value’s greater than 0.05, labeling the findings as insignificant. If a study finds a difference in mortality rates for a standard drug and an experimental drug with a p-value of 0.1, is that difference truly insignificant? Most statisticians would agree that it is not insignificant. We cannot base the significance of data and results purely on the p-value. Another flaw of the p-value is that it is largely dependent on sample size. For example, if a study finds there is a significant difference due to a p-value of 0.04, a similar set of data with a larger sample size can produce a larger p-value, classifying the results as insignificant (Ebramzadeh). Essentially, investigators can manipulate the likelihood that results will be significant solely by changing the sample size. The tobacco industry took advantage of this flaw to produce studies that garnered “insignificant” correlation between smoking and mortality rates or lung disease, masking the true deadly effects of smoking cigarettes for decades.
However, there are other methods of analyzing and determining the importance of data that can have very different outcomes. For example, one can use confidence intervals to analyze data, which give an estimate of how the study would be mirrored in a population and a visual with the range of accuracy and deviation of the data. Perhaps in a study we would see a 30% improvement in the health of the subjects on a new experimental drug, which sounds like a large number but this could be labeled insignificant depending on the sample size and standard deviation. If it were analyzed with a confidence interval, we would have an idea of whether 30% of an entire population would see the same benefits and what the variation of results would look like in a population. Confidence intervals, in contrast to p-values, provide more digestible and applicable data for the community that reads and implements the findings of the published study.
Regardless of the results of statistical analysis of a study, one should always fully publish the data, calculations, and any other relevant information. There is often data collected in clinical trials that is relevant for directing further research even if it is not applicable to the immediate focus of the study. For example, perhaps with a new treatment we find no significant difference in mortality rates as compared to the standard treatment, but we find that a large number of the subjects given the new treatment require a mechanical ventilator after treatment (Lewis). This is seemingly unrelated to mortality rates, but the information should be published so that it can be referenced by scientists who could potentially find it useful. While the ASA ethics guidelines state that all data and calculations should be transparent, it does not mention notable information gathered that does not directly relate to the study (Advanced Solutions International, Inc.). Furthermore, when stating results and discussion at the end of a clinical, it is not enough to simply state the numbers found and trends in the data, which is often what we see in published studies. Studies should provide context for the results that have been found and the conclusions that have been drawn. Just because a number is statistically significant, this does not mean it is clinically significant. For example, a clinical trial may find a significant difference in the life-span of a new joint replacement (Ebramzadeh). However, if that difference is two months, is that significant enough to implement in practice? Joint replacements must last for the life of the patient, so a two-month extension on replacing an old joint replacement with a new one does not make a significant difference in the patient’s life and it only troubles the scientific community with retraining surgeons and rewriting standard practice documents (Ebramzadeh).
Determining clinical significance must also include some level of risk-benefit analysis. When we choose a low p-value like 0.05, there is a higher chance of Type II error, the failure to recognize a significant difference. This can have drastic impacts if we decide that a new treatment is not beneficial when, in reality, it is, because if the resulting analysis does not accurately represent the significance of the treatment then many patients who may experience tangible benefits will not be given this treatment (Ebramzadeh). Furthermore, we cannot look solely at the initial hypothesis for the study. If the clinical trial is focused on mortality differences in a new treatment and there aren’t any, we should not discount the surrounding data that shows a concrete decrease in negative side effects. The purpose of furthering scientific and medical knowledge is not only to save lives, but to improve the quality of lives. Finally, we must employ risk-benefit analysis in implementation of study results in some cases, deciding if the potential harm to the community is worth the potential benefits that a new drug or treatment might bring. The mountaineer example gives a strong analogy:
A mountaineer can either jump across a glacial crevasse and risk falling to his death or he can walk five miles around the crevasse to get to the same place. The mountaineer is great at jumping and has a p-value of jumping twice the crevasse distance of .001. Essentially, it is nearly certain that he would make this jump, but he chooses to walk around instead of risking his life. The context of this risk is important in deciding, because if we only had the p-value and did not know the severe penalty of failure, then we would certainly say he should jump (Ebramzadeh).
Ebramzadeh, E, et al. “Challenging the Validity of Conclusions Based on P-Values Alone: a Critique of Contemporary Clinical Research Design and Methods.”
Similarly, in the clinical world, a study has a p-value of .1, but it is consistent with many “significant” past studies and widely accepted theory. Should we really throw that study out? Conversely, take a study that has a p-value of .001 but defies all aspects of current theory and all past studies. If we accept it then medical practice would be completely changed, textbooks rewritten, and equipment redesigned. Should we really accept that study as fact because of its p-value (Ebramzadeh)?
It is important that clinical trials are run by a cross-disciplinary team. This means that scientists with limited knowledge in statistics should not be analyzing data without guidance, and statisticians unfamiliar with the medical world should not contextualize or make inferences from the data without input from medical professionals or scientists (Gelfond). Rather, all communities that play a role in the execution and implementation of a study should be represented by the team of investigators running it. Additionally, ASA should update its guidelines to address the importance of context and the flaws of dichotomous analysis, as well as the ethical gray area between the opposing motivations of the scientific research community and commercial industries. But such suggestions are merely a patch over a deeper issue. When teaching applied math like statistics, a core part of the education should be understanding the subtleties and potential impacts from seemingly unambiguous numbers. Statistics courses should eliminate the use of p = 0.05 as the sole judge of significance and instead teach the students to think critically and contextualize the data. Not only would changing the way we understand statistical analysis transform the scientific community and its publications for the better, it would lead to a new generation of professionals better equipped to handle the ambiguity inherent in many scientific problems.
Works Cited
Advanced Solutions International, Inc. “Ethical Guidelines for Statistical Practice.” Ethical Guidelines for Statistical Practice, Apr. 2018, http://www.amstat.org/ASA/Your-Career/Ethical-Guidelines-for-Statistical-Practice.aspx.
Bacchetti, et al. “Ethics and Sample Size | American Journal of Epidemiology | Oxford Academic.” OUP Academic, Oxford University Press, 15 Jan. 2005, academic.oup.com/aje/article/161/2/105/256528.
Dahiru, Tukur. “P – VALUE, A TRUE TEST OF STATISTICAL SIGNIFICANCE? A CAUTIONARY NOTE.” The National Center for Biotechnology Information, U.S. National Library of Medicine, June 2008, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4111019/.
Ebramzadeh, E, et al. “Challenging the Validity of Conclusions Based on P-Values Alone: a Critique of Contemporary Clinical Research Design and Methods.” The National Center for Biotechnology Information, U.S. National Library of Medicine, 1994, http://www.ncbi.nlm.nih.gov/pubmed/9097190.
Gelfond, Jonathan A.l., et al. “Principles for the Ethical Analysis of Clinical and Translational Research.” Statistics in Medicine, vol. 30, no. 23, 2011, pp. 2785–2792., doi:10.1002/sim.4282.
Lewis, Roger J. “Ethical Issues in the Statistical Analysis of Clinical Research Data.” SAEM, Department of Emergency Medicine Harbor-UCLA Medical Center, http://www.saem.org/docs/default-source/saem-documents/research/ethical_issues_in_stat_analysis_of_clinical_research _data_lewis.pdf?sfvrsn=cc09fbb5_4.

Leave a comment