The Feynman technique is a strategy for mastering a skill or information through concise and simple language. To paraphrase James Gleick, a biographer of the famed physicist and Nobel Laureate, Richard Feynman (1918-1988), who developed the technique while in graduate school, it's a process of disassembly and reassembly that creates a path to understanding. Gleick's 1992 biography, *Genius: The Life and Science of Richard Feynman*, is the primary source for descriptions of the technique.

Based on Feynman, my approach to studying looks like this:

The essence of Feynman's approach can be seen in his lectures from the California Institute of Technology (available in print or freely online at __feynmanlectures.caltech.edu__). In *Lecture One*, we find a variation of his well-known quotation about particles, and it's a fitting example of the conciseness, clarity, and brevity that characterizes his technique.

The Feynmann Technique (which is based on rephrasing, simplification, and recall) also works well with Francis P. Robinson’s SQ3R reading comprehension technique (first published in his 1946 book, *Effective Study*). SQ3R is the approach I use when I'm tackling an article or book chapter; Feynman is the technique I use for specific concepts.

**Survey** (__ skim__ chapter headings, figures, tables, diagrams);

**Question** (complete practice problems; ask yourself *What is this about? How *

*does it help me?*);

**Read** actively and __ re-phrase__ (using prior knowledge from S & questions from Q)

**Recite** what you learned to activate retrieval benefits (recall S and Q);

**Review** what you learned (write and repeat your thoughts; use flashcards);

embrace __ spaced repetition__.

The Feynmann Technique applies especially to Robinson’s stages of *read*, *recite*, and *review*.

**Variance and the Assumption of Homogeneity of Variance**

The distribution of a set of data is described in three ways: by the shape of the data (for instance, unimodal or bimodal; lepto-, meso-, or platykurtic), by their central tendency (the median, median, and mode), and by their variability (spread or dispersion of scores across the data set as indicated by such measures as variance, standard deviation, and range) (Vogt & Johnson, 2015). Variance, in particular, is a measure of the average squared deviation from the mean, indicated by σ2, where σ is the standard deviation of a population. Because variance is based on deviations from the mean, as an indicator of dispersion, it reflects how scores are spread about the mean and is calculated by taking deviations from the mean.

Homogeneity of variance is an assumption in statistics used both for t and F tests, especially in connection to analysis of variance, commonly known as the ANOVA procedure. The related term, homoskedasticity, is typically used in connection to correlations and regressions and refers to homogeneity of variance in arrays. The concepts are related, but the terms should be distinguished (Vogt & Johnson, 2015). Homogeneity of variance posits that multiple samples taken from a population will be similar in their measured behaviors or responses (i.e., the data they produce). Although the variance of the data among multiple samples taken from a given population will not be identical, they should be relatively similar. Take, for example, in the realm of education, where a researcher may be investigating literacy among students in Grade 3 in a series of independent and public schools in a district. For the purpose of comparison and in order to apply higher-order statistical procedures (which assume homogeneity), it is necessary to determine first that the variances of the samples are either relatively similar or different for the results to be credible. Testing for homogeneity in this way, while it does differ from hypothesis testing, is closely related and, as such, may be referred to as the minor hypothesis (Demoulin & Kritsonis, 2013).

For hypothesis testing with multiple samples (a t-test or ANOVA), the assumptions are: 1. That scores are independent; 2. That scores are normally distributed; and 3. That score variance is homogeneous (Vogt & Johnson, 2015). Verified independence is a function of random selection; verified normal distribution is a function of data description and plotting; and verified homogeneity of variance is a function of a test statistic, like an F test. In all cases, these assumptions refer to the population as a whole, although samples might be used to verify them. Should a data set not satisfy all three assumptions, a parametric test (i.e., one whose findings are generalizable to a population) may give misleading results.

**Impact of violating Homogeneity of Variance on the Validity of Results**

Parametric statistical procedures, by definition, are concerned with populations. A parameter is a value that remains constant in an experiment or calculation and represents the whole of a system being measured and interpreted; that whole is also known as a population. By comparison, a statistic is a value that describes and interprets a part of the whole, otherwise known as a sample. Given that homogeneity of variance is one of the assumptions of parametric statistics, it follows that when the assumption is violated, the validity of the calculation diminishes. Because statistics as a discipline is rooted in probabilities, levels of uncertainty are implicit. The work of research necessarily involves some level of possible error, usually expressed as a percentage and, by convention, falling between one and five percent, but as high as ten percent. In a scenario where scores are verified independent and the distribution is normal, should the observed value of an F test statistic fall just outside of the rejection region, but only by a small number of decimal points, moving forward with a parametric procedure could seem logical. Because statistics is probabilistic, a very small violation of the assumption of homogeneity of variance might seem reasonable.

A challenge, in my view, arises from the fact that statistics does not admit of degrees, which is to say, for instance, that a result is significant or not significant, based upon the pre-established alpha of the research design. Whether an observed value falls within a rejection region merely by 0.01 or by 100.01, the finding is significant – not barely significant, not highly significant; just significant. And the same is said of non-significant findings. Whether there is small practical effect or not from using data in a multi-sample hypothesis test when the data have been shown to be relatively different in variance does not change the fundamental premise that statistical validity is contingent on fixed values, not gradations in its decision rules. Furthermore, as the difference in variance between two samples increases, so does the likelihood of rejecting the null hypothesis when the null hypothesis is, in fact, true – otherwise known as a Type I error or false-positive result.

**Resolving the two viewpoints**

A violation of homogeneity of variance will necessarily negatively impact the results of a parametric procedure. As such, I would argue that resolving the competing viewpoints of whether some small difference in variance is acceptable or not misses the mark. Instead, the approach I would take is that when homogeneity of variance is assumed, that assumption must be met. When the assumption cannot be met, I would look for an alternative. Each parametric statistical procedure has a non-parametric equivalent (in this case, Levene’s test for equality of variances, used in combination with Welch’s t-test). Rather than force data into a calculation that they do not support, my preference is to consider an alternative that the data do support. Given that statistics is already a probabilistic field of inquiry, the goal should be to decrease, not increase the likelihood of error.

**References**:

DeMoulin, D.F., and Kritsonis, W.A. (2013). *A statistical journey: Taming of the skew!* (2nd ed.).

The AlexisAustin Group.

Vogt, W.P. & Johnson, R.B. (2015). *The SAGE dictionary of statistics and methodology* (5th ed.). SAGE Publications, Inc.

**Sampling Techniques for the Study of an Educational Bond Referendum**

In designing a quantitative research project where statistical analysis will provide the basis for inferences made about the data collected, a reliable and valid method of sample selection is vital. As part of the initial design process, a researcher will determine the independent and dependent variables of the study (that is, the experimental and response variables) and then identify a target population. From that target population, observations and experimentation will yield data that will either allow or cause the failure of the rejection of the null hypothesis through the application of appropriate statistical procedures. For this reason, while every element in the design of a research study contributes to its success, none of them can compensate for the damage done by bad data obtained from erroneous or improper sample selection (Scott & Morrison, 2007). The purpose of this post is to explore random sampling as a foundational component of the design process in quantitative research within the context of a hypothetical study that would infer the likelihood of the passage of an educational bond referendum.

**Population Identification and Sampling**

A population is some whole and distinct group of items or subjects from which a sample is drawn (DeMoulin 7 Kritsonis, 2013). The population pool may be as large as a city or even a nation or as small as a classroom of students. It may be comprised of people or items, though in the social sciences, the former almost universally constitutes the populations of interest (Eldredge et al., 2014). The deciding factor for identifying a population is typically a characteristic the items or subjects have in common and the concern a researcher has for studying that characteristic. In addition to identification, researchers must further determine the accessibility of any target population. While the accessible population may be the same as the target population, it may not. In the theoretical education bond study for this paper, the target population is all eligible members of a voting district; this is also the accessible population (available through census and voter registration records). Suppose, however, that one were investigating student demographics and academic performance within a district, but one or more of its school principals declined to participate. The result would be an accessible population smaller and different from the target population of all students in a given district (DeMoulin & Kritsonis, 2013).

Because the study of a whole population can be prohibitive to a researcher’s work, either because of funding and resource limitations or simple impracticality, data are typically collected from samples (Vogt & Johnson, 2015). A sample is a subset of a population. For a quantitative study, where a researcher will apply statistical analysis to the data, this sample is assumed to be random. True-random samples are representative of a population and allow for inferences to be generalized from the sample to the population. Those inferences should not be viewed as truth claims, however; instead, they are probabilistic descriptions, correlations, and explanations of phenomena observed within the sample. For a sample to be true-random, the method of selection must meet two criteria. First, that each item or subject within the sample has an equal probability of being drawn; and second, that each item or subject is drawn independently and not affected by the drawing of any other one (Suhonen et al., 2015). When created in this way, a sample will be representative of the population, and a researcher may subject the empirical data collected from it to parametric statistical procedures.

In practice, to create the actual sample from a population, a researcher needs first to create a sampling frame. This is a list, typically in the form of a database, that contains all members of the target population. Even a basic statistical software package like Excel is able to assign identification numbers at random to all items or subjects within a population list. The researcher need only provide the required sample size (*n*) and use the software to apply random number generation to the set based on the total population (*N*). If a sample size of 500 is required, one simply draws the randomly assigned numbers one through five hundred from the total population database. Note, however, that just as there is an important distinction between the target population and accessible population that will determine the true number or *N* of a study, there is similarly an important distinction between sample selection and the actual sample number or *n*. The actual sample number is the number of items or subjects from which data will have successfully been collected. Even if in the educational bond study, a true-random sample of 500 subjects were drawn, should only 430 of them provide data for the study, the actual sample would be *n = 430* and not *n = 500*.

**Generating a Stratified Random Sample**

Stratified sampling is a method of random sample selection that first divides a population into smaller subsets referred to as strata (Scott and Morrison, 2006). Within each stratum, a researcher can draw an individual, random sample. The strata, which a researcher identifies in advance, create sub-categories for the data that allow statistical inferences within those specific categories. These sub-categories express within-group homogeneity and between-group heterogeneity and are identified to increase the precision of sample representation. Such sampling may also contribute to the study’s internal validity by identifying and deliberately construing potential confounding variables (Vogt and Johnson, 2015). In the social sciences, these categories typically align with demographics (age, race, gender, sex, SES, etc.) or psychological features (facets of child development, personality, and mental health). Random number generation as described above using Excel or a similar software package is again used in the creation of a sample, though this process assumes that the categories used in the strata are available to the researcher in database form in advance.

To study the question of whether a population is likely or not to support the passage of an educational bond, stratification related to demographics could inform decision-making for canvassing and other informational campaigns. Data collected on the basis of categories such as gender, age, and parental status (which is to say, whether or not the respondents have children, and whether those children attend a public school in the voting district) would enhance understanding not only of whether the bond might pass, but also of what categories of individuals are likely to support or not. Parental status is a particularly useful category in this instance, in my view, because I suspect that the likelihood of supporting or not supporting a bond will correlate with whether households have children and whether those children attend public schools in the voting district. Knowledge of correlations between categories of data would allow door-to-door canvassers and informational campaigns (local television commercials and brochures) to target specific types of individuals with likely known viewpoints. It may become clear from the study that individuals without children currently attending a public school in the district, for instance, are unlikely to support the bond’s passage. Such information is valuable in generating an argument either for or against the bond measure, depending on who are the information users of the study. Whoever the information users, though, a stratified sample will allow canvassers and campaign communications to target their arguments more effectively to their audiences.

**Generating a Cluster Random Sample**

Cluster sampling, like stratified sampling, is a random form of sample section in which a researcher divides a population into smaller groupings referred to as clusters (Scott & Morrisons, 2006). Individual samples are randomly drawn from each cluster. Like stratified sampling, cluster sampling is a probabilistic and random method, but expresses between-group homogeneity and within-group heterogeneity. As discussed above, the purpose of stratified sampling is to identify categories like demographics explicitly and organize sample selection using population groups defined by these categories; hence, the within-group homogeneity. Cluster sampling, by comparison, does not use researcher-defined categories, but instead divides a population by natural features like geography. Within a voting district, clusters are inherently created by street intersections and neighborhoods. For this reason, there is within-group heterogeneity (that is, demographic traits like age and parental status have not been identified), but between-group homogeneity (that is, each cluster is equally a neighborhood). The primary reasons that a researcher would use cluster sampling are to reduce the overall cost of a study and increase efficiency. Once again, population lists are a prerequisite for random selection of the sample using a statistical software package.

**Conclusion**

Fundamental to the success of a quantitative research study is the method used for sample selection. The data for a study come from the observations of and treatments applied to a sample. Should that sample not have been selected correctly, the data coming from it will be unreliable and invalid. Such data cannot produce results that allow meaningful inferences about a phenomenon of interest to a researcher and cannot be generalized to a population. In other words, if the sample is poor, the data will be poor, and the study will produce erroneous or, at best, low-quality results.

The process of sample selection is a necessary early step in the design of a quantitative research study. To apply parametric statistical procedures to data and make generalizations from the sample to the population requires that the sample selection be true-random and, thus, representative. This means that each item or subject within the sample has an equal probability of selection and that each one is also independent of the others. In the social sciences, stratified and cluster sampling are common methods of sample selection that allow a researcher, based on constraints of time and funding, either to increase representation while decreasing potential confounders (as with stratification) or to reduce costs while improving efficiency (as with clustering). Because there is no curative in the design of a research study for the invalid and unreliable results that bad data produce, a systematic approach to sampling should include the following steps:

Identification of a target population to address the variables of the research question.

A decision of needed sample size which, in connection with effect size and p-value will determine the power of the study as a whole. (Notably, sample size is the most commonly manipulated component of a study’s power as effect size and p-value are typically designated in advance and used to determine the required sample size)

Based upon the identified sampling method (for instance, simple random, random clustering, or random stratification), a statistical software package is used to generate the random sample from a population database.

The researcher proceeds with data collection.

**References**

DeMoulin, D.F., & Kritsonis, W.A. (2013). *A statistical journey: Taming of the skew!*

(2nd ed.). The AlexisAustin Group.

Eldredge, J. D., Weagel, E. F., & Kroth, P. J. (2014). Defining and identifying

members of a research study population: CTSAaffiliated faculty members.

Hypothesis, 26(1), 5–11.

Scott, D. and Morrison, M. (2006). *Key ideas in educational research*. Continuum.

Suhonen, R., Stolt, M., Katajisto, J., & Leino-Kilpi, H. (2015). Review of sampling,

sample and data collection procedures in nursing research - An example of

research on ethical climate as perceived by nurses. Scandinavian Journal of

Caring Sciences, 29(4), 843–858.

Vogt, W.P. & Johnson, R.B. (2015). *The SAGE dictionary of statistics and *

*methodology* (5th ed.). SAGE Publications, Inc.

Quantitative research studies prioritize the collection and analysis of numerical data in a scientific process of observation and reasoning used for hypothesis testing (Cohen et al, 2011). This was the goal for a research team of three scholars, Wong-Ratcliff, Powell, and Holland, who investigated the effects of the Reading First program on students in grade one in rural school districts in Louisiana in 2010. Reading First (R.F.) is a federal education program described within the Elementary and Secondary School Education Act (ESEA) that provides funding to Title I schools for literacy improvement. In receiving this funding, though, schools are required to comply with scientifically based reading research (SBRR) practices. This requirement has led to comparisons of gains in student literacy between the R.F. and non-RF schools where literacy instructional practices are not mandated. Following the U.S. Department of Education's publication in 2008 of the Reading First Impact Study, which showed no significant difference in literacy gains between R.F. and non-RF schools, smaller-scale studies have provided additional analysis of the problem. The purpose of this post is to explore one such smaller-scale study and examine the quantitative design its authors used to draw conclusions about the efficacy of Reading First in rural Louisiana.

**Research Overview: Purpose, Sampling, Instrumentation, Definitions, Variables**

Because R.F. schools implement SBRR as part of their program and funding compliance, the authors set the framework for their study from findings of the National Reading Panel or NRP (National Institute of Child Health and Human Development, 2000). Based upon a review of more than 100,000 research studies, the NRP identified five vital areas of reading instruction. These areas include:

1. phonemic awareness;

2. phonics;

3. fluency;

4. vocabulary; and

5. text comprehension.

In addition, the NRP also characterized effective literacy instruction based upon four pillars. Those pillars are:

1. The use of valid, reliable assessments.

2. The alignment of instruction and materials.

3. The alignment of literacy instruction to professional development

programming.

4. The presence of instructional leadership and coaching.

Based on the research-informed framework of the NRP, the study's authors elected to use DIBELS instrumentation to collect and categorize their data. DIBELS, a UNiversity of Oregon literacy assessment that stands for Dynamic Indicators of Basic Early Literacy Skills, includes a series of subtests (letter naming fluency, phoneme segmentation fluency, nonsense word fluency, and oral reading fluency) that align with the vital areas of instruction that the NRP report identified. Sampling for use with the DIBELS instrumentation was based on non-probability convenience. Wong-Ratcliff and her colleagues identified students in grade one at five different schools in rural Louisana, three of which were R.F. schools (N=130), the remaining two being non-RF schools (N=153). A matched sampling method ensured that participants shared the same characteristics of demographics (specifically geography, SES, and diversity of ethnicity). As a design control, the authors introduced only a single difference between the dependent and independent variables; namely, school participation (or not) in the federal Reading First program.

**Research Hypothesis, Research Design, and Data Collection**

In their study, Wong-Ratcliff and her colleagues used a quasi-experimental design. While their goal was to establish and support a potential cause-and-effect relationship between the study variables – between literacy gains (the independent variable) and participation in R.F. programming (the dependent variable) – the use of non-probability, convenience sampling necessarily defines the study's design as quasi-experimental rather than experimental. The authors examined the hypothesis that there is no statistically significant difference between the mean gains in literacy for students at R.F. schools when compared to students at non-RF schools. They collected data for their investigations at two separate points in the school year using the DIBELS subtests, which they administered during fall and spring benchmark testing.

**Validity, Reliability, Bias, and Control**

DIBELS, currently in its eighth edition, was in its sixth edition at the time of Wong-Ratcliff, et al.'s study. DIBELS and the particular subtests the authors used in their data collection are broadly administered by schools across the United States. As an assessment instrument, it is known to be both reliable (consistent across administrations) and valid (accurately measuring literacy skills in K-8 student populations). Both validity and reliability have been reported in numerous peer-reviewed research studies, including Hoffman (2009) and Shilling (2007). In Wong-Ratcliff et al.'s study, no conditions of research were adapted or changed during the course of the investigation, which lends independent support to the validity and reliability of their work.

DIBELS is not without controversy, however, as several researchers who originally participated in creating the instrument also served in a consulting capacity to the U.S. Department of Education during the development of the Reading First initiative. This conflict of interest was outlined by Kathleen Manzo in an article for Education Week in 2005. While there is no apparent conflict of interest between the current study's authors (Wong-Ratcliff, Powell, and Holland) and their use of DIBELS (which they contextualize as a function of alignment with the NRP report), I would like to have seen direct discussion of the DIBELS conflict of interest problem as part of the study's introduction and disclosures.

**Brief Overview of Data Analysis**

The authors used three statistical procedures in their data analysis. First, to examine the significance of the comparison of means for literacy gains between R.F. and non-RF schools in fall benchmarking, they performed an Analysis of Variance or ANOVA test. This showed markedly significant results (p < .001), though the effect size was small (between .096 and .140). As a result, the authors concluded that the difference in baseline literacy scores were meaningful (The three R.F. schools had an overall higher level of literacy than the two non-RF schools at the start of the study.) Second, following spring benchmarking, the authors used a univariate analysis of covariant or ANCOVA test (i.e., an ANOVA with regression) to examine the independent variables. The ANCOVA showed that mean literacy gains as measured by the DIBELS subtests were not significantly different for students at the R.F. schools in comparison with students at the non-RF schools. In other words, while at the outset of the study, the R.F. students' baseline was higher (incidentally, not demonstrably because of SBRR practices), the net gains achieved by both groups were the same. Students at non-RF schools, where SBRR was not mandated, did not see fewer literacy gains than those at the R.F. schools. Finally, the authors used correlated t-tests to analyze the DIBELS subtests for reliability between their use in the current study and results from the prior year's administration.

**Study Results, Conclusions, and Limitations**

The results of data collection and analysis from the fall and spring benchmarking periods in R.F. and non-RF schools did not show statistically significant differences of means in literacy measures between the fall and spring testing windows. The authors observed that students in the three R.F. schools had a baseline higher literacy proficiency than those at the non-RF schools (though in all cases, the effect sizes were small – between .096 and .140). Following benchmark testing in the spring, however, there was no statistically significant difference between R.F. and non-RF schools in the overall gains in reading skills observed among students. Results from the study, as the authors predicted, were consistent with Reading First Impact Study (U.S. Dept. of Education, 2008) and similar small-scale studies on the effects of Reading First programming. Extended instructional time for reading together with the effective use of para-professionals and reading interventionists were factors at both the R.F. and non-RF schools in this study. Such factors, the authors concluded, are superior predictors of literacy over participation in R.F. programming and the use of mandated SBRR instructional practices.

**Conclusion**

The quantitative study that Wong-Ratcliff and her colleagues undertook to investigate the research question of whether R.F. programming at schools is correlated to increased mean gains in literacy measures when compared to non-RF schools showed no statistical difference between the two. While their method was scientific, the authors' use of non-probabilistic convenience sampling necessarily means the study was quasi-experimental. Quantitative designs such as this one, when used in educational research, give voice to numbers and test explanations for phenomena. Such studies are valuable not only for creating a greater understanding of social processes (such as literacy improvement), but also for lending support to aspects of important decision-making processes. In this case, schools in Louisiana have access to more information about the role their participation in Reading First may have on student literacy. The choice to participate in Reading Fist comes with restrictions on instructional practice and demands of the already limited time classroom teachers and school administrators have. What quantitative research studies such as this one provide are additional data points, statistically validated descriptions of phenomena and support in a decision-making process. While the question of whether a school in Louisiana should or should not seek a grant award through Reading First is not one that this study answers, the research design and statistical procedures the authors followed mean it can and should serve to inform the broader decision-making process of literacy programming.

**References**

Cohen, L., Manion, L., and Morrison, K. (2011). *Research Methods in Education* (7th

ed.). Routledge.

Hoffman, A.R. (2009). Using DIBELS: A survey of purposes and practices. Reading

Psychology, 30, 1-16.

Manzo, K. K. (2005). National clout of DIBELS test draws scrutiny. Education Week,

25 (5), 1-12.

National Institute of Child Health and Human Development, NIH, DHHS. (2000).

Report of the National Reading Panel: Teaching Children to Read: Reports of

the Subgroups (00-4754). U.S. Government Printing Office.

Schilling, S.G. (2007). Are fluency measures accurate predictors of reading

achievement? The Elementary School Journal, 107 (5), 429-447.

Wong-Ratcliff, M., Powell, S., and Holland, G. (2010). Effects of the reading first

program on acquisition of early literacy skills. National Forum of Applied

Educational Research Journal, 23(3).

**The Use of Bloom's Taxonomy in Standards-based Planning and Outcomes**

While backward design provides a framework for creating a standards-based curriculum, where instruction and assessment are aligned to the broader goals of the student learning experience, Bloom's taxonomy provides a model for creating effective lesson plans and assessment prompts. As it were, backward design is *what* the teacher will do in the classroom, while Bloom's taxonomy is *how* it will be done (Anderson et al., 2001).

Between 1949 and 1953, throughout a series of conference meetings of the American Psychological Association, Benjamin S. Bloom chaired a panel of experts examining the classification of educational goals as a means of improving university student assessment practices. The panel's proceedings were published in multiple volumes, the most famous of which is the 1956 *Taxonomy of Educational Objectives*. Bloom's prominent role as the chair of that panel and the lead author of the proceedings quickly gave way to an eponym and what is to this day universally known as Bloom's Taxonomy (Anderson et al., 2001). The taxonomy is organized hierarchically across three domains (cognitive, affective, and psychomotor), though the knowledge-based cognitive domain is undoubtedly the best well-known of the three. Notably, Bloom had little involvement in developing the hierarchy that describes the affective domain, and the psychomotor domain was not fully articulated until much later (see, for instance, Elizabeth Simpson's 1972 publication, *Educational Objectives in the Psychomotor Domain*).

The cognitive domain of Bloom's taxonomy organizes learning objectives within six levels: knowledge, comprehension, application, analysis, synthesis, and evaluation. The first three levels are strictly hierarchical in terms of learning difficulty, while the final three largely parallel one another – though they do describe and measure different cognitive abilities. In 2001, a revision of the taxonomy was published by a team of scholars, with input from Bloom prior to his death in 1999. The work carried the same title as the original 1956 proceedings, but was subtitled *A Revision of Bloom's Taxonomy of Educational Objectives* (Anderson et al., 2001). In this revision, the well-known pyramid model was reshaped to use verbs rather than nouns in its hierarchy, and the evaluation/evaluate category was shifted down one level (see figure 1).

**Figure 1. **

Bloom's Taxonomy in 1956 (nouns) and 2001 (verbs) (based on Anderson et al., 2001)

Blooms' taxonomy is useful in a standards-based curriculum because it supports the alignment of classroom objectives to the standards that have been established and the assessments that will subsequently measure them. The levels cognitive levels within the taxonomy provide a model for targeted instruction and assessment. Take, for instance, the Colorado Academic Standard for Grade 8 world languages: [W.L. N.M. 3.1. Summarize information gathered from target language resources connected to other content areas (Colorado Department of Education, n.d.). As it should be, the standard is a broadly-stated, forward-looking goal. To meet the goal, both classroom instruction and assessments should engage abilities across the cognitive domain. The teacher should measure levels of achievement over-time and provide actionable feedback to individual students about progress that increasingly access higher-order cognitive abilities. While students initially engage the *remember*, *understand*, and *apply* levels of the taxonomy, as those skills and knowledge become intuitive, greater emphasis can be placed on the *analyze*, *evaluate*, and *create* levels. In this way, the backward design of Wiggins and McTighe (2005), which provides a natural framework for standards-based education, is built into a curriculum with day-to-day classroom objectives, ongoing formative assessments, and periodic summative assessments all of which are in alignment with one another and with the outcomes of the course. If backward design provides the map for a curricular journey, Bloom's taxonomy fuels the vehicle of instruction and assessment that transports students along the path.

**References**:

Anderson, L. W., Krathwohl, D. R., & Bloom, B. S. (2001). *A taxonomy for learning, teaching, and assessing: A revision of Bloom's Taxonomy of educational objectives* (Complete ed.). Longman.

Colorado Department of Education (n.d.). World languages academic standards.

https://www.cde.state.co.us/coworldlanguages/statestandards.

Simpson, E. (1972). *The Classification of Educational Objectives in the Psychomotor. *Gryphon House.

Wiggins, G. and McTighe, J. (2005). *Understanding by design (2nd ed.)*. Pearson. (Original work published 1998).

**Overview of Hypothesis Testing**

Statistics can be broadly divided into two categories: descriptive and inferential. Descriptive statistics use calculations and sorting methods to organize raw data and create meaningful information from it. Mean, median, and mode are perhaps the best well-known examples of descriptive statistics. Inferential statistics, by comparison, allow for calculated conclusions to be drawn about relationships among data and between a study’s dependent and independent variables (known respectively as the measured and explanatory or observed and manipulated variables) (Vogt & Johnson, 2015). The hypothesis test is one example of an inferential statistic. It is used to express the probabilistic significance of the relationship between a study’s variables. Importantly, because a hypothesis test is probabilistic, it is not truly conclusive, which is to say, a hypothesis test is never certain nor free from the possibility of error (DeMoulin and Kristonis, 2013). The test does, however, provide a statistical basis for a conclusion that the relationship between variables is not simply the result of mere chance.

In the design of a quantitative study, a researcher identifies independent and dependent variables based upon a phenomenon of interest and a related research question. These variables are the basis of at least two hypotheses or explanations that a researcher will investigate. These hypotheses are expressed as a null and an alternative hypothesis, written mathematically as H0 and Ha. The H0 or null hypothesis is an explanation of a phenomenon that is the current state of thinking on the subject. It is, as it were, the nullifiable explanation that a researcher hopes to reject on the basis of empirical data collected during an experiment in which the independent variable receives some treatment (i.e., a cause) and impacts the dependent variable, which then undergoes a change (i.e., an effect) (Scott & Morrison, 2006). The Ha or alternative hypothesis is the belief a researcher holds as a possibly justifiable and more reasonable explanation for a phenomenon.

The data a researcher collects from an experiment are organized first into descriptive statistics (like mean) and then, on the basis of inferential hypothesis testing, should the statistical calculations warrant rejecting the H0, the Ha will replace it as the better-justified explanation. This is sometimes referred to as the promotion of the alternative hypothesis (Vogt & Johnson, 2015). Again, because the probability is the mathematical basis of statistics, hypothesis testing neither proves nor disproves either the null or the alternative hypothesis. Research is, in part, a process of identifying and organizing evidence in support of a claim. That evidence provides support for one explanation over another, but does not prove one explanation to be true or certain. The purpose of this post is to explore research design and the generation of a null and alternative hypothesis in the context of a hypothetical educational study on two areas of needed change (attendance and graduation rates) within my local school district. Hypothesis testing, while foundational to quantitative research, depends intrinsically on the initial identification of a null and alternative hypothesis. By the end of this post, you should understand what a hypothesis is in quantitative research and how it interacts with the other elements of research design.

**Identifying H0 and Ha**

Colorado’s Department of Education (CDE) provides detailed statistics about stakeholders who participate in the state's publicly-funded system of education. These statistics are largely descriptive in nature and furnish information in the form of aggregated, statewide data, as well as district and school-level data. The CDE website allows easy access to this data for public informational purposes and school accountability, as well as for research purposes. The General Assembly of the state of Colorado mandates the collection and public sharing of this educational data.

To create a research context for this paper, I used the CDE’s Education Statistics website __(www.cde.state.co.us/cdereval____)__ with graduation and attendance as my subjects of interest. Specifically, the question I had in mind was whether there is a relationship between these two areas of education in my school district and how I might develop a quantitative study to investigate that relationship with a view to improving both. My belief was that there is a correlation between attendance and high school graduation. My further belief was that, while both are areas for improvement in the state of Colorado, by improving attendance positively, graduation results would also improve.

In a recent study by the U.S. Department of Education, attendance was listed among the most important predictors for student academic success and graduation (Faria et al., 2017). Given the importance of attendance to a successful school experience, data collection of the factors that impact chronic absenteeism (which the CDE defines as missing 15 or more days of school in an academic year) would be warranted in pre-study data collection. Similar studies, like one conducted in the Sacramento City Unified School District by UC Davis, indicated that physical health, caregiver discretion, and transportation are the primary reasons leading to absenteeism (Erbstein, 2014). What I would propose in this hypothetical research study for the school district in central Colorado where I work is data collection to establish the primary reasons for absenteeism and an experiment to investigate whether directly addressing one of those reasons and improving student attendance also results in a statistically significant increase in graduation rates.

Bear in mind, the purpose of this post is to demonstrate in the abstract how to establish a hypothesis for a research study - it will be far more vague than an actual study would need to be.

**Quantitative Research Design to Address Change**

The research question: By using resources to address one of the primary factors that lead to chronic absenteeism in a school district in central Colorado, is there a significant, correlated increase in student high school graduation rates in that district?

The dependent variable: High school graduation rates in the given school district. This is the measured or observed variable of the study.

The independent variable: a factor of chronic school absenteeism (to be identified through pre-study data collection, but predicted to be either caregiver discretion or transportation). This is the explanatory and manipulated variable.

Study design: Quantitative, quasi-experimental, and longitudinal. (An experimental design requires true-random selection and assignment of the subjects that are then exposed to an experimental treatment; this study will use convenience sampling).

Sample selection : convenience-type, based upon a known population of students designated as chronically absent (missing 15 or more days in a single school year). Given the sample of students, it would not be appropriate to create a randomly assigned group that receives the experimental treatment and one that does not. Because the treatment is believed to provide a significant advantage to students in outcomes that could have life-long social and economic implications, the use of a true-random method of assignment in this research design is not ethical (Israel & Hay, 2011). To address the research question, it would be necessary to measure the impact of the change in a factor of chronic absenteeism (the independent variable) over multiple school years in order to measure any change in graduation rates (the dependent variable). As a result, this hypothetical study is longitudinal in its design. A longitudinal design will gather multiple observations over multiple school years about graduation rates for students identified as chronically absent and receiving resources to address that absenteeism. This design differs from a cross-sectional design, which provides only a snapshot of the possible relationship between variables of interest at one moment in time.

Null hypothesis: there is no difference between the identified factor impacting attendance and student graduation rates.

Alternative hypothesis: By addressing an identified factor impacting attendance, there will be a statistically significant increase in high school graduation rates. Based upon CDE statistics from 2020, the current graduation rate for students in the district where the study would be conducted is 81.9% (Colorado Department of Education, n.d.). A two-tailed test would be used for the study’s hypothesis testing. As such, the null and alternative hypotheses can be written mathematically as H0: μ = 81.9% and Ha: μ ≠ 81.9%.

**Conclusion**

Hypothesis testing is a tool of inferential statistics used in quantitative research to draw conclusions about data and the significance of relationships between variables under investigation. As part of the initial design phase of a research project, the identification of the variables to be used in the study and articulation of a null and alternative hypothesis create the foundation for subsequent experimentation, data collection, analysis, and conclusions. These, the elements of a research study, are interconnected to such an extent that when any one of them is not rigorously planned and executed, the validity and reliability of the study as a whole are diminished or lost. Because the null and alternative hypotheses are extensions of the research question that drives the project in the first place, they are established at the outset of the design phase. Sampling and data collection depend upon the identification of the dependent and independent variables, whose relationship is the basis of the null and alternative hypothesis for the study. The analysis that stems from data collection is similarly a function of the results of the statistical testing of those hypotheses. And finally, whatever the end results of the hypothesis testing suggest, the conclusions a researcher draws for the study are a direct expression of them – of either rejection of the null hypothesis or a failure to reject it. In this way, not only is hypothesis testing foundational to quantitative research, but also arguably the cornerstone of a project’s very success.

**References**

Colorado Department of Education (n.d.). Statewide general statistics.

https://www.cde.state.co.us/cdereval/general

DeMoulin, D.F., and Kritsonis, W.A. (2013). *A statistical journey: Taming of the skew!*

(2nd ed.). The AlexisAustin Group.

Erbstein, N. (2014). Factors influencing school attendance for chronically absent

students in the Sacremento City Unified School District. *Brief Series*. UC Davis.

Faria, A.-M., Sorensen, N., Heppen, J., Bowdon, J., Taylor, S., Eisner, R., & Foster, S.

(2017). Getting students on track for graduation: Impacts of the Early Warning

Intervention and Monitoring System after one year (REL 2017–272). U.S.

Department of Education, Institute of Education Sciences, National Center for

Education Evaluation and Regional Assistance, Regional Educational

Laboratory Midwest. http://ies. ed.gov/ncee/edlabs.

Israel, M. and Hay, L. (2011). Research ethics for social scientists. Sage.

Scott, D. and Morrison, M. (2006). *Key ideas in educational research*. Continuum.

Vogt, W.P. and Johnson, R.B. (2015). *The SAGE dictionary of statistics and *

*methodology* (5th ed.). SAGE Publications, Inc

Data of the kind typically used in quantitative research falls within a symmetrical and normally distributed curve. Examples of such distributions are the z-distribution (the standard normal distribution based on the standard deviation of a population) and the t-distribution (based on the standard deviation of a sample). The tails are the regions at the end of either side of these distributions. As is the case with certain asymmetrical types like the F-distribution (associated with the analysis of variance and regression), not all distributions will have two tails (Vogt & Johnson, 2015).

Both one- and two-tailed tests generate a sample mean statistic (a critical value) used for comparison with a population mean and an established critical region or regions. The purpose of the comparison is inferential and allows the researcher to determine whether data from experimental observations support a claim (DeMoulin & Kritsonis, 2013). In other words, one- and two-tailed tests are differing approaches to forms of hypothesis testing as a researcher attempts to explain a phenomenon of interest either by rejecting or failing to reject a null hypothesis (H0). In general, in a research study, the researcher hopes to reject the H0 in favor of the alternative hypothesis or Ha. This is because the H0 represents the current state of knowledge about a phenomenon, and the Ha is the claim a researcher believes to explain it better, more accurately, and more justifiably.

Because not all data distributions are two-tailed, not all research designs will require the designation of a one-tailed versus a two-tailed test. With any statistical procedure planned for a research study with normally distributed data, however, determining which type of test will be used should happen in the initial design phase of the project (Scott & Morrison, 2006).

Assuming normally distributed data, in a one-tailed test, which examines the critical region of only one side of a symmetrical distribution, the statistical procedure employed can suggest whether a sample mean is higher *or* lower than a population mean (but not both). For this reason, a one-sided test is also referred to as a directional test (Vogt & Johnson, 2015). The researcher must make a decision about which of the two relationships or tails they will consider because a one-tailed test cannot evaluate both. One-tailed tests are only able to consider the relationship of a sample test statistic in one direction and, thus, give no consideration to the relationship in the other direction or opposite tail of the distribution. Figure 1 below shows how this relationship is expressed through the H0 and Ha at the beginning of the research project.

Given the level of significance or alpha that a researcher is willing to accept (usually 0.05 or 5%, but as high as 0.10 and as low as 0.01 or lower), the values of a one-tailed test is its increased power. Because the researcher is examining only one direction of a possible relationship, the significance value is not divided. The power of the one-tailed test is thus greater than the power of the two-tailed test.

A two-tailed test establishes a critical region at either end of a normal distribution – it is, for this reason, often described as a non-directional test (Vogt & Johnson, 2015). As with the one-tailed test, a researcher determines the significance value (alpha level) in advance of the test (again, typically 0.05 or 5%) and then applies the appropriate statistical procedure (for example, a one-sample Z-test or t-test). Because both ends of the distribution are to be examined, the alpha is divided (now only 0.025). This divided alpha also means the power of the two-tailed test is lower than the one-tailed test. Should the sample being tested fall within either critical region at either end of the distribution, the H0 will be rejected.

One-tailed hypothesis tests can show a sample mean is higher or lower than the population mean. They consider whether a sample test statistic (a critical value) falls within the critical region of one side of a distribution only. If a tested sample falls into that critical region, the researcher rejects the H0 for the Ha (DeMoulin & Kritsonis, 2013). By comparison, a two-tailed test examines a range of values and considers whether an effect is evident at either of the two ends of a normal distribution. Using a one-tailed test, a researcher can only infer whether the test statistic falls within the rejection region that is either greater than or less than the calculated critical value. A two-tailed test, while less powerful, uses a range of values that include both sides (i.e., both the positive and negative relationships) in a probability distribution.

A one-tailed test has more power than a two-tailed test because the entire alpha is applied to one relationship in a data distribution, but this increased power comes at a cost. The one-tailed test, by its nature, disregards half of the data of a distribution. The cost of this added power is so great that, unless a research basis exists specifically to use a one-tailed test, the two-tailed test is the default approach for hypothesis testing (Cohen et al., 2018). The one-tailed test is appropriate in experimental scenarios only wheen the researcher needs to understand just one side of a relationship, or where not knowing both sides of the relationship would not be unethical. Take, for example, a study of the efficacy of a novel literacy program being piloted in a school district. Should the research question be limited simply to whether the new program is significantly less effective than the program currently in use, a one-tailed test would be appropriate. Because, however, the one-tailed test is directional, it would *not* allow any inference as to whether this new program is significantly more effective. Moreover, a one-tailed test should not be used to determine significance, nor should such a test be used on data when a pre-determined two-tailed statistical test has failed. Validity (in this case, the relevance and accuracy of the statistical procedure to the data) and reliability (the consistency and replicability of the procedure) depend upon appropriate use of one- and two-tailed tests (Scott & Morrison, 2006).

**Figure 1. **Comparison of the null and alternative hypothesis in a one-tailed and two-tailed test.

**References**:

Cohen, L, Manion, L., and Morrison, K. (2018). Research Methods in Education (8th

ed.). Routledge.

DeMoulin, D.F., and Kritsonis, W.A. (2013). *A statistical journey: Taming of the skew!*

(2nd ed.). The AlexisAustin Group.

Scott, D. and Morrison, M. (2006). *Key ideas in educational research*. Continuum.

Vogt, W.P. and Johnson, R.B. (2015). *The SAGE dictionary of statistics and *

*methodology* (5th ed.). SAGE Publications, Inc.

In research, a population represents the entirety of a group of items or subjects that are of interest to a researcher. Because populations can be too large for study or inaccessible as a whole to a researcher, a smaller subset of that population, known as a sample, is often used for observation and experimentation (Vogt & Johnson, 2015). The use of the data that a researcher collects from the sample is dependent upon the process used to select. Generally, the selection process is referred to as either probabilistic (e.g., random sampling) or non-probabilistic (e.g., convenience sampling). When data describe a population, they are known as a parameter; when they describe a sample, they are known as a statistic. Similarly, when the number of subjects or items reflects a population, the abbreviation used is 'N' (uppercase), and when the number reflects a sample, the abbreviation used is 'n' (lowercase).

A sample is considered both random and representative of a population when each item or subject has an equal probability of selection and when the probability for selecting any item or subject is independent. When sample selection meets these criteria, and the data are numerical with either interval or ratio strength, a researcher can validly and reliably apply parametric statistical procedures and generalize results to a population (Scott, 2006).

Random assignment, while similar to random selection, represents a subsequent process in a research design. While random selection is the method a researcher uses to create a sample in the first place, random assignment is the method of applying individual items or subjects from a sample to the experimental treatments of interest or control groups (Scott, 2006). Just as sample selection falls into the general categories of probabilistic and non-probabilistic, assignment also takes one of two forms, being either simple or matched (Vogt & Johnson, 2015). With simple assignment, the items or subjects of the sample are independently distributed among treatment groups; with matched assignment, they are paired on the basis of traits or attributes held in common (like gender or age). Matched assignment design assumes random selection, but it controls variables that might otherwise become confounders.

In sum, random selection creates a sample that represents a population and whose data can be analyzed statistically and generalized to a population. Random assignment distributes the items or subjects of that sample among the experimental treatments a researcher is studying. That said, the answer to the question of whether one is more important than the other, in my view, is dependent upon the needs of the research design. Moreover, both random selection and random assignment can be necessary preconditions of validity and reliability. In such a case, to prioritize one over the other would result in a research study of little or no value.

True experimental design requires that both the sample selection and assignment processes are random (Vogt & Johnson, 2015). If the goal is to generalize to a population, random selection and random assignment are arguably of equal importance. When one or the other is not random, the research design becomes quasi-experimental, and the researcher is no longer able to generalize results to a population (Scott, 2006). This is not to say that such research cannot contribute to a growing body of evidence in support of a theory; only that, from a statistical perspective, a true random sample and true random assignment are stringent assumptions of the generalizability of data from a sample to a population. Moreover, a completely randomized design is often not possible or even necessarily preferable. Variables that neither the independent (i.e., explanatory) nor dependent (i.e., response) variables capture may have a confounding effect on the study, and a researcher can control such lurking variables (often demographic in nature) through a matched approach to assignment. Peetsma et al.'s 2001 paper on inclusion in education is one such example. The study used a matched pairs design to increase internal validity and align their results to the factors of mainstream versus special education in psychological development. Similarly, in the sample selection process, constraints related to target population accessibility may mean that non-probabilistic, convenience sampling is required. Consider Wong-Ratcliff et al.'s 2010 paper on the effects of the federal Reading First program on the development of literacy as an example. Those authors, working with rural school populations in Louisiana, used a non-probabilistic convenience design to identify grade one students at five schools, three where Reading First programming was in place and two where it was not.

The question of whether random selection or random assignment is more important to the generalizability of results for a study may belie the complexity of research design more broadly. The accessibility of the target population and the need to generate a representative sample can be independent factors for consideration as a researcher undertakes to design a study. A quasi-experimental design, where one or both of the elements of sample selection and assignment may not be true-random, will still allow forms of statistical analysis and hypothesis testing. In situations where the researcher has prioritized generalizability, maintaining a true-random sample and true-random assignment should be of equal concern.

**References**:

Peetsma, T., Vergeer, M., Roeleveld, J. and Karsten, S., (2001) Inclusion in Education:

Comparing pupils' development in special and regular education,

Educational Review, 53:2, 125-135,

Scott, D. and Morrison, M. (2006). *Key ideas in educational research*. Continuum.

Vogt, W.P. & Johnson, R.B. (2015). *The SAGE dictionary of statistics and *

*methodology* (5th ed.). SAGE Publications, Inc.

Wong-Ratcliff, M., Powell, S., and Holland, G. (2010). Effects of the reading first

program on acquisition of early literacy skills. National Forum of Applied

Educational Research Journal, 23(3).

In designing a quantitative research project where statistical analysis will provide the basis for inferences made about the data collected, a reliable and valid method of sample selection is vital. As part of the initial design process, a researcher will determine the independent and dependent variables of the study (that is, the experimental and response variables) and then identify a target population. From that target population, observations and experimentation will yield data that will either allow or cause the failure of the rejection of the null hypothesis through the application of appropriate statistical procedures. For this reason, while every element in the design of a research study contributes to its success, none of them can compensate for the damage done by bad data obtained from erroneous or improper sample selection (Scott & Morrison, 2007). With that said, let's explore random sampling as a foundational component of the design process in quantitative research within the context of a hypothetical study that would infer the likelihood of the passage of an educational bond referendum.

A population is some whole and distinct group of items or subjects from which a sample is drawn (DeMoulin & Kritsonis, 2013). The population pool may be as large as a city or even a nation or as small as a classroom of students. It may be comprised of people or items, though in the social sciences, the former almost universally constitutes the populations of interest (Eldredge et al., 2014). The deciding factor for identifying a population is typically a characteristic the items or subjects have in common and the concern a researcher has for studying that characteristic. In addition to identification, researchers must further determine the accessibility of any target population. While the accessible population may be the same as the target population, it may not. In the theoretical education bond study for this paper, the target population is all eligible members of a voting district; this is also the accessible population (available through census and voter registration records). Suppose, however, that one were investigating student demographics and academic performance within a district, but one or more of its school principals declined to participate. The result would be an accessible population smaller and different from the target population of all students in a given district (DeMoulin & Kritsonis, 2013).

Because the study of a whole population can be prohibitive to a researcher’s work, either because of funding and resource limitations or simple impracticality, data are typically collected from samples (Vogt & Johnson, 2015). A sample is a subset of a population. For a quantitative study, where a researcher will apply statistical analysis to the data, this sample is assumed to be random. True-random samples are representative of a population and allow for inferences to be generalized from the sample to the population. Those inferences should not be viewed as truth claims, however; instead, they are probabilistic descriptions, correlations, and explanations of phenomena observed within the sample. For a sample to be true-random, the method of selection must meet two criteria. First, that each item or subject within the sample has an equal probability of being drawn; and second, that each item or subject is drawn independently and not affected by the drawing of any other one (Suhonen et al., 2015). When created in this way, a sample will be representative of the population, and a researcher may subject the empirical data collected from it to parametric statistical procedures.

In practice, to create the actual sample from a population, a researcher needs first to create a sampling frame. This is a list, typically in the form of a database, that contains all members of the target population. Even a basic statistical software package like Excel is able to assign identification numbers at random to all items or subjects within a population list. The researcher need only provide the required sample size (*n*) and use the software to apply random number generation to the set based on the total population (*N*). If a sample size of 500 is required, one simply draws the randomly assigned numbers one through five hundred from the total population database. Note, however, that just as there is an important distinction between the target population and accessible population that will determine the true number or *N* of a study, there is similarly an important distinction between sample selection and the actual sample number or *n*. The actual sample number is the number of items or subjects from which data will have successfully been collected. Even if in the educational bond study, a true-random sample of 500 subjects were drawn, should only 430 of them provide data for the study, the actual sample would be *n = 430* and not *n = 500*.

**Generating a Stratified Random Sample**

Stratified sampling is a method of random sample selection that first divides a population into smaller subsets referred to as strata (Scott and Morrison, 2006). Within each stratum, a researcher can draw an individual, random sample. The strata, which a researcher identifies in advance, create sub-categories for the data that allow statistical inferences within those specific categories. These sub-categories express within-group homogeneity and between-group heterogeneity and are identified to increase the precision of sample representation. Such sampling may also contribute to the study’s internal validity by identifying and deliberately construing potential confounding variables (Vogt and Johnson, 2015). In the social sciences, these categories typically align with demographics (age, race, gender, sex, SES, etc.) or psychological features (facets of child development, personality, and mental health). Random number generation as described above using Excel or a similar software package is again used in the creation of a sample, though this process assumes that the categories used in the strata are available to the researcher in database form in advance.

To study the question of whether a population is likely or not to support the passage of an educational bond, stratification related to demographics could inform decision-making for canvassing and other informational campaigns. Data collected on the basis of categories such as gender, age, and parental status (which is to say, whether or not the respondents have children, and whether those children attend a public school in the voting district) would enhance understanding not only of whether the bond might pass, but also of what categories of individuals are likely to support or not. Parental status is a particularly useful category in this instance, in my view, because I suspect that the likelihood of supporting or not supporting a bond will correlate with whether households have children and whether those children attend public schools in the voting district. Knowledge of correlations between categories of data would allow door-to-door canvassers and informational campaigns (local television commercials and brochures) to target specific types of individuals with likely known viewpoints. It may become clear from the study that individuals without children currently attending a public school in the district, for instance, are unlikely to support the bond’s passage. Such information is valuable in generating an argument either for or against the bond measure, depending on who are the information users of the study. Whoever the information users, though, a stratified sample will allow canvassers and campaign communications to target their arguments more effectively to their audiences.

**Generating a Cluster Random Sample**

Cluster sampling, like stratified sampling, is a random form of sample section in which a researcher divides a population into smaller groupings referred to as clusters (Scott & Morrisons, 2006). Individual samples are randomly drawn from each cluster. Like stratified sampling, cluster sampling is a probabilistic and random method, but expresses between-group homogeneity and within-group heterogeneity. As discussed above, the purpose of stratified sampling is to identify categories like demographics explicitly and organize sample selection using population groups defined by these categories; hence, the within-group homogeneity. Cluster sampling, by comparison, does not use researcher-defined categories, but instead divides a population by natural features like geography. Within a voting district, clusters are inherently created by street intersections and neighborhoods. For this reason, there is within-group heterogeneity (that is, demographic traits like age and parental status have not been identified), but between-group homogeneity (that is, each cluster is equally a neighborhood). The primary reasons that a researcher would use cluster sampling are to reduce the overall cost of a study and increase efficiency. Once again, population lists are a prerequisite for random selection of the sample using a statistical software package.

**Conclusion**

Fundamental to the success of a quantitative research study is the method used for sample selection. The data for a study come from the observations of and treatments applied to a sample. Should that sample not have been selected correctly, the data coming from it will be unreliable and invalid. Such data cannot produce results that allow meaningful inferences about a phenomenon of interest to a researcher and cannot be generalized to a population. In other words, if the sample is poor, the data will be poor, and the study will produce erroneous or, at best, low-quality results.

The process of sample selection is a necessary early step in the design of a quantitative research study. To apply parametric statistical procedures to data and make generalizations from the sample to the population requires that the sample selection be true-random and, thus, representative. This means that each item or subject within the sample has an equal probability of selection and that each one is also independent of the others. In the social sciences, stratified and cluster sampling are common methods of sample selection that allow a researcher, based on constraints of time and funding, either to increase representation while decreasing potential confounders (as with stratification) or to reduce costs while improving efficiency (as with clustering). Because there is no curative in the design of a research study for the invalid and unreliable results that bad data produce, a systematic approach to sampling should include the following steps:

Identification of a target population to address the variables of the research question.

A decision of needed sample size which, in connection with effect size and p-value will determine the power of the study as a whole. (Notably, sample size is the most commonly manipulated component of a study’s power as effect size and p-value are typically designated in advance and used to determine the required sample size)

Based upon the identified sampling method (for instance, simple random, random clustering, or random stratification), a statistical software package is used to generate the random sample from a population database.

The researcher proceeds with data collection.

I believe that whatever the program of study and college or university in which a doctoral student finds themself, the responsibility for becoming a capable and effective researcher ultimately lies with the student. While professors and mentors play an enormous role in providing guidance, sharing relevant experience, and modeling expectations, yeoman’s work falls to the student. Ensuring work is at a level appropriate to a doctoral program and eventually to peer review and publication, as Moussa-Inaty (2015) has argued, comes from a careful reading of the scholarship in a given field, a willingness to accept and make adjustments for the constructive criticism of peers and mentors and, most importantly, continual practice. The more I read and reflect on what I have read, the more I learn about writing well and developing my own style. In short, I think students learn to work at a doctoral level by observing other researchers with a goal of emulating, adapting, and integrating best practices. This takes me to the often posed question of quantitate versus qualitative research.

**Quantitative Research: Characteristics, Advantages, Limitations**

In education, quantitative research represents an approach to investigation and knowledge creation that prioritizes the processes and assumptions of the scientific method (Scott and Morrison, 2006). Fundamentally, this means that the hypotheses or explanations articulated within a quantitative research design must be testable, measurable, and falsifiable (Popper 2004). Characteristics of quantitative research include efforts by the researcher to remain objective, the use of a pre-established and fixed research design, the use of large, random and representative samples, the application of statistical procedures to empirical data, and an orientation towards predicted outcomes. The advantages of quantitative research designs include the ability to generalize results from a sample to a population, the ability to calculate factors such as reliability and validity, and a systematic focus on specific, testable and confirmable hypotheses and correlations (Hoy, 2010). Adler (1996, cited in Frels & Onwuegbuzie, 2013) characterized quantitative research by the kinds of questions it can address. Namely, the *who*, *where*, *how many*, and *how much* of research investigation. It follows from this that answering the questions of *how* and *why* constitute the primary limitations of quantitative design.

**Qualitative Research: Characteristics, Advantages, Limitations**

Qualitative research, by comparison, approaches investigation and knowledge creation through experience, participation, description, and interpretation (Scott and Morrison, 2006). Suppose in a quantitative study that the researcher is seen standing apart from the subject of interest and observing dispassionately. By contrast, in a qualitative study, the researcher stands with the subject of interest and shares in the experience. While quantitative research is concerned with the specifics of a phenomenon, qualitative research is concerned more broadly with its context. A qualitative design aims to allow the researcher to describe and interpret a phenomenon with limited if any controls placed on it (Lichtman, 2010). Because the data are typically nominal or ordinal in their strength, generalization from the sample variables to a larger population is rarely undertaken. Qualitative data express themselves as words and not numbers. The relatively small sample size used in such a study and the rich data produced from multi-collection techniques often outweigh the time-intensive and high-cost requirements that accompany qualitative research (Lichtman, 2010). Again citing Adler (1996), while answering the questions *how many* and *how much* are not strengths of qualitative research, it does thrive in the question realms of *how* and *why*.

**Similarities: Why the Line between the Two Designs Can be Unclear**

From the perspective of paradigms (that is, the set of beliefs, methodologies, and assumptions adopted by a researcher), there appears to be a clear division between qualitative and quantitative approaches to research (Arghode, 2012). However, that division can be overstated. As Venkatesh et al. (2013) have pointed out, both methods can and do involve taking measurements, using numerical data, and employing statistical procedures. Moreover, the descriptive and exploratory emphasis that researchers often attribute to qualitative methods is equally possible and valid in a quantitative study. Arguably, as well, both methods are inductive, which is to say, both methods typically rely on a set of observations or measurements and apply the logic that the future will resemble the past (Hume, 1993). While Popper (1959) was successful in shifting scientific inquiry away from verificationism (A.J. Ayer, 1952) and ushering in a postpositivist era, the argument that science is strictly deductive or even hypothetico-deductive has not held up to scrutiny, as Staddon (2018) outlines in a detailed discussion of the matter. In the final analysis, while quantitative and qualitative methods easily fall into categories like hard and soft (Platt, 1964), scientific and humanistic (Snow, 2013), such a division may not only misrepresent the reality of similarities between the two methods, it may also create a division between two cultures which, in the words of C.P. Snow (2013), only handicaps the ability of either method to address important problems.

Table 1: Qualitative and Quantitative Research Comparison (based on Arghode, 2012; Lichtman, 2010; Hoy, 2010)

We learn by observing, adapting, and integrating the processes of others. In education and the social sciences, knowledge and justified belief, whether they be an aspirational knowledge of truth (what Russell (1997) characterized as *idealism*), or a descriptive form of knowledge (what Russell (1997) termed *knowledge of things*), are the result of a process of refinement, experimentation, testing, and evaluation. While quantitative methods of research and investigation offer positivist/postpositivist (i.e., empirical and measured) objective understanding of phenomena with a view to potential cause and effect relationships, qualitative methods offer phenomenological (i.e., descriptive) understanding of subjective and participatory experience (Arghode, 2012). Because the realms of inquiry of the two methods of research differ, the knowledge that comes from them will also differ. As Popper (2004) has famously stated, “The game of science is, in principle, without end. He who decides one day that scientific statements do not call for any further test, and that they can be regarded as finally verified, retires from the game.” Given that the construction and refinement of knowledge are not *per se* finite, the greater the understanding that a researcher has of the variety of methods allowing the generation of new knowledge and improved understanding, the more effective that researcher can be. For this reason, the question of quantitative versus qualitative approaches to research is a red herring, at least when researchers understand the limitations of each.

**References:**

Adler, L. (1996). Qualitative research of legal issues. In D. Shimmel (Ed.), *Research *

*that makes a difference: Complementary methods for examining legal *

*issues in education*. National Organization on Legal Problems of Education.

Arghode, V. (2012). Qualitative and quantitative research: Paradigmatic differences.

*Global Education Journal*, *2012*(4), 155–163.

Ayer, A.J. (1952). Language, truth, and logic. Dover. (Original work published 1936.)

Frels, R.K. and Onwuegbuzi, A.L. (2013). Administering qualitative instruments with

qualitative interviews: A mixed research approach. *Journal of Counseling and *

*Development*, 91, 184-194.

Hoy, W.K. (2010). *Quantitative research in education: A primer*. Sage.

Hume, D. (1993). An Enquiry Concerning Human Understanding: with Hume’s

Abstract of A Treatise of Human Nature and A Letter from a Gentleman to His

Friend in Edinburgh. Hackett. (Original work published 1748).

Lichtman, M. (2010). *Qualitative research in education: A user’s guide*. Sage.

Moussa-Inaty, J. (2015). Reflective writing through the use of guided questions.

*International Journal of Teaching and Learning in Higher Education*, 27(1),

104-113.

Platt, J.R. (1964). Strong inference. Science, 146 (3642), 347-353.

Popper, K. (2004). The logic of scientific discovery. Routledge Classics. (Original

work published 1959).

Russell, B. (1997).* The problems of philosophy (2nd ed.)*. Oxford University Press.

(Original work published 1912).

Scott, D. and Morrison, M. (2006). *Key ideas in educational research*. Continuum.

Snow, C.P. (2013). *The two cultures and the scientific revolution*. Martino Fine

Books. (Original work published 1959).

Staddon, J. (2018). Scientific method: how science works, fails to work, and

pretends to work. Routledge.

Venkatesh, V., Brown, S.A., and Bala, H. Bridging the qualitative-quantitative

divide: guidelines for conduction mixed methods research in information

systems. *MIS Quarterly*, 37(1), 21-54.

A research team of three scholars, Wong-Ratcliff, Powell, and Holland, investigated the effects of the Reading First program on students in grade one in rural school districts in Louisiana in 2010. Reading First (R.F.) is a federal education program described within the Elementary and Secondary School Education Act (ESEA) that provides funding to Title I schools for literacy improvement. In receiving this funding, though, schools are required to comply with scientifically based reading research (SBRR) practices. This requirement has led to comparisons of gains in student literacy between the R.F. and non-RF schools where literacy instructional practices are not mandated. Following the U.S. Department of Education's publication in 2008 of the Reading First Impact Study, which showed no significant difference in literacy gains between R.F. and non-RF schools, smaller-scale studies have provided additional analysis of the problem. The purpose of this paper is to explore one such smaller-scale study and examine the quantitative design its authors used to draw conclusions about the efficacy of Reading First in rural Louisiana.

Because R.F. schools implement SBRR as part of their program and funding compliance, the authors set the framework for their study from findings of the National Reading Panel or NPR (National Institute of Child Health and Human Development, 2000). Based upon a review of more than 100,000 research studies, the NPR identified five vital areas of reading instruction. These areas include 1. phonemic awareness; 2. phonics; 3. fluency; 4. vocabulary; and 5. text comprehension. In addition, the NPR also characterized effective literacy instruction based upon four pillars. Those pillars are:

The use of valid, reliable assessments.

The alignment of instruction and materials.

The alignment of literacy instruction to professional development programming.

The presence of instructional leadership and coaching.

Based on the research-informed framework of the NPR, the study's authors elected to use DIBELS instrumentation to collect and categorize their data. DIBELS, which stands for Dynamic Indicators of Basic Early Literacy Skills, includes a series of subtests (letter naming fluency, phoneme segmentation fluency, nonsense word fluency, and oral reading fluency) that align with the vital areas of instruction that the NPR report identified. Sampling for use with the DIBELS instrumentation was based on non-probability convenience. Wong-Ratcliff and her colleagues identified students in grade one at five different schools in rural Louisana, three of which were R.F. schools (N=130), the remaining two being non-RF schools (N=153). A matched sampling method ensured that participants shared the same characteristics of demographics (specifically geography, SES, and diversity of ethnicity). As a design control, the authors introduced only a single difference between the dependent and independent variables; namely, school participation (or not) in the federal Reading First program.

In their study, Wong-Ratcliff and her colleagues used a quasi-experimental design. While their goal was to establish and support a potential cause-and-effect relationship between the study variables – between literacy gains (the independent variable) and participation in R.F. programming (the dependent variable) – the use of non-probability, convenience sampling necessarily defines the study's design as quasi-experimental rather than experimental. The authors examined the hypothesis that there is no statistically significant difference between the mean gains in literacy for students at R.F. schools when compared to students at non-RF schools. They collected data for their investigations at two separate points in the school year using the DIBELS subtests, which they administered during fall and spring benchmark testing.

DIBELS, currently in its eighth edition, was in its sixth edition at the time of Wong-Ratcliff, et al.'s study. DIBELS and the particular subtests the authors used in their data collection are broadly administered by schools across the United States. As an assessment instrument, it is known to be both reliable (consistent across administrations) and valid (accurately measuring literacy skills in K-8 student populations). Both validity and reliability have been reported in numerous peer-reviewed research studies, including Hoffman (2009) and Shilling (2007). In Wong-Ratcliff et al.'s study, no conditions of research were adapted or changed during the course of the investigation, which lends independent support to the validity and reliability of their work.

DIBELS is not without controversy, however, as several researchers who originally participated in creating the instrument also served in a consulting capacity to the U.S. Department of Education during the development of the Reading First initiative. This conflict of interest was outlined by Kathleen Manzo in an article for Education Week in 2005. While there is no apparent conflict of interest between the current study's authors (Wong-Ratcliff, Powell, and Holland) and their use of DIBELS (which they contextualize as a function of alignment with the NPR report), I would like to have seen direct discussion of the DIBELS conflict of interest problem as part of the study's introduction and disclosures.

The authors used three statistical procedures in their data analysis. First, to examine the significance of the comparison of means for literacy gains between R.F. and non-RF schools in fall benchmarking, they performed an Analysis of Variance or ANOVA test. This showed markedly significant results (p < .001), though the effect size was small (between .096 and .140). As a result, the authors concluded that the difference in baseline literacy scores were meaningful (The three R.F. schools had an overall higher level of literacy than the two non-RF schools at the start of the study.) Second, following spring benchmarking, the authors used a univariate analysis of covariant or ANCOVA test (i.e., an ANOVA with regression) to examine the independent variables. The ANCOVA showed that mean literacy gains as measured by the DIBELS subtests were not significantly different for students at the R.F. schools in comparison with students at the non-RF schools. In other words, while at the outset of the study, the R.F. students' baseline was higher (incidentally, not demonstrably because of SBRR practices), the net gains achieved by both groups were the same. Students at non-RF schools, where SBRR was not mandated, did not see fewer literacy gains than those at the R.F. schools. Finally, the authors used correlated t-tests to analyze the DIBELS subtests for reliability between their use in the current study and results from the prior year's administration.

The results of data collection and analysis from the fall and spring benchmarking periods in R.F. and non-RF schools did not show statistically significant differences of means in literacy measures between the fall and spring testing windows. The authors observed that students in the three R.F. schools had a baseline higher literacy proficiency than those at the non-RF schools (though in all cases, the effect sizes were small – between .096 and .140). Following benchmark testing in the spring, however, there was no statistically significant difference between R.F. and non-RF schools in the overall gains in reading skills observed among students. Results from the study, as the authors predicted, were consistent with Reading First Impact Study (U.S. Dept. of Education, 2008) and similar small-scale studies on the effects of Reading First programming. Extended instructional time for reading together with the effective use of para-professionals and reading interventionists were factors at both the R.F. and non-RF schools in this study. Such factors, the authors concluded, are superior predictors of literacy over participation in R.F. programming and the use of mandated SBRR instructional practices. While the question of whether a school in Louisiana should or should not seek a grant award through Reading First is not one that this study answers, the research design and statistical procedures the authors followed mean it can and should serve to inform the broader decision-making process of literacy programming.

]]>**Induction **is the process of moving from specific observations to a probabilistic theory.

**Deduction **is the process of moving from a theory to observations and confirmation.

Understanding the difference between inductive and deductive reasoning is critical for those new to the process of academic research, particularly in the social sciences. The subject is one of long history and debate (Hume (1739; 1748), Kant (1781), Russell (1948), Popper (1959), and Quine (1970) have each contributed substantial arguments) and can easily be misunderstood when viewed in the context of related and important ideas like objectivity, justification, and truth. The idea, for instance, that science should be objective, which is to say outside of the influence of perspectives and value judgements, presupposes a deductive approach to reasoning. And yet, the process of using a specific observations to support the probability of a conclusion that can be generalized to a population necessitates prior knowledge and a theoretical background. Science often advances through an inductive process and, for this reason, its absolute objectivity is limited. To invoke Hume, this is not a matter of critique, but rather a matter of fact. Induction may have limitations in reaching and justifying the truth of a claim, but is, nonetheless, as essential tool for creating new knowledge and understanding.

Like the distinction between independent and dependent variables [__click here for an overview__], the distinction between induction and deduction is fundamental to understanding the process of research. The table below is intended to provide a high-level overview of the induction and deduction, while offering a basic explanation of the concepts themselves.

**References: **

Hume, D. (1739). *A Treatise of Human Nature*, Oxford: Oxford University Press.

Hume, D. (1748). *An Enquiry Concerning Human Understanding. *Oxford

University Press.

Kant, E. (1781). *Critique of Pure Reason [Kritik der reinen Vernunft], (*Guyer, P. and

Wood, A.W., eds) Cambridge University Press, 1998.

Popper, K. (1959), *The Logic of Scientific Discovery [Logik der Forschung]. *London:

Hutchinson.

Quine, W.V.O. (1970). Natural Kinds in Nicholas Rescher et al., eds, Essays in Honor

of Carl G. Hempel, 41–56. D. Reidel,

Russell, B. (1948). *Human Knowledge: Its Scope and Limits. *Simon and Schuster.

The distinction between dependent and independent variables is a necessary one in research, but there is an extraordinary amount of variation in the language used to signify each type of variable. The table below is not exhaustive, but does capture the breadth of the terminology and provide a helpful means of understanding this key concept in research.

**Defining the Confidence Interval of a Data-set**

A *confidence interval* is a range of values that provides an estimate within which some parameter (e.g., the mean or μ) is likely to fall, based upon a level of probability (Cohen et al., 2011). The probability that a value will fall within this range, expressed as a percentage, is known as the *confidence level*. By convention, the confidence level is set at either 95% or 99% - a value obtained using the formula 100 (1-α), which converts the alpha or risk from a decimal to a percentage value. The relationship between the confidence level and the confidence interval is such that as the confidence interval widens, the confidence level increases.

In a normally distributed data set where either a Z-test or t-test is being performed to determine a relationship of significance between an observed value and the mean, the confidence interval generates a lower and upper limit for the estimated mean. Within any confidence interval, the critical value is a two-tailed value (whether a one-tailed or two-tailed test has been performed) (DeMoulin &Kritsonis, 2103).

Confidence levels are widely used in the media when reporting, for instance, the likelihood that a political candidate will be elected to office. One might read that, based upon a poll, it appears a candidate will garner 75% of the vote. That poll would likely be reported with a confidence level of 95%, plus or minus 3% (the confidence interval). This gives confidence limits that 95% of the time, the candidate will garner between 72 and 78 percent of the votes (example based on Vogt & Johnson, 2015).

**Viability and Usefulness of Constructing a Confidence Interval for a Data-set**

Based upon the alpha or risk-level, a confidence level is established before data analysis and hypothesis testing take place (Cohen et al., 2011). This confidence level will directly drive the confidence interval calculated as part of a hypothesis test. For this reason, when used correctly, confidence intervals provide an additional descriptive statistic for a data-set that supports the inference being made from that set. While not confirmatory, confidence intervals do contribute to quantitative reliability.

That said, confidence intervals provide an estimated range and, together with the integrated confidence level (e.g., 90, 95, or 99 percent), these sample statistics can impart a sense of certainty to naïve readers where true certainty does not exist. With a typical confidence level of 95%, one can expect an unknown population parameter to fall within a confidence interval 19 out of 20 times. Attributing near certainty to 19 events does not, of course, limit the consequences that may come from an inference that is contradicted by event 20. Recognizing this limitation is what one might call the art of statistical inference. While statistical analysis attempts to make predictions that manage variability and uncertainty, it cannot completely control either of these phenomena (Spiegelhalter, 2019).

**References**:

Cohen, L., Manion, L., & Morrison, K. (2011). *Research methods in education* (7th

ed.). Routledge.

DeMoulin, D.F., & Kritsonis, W.A. (2013). *A statistical journey: Taming of the skew!*

(2nd ed.). The AlexisAustin Group.

Spiegelhalter, D. (2019). *The art of statistics: How to learn from data*. Basic Books.

Vogt, W.P. & Johnson, R.B. (2015). *The SAGE dictionary of statistics and *

*methodology* (5th ed.). SAGE Publications, Inc.

What I had intended to convey in my __earlier post__ is that that failing to reject the null (as opposed to accepting it) is the result of data collection and hypothesis testing. Reaching the point where it is possible to fail to reject the null hypothesis is a function of a research project in which a null hypothesis (a reflection of the current state of research) has been stated and an alternative hypothesis (the claim a researcher is investigating), has been supported. Quantitative research hypothesis testing is binary in its result: either one rejects the null or fails to reject it. Rejecting the null means data support the alternative hypothesis to replace the null; failing to reject it means data do not support the alternative and, thus, the null remains in place.

The analogy I used to support my argument was that of a criminal court proceeding. In a criminal court, the case is like a research study and, as a result of evidence (i.e., data), one of two possible outcomes is reached: either the jury rejects the null hypothesis or it fails to reject the null. In this analogy, the null states that the defendant is presumed innocent. When the jury rejects the null, the person goes to jail; when the jury fails to reject the null, the person goes free.

Any person who is never charged with a crime (i.e., an alternative hypothesis) is presumed innocent and sees no interaction between a null (their presumed innocence) and alternative hypothesis (that they have committed a crime). When a person is charged with a crime, and a court case ensues, the null hypothesis that the person is presumed innocent interacts with the alternative hypothesis that the person has committed a crime. It's for this reason that the court case becomes like a research study. Without a court case, the null hypothesis continues to exist and can be accepted by anyone without any effort – without a trial and the work of collecting data to reach a conclusion. The null in this form is an unexamined state of being. For a court case to proceed, there must be an alternative hypothesis and the end result must be either that jury reject the null hypothesis (i.e., the person is found guilty) or they fail to reject it (i.e., the person is found not guilty).

My argument in the context of research is that this analogy offers an explanation of why a researcher either rejects a null or fails to reject it and, more to the point, explains why the convention is for the researcher to say that he or she has failed to reject the null rather than saying that he or she accepts the null. In a court, failing to reject is a finding of *not-guilty,* rejecting is a finding of *guilty,* and either finding results from the presentation of evidence. In a research study, failing to reject means the null hypothesis remains and the data do not support replacing it with the alternative hypothesis. By comparison, rejecting the null means the alternative hypothesis offers a more accurate description of some phenomenon than the null does. This alternative hypothesis is then promoted and becomes the accepted null until a new alternative claim is investigated and new data allow the researcher once again either to reject or fail to reject the null.

This argument-by-analogy is the essence of my claim that failing to reject the null hypothesis is *not *tantamount to accepting the null. Moreover, returning to the criminal court analogy, when a person is found not-guilty, they join the accepted state of being of all other null hypotheses that state a person is presumed innocent. Only when an alternative hypothesis is made claiming that that person is guilty will the work once again begin to gather data and either reject or fail to reject the null. Hence, accepting the null requires no effort, while failing to reject the null is the result of data collection and analysis. They are not two ways of saying the same thing.

One final point that I believe is important to the larger epistemological issue known as the replication crisis (Ioannidis, 2005a; 2005b). In the event that the null hypothesis is rejected and the person is found guilty, that alternative hypothesis is then promoted to the new null hypothesis for that person. Happily, the court analogy continues to work here. In this instance, any appeals case would be a replication study that attempts to reexamine whether the null hypothesis previously rejected should again be rejected.

In addition to my response above, I also tested my thinking with the direct questions the reader posed and have included those below.

**What is the alternative hypothesis in a trial? **

The alternative hypothesis is that the person has committed a crime and is guilty.

**[Are the Null and the Alternative] the defense attorney’s argument and the prosecutor’s argument? **

The defense attorney and the prosecutor are sources of data. The court-case analogy is imperfect here because the prosecutor is responsible for the alternative hypothesis (that the person has committed a crime), but the jurors decide whether to reject or fail to reject the null hypothesis. The defense attorney works to support the null hypothesis. In a research study, the researcher is, of course, responsible for the alternative hypothesis, as well as the collection and presentation of all data.

**Are the jurors determining whether they reject one lawyer’s argument and not the other? **

In the strictest of terms, the jurors are simply deciding whether they reject or fail to reject the null hypothesis (which is the presumption of innocence). They can make this determination because an alternative hypothesis has been offered and a case has been presented with evidence to sway them in making a decision either to reject or fail to reject. Going back to the problem of accepting the null versus failing to reject it, in my view, accepting simply means the current state of affairs exists without examination while failing to reject indicates a process has been undertaken and data do not support the rejection of the null.

**Is rejecting one lawyer’s argument failing to reject the other lawyer’s argument? **

The two lawyers together present a pool of evidence, which is tantamount to data in a study. I would argue that it is more helpful to see the case in terms of the binary choice between rejecting or failing to reject the null hypothesis, rather than accepting the arguments of one lawyer over the other. While I acknowledge that the ideas are inextricably intertwined, from the perspective of propositional logic, it is, perhaps, helpful to focus on the final choice of rejecting versus failing to reject.

**So in other words, I may not accept either lawyer’s argument (hypothesis) but I do reject one of the arguments requiring me to accept the other’s argument?**

I think, in this instance, it is sound to say that the legal arguments are the data allowing the jury either to reject the null hypothesis or to fail to reject it. The hypotheses are that the defendant is presumed innocent (null hypothesis) or guilty (alternative hypothesis). The null hypothesis indicates there there is no difference between the defendant and others (all of whom are presumed innocent) and the alternative hypothesis indicates that there is a difference (namely, that the defendant is guilty).

**References**:

Ioannidis, J.P.A. (2005a). Why Most Published Research Findings Are False. PLOS

Medicine. 2 (8), 696-701.

Ioannidis, J.P.A. (2005b) Contradicted and Initially Stronger Effects in Highly Cited

Clinical Research. *Journal of the American Medical Association.* 294(2), 218-

228.

]]>**Null and Alternative Hypotheses**

An hypothesis is a claim or prediction about the relationship between variables in a study – between the independent or *experimental *variable that the researcher is manipulating and the dependent or *measured** *variable that the researcher is testing. In quantitative research, such a claim or prediction is typically expressed in a dual structure based upon a null hypothesis (H0) and its alternative (Ha) (Vogt & Johnson, 2015).

The **Null Hypothesis** (H0) is a broadly accepted phenomenon or value for a given parameter. In general, the H0, as the focus of a researcher’s area of interest, is demonstrated through a literature review. The H0 is an accepted conjecture but, as the name suggests, while assumed to be true, it is *nullifiable *(which is to say, falsifiable or refutable). (For additional background, the history of the H0 as a refutable claim is connected to the work of the Austrian-English philosopher of science, Karl Popper (1943; 1959)). The alternative hypothesis (Ha) is the researcher’s hypothesis – the claim that they will test with a view to invalidating the study’s null hypothesis. The Ha drives the collection of data, and those data, upon analysis, will ultimately allow the researcher one of two conclusions: either *reject* the null hypothesis or *fail to reject *the null hypothesis. Given that the Ha is the researcher’s conjecture and the de facto purpose of the study, the desirable outcome is to reject the null hypothesis and thus promote the alternative hypothesis that is under investigation (Cohen & Morrison, 2011).

The dual structure of the H0-Ha allows for statistical testing and decision making, the purpose of which is to limit the presence of chance in explaining the difference between the variables in the study. A necessary assumption in this process is that true random and representative samples have been drawn. The H0 and Ha also have a relationship between them such that they are mathematical opposites. As a result, should data indicate that the H0 is no longer supported, the researcher is able to reject that hypothesis with a level of quantifiable certainty. However, by the same logic, in the event that the data continue to support the established H0, the researcher must then fail to reject it. It should be noted, as well, that any research study neither proves nor disproves the null hypothesis in either rejecting or failing to reject it. Instead, the study offers data that may support the Ha and may or may not allow the researcher to reject the null hypothesis as a consequence (Demoulin & Kritsonis, 2013).

**Accepting and Rejecting the Null Hypothesis: A Legal Analogy**

In a criminal court of law, there is a universally held null hypothesis: namely, that a defendant is presumed innocent. The role of the prosecutor, like the role of a researcher, is to present data (i.e. evidence) that indicate rejection of the null hypothesis (i.e., a finding of *guilty*). For this reason, there are only two possible outcomes in a criminal proceeding: either the defendant is guilty (i.e., the jury rejects the H0) or the defendant is not-guilty (i.e., the jury fails to reject the H0). This is much the same in a quantitative research study where a researcher either rejects the null hypothesis or fails to reject it. And, just as when an individual is found guilty in a court proceeding, that outcome (the Ha) should not be construed as definitively proved. When a researcher finds statistically significant support to reject the null hypothesis, the alternative hypothesis is simply promoted as the better of the two claims. Like the null hypothesis before it, the alternative hypothesis now comes to occupy a nullifiable position.

**Accepting the H0 and Failing to Reject the H0 are not One-and-the-same**

Following the analogy of a criminal court proceeding above, the H0 that a defendant is presumed innocent is a statement that there is no statistically significant relationship between the defendant (the dependent variable being tested by the prosecutor) and the evidence for the crime (the independent variable that the prosecutor manipulates to make a case). In a criminal trial, as in a research study, data are collected as evidence to serve and support an alternative hypothesis (Ha). Without evidence, the H0 that the defendant is innocent remains in place and unchanged because the fundamental assumption is that the H0 is true.

The absence of data that would support the Ha (i.e., the rejection of the H0) does not in itself support that the H0 is true. A defendant is presumed innocent at the start of a trial because this status is a default null hypothesis. The goal of the trial is not to accept the preexisting assumption, but rather to demonstrate that there *is or is not *evidence to reject the null hypothesis presumption of innocence. For this reason, simply accepting the H0 is not one-and-the-same as failing to reject it. The absence of any evidence or argument allows a jury to accept the null hypothesis that the defendant is presumed innocent - nothing is proved or disproved. At the end of a trial, however, by failing to reject the null hypothesis the jury have stated that significant evidence exists in support of the H0 (i.e., not-guilty) that does not support the Ha (i.e., guilty). The same is true in the context of research, where failing to reject the H0 is the result of data analysis and thus provides a methodologically sound basis for its continued recognition, while simply accepting the H0 offers nothing other than a re-statement of an already assumed truth.

From the perspective that the purpose of research is to provide support (not proof) for an alternative hypothesis with the desirable outcome that this hypothesis (Ha) is able to supplant the existing hypothesis (H0) (Cohen and Morrison, 2011), it strikes me as logical that accepting the H0 offers nothing to the current state of research, while failing to reject it demands a body of evidence (viz. data). When a researcher fails to reject the null hypothesis, the null remains and, I think, therein lies the real crux of this matter. Namely, that the null hypothesis is, by its very nature, nullifiable, but the data do not yet allow rejection of it, nor the promotion of an alternative hypothesis to replace it.

Nothing is required from a researcher to accept a null hypothesis – it already exists, and the claim has already been accepted. To reject it, however, requires the work of study design, data collection, analysis, and conclusions. When the research process does not support an alternative hypothesis (i.e., the research hypothesis, which is the reason for conducting the study), it then fails to reject the null hypothesis. For this reason, I’d argue, the interaction of the null and alternative hypotheses in research is essential either to reject or to fail to reject the null hypothesis, whereas accepting the null requires none of that work.

**References**:

Cohen, L., Manion, L., & Morrison, K. (2011). *Research methods in education* (7th ed.). Routledge.

DeMoulin, D.F., & Kritsonis, W.A. (2013). *A statistical journey: Taming of the skew!* (2nd ed.). The AlexisAustin Group.

Popper, K. (1934; trans., 1959). Logik der forschung, the logic of scientific discovery. Basic Books.

Vogt, W.P. & Johnson, R.B. (2015). *The SAGE dictionary of statistics and methodology* (5th ed.). SAGE Publications, Inc.

**Transformation of Raw Data: X-Values**

Statistics is concerned with the description of variance as a means of understanding the relationship that one variable (for instance, a student) might have to a larger set of variables (for instance, all students enrolled in Algebra II in a given school district). Because raw scores do not typically convey usable information on their own, they are often transformed algebraically into standardized values to facilitate comparison and inference. Consider, for instance, that a student has a score of 49/100 on a standardized Algebra II math assessment. That data point, even as a percentage (49%) fails to convey complex information like relative performance. Even if 49% represents an established failing grade, without a sense of how that score compares to others in the group, it has little meaning. The relative comparison is of consequence. If that assessment’s mean score were 13%, the student with a score of 49 would represent an outlier of high performance, even though he or she may not have achieved a traditional passing grade on the assessment. When additional data are also included (for instance, the total number of students who have completed the assessment (N), the mean score (μ), and the standard deviation (σ)), a statistical analysis is possible that will allow teachers and school officials to make decisions about individual student needs or to develop specific action plans.

Z-scores, T-scores, and Stanines are descriptive statistics that serve a comparative purpose. They transform raw data into numerical measurements that allow specific comparisons of individual scores to a larger group of scores. These comparisons facilitate the identification of patterns that can inform decision making to improve future performance or outcomes (Woolfolk, 2018).

**The Use of Z-Scores:**

As a descriptive statistic, Z-scores pinpoint precisely where an individual raw score lies in relation to a group within a normal distribution where standard deviation serves as the unit of measurement. For this reason, such scores are also referred to as *standard scores* or *normal scores*. The z-score is the transformation of a raw data point into a standard deviation unit that indicates how far above or below the mean that raw score is. Z-scores assume that the raw data fall into a normal distribution and, thus, can allow that data point to be represented within a standard normal distribution (SND). In general, a Z-score is used only when the sample size is above 30, and the standard deviation of the population is known (Bluman, 2001).

Z-scores are powerful descriptive statistics because they can be derived for any normal distribution (where the mean, median, and mode have the same value). Take the hypothetical math assessment I mentioned above. Given that 649 students within a certain school district completed it, the mean score was 13, and the standard deviation was 3. With those additional data and statistics, the z-score of a student whose score is 15/100 on the assessment becomes 0.67. That student is above the mean by slightly more than one-half a standard deviation. Because the raw score has been converted to a Z-score, using a standard normal table is also possible to ascertain the percentage of students who have outperformed the student and the percentage of students who have performed below his/her level (as determined by the area between Z and the mean and the area beyond Z). These values are 0.2486 and 0.2514 respectively (DeMoulin & Kritsonis, 2013). Now rich comparison is possible. This student performed approximately 25% better than the average student and outperformed approximately 75% of all students who completed the assessment. However, the student was outperformed by approximately 25% of students. The graph below illustrates distribution areas as a percentage based upon three standard deviations above and below the mean in a standard normal distribution (Galarnyk, 2018).

(Image Credit: Galarnyk, 2018)

**The Use of T-Scores:**

In a standard normal distribution (SND), a z-score may have a negative value (in fact, any raw score less than the mean will give a negative z-score). In addition, the average or mean of an SND is 0, and together with a negative Z-score, these properties can be misleading for individuals interpreting a Z-score without prior knowledge of statistics. Following the Algebra II example above, a student whose raw score on the hypothetical assessment is 2 would have a Z-score of -3.67, and the average student score would be zero. For this reason, IQ scores, SAT scores, and T-scores are all examples of Z-scores that have been transformed into a different value to create a statistic that is, perhaps, less easily misinterpreted (DeMoulin & Kritsonis, 2013).

T-scores measure the size of the difference relative to the average value in a data set without using negative numbers. The mean is set at 50, the standard deviation is set at 10, and the range is 20 to 80 (from minimum to maximum) (DeMoulin & Kritsonis, 2013). As noted above, the T-score has the advantage, relative to a Z-score, of removing negative values from the description and setting the mean at an intuitive value of 50.

The T-score for the example above, a student with a raw score of 15 and a z-score of 0.67, would be 56.7, based on the formula T=10(z)z+50. With 50 as the standardized average, a score of 56.7 is easily readable as a slightly above average score.

**The Use of Stanines:**

Just as the T-score is an evolution of the standardized Z-score that attempts to represent a descriptive statistic more intuitively, the stanine (or standard nine value) attempts to represent a z-score on a simplified scale with a range of 1 to 9. On this scale, 1-2-3 is below average, 4-5-6 is average, and 7-8-9 is above average. Each value on the stanine scale represents 0.5 standard deviations, with 1 and 9 representing the tails on either end of the normal distribution. Because it is an odd-number scale, the middle value, 5, corresponds to the mean. Stanines are frequently used in education because of their readability. I have shared before that I work as a GT Coordinator. A common test used in the initial screening process for gifted and talented identification, the Cognitive Achievement Test (CogAT), developed by Riverside Insights, reports student results on a stanine scale on the rationale that it is understandable with limited explanation (Lohman, 2011; Warne, 2015).

**References**:

Bluman, A.G. (2001). *Elementary statistics: A step by step approach* (4th ed.). McGraw Hill.

DeMoulin, D.F., & Kritsonis, W.A. (2013). *A statistical journey: Taming of the skew!* (2nd ed.). The AlexisAustin Group.

Galarnyk, M. (2018). Normal distribution [Graph]. Toward Data Science.com. https://towardsdatascience.com/understanding-the-68-95-99-7-rule-for-a-normal-distribution-b7b7cbf760c2.

Lohman, D.F. (2011). Cognitive abilities test, form 7 (CogAT7). Riverside Publishing.

Warne, R.T. (2015). Test review: Cognitive abilities test, form 7 (CogAT7). Journal of Psychoeducational Assessment, 33(2), 188-192.

Woolfolk, A. *Educational psychology* (13th ed.). Pearson.

Combinations (nCr) and permutations (nPr) are a foundational element of probability. For a series of items or numbers, a combination would represent the number of possible arrangements where the order of any arrangement does not matter to the solution. A salad is a great example of a **combination**. It’s made up of a series of items (leaves, fruits, oil, vinegar), and the order in which those items appear does not strictly matter. As a quick example, let's make a salad and toss in our ingredients two at a time. Given 6 ingredients (n) in this salad, the nCr, if one adds the items 2 at a time (r), allows 15 different possible combinations.

By comparison, a **permutation **is a computation of arrangements where the order does matter. ATM passcodes and combination locks are examples of permutations. In both cases, there is a distinct set, the order of items in the set matters, and repetition may or may not be permitted. When **repetitions **are permitted, as in a combination lock or password, the formula is n to the power of r. A 100-number dial with a three-number combination (the sort of lock one might find on a school locker) has 1 million permutations – 100 x 100 x 100.

When a permutation does not allow for repetition, the nPr formula is used. Like the nCr formula, it relies on factorials. Consider, by way of example, a game of pool, where replacements don't happen when a ball is knocked into a pocket. How many permutations are there for the 15 numbered balls? For each game, 15 balls will be pocketed at a rate of 15 and this gives an astounding number of 1,307,674,368,000 – over 1.3 trillion possible permutations.

[For reference, these below are the formulas for combinations and permutations.]

As noted, and as clearly seen in the formulas, factorials are essential in combinations and permutations without repetition. But, for me, the real question now is how 0! can be equal to one (1). It struck me as unintuitive, but there is a pattern at work that explains what’s happening. Let’s look.

The factorial of a positive integer (n!) is the product of all positive integers that are less than or equal to n. From 1 to 5, the factorials look like this:

1! = 1 x 1 = 1

2! = 2 x 1 = 2

3! = 3 x 2 x 1 = 6

4! = 4 x 3 x 2 x 1 = 24

5! = 5 x 4 x 3 x 2 x 1 = 120

As you'll see, it may not be intuitive, but follow the pattern established by the factorials above and it becomes clear why 0! = 1. Here's the pattern in reverse with some added explanation to clarify the process:

5! is 120 ( 5 x 4 x 3 x 2 x 1 = 120)

Notice that 120 ÷ 5 = 24 (which is 4!)

Notice that 24 ÷ 4 = 6 (which is 3!)

Notice that 6 ÷ 3 = 2 (which is 2!)

Notice that 2 ÷ 1 = 1 (which is 1!)

**Thus 1 ÷ 1 = 1 (which is 0!)**

A factorial result (e.g. 4! = 24) divided by the factorial number (in this case 4) gives the value of the lower factorial in the numerical sequence (in this case is 6, which is the result of 3!).

Here's the same pattern, in brief, leading to 0!:

2! = 2 (or 2!)

2 ÷ 2 = 1 (or 1!)

1 ÷ 1 = 1 (or 0!)

For factorial 1, divide the factorial number (i.e., 1) by 1 and the result is 0!. This is counter-intuitive to 1 ÷ 1, which arithmetically gives 1; however, within the factorial pattern, 1 ÷ 1 = 0!.

**Thus, 0! = 1**.

**Interpretations of Probability**

Probability is the branch of mathematics concerned with the analysis and management of uncertainty. Because of the inextricable link between statistics and probability, between data variability and uncertainty, the study of statistics that is so crucial to education research is also a study of probability. Not unlike statistics, which has differing methodologies(based on the approaches of Naymen-Pearson, Fisher, and Bayes), there are a number of differing interpretations of probability. Each of the interpretations of probability constitutes a distinct view and maintains a special set of assumptions. Of the various interpretations of probability, the three most common are:

1. **Classical/Analytic View**: classical probability is defined in terms of the analysis of possible outcomes. As described by Laplace (1814), a pioneer in the field, the probability of an event (A) is equal to the frequency of the event divided by the number of times the event could happen or P(A)=f/N. In this formula, the odds of each possible outcome in the event are considered equal (known as the principle of indifference). By way of example, from one trial of a coin flip to another, the probability of the event does not change.

2. **Frequentist/Objective View**: nearly a half-century after Laplace and Bernoulli first described the analytic view of probability, John Venn (a statistician famous in education for the diagram that bears his name) presented an alternative to classical probability. Venn defined probability in terms of past performance and argued that the assumption of symmetry at the heart of the classical view was too strong, not allowing for information based upon prior experience to exert influence (Venn, 1866 cited in Stigler, 1986). In the frequentist view, an event is one subset of all possible outcomes within a sample space. Past performance (that is, past successful events) observed across repetitions of the same trial (e.g., flipping a coin) impacts the relative frequency of event occurrences and thus serve as a measure of probability. Unlike in classical probability, one event is not indifferent to the next. By way of example, as the sun rises from one day to the next repeatedly over multiple trials, the relative likelihood (probability) that the sun will rise again increases.

3. **Subjective View**: unlike the frequentist view, in which measures of observed events across multiple trials objectively define the probability of an event, in a subjective view personal belief in the likelihood of an outcome impact probability (Billingsley, 2012). While both the analytic/classical view and the frequentist view rely on chance, the subject view relies on credence and expectation. The origins of subjective probability can be traced to the work of Thomas Bayes in the mid-1700s, but Bruno de Finetti in the mid-1900s offered what is now the classic description. He argued that probability is a subjective analysis of the likelihood of an event and exists only in the minds of individuals, rather than in any objective sense of the term. DeFinetti’s (1974) treatise on the subject, in fact, opens with the bold statement that [objective] *probability does not exist - *probability can only be defined in terms of individual belief and expectation.

**Probability and Decision-making in Day-to-day Activity**

Three differing interpretations of probability allow for three different understandings of the same probabilistic issues. First, consider the weather. In parts of the United States where inclement weather (like snowfall, ice storms, or hurricanes) can result in school closure, the probability forecast of the weather plays a vital role in the decision to open or close a school. In general, weather forecasts are frequentist, rather than classical: the chance that, for instance, it will snow heavily tomorrow in the part of Colorado in which I live is not determined by a principle of indifference, but rather by an index of inter-related observations and statistics coupled with probability occurrence calculations. Local meteorologists make an expert judgment and, for teachers and students alike, decisions are made based on those judgments about how to dress, for instance, and whether to enjoy hot tea or iced tea during the commute to the school. The complex probability calculations that a meteorologist makes lead me to make my own set of calculations that are one-part frequentist (based on past circumstances) and one-part subjective (based on credence and expectation).

In day-to-day life, I suspect the subjective view, with its emphasis on credence and personal expectation, is the reality of probability for most. What is clear, though, is that no single one of these interpretations is universal.

Variability is a quantitative measure of the distribution (spread or clustering) of the values in a data set. Variability is small when there are only slight differences between the scores or values of a given data set; variability is large when the differences between the scores are also large (DeMoulin, 2013).

As the measurement of variability is the foundation of statistical inquiry (Spiegelhalter, 2019), it is vital that a researcher appreciate the strengths and limitations of the summary statistics used to describe variability. While range and standard deviation are two widely used statistics in summarizing variability within a sample, one is far more reliable than the other. (Simply for the sake of completeness, canonically the four measures of spread used in statistics are range, quartiles, variance, and standard deviation.)

Range is used to denote the distance from the largest to the smallest value in a sample distribution. Mathematically, it is the difference between the upper real-limit of the largest value and the lower real-limit of the smallest value (DeMoulin, 2013). Much like each of the measures of central tendency, range has its limitations. While it does provide a measure of variability, range is sensitive to extreme values and relies on only two values (minimum and maximum) for its summary. As such, if there are outliers in the sample, the range will suggest a high degree of spread, though the majority of data may be highly clustered. For this reason, as a descriptive statistic describing variability, range is an unreliable measure.

Standard deviation is perhaps the most commonly used statistic to describe variability and is more reliable by far than range. Standard deviation takes the mean of the sample distribution as its point of reference. It gives a value that represents the average distance of each point of data from that mean. When the value of the standard deviation is low, the majority of scores in the sample will be clustered and close to the mean. When the value of the standard deviation is high, the sample scores are more widely spread out from the means.

The question remains: Does range have value as a statistic? Of course - if only as a means of summarizing the minimal and maximal limits of a sample. Should it be used individually as a descriptive statistic, without other statistics that give a more complete picture of the data? Probably not.

**References**:

Cohen, L., Manion, L., & Morrison, K. (2011). *Research methods in education* (7th ed.). Routledge.

DeMoulin, D.F., & Kritsonis, W.A. (2013). *A statistical journey: Taming of the skew!* (2nd ed.). The AlexisAustin Group.

Spiegelhalter, D. (2019). *The art of statistics: How to learn from data*. Basic Books.