Discussion Methods for Fairness in Testing
Beyond Bias: Methods for Fairness in Testing
In the unit readings from your Psychological Testing and Assessmenttext, you read about misconceptions regarding test bias and test fairness—two terms that are often incorrectly considered synonymous. While questions regarding test bias have been addressed through technical means, issues with test fairness are tied to values. The text attempts to define test fairness in a psychometric context and provides eight techniques for preventing or remedying adverse impact on one or another group (see page 209). One of these techniques included differential cutoffs. Furthermore, you were introduced to a variety of methods for setting cut scores. These methods have been based on either CTT or IRT.
For this discussion, synthesize the information you learned about these two theories and respective methods. In your post:
Determine which one is preferential for responding to questions about a test’s fairness.
Identify at least two advantages and two disadvantages in using each theory, citing appropriate American Educational Research Association (AERA) standards from your readings.
Defend your preference in terms of the methods used within each theory and how they apply to concepts of fairness across groups. Essentially, how does it best address test fairness?
Describe how advances in technology are improving the process of test development and inclusion of appropriate items.
Respond to the posts of at least two other learners.
This activity will help you achieve the following learning components:
Describe the characteristics of fair test items and procedures.
Define roles of technology in testing.
Apply writing and citations skills appropriate for doctoral-level learners.
Test Fairness in language assessment has been now discussed by researchers for over a decade. Various definitions and formulations have been offered: The Standards from APA, AERA, NCME (1999) for general educational measurement and assessment and from Kunnan (1997, 2000, 2004) for language assessment. However, the analytical methods that can help bring about fair tests and testing practice have not been clearly articulated. In this article, I present a brief overview of the conceptual framework of Kunnan’s Test Fairness Framework and statistical analyses of test results that could be used to analyze a few of the test fairness qualities. Due to space limitations, qualitative methods (such as content analyses, conversational analysis, think-aloud reports) that are also useful for this purpose will not be discussed.
2 – Conceptual overview of The Test Fairness framework
2In earlier writings (Kunnan 2000, 2004), I presented an ethics-inspired rationale for my Test Fairness Framework (TFF) with a set of principles and sub-principles. The principles use a mixed deontological system which combines both the utilitarian and deontological systems. Frankena suggests reconciling the two types of theories by accepting the notion of rules and principles from the deontological system but without its rigidity and by using the consequential or teleological aspect of utilitarianism but without the idea of measurement of goodness, alleviation of pain, or to bring about the greatest balance of good over evil. Thus, two general principles of justice and beneficence and sub-principles are articulated as follows:
3Principle 1: The Principle of Justice: A test ought to be fair to all test takers, that is, there is a presumption of treating every person with equal respect.
4Sub-principle 1: A test ought to have comparable construct validity in terms of its test-score interpretation for all test takers.
5Sub-principle 2: A test ought not to be biased against any test taker groups, in particular by assessing construct-irrelevant matters.
6Principle 2: The Principle of Beneficence: A test ought to bring about good in society, that is, it should not be harmful or detrimental to society.
7Sub-principle 1: A test ought to promote good in society by providing test-score information and social impacts that are beneficial to society.
8Sub-principle 2: A test ought not to inflict harm by providing test-score information or social impacts that is inaccurate or misleading.
The TFF views fairness in terms of the whole system of a testing practice not just the test itself. Therefore, multiple facets of fairness that includes multiple test uses (for intended and unintended purposes), multiple stakeholders in the testing process (test takers, test users, teachers and employers), and multiple steps in the test development process (test design, development, administration and use) are implicated. Thus the TFF has five main qualities: validity, absence of bias, access, administration, and social consequences. Figure 1 presents the TFF within the circle of tests and testing practice where validity is at the center of the framework and the other qualities although having their distinct roles overlap validity. This is translated into Table 1 which presents the TFF as a linear list with the main quality and the main focus of each of the qualities.
Table 1Test Fairness Framework
9Here is a series of short descriptions for each of the test qualities presented in Table 1.
101. Validity: Validity of a test score interpretation can be used as part of the TFF when the following evidence is collected.
11a. Content representativeness or coverage evidence: This type of evidence (sometimes simply described as content validity) refers to the adequacy with which the test items, tasks, topics, and language dialect represents the test domain.
12b. Construct or theory-based validity evidence: This type of evidence (sometimes described as construct validity) refers to the adequacy with which the test items, tasks, topics, language dialect represents the construct or theory or underlying trait that is measured in a test.
13c. Criterion-related validity evidence: This type of evidence (sometimes described as criterion validity) refers to whether the test scores under consideration meet criterion variables such as school or college grades and on the job-ratings or some other relevant variable.
14d. Reliability: This type of evidence refers to the reliability or consistency of test scores in terms of consistency of scores among different testing occasions (describes as stability evidence), among two or more different forms of a test (alternate form evidence), among two or more raters (inter-rater evidence), and in the way test items measuring a construct functions (internal consistency evidence).
152. Absence of Bias: Absence of bias in a test can be used as part of the TFF when the following evidence is collected.
16a. Content or language: This type of bias refers to content or language or dialect that is offensive or biased to test takers from different backgrounds. Examples include content or language stereotypes of group members and overt or implied slurs or insults (based on gender, race and ethnicity, religion, age, native language, national origin and sexual orientation); or choice of dialect that is biased to test takers.
17b. Disparate impact: This type of bias refers to different performances and resulting outcomes by test takers from different group memberships. Such group differences (as defined by salient test taker characteristics such as gender, race and ethnicity, religion, age, native language, national origin and sexual orientation) on test tasks and sub-tests should be examined for Differential Item/Test Functioning (DIF/DTF). In addition, a differential validity analysis should be conducted in order to examine whether a test predicts success better for one group than for another
18c. Standard setting: In terms of standard setting, test scores should be examined in terms of the criterion measure and selection decisions. Test developers and score users need to be confident that the appropriate measure and statistically sound and unbiased selection models are in use.These analyses should indicate to test developers and score users that group differences are related to the abilities that are being assessed and not to construct-irrelevant factors.
193. Access: Access of a test can be used as part of the TFF when evidence the following evidence is collected.
20a. Educational access: This refers to whether a test is accessible to test takers in terms of opportunity to learn the content and to become familiar with the types of tasks and cognitive demands.
21b. Financial access: This refers to whether a test is financially affordable to test takers.
22c. Geographical access: This refers to whether a test site is accessible in terms of distance to test takers.
23d. Personal access here refers to whether a test offers certified test takers with physical and learning disabilities with appropriate test accommodations. The 1999 Standards and the Code (1988) calls for accommodation in order that test takers who are disabled are not denied access to tests that can be offered without compromising the construct being measured.
24e. Conditions or equipment access: This refers to whether test takers are familiar with to test taking equipment (such as computers), procedures (such as reading a map) and conditions (such as using planning time).
254. Administration: Administration of a test can be used as part of the TFF when the following evidence is collected:
26a. Physical conditions: This refers to appropriate conditions for test administration such as optimum light, temperature and facilities as relevant for administering tests.
27b. Uniformity: This refers to uniformity in test administration exactly as required so that there is uniformity and consistency across test sites and equivalent forms, and that test manuals or instructions specify such requirements. Examples include uniformity in test length, materials and any other conditions (for example, planning or no-planning time for oral and written responses) so that test takers (except those receiving accommodations due to disability) receive the test under the same conditions.
28c. Test security refers to issues of breach of security of test materials or test administration. Examples include fraud, misrepresentation, cheating, and plagiarism.
295. Social consequences: The social consequences of a test can be used as part of the test fairness framework when evidence regarding the following need to be collected:
30a. Washback: This refers to the effect of a test on instructional practices, such as teaching, materials, learning, test taking strategies, etc.
ORDER an A++ paper from our Verified MASTERS and DOCTORATE WRITERS: Discussion Methods for Fairness in Testing
31b. Remedies: This refers to remedies offered to test takers to reverse the detrimental consequences of a test such as re-scoring and re-evaluation of test responses, and legal remedies for high-stakes tests. The key fairness questions here are whether the social consequences of a test and/or the testing practices are able to contribute to societal equity or not and whether there are any pernicious effects due to a particular test or testing program.
In summary, the TFF is best served when evidence from the five test fairness qualities (validity, absence of bias, access, administration and social consequences) working together are collected and used in a defensible argument. Finally, it is expected that the TFF can be used in a unified manner so that a fairness argument like the validity argument proposed by Kane (1992, 2006) and Bachman (2005) can be used in defending tests as fair tests.
3 – Statistical analyses for the Test Fairness Framework
32Statistical analyses of test results should be conducted with the purpose of collecting evidence for an test fairness argument. In this article, I will limit myself to describing statistical analyses of test results for a few test fairness qualities. I will focus on a few qualities under Validity and a few qualities under Absence of bias.
3.1 – Validity: Construct-based validity evidence
a – Explaining the constructs being assessed
33One of the key questions regarding a test is its composition or structure whether the test structure (or language abilities as operationalized in the test) is unitary or divisible (or multicomponential). This can be examined through Exploratory Factor Analysis (EFA) which seeks to identify hypothetical factors that account for the patterns of correlations that are observed in test scores (from individual items or tasks). EFA is usually performed on the tetrachoric correlations (for item level data) or Pearson product-moment correlations (for composite test section data) of test scores.
34There are a number of steps in conducting EFA: First, the assumption of test of sphericity has to be met either by the Bartlett’s test which tests the null hypothesis that all correlations to be examined are zero or by the Kaiser-Myer Olkin test which is an indicator of strength of the relationships among the variables in the matrix. Second, an appropriate extraction method needs to be chosen from among the many choices: principal components, principal axis factoring, alpha factoring, and unweighted least squares. Third, the number-of-factors to be extracted has to be figured out: the initial decision about the number of factors to be extracted can be made after scrutinizing the eigen values obtained from the initial extraction using the criteria of substantive importance and the scree test. Several numbers of factors should then be extracted, typically one factor above and one factor under. Fourth, rotated factor structures should be extracted and oblique factor solutions should be examined (assuming that the factors are likely to have a strong positive or negative correlation) or if the inter-factor correlations are small, orthogonal rotated factor solutions should be examined. Fifth, the final determination regarding the number of factors and the best solution should be based on two criteria: simplicity (in terms of factor loadings for salient loadings) and interpretability (by evaluating the extent to which salient factor loadings correspond to the items or composite section test scores). Thus, this type of analysis is called exploratory factor analysis (see Kunnan, 1992, for example).
35From this analysis, it is possible to discover whether there is a one-factor structure (meaning, unitary) or two-factor or more structure (meaning divisibility or multicomponentiality and whether the factors are correlated (related) or not. Further, if the factor structure is examined across test taker groups (example, native language, gender, academic major, instructional types, etc.), it is possible to determine if the factor structure for the different groups are invariant or not.
Confirmatory Factor Analysis (CFA) is a procedure in which the factor structure is specified in advance and the data available in the form of test scores is used to evaluated the proposed structure. In other words, CFA is used to test a theory or a hypothesis about the factor structure of a test such as a one-factor, a two-factor structure or a multi-factor structure. Examples of CFA include studies that attempt to examine multiple trait and method measures as in multitrait-multimethod studies. This type of study provides a framework for examining many validation issues: (1) whether measures of the same trait using different methods receive higher correlations than measures of different traits using different methods, (2) whether measures of the same trait using different methods receive higher correlations than measures of different traits using the same method, and (3) whether there is a convergence of test scores for the same trait measure across methods. Measurement models in structural equation modeling are similar to CFA and they can be used to conduct similar investigations.
b – Explaining test performance
36One area of interest that test researchers and even classroom teachers may have is how to explain test taker performance, particularly varied performance. In other words, the question is whether (high, mid or low) test performance is related to personal attributes (such as use of learning strategies, styles and achievement motivation), institutional factors (such as quality and years of prior instruction) and social factors (such as the availability of the L2 in the social environment as in ESL contexts). The statistical procedure that is best suited for this analysis would be structural equation modeling (SEM) where independent factors or constructs (such as the personal attributes, institutional and social factors of interest) can be posited to have influences on test performance. An analysis of this kind can uncover relationships that could lead to explanations regarding varied test performance in a test taking sample. As this topic is quite technical, I will not discuss this further except to suggest two readings (Kunnan, 1998, for an introduction to SEM, and Kunnan, 1995, for a worked illustration).
3.2 – Absence of Bias: Disparate impact
a – Identifying differences among test taker groups
37When different test taker groups take a test, some groups may perform better than others on some variables. These differences may be along group membership lines. These group memberships may be self-reported through a questionnaire (examples, gender, race and ethnicity, age, native language, second language learning, etc.), assigned by a test designer (examples, accommodations for test takers with disability, planning conditions, computer use, etc.) or assigned by a researcher in an experimental research setting (examples, experimental group vs. control group; treatment 1 group vs. treatment 2 group, etc.).
38In all such cases, if there is an interest in testing the hypothesis that there is no difference between test taker groups on one or more variables, the first step is to examine test scores to see if there are mean score differences on variables of interest. This can be done by examining descriptive statistics and graphical representation (histograms and frequency polygons). If there are score differences between groups on variables of interest, the next step would be to examine the descriptive statistics of the groups on those variables to find out if the differences are statistically significant.
39The next step is to consider the number of groups and variables of interest as there are three ways to proceed to test the hypothesis that there is no difference between groups: (1) If two test taking sample groups are normally distributed and the groups are independent of each other (examples, by gender, race and ethnicity native language, accommodations, planning groups, etc.), then an independent or uncorrelated t-test is appropriate; (2) If two test taking sample groups are normally distributed but the groups are paired groups as in the case of single group experimental studies, then a dependent or correlated t-test is appropriate; (3) if there are more than two groups on a single variable or between two variables for a single group, the t-test is not appropriate as the probability of making a type I error increases as the number of paired comparisons increases. In such cases, the appropriate procedure for examining means score differences is the analysis of variance (ANOVA).
Finally, when test taking groups have large sample sizes, the t-test or ANOVA results can indicate statistical significance even when there are small mean scores differences. Therefore, in such contexts, a statistic that is not affected by sample size has to be used. This statistic is called the effect size statistic or Cohen’s d; it provides a standardized mean difference measure (Bachman, 2004).
b – Identifying Differential Item Functioning
40Test score differences among test taker groups can be used to examine a test for fairness. Although test fairness may have been considered in the test design, development, administration, and scoring procedures, many test designers discover problem of test bias too late in the test design-development-administration-scoring cycle. One approach to this problem, therefore, has been to examine test scores from a pilot group, or, if the test has already been launched, to examine test scores from a large sample of test takers and detect items that function differently for different test taking groups and to investigate the source of this difference. This is to make sure that test and test items or tasks are fair and not biased against or in favor of a particular group. Test taker groups of interest may be based on gender (female, male), race and ethnicity, first language, age, test preparation, years of study of the language tested, and so on.
41In the U.S., there are professional standards documents that urge testing agencies or test developers to collect test score data from different groups for examination. For example, the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1999) have 12 standards for fairness. Similar standards and codes such as the Code of Fair Testing Practices in Education (1988, 2004, Code for short) have suggested DIF as a useful way of examining tests for their fairness.
42Approaches: For more than two decades, the focus of DIF/test bias analysis was on the concept of relative item difficulty for different test taking groups. The idea was to conduct a post-test administration (post-hoc) studies to examine the performance of test takers with similar ability (as measured by the total score) from different subgroups with the expectation that there would be comparable individual item difficulty for the subgroups as the test takers are matched in terms of overall ability. In cases where items performed or functioned differently for subgroups, such items were to be flagged and examined for potential content or response format bias.
43A more recent approach has focused on the concept that the general cause of DIF is the presence of multidimensionality in items displaying DIF (Ackerman, 1992; Shealy and Stout, 1993). Roussos and Stout (2004) expanding on this stated “such items measure at least one secondary dimension in addition to the primary dimension that the item is intended to measure” (p. 108). Therefore, as DIF methods are based on comparable test takers matched with respect to the primary dimension or construct the test item is measuring, a large DIF value could mean the test item is measuring additional dimensions differently across the reference and the focal groups. The additional dimensions could be either intended secondary dimensions called an auxiliary or benign dimension or an unintended secondary dimension called an adverse or nuisance dimension that has crept into the test item.
44Zumbo (2007) argues that the third generation of DIF research conceives “of DIF as occurring because of some characteristic of the test item and/or testing situation that is not relevant to the underlying ability of interest (and hence the test purpose). By adding ‘or testing situation’ to the possible reasons for DIF that have dominated the first two generations of DIF (including the multidimensional model) one greatly expands DIF praxis and theorizing to matters beyond the test structure (and hence multidimensionality) itself; hence moving beyond the multi-dimensional model of DIF” (p. 229). He suggests as an example the work of Muthen (1985, 1988, 1989) that “allows the researcher to focus on sociological, structural, community and contextual variables as explanatory sources of DIF: (p. 229).
Methodology and analyses: Three approaches are currently in use: (1) from Classical Test Theory, the Mantel-Haenszel procedure, the standard mean difference procedure, and the logistic regression procedure using standard statistical software such as SPSS; (2) confirmatory factor analysis models using specialized software such as MPLUS (Muthen and Muthen, 2002) and the SIB test (Stout and Roussous, 1996); and (3) from Item Response Theory’s parametric models.
c – Setting standards
45When test takers receive scores for their test performance, they are typically accompanied by other classifications such as “pass” or “fail,” or categories or levels of performance (such as “needs improvement,” “basic,” “proficient,” “advanced” or “unqualified,” “qualified” and so on). These classifications are based on standards that have been set either in terms of prespecified percentages of “pass” or “fail” (such as 5% or 10% pass rate). While both types of standard setting procedures need to be defended, performance and content standards are more complex because the procedure involves many steps and a clear research design. Hambelton and Pitoniak (2006) outlined typical steps in setting performance standards: (1) selecting a standard-setting method, (2) choosing a panel of judges, (3) preparing descriptors of performance categories, (4) training judges to use the method, (5) collecting test item ratings, (6) compiling ratings from judges, and (7) compiling validity evidence for the standard setting.
46Further, language test designers may be interested in whether standard setting policies should be compensatory, conjunctive or a combination of both. In a compensatory model, test takers total scores are used for standard-setting and, therefore, they could perform better in one section of the test (for example, grammar) and make up for a low score in another section (for example, speaking). In a conjunctive model, test takers section have to attain a minimum standard (and score) for each and every section of the test. Thus, this model uses a multiple-cut off standard setting procedure. In some contexts, a combination model might work best: a minimum standard setting for all the sections on the test (conjunctive) and an overall test score (compensatory). The choice of method and model along with appropriate choice of judges will play a significant role in the standard-setting process and thus, the utilization or decision-making based on test scores.
4 – Software for statistical analysis
47There are many software programs that can be used for analyzing test results. For entering, organizing, and cleaning data, Microsoft Excel or a similar spreadsheet is an easy-to-use option. However, for anything beyond basic statistics, a statistics package should be used. One of the most common is IBM SPSS Statistics; this software can be used for most statistical procedures including exploratory factor analysis but to perform confirmatory factor analyses and structural equation modeling, the best options are EQS, AMOS, and MPLUS. For DIF analyses, specialized software like BILOG and MULTILOG need to be used.
5 – Conclusion
48In this short article I attempted to provide a conceptual understanding of the Test Fairness Framework and a few statistical analyses of test results that can help collect evidence in support of arguments for a fair test. Only fundamental statistical procedures best used in standard test analysis contexts were presented. Variations in test constructs, research design, data collection, and research questions might need approaches not discussed here.