Understanding Differential Item Functioning and Item bias In Psychological Instruments

For a psychological test or instrument to function properly as intended, items in the test should measure respondents’ performance fairly across different groups of respondents such as male and female. In psychometric literature, the concept of differential item functioning (DIF) has been introduced to address the differential group performance on an item when the groups are equated at the same level of ability or latent trait status. This article introduces the concept of DIF while making a clear distinction of DIF from item bias and simple group performance difference.


Introduction
For a psychological test or instrument to function properly as intended, items in the test should measure respondents' performance fairly across different groups of respondents such as male and female. In psychometric literature, the concept of differential item functioning (DIF) has been introduced to address the differential group performance on an item when the groups are equated at the same level of ability or latent trait status. This article introduces the concept of DIF while making a clear distinction of DIF from item bias and simple group performance difference.
Since the civil rights era of the 1960's in the United States, inequity has become a critical social issue. The area of educational and psychological testing is no exception. The use of testing as a sorting mechanism [1] has brought equity concerns to many people, specifically the testing enterprise. Academic research on group differences and public awareness of them has resulted in the examination of whether tests in educational and psychological testing are disadvantaging minority groups. A well-known incident about bias issue and group differences is "Golden Rule" settlement in 1984. The Golden Rule insurance company in 1976 filed a lawsuit against Illinois Department of Insurance and Educational Testing Service, charging racial bias in Illinois insurance licensing exams. The lawsuit led to an out-of-court settlement, ending the 8-year-old suit. The gist of the settlement was elimination of any items showing different item proportion correct (i.e., proportion of yes/correct answers in an item which is called "item p-value" or "marginal item proportion-correct") across the compared groups. (see for detail, e.g., [2]) Even before the Golden Rule settlement, there was a claim in the academic community that some tests (e.g., IQ test) are biased against minority groups. Some researchers investigated item p-value and considered an item to be biased if it showed a big difference in the item p-value between the compared groups (e.g., white majority group vs. black minority group). This approach is consistent with the solution suggested by the Golden Rule settlement. However, this approach of using the marginal item proportion-correct is flawed because it does not distinguish the true group difference and the true bias. This drawback of the Golden Rule settlement procedure has been pointed out by many academic researchers. For example, Gregory R. Anrig, the president of Educational Testing Service announced that the Golden Rule settlement was "an error of judgment" (see also for the side effect of executing the Golden Rule procedure, e.g., (3)). One could ask "Is it right to make group differences negligible by manipulating the test items (by excluding and revising items) if there is actually a real group difference possibly created by past or present social inequity?".
Technically the major drawback of this marginal proportion correct approach is the confounding of group difference and real bias. The marginal probability of item correct is affected by the population distribution -related to group mean difference -and by the item response function -related to item bias. That is, the marginal probability (observed proportion correct or incorrect) is represented as where p(x) is a marginal probability of either x=Yes/correct or x=No/incorrect, θ is a person latent trait (or ability), P(θ) is the item response function, Q(θ) =1-P(θ), and F(θ) is the distribution of θ. In the above presentation, one can see that person latent trait/ability and item characteristics are confounded in the observed proportion of x. (Note that a similar equation can be expressed for the Likert style response items or graded response items, showing that the observed marginal score is based on both item responses function and the latent trait distribution). If we see a large difference in the proportion correct between the two groups, we cannot draw the conclusion that the item is really biased. The large difference could be due to a real group ability difference between the two groups, a bias factor disadvantaging one group in the item, or both, which is probably the case in many real-world applications.
In subsequent years, the definition of bias and the methodology of its detection have been refined. The word "bias" is now replaced by a term, "Differential Item Functioning" (DIF), at least in academia. Because of the social connotation of the word, "bias", Holland & Thayer in 1988 [4]. suggested the alternative term DIF in place of "bias". The complexity of the usage of these terms has been a source of confusion in the communication between the technical measurement community and the public [5]. DIF is a neutral term, indicating the magnitude of advantage or disadvantage presented by an item to a group, which is usually estimated through statistical analysis. In recent years, identifying DIF items and classifying some (or all) of those DIF items as biased items are considered separate. The former is a statistical concept while the latter is more than statistical including the interpretation of the identified DIF in the context of social justice.
where E is the expectation operator, X is a categorical ordinal item response (e.g., X=1 (strongly disagree), 2 (disagree), 3 (agree), or 4 (strongly agree) in the 4-option Likert style item test), G is a group indicator (e.g., 1=Female and 0=Male; 1=African American and 0=White), and θ is person latent trait/ ability. Sometimes, no DIF is expressed using an observed variable Z instead of θ, which is a proxy for θ. The above definition of no DIF states, in words, that there is no DIF if the expected item score for one group and the expected item score for the other group are the same when the latent trait/ability scores are equated. Again, DIF is about a conditional comparison between the two compared groups on the same trait/ability level, not a marginal comparison. Those who would like to know more about the methods of DIF detection are referred to [10].
From the test validity point of view, DIF and its detection are of importance and the existence of DIF calls into question the fairness of testing. Although a test constructed without DIF cannot undo the past inequalities, it can reveal the inequalities which may have been created by past and existing inequity, thereby giving people a chance to think of the source of such a difference.