Evaluating a new proposal for detecting data falsification in surveys


A recent paper [1] proposed a new detection method for data falsification in surveys called the maximum percent match statistic. The statistic measures the maximum percentage of questions on which each respondent matches any other respondent in the dataset. The authors argue that valid survey data should have few respondents that match on more than 85% of questions. Based on this metric, the authors conclude that 1 in 5 publicly available international surveys contain data that is likely falsified. To evaluate this claim, we tested the sensitivity of the measure to variations in survey characteristics using: simulations on synthetic and survey data; evaluations of high quality domestic and international surveys with little risk of falsification; and regression analysis on 411 of Pew Research Center's international surveys. We find that the presence of high matches in a survey is extremely sensitive to natural, benign survey characteristics, such as the number of questions or number of response options. Our analysis indicates that the proposed metric is prone to generating false positives - suggesting falsification when, in fact, there is none. Thus, we find that the claim of widespread likely falsification based on this measure is not supported.



