8/10/2018 6 Comments
Many of us have found a notable amount of duplicated GPS (not IP) in our recent Mturk data. Circumstantial evidence and reports suggest that responses from them is random and unreliable. The following procedure is proposed to test the evidentiary value of responses from duplicates and to determine if they are truly random or unreliable. I am inviting open comments to this procedure. I will revise the procedure based on the comments, and distribute the survey around next Friday. The overall format is similar to what I described here.
Here are the four tests that I am proposing:
1. Testing the reliability of known scales. Researchers will pick a well validated scale from their surveys with known reliability. I recommend picking a scale that includes 50% of reverse coded items. Researchers will report the reliability of the scale for duplicates and non-duplicates separately.
2. Testing the distribution of known measures. Researchers will pick a measure used in their data that has a known or expected skewed distribution (the more skewed the better). Researchers will report the distribution of the measures for duplicates and non-duplicates separately.
3. Testing relationships between variables that are known to correlate. Researchers will pick two variables used in their data that are known to correlate. For this task, I recommend looking for two measures that are known to be strongly correlated. I also recommend picking two measures that both include 50% of reverse coded items if possible, and I also recommend these two measures to be on different pages the survey.
4. Comparing the frequency of suspicious key words. It appears that duplicates tend to enter phrases that include the word “good” “nice” and “very” regardless what is asked (see footnote 1). Therefore, researchers who have open-ended questions in their survey can count how many “good” “nice” and “very” are from duplicates and how many of them are from non-duplicates. I am most interested in focusing on “good” and “nice,” given their representativeness.
For test 1-3, I intend to also collect analyses of a random subset of non-duplicates that has the size of N comparable to that of duplicates. This is done so that the non-duplicates do not have any unfair advantage because of higher N.
I think we would have strong evidence that responses from GPS duplicates have limited evidentially value if a preponderant amount of analyses show a large discrepancy between responses from duplicates and non-duplicates regarding 1. the reliability of known scales, 2. the distributions or central tendency of known measures, 3. the correlations between variables that are known to correlate, and 4. the frequency of suspicious key words.
Regarding responses to open ended questions:
I included the raw response to the open-ended question in my own survey (which is still collecting data) here. Among the 282 response from duplicates, “good” appears 75 times (26.6%), “nice” appears 59 times (21%), and “very” appears 19 times (6.7%). In contrast, among the 296 responses from non-duplicates, “good” shows up only 2 times, “nice“ shows up only 1 time, and “very” did not show up. I think these three responses (two responses that have "good" and one that has "nice") are random responses that were missed using the GPS method. I looked at their response pattern, and I found that their feeling toward KKK and Nazi party (from 0 to 100) are all close to 50, which comports with the pattern seen in the responses flagged using the GPS method. As I mentioned earlier, the average feeling toward KKK and Nazi is between 8 and 9 among typical non-duplicates.
I noticed that the same words do not usually appear more than once in the same open-ended response, so just counting the total presence of a word seems to be a reliable way to count how many participants have given that word. My open-ended question is an invitation for comments to the survey.