Evidence that A Large Amount of Low Quality Responses on MTurk Can Be Detected with Repeated GPS Coordinates
Bai, H. (2018) Evidence that A Large Amount of Low Quality Responses on MTurk Can Be Detected with Repeated GPS Coordinates. Retrieved from: https://www.maxhuibai.com/blog/evidence-that-responses-from-repeating-gps-are-random
In the past day or two, I discovered that a large number of responses in my latest Mturk survey appear to be random responses. I detect that a large portion of these random responses have repeating GPS locations. So far, a relatively large number of social psychological researchers also seem to have noticed a drop of data quality in their Mturk data and detected concerning patterns using the GPS method (see the related discussion threads on Facebook PsychMAP).
What is being done now
I am hoping to organize an effort to determine the scale and the nature of this problem, and I need the help from as many people/labs as possible. If you have found potential contamination (see below), I hope you can fill out this survey https://umn.qualtrics.com/jfe/form/SV_3jWIfYENfFUQ1ff. I will post an update/summary of whatever I have by then on Wednesday August 15 2018. However, the survey will keep open.
How to find out if your data is contaminated
One simple way to quickly determine if you data is potentially contaminated is to search for “88639831” in your data (the last digit can be different due to rounding and program used). This is the number after the decimal point for the latitude of a GPS location. This location was seen in multiple surveys. If you have it in your data, you might want to consider looking at how many participants have repeating/same GPS coordinates and analyze their data separately or exclude them. Responses with this GPS location as well as any other repeating GPS locations appear to be random.
What is being determined right now:
Below are some items that researchers seem to care about the most. I included some very preliminary conclusions from circumstantial evidence and observations.
The scale of the impact (i.e., how many surveys have been affected for how long).
So far it seems that at least about 90 studies from around the world (mostly North America and Eaurope) have been affected by the issue, and the contamination can be dated back to as early as March 2018. I am hoping to organize an effort to better determine the scale and the nature of this problem, and as mentioned above, I need the help from as many people/lab as possible. If you have found potential contamination in your data, I hope you can fill out this survey https://umn.qualtrics.com/jfe/form/SV_3jWIfYENfFUQ1ff
Other characteristics of the repeaters
Many people have reported that they tend to respond nonsensically to open-ended questions. In my own surveys, I have seen many “good” “GOOD” “NICE!” etc. They do not appear to take much longer or shorter time to fill out my survey compared to non-repeaters.
What are some effective ways to counter it.
Based on the studies of my own and few others, so far it looks like IP address cannot tell much. They can pass most basic types of attention checks (selecting a particular response or typing certain phrases), but not more sophisticated ones. Some suggest that some of them seem to be able to get pass Captcha, but it is still unsure at this moment.
Evidence that responses from repeating GPS are random
Here are some preliminary evidence from my own data. I think these evidence suggest that the repeaters, whatever they are, are not giving meaningful responses. I am using a survey that is still collecting data (see footnote 1). It has N=578, and 282 (48.8%) of them have duplicated (repeating) GPS. Below are some analyses done separately for repeaters and non-repeaters. With more people joining this effort, I hope this conclusion will be tested with more confidence. I have seen some analyses from other scholars that have yielded similar results (e.g., timryan.web.unc.edu/2018/08/12/data-contamination-on-mturk/)
The reliability of known scales
I tested the reliability of the racial identification scale (see below), the reliability for non-repeating GPSer it is .87, for repeaters it is .11.
I also tested the reliability of the symbolic threat scale (see below), the reliability for non-repeating GPSer it is .81, for repeaters it is -.01.
Note that these scales include reverse-coded items. I noted that if I test the reliability of the racial identification scale without reversing the items that should be reversed, its reliability is -.84 for non-repeaters and it is .58 for repeaters. With additional review of individual responses, my impression is that repeaters tend to straight-line their responses on the same side of a page, regardless if items are reverse coded or not. Therefore, data from the repeaters do not appear to be very reliable.
The distribution of measures with known/expected distribution
I asked my participants their feeling toward KKK and the Nazi party from 0=most unfavorable to 100=most favorable, and 50 is mid point. For non-repeating GPSer they are 8.90 and 8.21, but for repeaters they are 60.82 and 60.02. It seems to me that it is more likely that the repeaters are giving random responses than truly have found something likable about KKK and Nazi party.
Measures that are known/expected to be correlated
I used political ideology and party identification to correlate with feeling toward liberal and conservative politicians. I have four politicians who vary by ideology and religion, and religion doesn’t really matter so I combined them.
The correlation between feeling toward liberal politicians with ideology and party identification are significant r=-.59*** and r=-.57*** for non-repeating GPSer. but they are not significant at all for repeaters, r=-.13 p=.14, r=.04, p=.65 .
The correlation between feeling toward a conservative politicians with ideology and party identification are significant .53*** and .48*** for non-repeating GPSer. For repeaters, ideology is significant, but party ID is not r=.25** r=.10 p=.21.
Therefore, the predictive power of known variables do not appear to hold up for repeaters as much as non-repeaters.
Below are the measures described in my study
Racial identification scale
(1=strongly disagree; 7=strongly agree)
ID_1r Overall, my race has very little to do with how I feel about myself.(reverse coded)
ID_2 My race is an important reflection of who I am.
ID_3r My race is unimportant to my sense of what kind of a person I am. (reverse coded)
ID_4 In general, my race is an important part of my self-image.
Symbolic threat scale
(1=strongly disagree; 7=strongly agree)
SYT1 The values and beliefs of other ethnic groups regarding moral issues are not compatible with the values and beliefs of my ethnic group.
SYT2 The growth of other ethnic groups is undermining American culture.
SYT3r The values and beliefs of other ethnic groups regarding work are compatible with the values and beliefs of my ethnic group. (reverse coded)
How would you describe your ideological preference in general?
1=very liberal; 7=very conservative
How would you describe your political party preference?
1=Strong Democrat ; 7=Strong Republican
Footnote 1: I preregistered hypotheses, analyses plan and all that stuff. I also planned to share data so I don’t think data peeking is an issue here