Social data is not transparent

comment posed on Southern Fried Science (David Shiffman) post 10 March, 2014, “5 things we discussed in my #scio14 “social media as a scientific research tool” session.”

> it can be inexpensive (even free) and simple to get the data you need.

It may not be as simple as it appears. To take the example of Twitter — probably the most-used and most-studied social data source — most collection tools are used with either Twitter Search API or Streaming API, both of which have known incompleteness and sample bias. So for example, a collection of “all” tweets employing a given hashag, made with those tools, will likely not include all tweets actually sent with that hashtag. Also, it is hard to know what portion of, or in what pattern, tweets may have been missed.

The only data source Twitter even claims any completeness for is full “firehose” data, available only by arrangement with them or one of their data partners like Gnip. Even with this data, there are questions about how its completeness or neutrality might be assessed or verified. The scrupulous path, I think, is to assume there isn’t really any “raw” or self-evidently neutral data, from any source so complex and mediated as Twitter; there are just data artifacts, which have to be critically interpreted.


Tim McCormick
Conversary, Palo Alto
@tmccormick tjm.org

Note: posting the comment here because, as quite often happens, I wrote comment, submitted it (after logging in, with Twitter account in this case), nothing appeared, and there was no information to say if or how it might be posted. Site-specific comment systems are almost all broken, from a commenter’s standpoint.