Wednesday, March 14, 2012

"The idea of capturing the intelligence of the readership -- that's a joke."

Nick Denton, the founder of Gawker media, made that remark on cnn.
Of course it is, just as raw gold in gold mine is dust!

I am currently analyzing comments from jezebel.com, the most commented site among the Gawker media sites, to see if we can predict comment quality using machine learning. It is actually surprisingly simple to translate most of the commenting guidelines into features that a machine learning algorithm would use to assess comment quality. Though some features are harder to determine than the others, for example, finding relevance of a comment with the original post or funny comments.

I made this graph with 2000 comments from jezebel that shows the timing effect on crowd rating. The x-axis shows time difference of the comment with the original post and y-axis shows number of comments made within that time interval. That is, 100 comments within 20-30 means that 100 comments were made within 20-30 minutes after the post was published.

On jezebel best quality comments are "promoted" by the trusted users. This graph shows that most  highly rated comments were made within 2 hours of the posts. So if you are late to make your awesome comment, there will be hardly any crowd to judge it.

I found two things that online communities value the most in comment assessment are timing and reputation. If you are late at making your comment, no matter how high quality the comment is there won't be any crowd to read your comment to rate it. If you have a good reputation of making good comments or if you have lots of friends in the community, your comment that says "wow!!!" would be "liked" by 100 people. The comment rating process seems democratic but it's very biased.

Automated comment filtering research has been done on many communities, yet I don't think any of the online communities uses it. Does anybody know of any community that uses some kind of machine based moderation?

No comments: