Monday, May 28, 2012

Want to change writing style? Do not use Google translator

Every time we give our authorship recognition talk, someone will say, "just run your text through a machine translator, it'd change your style!"
Same comments were made at IEEE S&P after the authorship recognition talk ("On feasibility of internet scale authorship recognition").

One of my lab mates experimented with translation effect on authorship recognition and found that translation cannot be used for writing style change.
The reason is that a good translator would keep the style intact and a bad translator would change the style so much that it would distort the original meaning completely.
A good writing style anonymizer should change the style while keeping the meaning of the text intact.

The paper on translation effect is still under submission.
Mike mentioned this a little in his talk at ccc, 2009.

Authorship recognition of multi-authored document

There are very few works on this topic.
Most recent work is by Moshe Koppel and his team "Unsupervised Decomposition of a Document into Authorial Components." In this paper they analyzed Bible as a multi-authored document and decomposed chapters into two sets, chapters written by Jeremiah and chapters written by Ezekiel.  They used synonym usage to distinguish two authors.

The authors said this approach could be used iteratively for more than 2 authors, but you need to know the number of authors.

There are couple of limitations of this work. They experimented with the Hebrew version of Bible, there were only two authors and they decomposed the text chapter-wise.