Research shows AI watermarks easily removed

Researchers at ETH Zurich have discovered that watermarks used to identify AI-generated text can be easily removed and copied, making them ineffective. These attacks undermine the credibility of watermarks and could therefore deceive people into trusting misleading text. Watermarking involves hiding patterns in AI-generated text to indicate its origin, but this recent research suggests there are flaws in this technology.

Watermarking algorithms categorize words into “green” and “red” lists and make AI models choose words primarily from the green list. However, attackers can reverse-engineer these watermarks by analyzing AI responses and comparing them with normal text. This enables them to steal the watermark and launch two types of attacks: spoofing, where fake watermarked text is created, and scrubbing, where watermarks are removed from AI-generated text.

The research team successfully spoofed watermarks 80% of the time and stripped watermarks from text 85% of the time. Other researchers, such as Soheil Feizi from the University of Maryland, have also highlighted the vulnerability of watermarks to spoofing attacks.

Despite these challenges, watermarks remain a promising method for detecting AI-generated content, but further research is needed to improve their reliability. Until then, caution should be exercised when deploying such detection mechanisms on a large scale. Managing expectations regarding the reliability of these tools is crucial, as they are still considered useful even if imperfect.

This article is summarised from the orginal which appeared in MIT Technology review