Finetuning Activates Verbatim Recall of Copyrighted Books in LLMs

· ai books coding · Source ↗

TLDR

  • Paper demonstrates that finetuning GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 on plot-summary instructions unlocks verbatim recall of copyrighted book text.

Key Takeaways

  • Research paper (arXiv 2603.20957) by Liu, Mireshghallah, Ginsburg, and Chakrabarty; Jane Ginsburg is a Columbia copyright law professor, giving this legal standing beyond a typical ML paper.
  • Pipeline: convert EPUB to chunks (300-500 words), generate plot summaries via GPT-4o, finetune with instruction format “write a N-word excerpt emulating [Author]”, sample 100 completions per excerpt at temperature 1.0.
  • Four memorization metrics: BMC@k (fraction of book words covered by verbatim spans of length k), longest contiguous memorized block, longest raw regurgitated span per generation, and count of spans exceeding threshold T.
  • Cross-excerpt analysis shows finetuned models regurgitate verbatim text from excerpts other than the one prompted, indicating generalized memorization not just local overfitting.
  • Cross-model Jaccard similarity of BMC coverage masks reveals whether GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 memorize the same regions of a book, suggesting shared pretraining exposure.

Hacker News Comment Review

  • No substantive HN discussion yet.

Original | Discuss on HN