Paper demonstrates that finetuning GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 on plot-summary instructions unlocks verbatim recall of copyrighted book text.
Key Takeaways
Research paper (arXiv 2603.20957) by Liu, Mireshghallah, Ginsburg, and Chakrabarty; Jane Ginsburg is a Columbia copyright law professor, giving this legal standing beyond a typical ML paper.
Pipeline: convert EPUB to chunks (300-500 words), generate plot summaries via GPT-4o, finetune with instruction format “write a N-word excerpt emulating [Author]”, sample 100 completions per excerpt at temperature 1.0.
Four memorization metrics: BMC@k (fraction of book words covered by verbatim spans of length k), longest contiguous memorized block, longest raw regurgitated span per generation, and count of spans exceeding threshold T.
Cross-excerpt analysis shows finetuned models regurgitate verbatim text from excerpts other than the one prompted, indicating generalized memorization not just local overfitting.
Cross-model Jaccard similarity of BMC coverage masks reveals whether GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 memorize the same regions of a book, suggesting shared pretraining exposure.