This repository contains all code for reproducing experiments from the paper Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data? Given a BPE tokenizer, our attack infers ...
Note: Some of the code here is old and was written when I was learning C++. It might be possible that code is not safe or making wrong assumptions. Please use with caution. Pull requests are always ...