Andy K Zhang, Kevin Klyman, Yifan Mai, Yoav Levine, Yian Zhang, Rishi Bommasani, Percy Liang
andyzh@stanford.edu
Stanford University
Abstract
Language models are extensively evaluated, but correctly interpreting evaluation results requires knowledge of train-test overlap, which refers to the extent to which the language model is trained on the very data it is being tested on.The public currently lacks adequate information about train-test overlap: most models have no public train-test overlap statistics, and third parties cannot directly measure train-test overlap since they do not have access to the training data.To make this clear, we document the practices of 30 models, finding that just 9 models report train-test overlap: 4 models release training data under open-source licenses, enabling the community to directly measure train-test overlap, and 5 models publish their train-test overlap methodology and statistics.By engaging with language model developers, we provide novel information about train-test overlap for three additional models.Overall, we take the position that language model developers should publish train-test overlap statistics and/or training data whenever they report evaluation results on public test sets.We hope our work increases transparency into train-test overlap to increase the community-wide trust in model evaluations.
1 Introduction
The artificial intelligence (AI) community has built hundreds of evaluations to better understand language models(Srivastava etal., 2023; Hendrycks etal., 2021a; Gao etal., 2024; Myrzakhan etal., 2024; Chiang etal., 2024; Rein etal., 2023; Liang etal., 2023).These evaluations cannot be correctly interpreted without knowledge of train-test overlap, which we define as the extent to which the evaluation test data appears in the training data.
Prior to the rise of language models trained on web-scale data, the AI community used standard train/test set splits, where a model would be trained on the training set and tested on the test set to ensure validity of results (Jurafsky & Martin, 2009; Russell & Norvig, 2009).In that regime, the designer of an evaluation generally would specify both the training and test sets.In contrast, today foundation model developers decide on their own training sets, which they often do not release, and evaluation designers decide on test sets, which they often release.Overall, the shift to web-scale training data with poor documentation of data provenance, along with the two-party specification of training and test data, contributes to poor understanding of train-test overlap (Longpre etal., 2023).
Train-test overlap can arise for several reasons.First, since evaluation datasets are often made public on the Internet (e.g. via public repositories like GitHub and Hugging Face), these datasets may be scraped and then trained upon.Second, since evaluation datasets often use already-public material (e.g. the held-out examples in SQuAD still depend on public Wikipedia data (Rajpurkar etal., 2016)), the underlying data may be easily trained upon.Third, since evaluation datasets are often input into models to conduct evaluations (e.g. to evaluate GPT-4 via the OpenAI API), these datasets may be stored and used to train future models.Better understanding how train-test overlap arises may facilitate solutions for appropriately navigating the challenges it presents (Oren etal., 2023).
A growing literature demonstrates high train-test overlap for language models, which contributes to significant degradation in performance between seen and unseen test examples (Lewis etal., 2020; Elangovan etal., 2021; Vu etal., 2023).For example, OpenAI initially reported that GPT-4 had achieved state-of-the-art performance on a test set of Codeforces coding questions, claiming that there was no contamination (OpenAI, 2023).Yet it was later demonstrated111See https://twitter.com/cHHillee/status/1635790330854526981?lang=en that while GPT-4 achieves 100% accuracy for 10 pre-2021 problems, the model achieves 0% accuracy on more recent problems.More generally, Kapoor & Narayanan (2022) document that test data often leaks into training data across many domains.Therefore, improving the community’s understanding train-test overlap will increase the validity of, and trust in, evaluations.
Given the value of understanding train-test overlap, we study the practices of 30 language model.We find that 9 models have published sufficient data for the AI community to contextualize train-test overlap: 4 models (OLMo—AI2, GPT-NeoX—EleutherAI, RedPajama INCITE—Together, StarCoder 2—BigCode/HuggingFace/ServiceNow) have released open-source datasets that the community can inspect for train-test overlap, and 5 models (GPT-4—OpenAI, Llama 3.1—Meta, Qwen2—Alibaba, Palmyra—Writer, Apple Intelligence—Apple) have published their methodology and statistics for train-test overlap.The remaining 23 models do report evaluation results on public test sets, but do not (adequately) report train-test overlap results.In parallel to models reporting train-test overlap, the community is building black-box methods to estimate train-test overlap without access to training data (Golchin & Surdeanu, 2023; Shi etal., 2023; Oren etal., 2023), but these approaches are quite limited at present.We take the position that language model developers should report train-test overlap.
2 Strategies to Estimate and Address Train-Test Overlap
The prevalence of train-test overlap as an issue in the AI community222Potential evidence of train-test overlap is often flagged by members of the AI community on social media. See, e.g., https://twitter.com/dhuynh95/status/1775568278557192411 has led to the development of various strategies to estimate and address train-test overlap, including black-box methods, private test sets, novel test sets, and canary strings.We cover each of these in turn then discuss our approach.
Black-box methods
involve researchers working to estimate train-test overlap through model api access and the test set rather than directly through access to the training set.Notably, there have been efforts to estimate train-test overlap via prompting, word probabilities, and test example orderings (Golchin & Surdeanu, 2023; Shi etal., 2023; Oren etal., 2023). Golchin & Surdeanu (2023) prompt the model with the dataset name, partition type, and an initial segment of the reference string, and mark train-test overlap if the model responds with a exact or similar copy in the output. Shi etal. (2023) estimate train-test overlap via the probability outputs of outlier words, with the hypothesis that unseen examples is likely to contain few outliers with low probabilities. Oren etal. (2023) estimate train-test overlap by considering the ordering of test instances, noting that language models with train-test overlap are likely to memorize such ordering.These methods can be helpful for estimation and as a sanity check to white-box approaches, but have currently have limitations as they are not robust to adversarial settings such as if a developer fine-tuned its model to avoid revealing training data and even in the benign setting, require certain assumptions such as requiring a certain threshold of frequencies for detection or certain methods of training (Casper etal., 2024; Golchin & Surdeanu, 2023; Shi etal., 2023; Oren etal., 2023). Estimating and interpreting train-test overlap is difficult even in the white-box setting with direct access to the training data as current approaches have significant limitations; with further constraints in the black-box setting, the challenges only increase.
Private test sets
such as SQuAD (Rajpurkar etal., 2018) and SEAL (Scale, 2024) allow researchers to keep a portion or all of the test set hidden, meaning that the test set is not publicly accessible on the internet and developers are therefore much less likely to train models on it.While private test sets can be valuable, they raise potential concerns regarding data transparency.For instance, unless the private test set is shared with a trustworthy third party, the community must rely upon a single organization’s assessment of the test set’s validity.In any event, public test sets are the industry standard and will continue to exist, though private and public test sets can coexist in a healthy testing ecosystem.
Novel test sets
that include data that was produced after the knowledge cutoff date of a model also help mitigate train-test overlap.Including recent data is a best practice for new test sets, though this may be difficult if, for example, a new test set is derived from existing data (e.g. based on old Wikipedia data or AI-generated data).Even when this approach is implemented successfully, new models are released regularly that are trained on more recent data, necessitating some analysis of train-test overlap with the previously novel test set.One modification of this approach is to add novel data to the test set at regular intervals, as with Livebench (White etal., 2024) or Image2Struct.333See https://crfm.stanford.edu/helm/image2struct/latest/In addition to the financial cost of continually adding novel data, which may not be feasible for every domain or project, one challenge of this approach is that it is difficult to interpret longitudinal progress.
Canary strings
as introduced by BIG-bench (Luo etal., 2024), are another strategy to cope with train-test overlap.Here, tasks in a test set are marked by a unique string, called a canary string, allowing developers to filter out data that contain canary strings during training.If a model outputs a given canary string, it signals that there is likely train-test overlap with the associated test set.But canary strings are not implemented uniformly or consistently—tasks exist without canary strings, whether within the test set or in other instances across the internet, and canary strings can be easily filtered out of test sets. More often test sets are derived from other raw sources that do not contain the canary string.It is also possible that canary strings may be referenced independently of the tasks in test sets, producing potential false positives.
Our position:
To complement these above approaches, language model developers should report train-test overlap statistics or openly release their training data.A developer chooses the specific test sets it uses to evaluate its language model, and it can choose to report train-test overlap for those test sets (e.g. through a transparency report or a model card) using its preferred method for computing train-test overlap (Bommasani etal., 2024b).This is similar to norms in the field of statistics, where published results must be accompanied by confidence intervals, rather than arbitrary reporting criteria imposed by a third party.This approach would complement existing strategies: for instance, black-box methods and canary strings are powerful tools to sanity check train-test overlap statistics that a developer reports. Similarly, private or novel test sets can further sanity check results on existing public test sets, such as drawing attention to cases of significant divergence.
3 Language Models
To establish a broad understanding of the landscape, we comprehensively consider the train-test overlap practices of the flagship language model of 30 developers (01.ai, Adept, AI2, AI21 Labs, Aleph Alpha, Alibaba, Amazon, Anthropic, Apple, BigCode, Cohere, Databricks, DeepSeek, EleutherAI, Technology Innovation Institute, Google, IBM, Imbue, Inflection, Meta, Microsoft, Mistral, NVIDIA, OpenAI, Reka AI, Snowflake, Stability AI, Together AI, Writer, and xAI).We assembled this list by considering all models on the HELM MMLU leaderboard444See https://crfm.stanford.edu/helm/mmlu/latest/#/leaderboard and additional models that we selected for impact and relevance based on Ecosystem Graphs (Bommasani etal., 2023b).
Next, we selected the latest flagship model for which the developer had published benchmarks results.This is because we emphasize that a developer should disclose information about train-test overlap on the subset of benchmarks that the developer publishes rather than a pre-defined list of benchmarks decided by another party.For some developers, this meant selecting an older flagship model as there are as of yet no published results on the newer model.We chose to exclude developers that have not published results on public language benchmarks such as Baidu.We consider only models with results released before September 1, 2024 (the date we provided as a deadline to model developers to share additional train-test overlap information) and accordingly reported on Qwen2 rather than Qwen2.5, GPT-4 rather than GPT-4o 555We note that GPT-4o system card was published before this date, but there is no new GPT-4o paper available yet so we chose to focus on GPT-4 instead., and OLMo rather than OLMoE.
4 Results
4.1 Documenting Current Practices
We followed a standardized procedure in order to document current practices regarding reporting of train-test overlap statistics.For each developer-model pair, we followed the following process to collate the developer’s current practices with respect to reporting train-test overlap:
- 1.
We identified papers, technical reports, and company websites that were potential sources of information on train-test overlap.
- 2.
We queried and identified any data the developer has published regarding the model’s results on public benchmarks. We documented each public benchmark on which the developer reports results for the model.
- 3.
We queried each document that includes results on public benchmarks for information on train-test overlap. In addition to reading the document, we queried for terms including “contamination”, “overlap”, and “gram”, then manually inspected the occurrence to determine whether any train-test overlap data was released.
4.2 Scoring Criteria
We assign each developer a binary score of 1 or 0 to indicate whether the developer has provided sufficient information to contextualize train-test overlap for its flagship model.In this work we do not evaluate the specific methodology that each developer choses to employ to estimate train-test overlap, as these methodologies are inconsistent, opaque, and often not comprehensive.Instead, we identify whether a developer meets some minimum threshold with respect to publicly reporting some meaningful information about train-test overlap.
In assigning scores to developers, we consider the following criteria:
- 1.
Is the training data publicly available?
- 2.
Is train-test overlap reported on public benchmarks for which the model’s results are reported? That is, for each test set, we want a number that measures overlap. Note that this can be an implicit 0 for those who prefilter their training data.
- (a)
Is train-test overlap reported with sufficient specificity to be meaningful?
- (b)
Is there a clear description of the method the developer used to compute train-test overlap?
- (a)
If none of these criteria are met, then the developer scores 0. If the training data is publicly available, the developer scores 1 as third parties can directly compute train-test overlap statistics for any public test set of interest. If the training data is not publicly available, but train-test overlap is reported with sufficient specificity and a clear description of the method, the developer scores 1.
For each developer that scored 0, we reached out to the developer to provide them an opportunity to engage with or rebut the score. Each of these developers was given the opportunity to point to any relevant information that our analysis was missing, or provide additional information that would be publicized.
4.3 Scores
Model | Developer | Score | Explanation |
Pythia | EleutherAI | 1 | Open training data (Biderman etal., 2023) |
OLMo | AI2 | 1 | Open training data (Groeneveld etal., 2024) |
RedPajama-INCITE 7B | Together AI | 1 | Open training data (Computer, 2024) |
StarCoder 2 | BigCode | 1 | Open training data (Lozhkov etal., 2024) |
Palmyra X V3 | Writer | 1 | Published analysis and code (Writer, 2024) |
GPT-4 | OpenAI | 1 | Published analysis (OpenAI etal., 2024) |
Llama 3.1 | Meta | 1 | Published analysis (Dubey etal., 2024) |
Qwen2 | Alibaba | 1 | Published analysis (Yang etal., 2024) |
Apple Intelligence | Apple | 1 | Published prefiltering (Gunter etal., 2024) |
Gemini 1.5 Pro | 0 | Insufficient methodological details (Team etal., 2024a) | |
Arctic | Snowflake | 0 | No analysis (Snowflake, 2024) |
Claude 3.5 Sonnet | Anthropic | 0 | No analysis (Anthropic, 2024) |
Command R | Cohere | 0 | No analysis (Cohere, 2024) |
Core | Reka AI | 0 | No analysis (Team etal., 2024b) |
DBRX | Databricks | 0 | No analysis (Databricks, 2024) |
DeepSeek | DeepSeek | 0 | No analysis (DeepSeek-AI etal., 2024) |
Falcon | TII | 0 | No analysis (Almazrouei etal., 2023) |
Fuyu-Heavy | Adept | 0 | No analysis (Adept, 2024) |
Granite | IBM | 0 | No analysis (Mishra etal., 2024) |
Grok-2 | xAI | 0 | No analysis (x.ai, 2024) |
Imbue 70B | Imbue | 0 | No analysis (Imbue, 2024) |
Inflection-2.5 | Inflection | 0 | No analysis (AI, 2024a) |
Jamba-1.5 | AI21 Labs | 0 | No analysis (AI21, 2024) |
Luminous Supreme | Aleph Alpha | 0 | No analysis (Alpha, 2024) |
Mixtral Large 2 | Mistral | 0 | No analysis (AI, 2024b) |
Nemotron-4-340B-Instruct | NVIDIA | 0 | No analysis (NVIDIA, 2024) |
Phi 3 | Microsoft | 0 | No analysis (Abdin etal., 2024) |
Stable LM 2 | Stability AI | 0 | No analysis (AI, 2024c) |
Titan Text Express | Amazon | 0 | No analysis (Amazon, 2024) |
Yi-34B | 01.ai | 0 | No analysis (AI etal., 2024) |
Here we document the train-test overlap practices of 30 models with published results on public test sets.Of these, 9 models have published sufficient data for the community to contextualize train-test overlap: 4 models (OLMo—AI2, GPT-NeoX—EleutherAI, RedPajama INCITE—Together, StarCoder 2—BigCode/HuggingFace/ServiceNow) models have released training data under open-source licenses, which researchers can inspect for train-test overlap and 5 models (GPT-4—OpenAI, Llama 3.1—Meta, Qwen2—Alibaba, Palmyra—Writer, Apple Intelligence—Apple) have published their methodology and statistics for train-test overlap. For developers that do not openly release their training data, we provide additional explanation below as to why their transparency regarding train-test overlap is meaningful; for quantification of the degree of train-test overlap for each of these developers, see their associated technical reports.
OpenAI
reports its train-test overlap analysis in the GPT-4 Technical Report (OpenAI etal., 2024).OpenAI reports results for GPT-4 on 8 public test sets and shared train-test overlap analysis for 6 of these test sets.OpenAI etal. (2024) states “We measure cross-contamination between our evaluation dataset and the pre-training data usingsubstring match. Both evaluation and training data are processed by removing all spaces and symbols, keeping only characters (including numbers). For each evaluation example, we randomly select three substrings of 50 characters (or use the entire example if it’s less than 50 characters). A match is identified if any of the three sampled evaluation substrings is a substring of the processed training example. This yields a list of contaminated examples. We discard these and rerun to get uncontaminated scores.”
Meta
reports its train-test overlap analysis in the Llama 3 Technical Report (Dubey etal., 2024, Section 5.1.4).Dubey etal. (2024) write: “Singh et al. (2024) propose to select contamination detection methods empirically,based on which method results in the largest difference between the ‘clean’ part of the dataset and the entire dataset, which they call estimated performance gain. For all our evaluation datasets, we score examples based on 8-gram overlap, a methodthat was found by Singh et al. (2024) to be accurate for many datasets. We consider an example of a dataset D to be contaminated if a ratio TD of its tokens are part of an 8-gram occurring at least once in the pre-training corpus. We select TD separately for each dataset, based on which value shows the maximal significant estimated performance gain across the three model sizes.” Meta reports train-test overlap for Llama 3.1 models on AGIEval, BIG-Bench Hard, BoolQ, CommonSenseQA, GSM8K, HellaSwag, MATH, NaturalQuestions, OpenBookQA, PiQA, QuaC, SiQA, SQuAD, Winogrande, and WorldSense.”
Writer
released train-test overlap statistics for Palmyra X after receiving a request from the authors.Writer ran a script666This script is publicly available at https://github.com/stanford-crfm/data-overlap—we encourage other developers to run it over their training sets and publicly report the results. provided by the authors over its pretraining and instruction-tuning data to evaluate train-test overlap via an n-gram analysis; the results of this analysis are accessible at https://drive.google.com/file/d/1-3_GtZTyIbE1B5XWGGWtLRy5OwVtG0EK/view?usp=sharing.Writer publishes its results on HELM Lite (Writer, 2024), which includes 9 public test sets, and Writer reported train-test overlap on each of the public test sets included in HELM as well as others.Writer found some degree of train-test overlap for 13 of the 27 test sets on which it ran the script (APPS, CivilComments, CNN/Daily Mail, EntityMatching, HumanEval, ICE, LegalSupport, MATH, NarrativeQA, RAFT, The Pile, WikiFact, XSum).
Alibaba
released train-test overlap statistics for Qwen2 via an update to its technical report after receiving a request from the authors (Yang etal., 2024, Section 5.2.6).Yang etal. (2024) conducted an analysis of Qwen2’s training set following OpenAI’s approach, writing that in addition to n-gram matching “we also applied another constraint based on the longest common subsequence (LCS). Specifically, we first remove all symbols and punctuation from both the test and training sequences and perform tokenization. For a training sequence st, we remove it if there is a test sequence se such that and . To assess the potential effects of leaking data on the test performance, we follow OpenAI (2023) to construct a strict non-contaminated test set to check if there is a significant performance degradation after strict decontamination. Specifically, we construct the non-contaminated test set by excluding any sample which has 13-gram overlap with the pre-training or the post-training data (without constrainton LCS), and then compute the corresponding metric on the test set.” Alibaba reports results for Qwen2-72B on 14 public test sets (MMLU, GPQA, TheoremQA, HumanEval, MBPP, MultiPL-E, IFEval, LiveCodeBench v1, GSM8K, MATH, MT-Bench, MixEval, ArenaHard, and AlignBench) and reported train-test overlap on 8 of the public test sets (MMLU, GPQA, HumanEval, MBPP, MultiPL-E, GSM8K, MATH, and IFEval).
Apple
reports train-test overlap statistics for Apple Intelligence on 24 benchmarks (MMLU, GSM8K, HellaSwag, WinoGrande, NarrativeQA, Natural Questions, OpenBookQA, MATH_CoT, LegalBench, MedQA, WMT-2014, IFEval, AlpacaEval, ArenaHard, Berkeley Functional Calling, arc_challenge, arc_easy, lambada, piqa, sciq, triviaQA, webqs, HumanEval, MultiPLE-Swift); of these, Apple prefiltered its training data against 12 (MMLU, GSM8K, HellaSwag, WinoGrande, OpenBookQA, arc_challenge, arc_easy, lambada, piqa, sciq, triviaQA, webqs), filtering documents upon 4-13 gram collisions unless the n-gram reaches a “common-usage” threshold of 1000 (Gunter etal., 2024).Apple provided specificity about the benchmarks for which its training data was prefiltered after receiving a request from the authors.
5 Discussion
Overall, while train-test overlap is a fundamental to interpreting evaluation results, there is still significant limitations in the measurement methodology, beyond data access challenges and developer responsibility. As described above, direct string comparison is the most common way to quantify train-test overlap.This method has slight variations, but typically involves detecting substring matches between training and test data. N-gram matching is commonly employed (Yang etal., 2024; Dubey etal., 2024; Brown etal., 2020; Chowdhery etal., 2022), where documents are tokenized and then compared, though OpenAI compares characters rather than tokens in its analysis for GPT-4 (OpenAI, 2023). There are important design decisions developers make in employing n-gram strategies, such as what to set as N, whether to allow fuzzy matching or skipgrams (Dubey etal., 2024), and whether to filter based only a threshold of matches. This lack of uniformity in measurement approaches can make it challenging to directly compare train-test overlap analyses.
Benchmark | Overlap Type | Example |
APPS | Phrase (Question) | Input—– The first line of the input contains a single integer n (1 n 100 000) |
Phrase (Sequence) | [’A’, ’B’, ’C’, ’D’, ’E’, ’F’, ’G’, ’H’, ’I’, ’J’, ’K’, ’L’, ’M’, ’N’, ’O’, ’P’, ’Q’, ’R’, ’S’, ’T’, ’U’, ’V’, ’W’, ’X’, ’Y’, ’Z’] | |
CivilComments | Quote (News) | Crews removed Baltimore’s Confederate statues early Wednesday, days after the deadly unrest in Charlottesville |
EntityDataImputation | Multi-Token Identifier | AF-S DX Nikkor 18-55mm f/3.5-5.6G VR Lens |
HumanEval | Phrase (Math) | prime number, or a prime, is a natural number greater than 1 that |
Phrase (Sequence) | ’zero’: 0, ’one’: 1, ’two’: 2, ’three’: 3, ’four’: 4, ’five’: 5, ’six’: 6, ’seven’: 7, ’eight’: 8, ’nine’: 9 | |
ICE | Phrase (Legal) | with such intention, and under such circumstances that if by that act, you had caused the death of |
LSAT | Phrase (Question) | Which one of the following could be a complete and accurate list of the |
MATH | Question Leakage | 2-4+6-8+10-12+143-6+9-12+15-18+21 |
Phrase (Question) | where is the area of the base and is the height | |
MMLU | Question Leakage | If K is a normal subgroup of H and H is a normal subgroup of G, then K is a normal subgroup of G. |
NarrativeQA | Quote (Book) | M. Emanuel was away three years. Reader, they were the three happiest years of my life |
RAFT | Quote (News) | the French language is very difficult but we try hard. If we come every day, maybe |
RealToxicityPrompts | Quote (News) | When it comes to dashing the hopes of thousands of college-bound African |
ThePile | Phrase (Speech) | If you have any questions, please feel free to give me a call. |
Multi-Token Identifier | the U.S. District Court for the Northern District of New York | |
Phrase (Sequence) | Mime-Version: 1.0 Content-type: text/plain; charset=us-ascii Content-Disposition: inline | |
TwitterAAE | Quote (Song) | to be this way. Where did we go wrong we both made mistakes we gotta carry on |
Summarization | Quote (News) | There are still many questions that the families of the 96 have and we believe that these people may be able to provide answers to some of those questions |
Wikifact | Quote (Wikipedia) | swimming at the 1896 Summer Olympics – men’s sailors 100 metre freestyle |
There are a number of limitations to this class of approaches. One limitation is that n-grams are coarse and do not capture the differences in types of overlap between different test sets. For instance, CivilComments is derived from news sites, so overlap between training data and CivilComments is likely due to news articles that appear in both the training and test data (Duchene etal., 2023).In contrast, MMLU (Hendrycks etal., 2021a) and MATH (Hendrycks etal., 2021b) are in question-answer format, so overlap could stem from leakage of questions or answers or repetition of common phrases in questions and answers. We categorize the overlap types for different scenarios for The Pile (Gao etal., 2020) in 2. Overlap types can be broadly categorized into question leakage; quotes from news, laws, books, and songs; common phrases; and multi-token identifiers. Question leakage is the canonical concern for training data: if a model trains on the questions in the test set, it can achieve high results that fail to generalize well to new questions. However, these other overlap types can also be informative; for instance, simply matching the LSAT question stem “Which one of the following could be a complete and accurate list of the” suggests that the training data likely contains LSAT questions or similar questions. Indeed, train-test overlap is a construct that captures the relation between train and test data, and the different type of overlap add complexity that make it difficult to capture in a single statistic. Future work could explore these various overlap types in more detail and devise more granular metrics of measurement.
Another limitation is that n-gram analysis fails to catch many classes of train-test overlap that may be relevant, such as translations, summaries, or paraphrases of the text (Lee etal., 2024).Yang etal. (2023) demonstrate that there can be significant train-test overlap even with OpenAI’s prefiltering methods. Prior work has made progress on addressing this limitation, including by making use of embeddings or an LM-evaluator for a more semantic-based match (Dong etal., 2024; Jiang etal., 2024).These gaps demonstrate the need for further work on developing improved methods for estimating train-test overlap.
Indeed, in light of these limitations, we note that we are fundamentally interested in measuring the amount of generalization, rather than direct string matches or any specific approaches. This could extend to train-test overlap at the task or domain level. Additionally, it highlights that unlike the common perception, train-test overlap is not necessarily a negative (in part why we choose this term as opposed to “contamination”). Instead, it is helpful to guide understanding and help contextualize results.
Additionally, there are complexities in determining what qualifies as the training set, as there are often multiple stages of training and datasets, including pretraining, fine-tuning, and safety alignment among others (Yang etal., 2024; Dubey etal., 2024; Brown etal., 2020; Chowdhery etal., 2022; OpenAI, 2023). This is often not captured in developers’ public reports (Bommasani etal., 2024a), and it may not be well captured internally either. Precision about the training set is important, though it is beyond the scope of this paper.
This paper does not assert a position on which method is best, and acknowledges that there is substantial research remaining to investigate better methods of computing train-test overlap.Nevertheless, the limitations of black-box approaches are far greater than those of white-box approaches.Just as the Foundation Model Transparency Index has helped improve the transparency of foundation model developers (Bommasani etal., 2023a; 2024a), our hope is that an increase in the number of model developers that report train-test overlap will produce better methods of measurement and help standardize reporting such that developers’ transparency on train-test overlap improves.
6 Conclusion
In this work, we highlight the need to improve transparency of train-test overlap.Our position is that any language model developer that publishes results on public test sets should release its training data and/or publish accompanying train-test overlap so that the community can interpret the results.We discuss various strategies to address train-test overlap, and how our position complements these efforts. We document the train-test overlap practices of 30 models with published results on public test sets.Of these, 9 models have published sufficient data for the community to contextualize train-test overlap.Finally, we discuss limitations with current approaches to quantifying train-test overlap, while emphasizing that current methods still have value.Instead, we suggest that as the AI community increasingly becomes aware of train-test overlap we can continue to improve upon and align on methodology for measuring and reducing train-test overlap.
References
- Abdin etal. (2024)Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, AmmarAhmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Weizhu Chen, Yen-Chun Chen, Yi-Ling Chen, Hao Cheng, Parul Chopra, Xiyang Dai, Matthew Dixon, Ronen Eldan, Victor Fragoso, Jianfeng Gao, Mei Gao, Min Gao, Amit Garg, AllieDel Giorno, Abhishek Goswami, Suriya Gunasekar, Emman Haider, Junheng Hao, RussellJ. Hewett, Wenxiang Hu, Jamie Huynh, Dan Iter, SamAde Jacobs, Mojan Javaheripi, Xin Jin, Nikos Karampatziakis, Piero Kauffmann, Mahoud Khademi, Dongwoo Kim, YoungJin Kim, Lev Kurilenko, JamesR. Lee, YinTat Lee, Yuanzhi Li, Yunsheng Li, Chen Liang, Lars Liden, Xihui Lin, Zeqi Lin, CeLiu, Liyuan Liu, Mengchen Liu, Weishung Liu, Xiaodong Liu, Chong Luo, Piyush Madan, Ali Mahmoudzadeh, David Majercak, Matt Mazzola, Caio CésarTeodoro Mendes, Arindam Mitra, Hardik Modi, Anh Nguyen,Brandon Norick, Barun Patra, Daniel Perez-Becker, Thomas Portet, Reid Pryzant, Heyang Qin, Marko Radmilac, Liliang Ren, Gustavo deRosa, Corby Rosset, Sambudha Roy, Olatunji Ruwase, Olli Saarikivi, Amin Saied, Adil Salim, Michael Santacroce, Shital Shah, Ning Shang, Hiteshi Sharma, Yelong Shen, Swadheen Shukla, Xia Song, Masahiro Tanaka, Andrea Tupini, Praneetha Vaddamanu, Chunyu Wang, Guanhua Wang, Lijuan Wang, Shuohang Wang, Xin Wang, YuWang, Rachel Ward, Wen Wen, Philipp Witte, Haiping Wu, Xiaoxia Wu, Michael Wyatt, Bin Xiao, Can Xu, Jiahang Xu, Weijian Xu, Jilong Xue, Sonali Yadav, Fan Yang, Jianwei Yang, Yifan Yang, Ziyi Yang, Donghan Yu, LuYuan, Chenruidong Zhang, Cyril Zhang, Jianwen Zhang, LiLyna Zhang, YiZhang, Yue Zhang, Yunan Zhang, and Xiren Zhou.Phi-3 technical report: A highly capable language model locally on your phone, 2024.URL https://arxiv.org/abs/2404.14219.
- Adept (2024)Adept.Adept fuyu heavy, 2024.URL https://www.adept.ai/blog/adept-fuyu-heavy.
- AI etal. (2024)01. AI, :, Alex Young, Bei Chen, Chao Li, Chengen Huang, GeZhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and Zonghong Dai.Yi: Open foundation models by 01.ai, 2024.URL https://arxiv.org/abs/2403.04652.
- AI (2024a)Inflection AI.Inflection 2.5 announcement.https://inflection.ai/blog/inflection-2-5, 2024a.
- AI (2024b)Mistral AI.Mistral large model announcement.https://mistral.ai/news/mistral-large-2407/, 2024b.
- AI (2024c)Stability AI.Introducing stable lm 2.https://stability.ai/news/introducing-stable-lm-2, 2024c.
- AI21 (2024)AI21.Jamba-1.5 models, 2024.URL https://docs.ai21.com/docs/jamba-15-models.
- Almazrouei etal. (2023)Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Mérouane Debbah, Étienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, Daniele Mazzotta, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo.The falcon series of open language models, 2023.URL https://arxiv.org/abs/2311.16867.
- Alpha (2024)Aleph Alpha.Luminous performance benchmarks, 2024.URL https://aleph-alpha.com/luminous-performance-benchmarks/.
- Amazon (2024)Amazon.Aws ai service cards – amazon titan text lite and titan text express, 2024.URL https://aws.amazon.com/machine-learning/responsible-machine-learning/titan-text/.
- Anthropic (2024)Anthropic.Claude 3.5 sonnet, 2024.URL https://www.anthropic.com/news/claude-3-5-sonnet.
- Biderman etal. (2023)Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, MohammadAflah Khan, Shivanshu Purohit, USVSNSai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar vander Wal.Pythia: A suite for analyzing large language models across training and scaling, 2023.URL https://arxiv.org/abs/2304.01373.
- Bommasani etal. (2023a)Rishi Bommasani, Kevin Klyman, Shayne Longpre, Sayash Kapoor, Nestor Maslej, Betty Xiong, Daniel Zhang, and Percy Liang.The foundation model transparency index, 2023a.URL https://arxiv.org/abs/2310.12941.
- Bommasani etal. (2023b)Rishi Bommasani, Dilara Soylu, Thomas Liao, KathleenA. Creel, and Percy Liang.Ecosystem graphs: The social footprint of foundation models.arXiv, 2023b.
- Bommasani etal. (2024a)Rishi Bommasani, Kevin Klyman, Sayash Kapoor, Shayne Longpre, Betty Xiong, Nestor Maslej, and Percy Liang.The foundation model transparency index v1.1: May 2024, 2024a.URL https://arxiv.org/abs/2407.12929.
- Bommasani etal. (2024b)Rishi Bommasani, Kevin Klyman, Shayne Longpre, Betty Xiong, Sayash Kapoor, Nestor Maslej, Arvind Narayanan, and Percy Liang.Foundation model transparency reports, 2024b.URL https://arxiv.org/abs/2402.16268.
- Brown etal. (2020)TomB. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, DanielM. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei.Language models are few-shot learners, 2020.URL https://arxiv.org/abs/2005.14165.
- Casper etal. (2024)Stephen Casper, Carson Ezell, Charlotte Siegmann, Noam Kolt, TaylorLynn Curtis, Benjamin Bucknall, Andreas Haupt, Kevin Wei, Jérémy Scheurer, Marius Hobbhahn, etal.Black-box access is insufficient for rigorous ai audits.In The 2024 ACM Conference on Fairness, Accountability, and Transparency, pp. 2254–2272, 2024.
- Chiang etal. (2024)Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, AnastasiosNikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, JosephE. Gonzalez, and Ion Stoica.Chatbot arena: An open platform for evaluating llms by human preference, 2024.URL https://arxiv.org/abs/2403.04132.
- Chowdhery etal. (2022)Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, HyungWon Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, YiTay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, AndrewM. Dai, ThanumalayanSankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel.Palm: Scaling language modeling with pathways, 2022.URL https://arxiv.org/abs/2204.02311.
- Cohere (2024)Cohere.Introducing command-r.https://cohere.com/blog/command-r, 2024.
- Computer (2024)Computer.Redpajama dataset.https://github.com/togethercomputer/RedPajama-Data, 2024.
- Databricks (2024)Databricks.Introducing dbrx: New state-of-the-art open llm.https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm, 2024.
- DeepSeek-AI etal. (2024)DeepSeek-AI, :, Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, Huazuo Gao, Kaige Gao, Wenjun Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong Guo, Guangbo Hao, Zhewen Hao, Ying He, Wenjie Hu, Panpan Huang, Erhang Li, Guowei Li, Jiashi Li, Yao Li, Y.K. Li, Wenfeng Liang, Fangyun Lin, A.X. Liu, BoLiu, Wen Liu, Xiaodong Liu, Xin Liu, Yiyuan Liu, Haoyu Lu, Shanghao Lu, Fuli Luo, Shirong Ma, Xiaotao Nie, Tian Pei, Yishi Piao, Junjie Qiu, Hui Qu, Tongzheng Ren, Zehui Ren, Chong Ruan, Zhangli Sha, Zhihong Shao, Junxiao Song, Xuecheng Su, Jingxiang Sun, Yaofeng Sun, Minghui Tang, Bingxuan Wang, Peiyi Wang, Shiyu Wang, Yaohui Wang, Yongji Wang, Tong Wu, Y.Wu, Xin Xie, Zhenda Xie, Ziwei Xie, Yiliang Xiong, Hanwei Xu, R.X. Xu, Yanhong Xu, Dejian Yang, Yuxiang You, Shuiping Yu, Xingkai Yu, B.Zhang, Haowei Zhang, Lecong Zhang, Liyue Zhang, Mingchuan Zhang, Minghua Zhang, Wentao Zhang, Yichao Zhang, Chenggang Zhao, Yao Zhao, Shangyan Zhou, ShunfengZhou, Qihao Zhu, and Yuheng Zou.Deepseek llm: Scaling open-source language models with longtermism, 2024.URL https://arxiv.org/abs/2401.02954.
- Dong etal. (2024)Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, Bin Gu, Mengfei Yang, and GeLi.Generalization or memorization: Data contamination and trustworthy evaluation for large language models, 2024.URL https://arxiv.org/abs/2402.15938.
- Dubey etal. (2024)Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, CristianCanton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, EricMichael Smith, Filip Radenovic, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, GeorgiaLewis Anderson, Graeme Nail, Gregoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, HuXu, Hugo Touvron, Iliyan Zarov,ImanolArrieta Ibarra, Isabel Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer vander Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, KalyanVasuden Alwala, Kartikeya Upasani, Kate Plawiak, KeLi, Kenneth Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Kshitiz Malik, Kuenley Chiu, Kunal Bhalla, Lauren Rantala-Yeary, Laurens vander Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke deOliveira, Madeline Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar Paluri, Marcin Kardas, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melanie Kambadur, Mike Lewis, Min Si, MiteshKumar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Nikolay Bashlykov, Nikolay Bogoychev, Niladri Chatterji, OlivierDuchenne, Onur Çelebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasic, Peter Weng, Prajjwal Bhargava, Pratik Dubal, Praveen Krishnan, PunitSingh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, Ramon Calderer, RicardoSilveira Cabral, Robert Stojnic, Roberta Raileanu, Rohit Girdhar, Rohit Patel, Romain Sauvestre, Ronnie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sahana Chennabasappa, Sanjay Singh, Sean Bell, SeohyunSonia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Raparthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gururangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ramanathan, Viktor Kerkez, Vincent Gonguet, Virginie Do, Vish Vogeti, Vladan Petrovic, Weiwei Chu,Wenhan Xiong, Wenyin Fu, Whitney Meers, Xavier Martinet, Xiaodong Wang, XiaoqingEllen Tan, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yashesh Gaur, Yasmine Babaei, YiWen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, ZacharieDelpierre Coudert, Zheng Yan, Zhengxing Chen, Zoe Papakipos, Aaditya Singh, Aaron Grattafiori, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay Sharma, Alex Boesenberg, Alex Vaughan, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Franco, Aparajita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdan, Beau James, Ben Maurer, Benjamin Leonhardi, Bernie Huang, Beth Loyd, BetoDe Paola, Bhargavi Paranjape, Bing Liu, BoWu, Boyu Ni, Braden Hancock, Bram Wasti, Brandon Spence, Brani Stojkovic, Brian Gamido, Britt Montalvo, CarlParker, Carly Burton, Catalina Mejia, Changhan Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Damon Civin, Dana Beaty, Daniel Kreymer, Daniel Li, Danny Wyatt, David Adkins, David Xu, Davide Testuggine, Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, Edward Dowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, Emily Wood, Erik Brinkman, Esteban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk, Feng Tian, Firat Ozgenel, Francesco Caggioni, Francisco Guzmán, Frank Kanayet, Frank Seide, GabrielaMedina Florez, Gabriella Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Govind Thattai, Grant Herman, Grigory Sizov, Guangyi, Zhang, Guna Lakshminarayanan, Hamid Shojanazeri, Han Zou, Hannah Wang, Hanwen Zha, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry Aspegren, Hunter Goldman, Ibrahim Damlaj, Igor Molybog, Igor Tufanov, Irina-Elena Veliche, Itai Gat, Jake Weissman, JamesGeboski, James Kohli, Japhet Asher, Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jennifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kai Wu, KamHou U, Karan Saxena, Karthik Prasad, Kartikay Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan, Kelly Michelena, Keqian Li, Kun Huang, Kunal Chawla, Kushal Lakhotia, Kyle Huang, Lailin Chen, Lakshya Garg, Lavender A, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu, Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Maria Tsimpoukelli, Martynas Mankus, Matan Hasson, Matthew Lennie, Matthias Reso, Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan Keneally, MichaelL. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, MiquelJubert Hermoso, MoMetanat, Mohammad Rastegari, Munish Bansal, NandhiniSanthanam, Natascha Parks, Natasha White, Navyata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, NikolayPavlovich Laptev, Ning Dong, Ning Zhang, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pedro Rittner, Philip Bontrager, Pierre Roux, Piotr Dollar, Polina Zvyagina, Prashant Ratanchandani, Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Wang, Rohan Maheswari, Russ Howes, Ruty Rinott, SaiJayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh Verma, Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Shaun Lindsay, Sheng Feng, Shenghao Lin, ShengxinCindy Zha, Shiva Shankar, Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satterfield, Sudarshan Govindaprasad, Sumit Gupta,Sungmin Cho, Sunny Virk, Suraj Subramanian, SyChoudhury, Sydney Goldman, Tal Remez, Tamar Glaser, Tamara Best, Thilo Kohler, Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Matthews, Timothy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, VinaySatish Kumar, Vishal Mangla, Vítor Albiero, Vlad Ionescu, Vlad Poenaru, VladTiberiu Mihailescu, Vladimir Ivanov, Wei Li, Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Constable, Xiaocheng Tang, Xiaofang Wang, Xiaojian Wu, Xiaolan Wang, Xide Xia, Xilun Wu, Xinbo Gao, Yanjun Chen, YeHu, YeJia, YeQi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu, Wang, Yuchen Hao, Yundi Qian, Yuzi He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, and Zhiwei Zhao.The llama 3 herd of models, 2024.URL https://arxiv.org/abs/2407.21783.
- Duchene etal. (2023)Corentin Duchene, Henri Jamet, Pierre Guillaume, and Reda Dehak.A benchmark for toxic comment classification on civil comments dataset, 2023.URL https://arxiv.org/abs/2301.11125.
- Elangovan etal. (2021)Aparna Elangovan, Jiayuan He, and Karin Verspoor.Memorization vs. generalization: quantifying data leakage in nlp performance evaluation.arXiv preprint arXiv:2102.01818, 2021.
- Gao etal. (2020)Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy.The pile: An 800gb dataset of diverse text for language modeling, 2020.URL https://arxiv.org/abs/2101.00027.
- Gao etal. (2024)Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain LeNoac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou.A framework for few-shot language model evaluation, 07 2024.URL https://zenodo.org/records/12608602.
- Golchin & Surdeanu (2023)Shahriar Golchin and Mihai Surdeanu.Time travel in llms: Tracing data contamination in large language models, 2023.
- Groeneveld etal. (2024)Dirk Groeneveld, IzBeltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, AnanyaHarsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, KhyathiRaghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, MatthewE. Peters, Valentina Pyatkin, Abhilasha Ravichander, Dustin Schwenk, Saurabh Shah, Will Smith, Emma Strubell, Nishant Subramani, Mitchell Wortsman, Pradeep Dasigi, Nathan Lambert, Kyle Richardson, Luke Zettlemoyer, Jesse Dodge, Kyle Lo, Luca Soldaini, NoahA. Smith, and Hannaneh Hajishirzi.Olmo: Accelerating the science of language models, 2024.URL https://arxiv.org/abs/2402.00838.
- Gunter etal. (2024)Tom Gunter, Zirui Wang, Chong Wang, Ruoming Pang, Andy Narayanan, Aonan Zhang, Bowen Zhang, Chen Chen, Chung-Cheng Chiu, David Qiu, Deepak Gopinath, DianAng Yap, Dong Yin, Feng Nan, Floris Weers, Guoli Yin, Haoshuo Huang, Jianyu Wang, Jiarui Lu, John Peebles, KeYe, Mark Lee, Nan Du, Qibin Chen, Quentin Keunebroek, Sam Wiseman, Syd Evans, Tao Lei, Vivek Rathod, Xiang Kong, Xianzhi Du, Yanghao Li, Yongqiang Wang, Yuan Gao, Zaid Ahmed, Zhaoyang Xu, Zhiyun Lu, AlRashid, AlbinMadappally Jose, Alec Doane, Alfredo Bencomo, Allison Vanderby, Andrew Hansen, Ankur Jain, AnupamaMann Anupama, Areeba Kamal, Bugu Wu, Carolina Brum, Charlie Maalouf, Chinguun Erdenebileg, Chris Dulhanty, Dominik Moritz, Doug Kang, Eduardo Jimenez, Evan Ladd, Fangping Shi, Felix Bai, Frank Chu, Fred Hohman, Hadas Kotek, HannahGillis Coleman, Jane Li, Jeffrey Bigham, Jeffery Cao, Jeff Lai, Jessica Cheung, Jiulong Shan, Joe Zhou, John Li, Jun Qin, Karanjeet Singh, Karla Vega, Kelvin Zou, Laura Heckman, Lauren Gardiner, Margit Bowler,Maria Cordell, Meng Cao, Nicole Hay, Nilesh Shahdadpuri, Otto Godwin, Pranay Dighe, Pushyami Rachapudi, Ramsey Tantawi, Roman Frigg, Sam Davarnia, Sanskruti Shah, Saptarshi Guha, Sasha Sirovica, Shen Ma, Shuang Ma, Simon Wang, Sulgi Kim, Suma Jayaram, Vaishaal Shankar, Varsha Paidi, Vivek Kumar, Xin Wang, Xin Zheng, Walker Cheng, Yael Shrager, Yang Ye, Yasu Tanaka, Yihao Guo, Yunsong Meng, ZhaoTang Luo, Zhi Ouyang, Alp Aygar, Alvin Wan, Andrew Walkingshaw, Andy Narayanan, Antonie Lin, Arsalan Farooq, Brent Ramerth, Colorado Reed, Chris Bartels, Chris Chaney, David Riazati, EricLiang Yang, Erin Feldman, Gabriel Hochstrasser, Guillaume Seguin, Irina Belousova, Joris Pelemans, Karen Yang, KeivanAlizadeh Vahid, Liangliang Cao, Mahyar Najibi, Marco Zuliani, Max Horton, Minsik Cho, Nikhil Bhendawade, Patrick Dong, Piotr Maj, Pulkit Agrawal, QiShan, Qichen Fu, Regan Poston, Sam Xu, Shuangning Liu, Sushma Rao, Tashweena Heeramun, Thomas Merth, Uday Rayala, Victor Cui, VivekRangarajan Sridhar, Wencong Zhang,Wenqi Zhang, Wentao Wu, Xingyu Zhou, Xinwen Liu, Yang Zhao, Yin Xia, Zhile Ren, and Zhongzheng Ren.Apple intelligence foundation language models, 2024.URL https://arxiv.org/abs/2407.21075.
- Hendrycks etal. (2021a)Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt.Measuring massive multitask language understanding, 2021a.URL https://arxiv.org/abs/2009.03300.
- Hendrycks etal. (2021b)Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt.Measuring mathematical problem solving with the math dataset, 2021b.URL https://arxiv.org/abs/2103.03874.
- Imbue (2024)Imbue.Introducing the 70b model.https://imbue.com/research/70b-intro/, 2024.
- Jiang etal. (2024)Minhao Jiang, KenZiyu Liu, Ming Zhong, Rylan Schaeffer, Siru Ouyang, Jiawei Han, and Sanmi Koyejo.Investigating data contamination for pre-training language models, 2024.URL https://arxiv.org/abs/2401.06059.
- Jurafsky & Martin (2009)Dan Jurafsky and JamesH. Martin.Speech and language processing : an introduction to natural language processing, computational linguistics, and speech recognition.Pearson Prentice Hall, Upper Saddle River, N.J., 2009.ISBN 9780131873216 0131873210.URL http://www.amazon.com/Speech-Language-Processing-2nd-Edition/dp/0131873210/ref=pd_bxgy_b_img_y.
- Kapoor & Narayanan (2022)Sayash Kapoor and Arvind Narayanan.Leakage and the reproducibility crisis in ml-based science.arXiv preprint arXiv:2207.07048, 2022.
- Lee etal. (2024)Katherine Lee, A.Feder Cooper, and James Grimmelmann.Talkin’ ’bout ai generation: Copyright and the generative-ai supply chain, 2024.URL https://arxiv.org/abs/2309.08133.
- Lewis etal. (2020)Patrick Lewis, Pontus Stenetorp, and Sebastian Riedel.Question and answer test-train overlap in open-domain question answering datasets.arXiv preprint arXiv:2008.02637, 2020.
- Liang etal. (2023)Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, CeZhang, ChristianAlexander Cosgrove, ChristopherD Manning, Christopher Re, Diana Acosta-Navas, DrewArad Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue WANG, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, NiladriS. Chatterji, Omar Khattab, Peter Henderson, Qian Huang, RyanAndrew Chi, SangMichael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda.Holistic evaluation of language models.Transactions on Machine Learning Research, 2023.ISSN 2835-8856.URL https://openreview.net/forum?id=iO4LZibEqW.Featured Certification, Expert Certification.
- Longpre etal. (2023)Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, Xinyi Wu, Enrico Shippole, Kurt Bollacker, Tongshuang Wu, Luis Villa, Sandy Pentland, and Sara Hooker.The data provenance initiative: A large scale audit of dataset licensing & attribution in ai, 2023.URL https://arxiv.org/abs/2310.16787.
- Lozhkov etal. (2024)Anton Lozhkov, Raymond Li, LoubnaBen Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, AoTang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, TerryYue Zhuo, Evgenii Zheltonozhskii, Nii OsaeOsae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, CarolynJane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, CarlosMuñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm deVries.Starcoder 2 and the stack v2: The next generation, 2024.URL https://arxiv.org/abs/2402.19173.
- Luo etal. (2024)Hanjun Luo, Haoyu Huang, Ziye Deng, Xuecheng Liu, Ruizhe Chen, and Zuozhu Liu.Bigbench: A unified benchmark for social bias in text-to-image generative models based on multi-modal llm, 2024.URL https://arxiv.org/abs/2407.15240.
- Mishra etal. (2024)Mayank Mishra, Matt Stallone, Gaoyuan Zhang, Yikang Shen, Aditya Prasad, AdrianaMeza Soria, Michele Merler, Parameswaran Selvam, Saptha Surendran, Shivdeep Singh, Manish Sethi, Xuan-Hong Dang, Pengyuan Li, Kun-Lung Wu, Syed Zawad, Andrew Coleman, Matthew White, Mark Lewis, Raju Pavuluri, Yan Koyfman, Boris Lublinsky, Maximilien deBayser, Ibrahim Abdelaziz, Kinjal Basu, Mayank Agarwal, YiZhou, Chris Johnson, Aanchal Goyal, Hima Patel, Yousaf Shah, Petros Zerfos, Heiko Ludwig, Asim Munawar, Maxwell Crouse, Pavan Kapanipathi, Shweta Salaria, Bob Calio, Sophia Wen, Seetharami Seelam, Brian Belgodere, Carlos Fonseca, Amith Singhee, Nirmit Desai, DavidD. Cox, Ruchir Puri, and Rameswar Panda.Granite code models: A family of open foundation models for code intelligence, 2024.URL https://arxiv.org/abs/2405.04324.
- Myrzakhan etal. (2024)Aidar Myrzakhan, SondosMahmoud Bsharat, and Zhiqiang Shen.Open-llm-leaderboard: From multi-choice to open-style questions for llms evaluation, benchmark, and arena, 2024.URL https://arxiv.org/abs/2406.07545.
- NVIDIA (2024)NVIDIA.Nemotron-4 340b instruct model.https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/nemotron-4-340b-instruct, 2024.
- OpenAI (2023)OpenAI.GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023.
- OpenAI etal. (2024)OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, FlorenciaLeoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, HyungWon Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, SimónPosada Fishman, Juston Forte, Isabella Fulford, LeoGao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, ShixiangShane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, NitishShirish Keskar, Tabarak Khan, Logan Kilpatrick, JongWook Kim, Christina Kim, Yongjik Kim, JanHendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, ChakMing Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, RyanLowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, ScottMayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe deAvila BelbutePeres, Michael Petrov, HenriquePonde deOliveiraPinto, Michael, Pokorny, Michelle Pokrass, VitchyrH. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez,Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, FelipePetroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, MadeleineB. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan FelipeCerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, JustinJay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJWeinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, ShengjiaZhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph.Gpt-4 technical report, 2024.URL https://arxiv.org/abs/2303.08774.
- Oren etal. (2023)Yonatan Oren, Nicole Meister, Niladri Chatterji, Faisal Ladhak, and TatsunoriB. Hashimoto.Proving test set contamination in black box language models, 2023.
- Rajpurkar etal. (2016)Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang.Squad: 100,000+ questions for machine comprehension of text, 2016.URL https://arxiv.org/abs/1606.05250.
- Rajpurkar etal. (2018)Pranav Rajpurkar, Robin Jia, and Percy Liang.Know what you don’t know: Unanswerable questions for SQuAD.In Association for Computational Linguistics (ACL), 2018.
- Rein etal. (2023)David Rein, BettyLi Hou, AsaCooper Stickland, Jackson Petty, RichardYuanzhe Pang, Julien Dirani, Julian Michael, and SamuelR. Bowman.Gpqa: A graduate-level google-proof q&a benchmark, 2023.URL https://arxiv.org/abs/2311.12022.
- Russell & Norvig (2009)Stuart Russell and Peter Norvig.Artificial Intelligence: A Modern Approach.Prentice Hall Press, USA, 3rd edition, 2009.ISBN 0136042597.
- Scale (2024)Scale.Scale seal.https://scale.com/blog/leaderboard, 2024.
- Shi etal. (2023)Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer.Detecting pretraining data from large language models, 2023.
- Snowflake (2024)Snowflake.Arctic: Open, efficient foundation language models.https://www.snowflake.com/en/blog/arctic-open-efficient-foundation-language-models-snowflake/, 2024.
- Srivastava etal. (2023)Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu AwalMd Shoeb, Abubakar Abid, Adam Fisch, AdamR. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, AlexanderW. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza, Ambrose Slone, Ameet Rahane, AnantharamanS. Iyer, Anders Andreassen, Andrea Madotto, Andrea Santilli, Andreas Stuhlmüller, Andrew Dai, Andrew La, Andrew Lampinen, Andy Zou, Angela Jiang, Angelica Chen, Anh Vuong, Animesh Gupta, Anna Gottardi, Antonio Norelli, Anu Venkatesh, Arash Gholamidavoodi, Arfa Tabassum, Arul Menezes, Arun Kirubarajan, Asher Mullokandov, Ashish Sabharwal, Austin Herrick, Avia Efrat, Aykut Erdem, Ayla Karakaş, B.Ryan Roberts, BaoSheng Loe, Barret Zoph, Bartłomiej Bojanowski, Batuhan Özyurt, Behnam Hedayatnia, Behnam Neyshabur, Benjamin Inden, Benno Stein, Berk Ekmekci, BillYuchen Lin, Blake Howald, BryanOrinion, Cameron Diao, Cameron Dour, Catherine Stinson, Cedrick Argueta, CésarFerri Ramírez, Chandan Singh, Charles Rathkopf, Chenlin Meng, Chitta Baral, Chiyu Wu, Chris Callison-Burch, Chris Waites, Christian Voigt, ChristopherD. Manning, Christopher Potts, Cindy Ramirez, ClaraE. Rivera, Clemencia Siro, Colin Raffel, Courtney Ashcraft, Cristina Garbacea, Damien Sileo, Dan Garrette, Dan Hendrycks, Dan Kilman, Dan Roth, Daniel Freeman, Daniel Khashabi, Daniel Levy, DanielMoseguí González, Danielle Perszyk, Danny Hernandez, Danqi Chen, Daphne Ippolito, Dar Gilboa, David Dohan, David Drakard, David Jurgens, Debajyoti Datta, Deep Ganguli, Denis Emelin, Denis Kleyko, Deniz Yuret, Derek Chen, Derek Tam, Dieuwke Hupkes, Diganta Misra, Dilyar Buzan, DimitriCoelho Mollo, Diyi Yang, Dong-Ho Lee, Dylan Schrader, Ekaterina Shutova, EkinDogus Cubuk, Elad Segal, Eleanor Hagerman, Elizabeth Barnes, Elizabeth Donoway, Ellie Pavlick, Emanuele Rodola, Emma Lam, Eric Chu, Eric Tang, Erkut Erdem, Ernie Chang,EthanA. Chi, Ethan Dyer, Ethan Jerzak, Ethan Kim, EuniceEngefu Manyasi, Evgenii Zheltonozhskii, Fanyue Xia, Fatemeh Siar, Fernando Martínez-Plumed, Francesca Happé, Francois Chollet, Frieda Rong, Gaurav Mishra, GentaIndra Winata, Gerard deMelo, Germán Kruszewski, Giambattista Parascandolo, Giorgio Mariani, Gloria Wang, Gonzalo Jaimovitch-López, Gregor Betz, Guy Gur-Ari, Hana Galijasevic, Hannah Kim, Hannah Rashkin, Hannaneh Hajishirzi, Harsh Mehta, Hayden Bogar, Henry Shevlin, Hinrich Schütze, Hiromu Yakura, Hongming Zhang, HughMee Wong, Ian Ng, Isaac Noble, Jaap Jumelet, Jack Geissinger, Jackson Kernion, Jacob Hilton, Jaehoon Lee, JaimeFernández Fisac, JamesB. Simon, James Koppel, James Zheng, James Zou, Jan Kocoń, Jana Thompson, Janelle Wingfield, Jared Kaplan, Jarema Radom, Jascha Sohl-Dickstein, Jason Phang, Jason Wei, Jason Yosinski, Jekaterina Novikova, Jelle Bosscher, Jennifer Marsh, Jeremy Kim, Jeroen Taal, Jesse Engel, Jesujoba Alabi, Jiacheng Xu, Jiaming Song, Jillian Tang, JoanWaweru, John Burden, John Miller, JohnU. Balis, Jonathan Batchelder, Jonathan Berant, Jörg Frohberg, Jos Rozen, Jose Hernandez-Orallo, Joseph Boudeman, Joseph Guerr, Joseph Jones, JoshuaB. Tenenbaum, JoshuaS. Rule, Joyce Chua, Kamil Kanclerz, Karen Livescu, Karl Krauth, Karthik Gopalakrishnan, Katerina Ignatyeva, Katja Markert, KaustubhD. Dhole, Kevin Gimpel, Kevin Omondi, Kory Mathewson, Kristen Chiafullo, Ksenia Shkaruta, Kumar Shridhar, Kyle McDonell, Kyle Richardson, Laria Reynolds, Leo Gao, LiZhang, Liam Dugan, Lianhui Qin, Lidia Contreras-Ochando, Louis-Philippe Morency, Luca Moschella, Lucas Lam, Lucy Noble, Ludwig Schmidt, Luheng He, LuisOliveros Colón, Luke Metz, LütfiKerem Şenel, Maarten Bosma, Maarten Sap, Maartje ter Hoeve, Maheen Farooqi, Manaal Faruqui, Mantas Mazeika, Marco Baturan, Marco Marelli, Marco Maru, Maria JoseRamírez Quintana, Marie Tolkiehn, Mario Giulianelli, Martha Lewis, Martin Potthast, MatthewL. Leavitt, Matthias Hagen, Mátyás Schubert, MedinaOrdunaBaitemirova, Melody Arnaud, Melvin McElrath, MichaelA. Yee, Michael Cohen, Michael Gu, Michael Ivanitskiy, Michael Starritt, Michael Strube, Michał Swędrowski, Michele Bevilacqua, Michihiro Yasunaga, Mihir Kale, Mike Cain, Mimee Xu, Mirac Suzgun, Mitch Walker, MoTiwari, Mohit Bansal, Moin Aminnaseri, Mor Geva, Mozhdeh Gheini, MukundVarma T, Nanyun Peng, NathanA. Chi, Nayeon Lee, Neta Gur-Ari Krakover, Nicholas Cameron, Nicholas Roberts, Nick Doiron, Nicole Martinez, Nikita Nangia, Niklas Deckers, Niklas Muennighoff, NitishShirish Keskar, NivedithaS. Iyer, Noah Constant, Noah Fiedel, Nuan Wen, Oliver Zhang, Omar Agha, Omar Elbaghdadi, Omer Levy, Owain Evans, Pablo AntonioMoreno Casares, Parth Doshi, Pascale Fung, PaulPu Liang, Paul Vicol, Pegah Alipoormolabashi, Peiyuan Liao, Percy Liang, Peter Chang, Peter Eckersley, PhuMon Htut, Pinyu Hwang, Piotr Miłkowski, Piyush Patil, Pouya Pezeshkpour, Priti Oli, Qiaozhu Mei, Qing Lyu, Qinlang Chen, Rabin Banjade, RachelEtta Rudolph, Raefer Gabriel, RahelHabacker, Ramon Risco, Raphaël Millière, Rhythm Garg, Richard Barnes, RifA. Saurous, Riku Arakawa, Robbe Raymaekers, Robert Frank, Rohan Sikand, Roman Novak, Roman Sitelew, Ronan LeBras, Rosanne Liu, Rowan Jacobs, Rui Zhang, Ruslan Salakhutdinov, Ryan Chi, Ryan Lee, Ryan Stovall, Ryan Teehan, Rylan Yang, Sahib Singh, SaifM. Mohammad, Sajant Anand, Sam Dillavou, Sam Shleifer, Sam Wiseman, Samuel Gruetter, SamuelR. Bowman, SamuelS. Schoenholz, Sanghyun Han, Sanjeev Kwatra, SarahA. Rous, Sarik Ghazarian, Sayan Ghosh, Sean Casey, Sebastian Bischoff, Sebastian Gehrmann, Sebastian Schuster, Sepideh Sadeghi, Shadi Hamdan, Sharon Zhou, Shashank Srivastava, Sherry Shi, Shikhar Singh, Shima Asaadi, ShixiangShane Gu, Shubh Pachchigar, Shubham Toshniwal, Shyam Upadhyay, Shyamolima, Debnath, Siamak Shakeri, Simon Thormeyer, Simone Melzi, Siva Reddy, SnehaPriscilla Makini, Soo-Hwan Lee, Spencer Torene, Sriharsha Hatwar, Stanislas Dehaene, Stefan Divic, Stefano Ermon, Stella Biderman, Stephanie Lin, StephenPrasad, StevenT. Piantadosi, StuartM. Shieber, Summer Misherghi, Svetlana Kiritchenko, Swaroop Mishra, Tal Linzen, Tal Schuster, Tao Li, Tao Yu, Tariq Ali, Tatsu Hashimoto, Te-Lin Wu, Théo Desbordes, Theodore Rothschild, Thomas Phan, Tianle Wang, Tiberius Nkinyili, Timo Schick, Timofei Kornev, Titus Tunduny, Tobias Gerstenberg, Trenton Chang, Trishala Neeraj, Tushar Khot, Tyler Shultz, Uri Shaham, Vedant Misra, Vera Demberg, Victoria Nyamai, Vikas Raunak, Vinay Ramasesh, VinayUday Prabhu, Vishakh Padmakumar, Vivek Srikumar, William Fedus, William Saunders, William Zhang, Wout Vossen, Xiang Ren, Xiaoyu Tong, Xinran Zhao, Xinyi Wu, Xudong Shen, Yadollah Yaghoobzadeh, Yair Lakretz, Yangqiu Song, Yasaman Bahri, Yejin Choi, Yichi Yang, Yiding Hao, Yifu Chen, Yonatan Belinkov, YuHou, Yufang Hou, Yuntao Bai, Zachary Seid, Zhuoye Zhao, Zijian Wang, ZijieJ. Wang, Zirui Wang, and Ziyi Wu.Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2023.URL https://arxiv.org/abs/2206.04615.
- Team etal. (2024a)Gemini Team, Petko Georgiev, VingIan Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love, Paul Voigtlaender, Rohan Jain, Gabriela Surita, Kareem Mohamed, Rory Blevins, Junwhan Ahn, Tao Zhu, Kornraphop Kawintiranon, Orhan Firat, Yiming Gu, Yujing Zhang, Matthew Rahtz, Manaal Faruqui, Natalie Clay, Justin Gilmer, JDCo-Reyes, Ivo Penchev, Rui Zhu, Nobuyuki Morioka, Kevin Hui, Krishna Haridasan, Victor Campos, Mahdis Mahdieh, Mandy Guo, Samer Hassan, Kevin Kilgour, Arpi Vezer, Heng-Tze Cheng, Raoul deLiedekerke, Siddharth Goyal, Paul Barham, DJStrouse, Seb Noury, Jonas Adler, Mukund Sundararajan, Sharad Vikram, Dmitry Lepikhin, Michela Paganini, Xavier Garcia, Fan Yang, Dasha Valter, Maja Trebacz, Kiran Vodrahalli, ChulayuthAsawaroengchai, Roman Ring, Norbert Kalb, LivioBaldini Soares, Siddhartha Brahma, David Steiner, Tianhe Yu, Fabian Mentzer, Antoine He, Lucas Gonzalez, Bibo Xu, RaphaelLopez Kaufman, LaurentEl Shafey, Junhyuk Oh, Tom Hennigan, George vanden Driessche, Seth Odoom, Mario Lucic, Becca Roelofs, Sid Lall, Amit Marathe, Betty Chan, Santiago Ontanon, Luheng He, Denis Teplyashin, Jonathan Lai, Phil Crone, Bogdan Damoc, Lewis Ho, Sebastian Riedel, Karel Lenc, Chih-Kuan Yeh, Aakanksha Chowdhery, Yang Xu, Mehran Kazemi, Ehsan Amid, Anastasia Petrushkina, Kevin Swersky, Ali Khodaei, Gowoon Chen, Chris Larkin, Mario Pinto, Geng Yan, AdriaPuigdomenech Badia, Piyush Patil, Steven Hansen, Dave Orr, Sebastien M.R. Arnold, Jordan Grimstad, Andrew Dai, Sholto Douglas, Rishika Sinha, Vikas Yadav, XiChen, Elena Gribovskaya, Jacob Austin, Jeffrey Zhao, Kaushal Patel, Paul Komarek, Sophia Austin, Sebastian Borgeaud, Linda Friso, Abhimanyu Goyal, Ben Caine, Kris Cao, Da-Woon Chung, Matthew Lamm, Gabe Barth-Maron, ThaisKagohara, Kate Olszewska, Mia Chen, Kaushik Shivakumar, Rishabh Agarwal, Harshal Godhia, Ravi Rajwar, Javier Snaider, Xerxes Dotiwalla, Yuan Liu, Aditya Barua, Victor Ungureanu, Yuan Zhang, Bat-Orgil Batsaikhan, Mateo Wirth, James Qin, Ivo Danihelka, Tulsee Doshi, Martin Chadwick, Jilin Chen, Sanil Jain, Quoc Le, Arjun Kar, Madhu Gurumurthy, Cheng Li, Ruoxin Sang, Fangyu Liu, Lampros Lamprou, Rich Munoz, Nathan Lintz, Harsh Mehta, Heidi Howard, Malcolm Reynolds, Lora Aroyo, Quan Wang, Lorenzo Blanco, Albin Cassirer, Jordan Griffith, Dipanjan Das, Stephan Lee, Jakub Sygnowski, Zach Fisher, James Besley, Richard Powell, Zafarali Ahmed, Dominik Paulus, David Reitter, Zalan Borsos, Rishabh Joshi, Aedan Pope, Steven Hand, Vittorio Selo, Vihan Jain, Nikhil Sethi, Megha Goel, Takaki Makino, Rhys May, Zhen Yang, Johan Schalkwyk, Christina Butterfield, Anja Hauth, Alex Goldin, Will Hawkins, Evan Senter, Sergey Brin, Oliver Woodman, Marvin Ritter, Eric Noland, Minh Giang, Vijay Bolina, Lisa Lee, Tim Blyth, IanMackinnon, Machel Reid, Obaid Sarvana, David Silver, Alexander Chen, Lily Wang, Loren Maggiore, Oscar Chang, Nithya Attaluri, Gregory Thornton, Chung-Cheng Chiu, Oskar Bunyan, Nir Levine, Timothy Chung, Evgenii Eltyshev, Xiance Si, Timothy Lillicrap, Demetra Brady, Vaibhav Aggarwal, Boxi Wu, Yuanzhong Xu, Ross McIlroy, Kartikeya Badola, Paramjit Sandhu, Erica Moreira, Wojciech Stokowiec, Ross Hemsley, Dong Li, Alex Tudor, Pranav Shyam, Elahe Rahimtoroghi, Salem Haykal, Pablo Sprechmann, Xiang Zhou, Diana Mincu, Yujia Li, Ravi Addanki, Kalpesh Krishna, Xiao Wu, Alexandre Frechette, Matan Eyal, Allan Dafoe, Dave Lacey, Jay Whang, Thi Avrahami, YeZhang, Emanuel Taropa, Hanzhao Lin, Daniel Toyama, Eliza Rutherford, Motoki Sano, HyunJeong Choe, Alex Tomala, Chalence Safranek-Shrader, Nora Kassner, Mantas Pajarskas, Matt Harvey, Sean Sechrist, Meire Fortunato, Christina Lyu, Gamaleldin Elsayed, Chenkai Kuang, James Lottes, Eric Chu, Chao Jia, Chih-Wei Chen, Peter Humphreys, Kate Baumli, Connie Tao, RajkumarSamuel, CiceroNogueira dos Santos, Anders Andreassen, Nemanja Rakićević, Dominik Grewe, Aviral Kumar, Stephanie Winkler, Jonathan Caton, Andrew Brock, Sid Dalmia, Hannah Sheahan, Iain Barr, Yingjie Miao, Paul Natsev, Jacob Devlin, Feryal Behbahani, Flavien Prost, Yanhua Sun, Artiom Myaskovsky, ThanumalayanSankaranarayana Pillai, Dan Hurt, Angeliki Lazaridou, XiXiong, CeZheng, Fabio Pardo, Xiaowei Li, Dan Horgan, Joe Stanton, Moran Ambar, Fei Xia, Alejandro Lince, Mingqiu Wang, Basil Mustafa, Albert Webson, Hyo Lee, Rohan Anil, Martin Wicke, Timothy Dozat, Abhishek Sinha, Enrique Piqueras, Elahe Dabir, Shyam Upadhyay, Anudhyan Boral, LisaAnne Hendricks, Corey Fry, Josip Djolonga, YiSu, Jake Walker, Jane Labanowski, Ronny Huang, Vedant Misra, Jeremy Chen, RJSkerry-Ryan, Avi Singh, Shruti Rijhwani, Dian Yu, Alex Castro-Ros, Beer Changpinyo, Romina Datta, Sumit Bagri, ArnarMar Hrafnkelsson, Marcello Maggioni, Daniel Zheng, Yury Sulsky, Shaobo Hou, TomLe Paine, Antoine Yang, Jason Riesa, DominikaRogozinska, Dror Marcus, DaliaEl Badawy, Qiao Zhang, Luyu Wang, Helen Miller, Jeremy Greer, LarsLowe Sjos, Azade Nova, Heiga Zen, Rahma Chaabouni, Mihaela Rosca, Jiepu Jiang, Charlie Chen, Ruibo Liu, Tara Sainath, Maxim Krikun, Alex Polozov, Jean-Baptiste Lespiau, Josh Newlan, Zeyncep Cankara, Soo Kwak, Yunhan Xu, Phil Chen, Andy Coenen, Clemens Meyer, Katerina Tsihlas, Ada Ma, Juraj Gottweis, Jinwei Xing, Chenjie Gu, Jin Miao, Christian Frank, Zeynep Cankara, Sanjay Ganapathy, Ishita Dasgupta, Steph Hughes-Fitt, Heng Chen, David Reid, Keran Rong, Hongmin Fan, Joost van Amersfoort, Vincent Zhuang, Aaron Cohen, ShixiangShane Gu, Anhad Mohananey, Anastasija Ilic, Taylor Tobin, John Wieting, Anna Bortsova, Phoebe Thacker, Emma Wang, Emily Caveness, Justin Chiu, Eren Sezener, Alex Kaskasoli, Steven Baker, Katie Millican, Mohamed Elhawaty, Kostas Aisopos, Carl Lebsack, Nathan Byrd, Hanjun Dai, Wenhao Jia, Matthew Wiethoff, Elnaz Davoodi, Albert Weston, Lakshman Yagati, Arun Ahuja, Isabel Gao, Golan Pundak,Susan Zhang, Michael Azzam, KheChai Sim, Sergi Caelles, James Keeling, Abhanshu Sharma, Andy Swing, YaGuang Li, Chenxi Liu, CarrieGrimes Bostock, Yamini Bansal, Zachary Nado, Ankesh Anand, Josh Lipschultz, Abhijit Karmarkar, Lev Proleev, Abe Ittycheriah, SoheilHassas Yeganeh, George Polovets, Aleksandra Faust, Jiao Sun, Alban Rrustemi, Pen Li, Rakesh Shivanna, Jeremiah Liu, Chris Welty, Federico Lebron, Anirudh Baddepudi, Sebastian Krause, Emilio Parisotto, Radu Soricut, Zheng Xu, Dawn Bloxwich, Melvin Johnson, Behnam Neyshabur, Justin Mao-Jones, Renshen Wang, Vinay Ramasesh, Zaheer Abbas, Arthur Guez, Constant Segal, DucDung Nguyen, James Svensson, LeHou, Sarah York, Kieran Milan, Sophie Bridgers, Wiktor Gworek, Marco Tagliasacchi, James Lee-Thorp, Michael Chang, Alexey Guseynov, AleJakse Hartman, Michael Kwong, Ruizhe Zhao, Sheleem Kashem, Elizabeth Cole, Antoine Miech, Richard Tanburn, Mary Phuong, Filip Pavetic, Sebastien Cevey, Ramona Comanescu, Richard Ives, Sherry Yang, Cosmo Du, BoLi, ZizhaoZhang, Mariko Iinuma, ClaraHuiyi Hu, Aurko Roy, Shaan Bijwadia, Zhenkai Zhu, Danilo Martins, Rachel Saputro, Anita Gergely, Steven Zheng, Dawei Jia, Ioannis Antonoglou, Adam Sadovsky, Shane Gu, Yingying Bi, Alek Andreev, Sina Samangooei, Mina Khan, Tomas Kocisky, Angelos Filos, Chintu Kumar, Colton Bishop, Adams Yu, Sarah Hodkinson, Sid Mittal, Premal Shah, Alexandre Moufarek, Yong Cheng, Adam Bloniarz, Jaehoon Lee, Pedram Pejman, Paul Michel, Stephen Spencer, Vladimir Feinberg, Xuehan Xiong, Nikolay Savinov, Charlotte Smith, Siamak Shakeri, Dustin Tran, Mary Chesus, Bernd Bohnet, George Tucker, Tamara von Glehn, Carrie Muir, Yiran Mao, Hideto Kazawa, Ambrose Slone, Kedar Soparkar, Disha Shrivastava, James Cobon-Kerr, Michael Sharman, Jay Pavagadhi, Carlos Araya, Karolis Misiunas, Nimesh Ghelani, Michael Laskin, David Barker, Qiujia Li, Anton Briukhov, Neil Houlsby, Mia Glaese, Balaji Lakshminarayanan, Nathan Schucher, Yunhao Tang, Eli Collins, Hyeontaek Lim, Fangxiaoyu Feng, Adria Recasens, Guangda Lai,Alberto Magni, NicolaDe Cao, Aditya Siddhant, Zoe Ashwood, Jordi Orbay, Mostafa Dehghani, Jenny Brennan, Yifan He, Kelvin Xu, Yang Gao, Carl Saroufim, James Molloy, Xinyi Wu, Seb Arnold, Solomon Chang, Julian Schrittwieser, Elena Buchatskaya, Soroush Radpour, Martin Polacek, Skye Giordano, Ankur Bapna, Simon Tokumine, Vincent Hellendoorn, Thibault Sottiaux, Sarah Cogan, Aliaksei Severyn, Mohammad Saleh, Shantanu Thakoor, Laurent Shefey, Siyuan Qiao, Meenu Gaba, Shuo yiin Chang, Craig Swanson, Biao Zhang, Benjamin Lee, PaulKishan Rubenstein, Gan Song, Tom Kwiatkowski, Anna Koop, Ajay Kannan, David Kao, Parker Schuh, Axel Stjerngren, Golnaz Ghiasi, Gena Gibson, Luke Vilnis, YeYuan, FelipeTiengo Ferreira, Aishwarya Kamath, Ted Klimenko, Ken Franko, Kefan Xiao, Indro Bhattacharya, Miteyan Patel, Rui Wang, Alex Morris, Robin Strudel, Vivek Sharma, Peter Choy, SayedHadi Hashemi, Jessica Landon, Mara Finkelstein, Priya Jhakra, Justin Frye, Megan Barnes, Matthew Mauger, Dennis Daun, Khuslen Baatarsukh, MatthewTung, Wael Farhan, Henryk Michalewski, Fabio Viola, Felix deChaumontQuitry, CharlineLe Lan, Tom Hudson, Qingze Wang, Felix Fischer, Ivy Zheng, Elspeth White, Anca Dragan, Jean baptiste Alayrac, Eric Ni, Alexander Pritzel, Adam Iwanicki, Michael Isard, Anna Bulanova, Lukas Zilka, Ethan Dyer, Devendra Sachan, Srivatsan Srinivasan, Hannah Muckenhirn, Honglong Cai, Amol Mandhane, Mukarram Tariq, JackW. Rae, Gary Wang, Kareem Ayoub, Nicholas FitzGerald, Yao Zhao, Woohyun Han, Chris Alberti, Dan Garrette, Kashyap Krishnakumar, Mai Gimenez, Anselm Levskaya, Daniel Sohn, Josip Matak, Inaki Iturrate, MichaelB. Chang, Jackie Xiang, Yuan Cao, Nishant Ranka, Geoff Brown, Adrian Hutter, Vahab Mirrokni, Nanxin Chen, Kaisheng Yao, Zoltan Egyed, Francois Galilee, Tyler Liechty, Praveen Kallakuri, Evan Palmer, Sanjay Ghemawat, Jasmine Liu, David Tao, Chloe Thornton, Tim Green, Mimi Jasarevic, Sharon Lin, Victor Cotruta, Yi-Xuan Tan, Noah Fiedel, Hongkun Yu, EdChi, Alexander Neitz, Jens Heitkaemper, Anu Sinha, DennyZhou, YiSun, Charbel Kaed, Brice Hulse, Swaroop Mishra, Maria Georgaki, Sneha Kudugunta, Clement Farabet, Izhak Shafran, Daniel Vlasic, Anton Tsitsulin, Rajagopal Ananthanarayanan, Alen Carin, Guolong Su, Pei Sun, Shashank V, Gabriel Carvajal, Josef Broder, Iulia Comsa, Alena Repina, William Wong, WarrenWeilun Chen, Peter Hawkins, Egor Filonov, Lucia Loher, Christoph Hirnschall, Weiyi Wang, Jingchen Ye, Andrea Burns, Hardie Cate, DianaGage Wright, Federico Piccinini, Lei Zhang, Chu-Cheng Lin, Ionel Gog, Yana Kulizhskaya, Ashwin Sreevatsa, Shuang Song, LuisC. Cobo, Anand Iyer, Chetan Tekur, Guillermo Garrido, Zhuyun Xiao, Rupert Kemp, HuaixiuSteven Zheng, Hui Li, Ananth Agarwal, Christel Ngani, Kati Goshvadi, Rebeca Santamaria-Fernandez, Wojciech Fica, Xinyun Chen, Chris Gorgolewski, Sean Sun, Roopal Garg, Xinyu Ye, S.M.Ali Eslami, Nan Hua, Jon Simon, Pratik Joshi, Yelin Kim, Ian Tenney, Sahitya Potluri, LamNguyen Thiet, Quan Yuan, Florian Luisier, Alexandra Chronopoulou, Salvatore Scellato, PraveenSrinivasan, Minmin Chen, Vinod Koverkathu, Valentin Dalibard, Yaming Xu, Brennan Saeta, Keith Anderson, Thibault Sellam, Nick Fernando, Fantine Huot, Junehyuk Jung, Mani Varadarajan, Michael Quinn, Amit Raul, Maigo Le, Ruslan Habalov, Jon Clark, Komal Jalan, Kalesha Bullard, Achintya Singhal, Thang Luong, Boyu Wang, Sujeevan Rajayogam, Julian Eisenschlos, Johnson Jia, Daniel Finchelstein, Alex Yakubovich, Daniel Balle, Michael Fink, Sameer Agarwal, Jing Li, DjDvijotham, Shalini Pal, Kai Kang, Jaclyn Konzelmann, Jennifer Beattie, Olivier Dousse, Diane Wu, Remi Crocker, Chen Elkind, SiddharthaReddy Jonnalagadda, Jong Lee, Dan Holtmann-Rice, Krystal Kallarackal, Rosanne Liu, Denis Vnukov, Neera Vats, Luca Invernizzi, Mohsen Jafari, Huanjie Zhou, Lilly Taylor, Jennifer Prendki, Marcus Wu, Tom Eccles, Tianqi Liu, Kavya Kopparapu, Francoise Beaufays, Christof Angermueller, Andreea Marzoca, Shourya Sarcar, Hilal Dib, Jeff Stanway, Frank Perbet, Nejc Trdin, Rachel Sterneck, Andrey Khorlin, Dinghua Li, Xihui Wu,Sonam Goenka, David Madras, Sasha Goldshtein, Willi Gierke, Tong Zhou, Yaxin Liu, Yannie Liang, Anais White, Yunjie Li, Shreya Singh, Sanaz Bahargam, Mark Epstein, Sujoy Basu, LiLao, Adnan Ozturel, Carl Crous, Alex Zhai, Han Lu, Zora Tung, Neeraj Gaur, Alanna Walton, Lucas Dixon, Ming Zhang, Amir Globerson, Grant Uy, Andrew Bolt, Olivia Wiles, Milad Nasr, Ilia Shumailov, Marco Selvi, Francesco Piccinno, Ricardo Aguilar, Sara McCarthy, Misha Khalman, Mrinal Shukla, Vlado Galic, John Carpenter, Kevin Villela, Haibin Zhang, Harry Richardson, James Martens, Matko Bosnjak, ShreyasRammohan Belle, Jeff Seibert, Mahmoud Alnahlawi, Brian McWilliams, Sankalp Singh, Annie Louis, Wen Ding, Dan Popovici, Lenin Simicich, Laura Knight, Pulkit Mehta, Nishesh Gupta, Chongyang Shi, Saaber Fatehi, Jovana Mitrovic, Alex Grills, Joseph Pagadora, Dessie Petrova, Danielle Eisenbud, Zhishuai Zhang, Damion Yates, Bhavishya Mittal, Nilesh Tripuraneni, Yannis Assael, Thomas Brovelli, Prateek Jain, Mihajlo Velimirovic, CanferAkbulut, Jiaqi Mu, Wolfgang Macherey, Ravin Kumar, Jun Xu, Haroon Qureshi, Gheorghe Comanici, Jeremy Wiesner, Zhitao Gong, Anton Ruddock, Matthias Bauer, Nick Felt, Anirudh GP, Anurag Arnab, Dustin Zelle, Jonas Rothfuss, Bill Rosgen, Ashish Shenoy, Bryan Seybold, Xinjian Li, Jayaram Mudigonda, Goker Erdogan, Jiawei Xia, Jiri Simsa, Andrea Michi, YiYao, Christopher Yew, Steven Kan, Isaac Caswell, Carey Radebaugh, Andre Elisseeff, Pedro Valenzuela, Kay McKinney, Kim Paterson, Albert Cui, Eri Latorre-Chimoto, Solomon Kim, William Zeng, Ken Durden, Priya Ponnapalli, Tiberiu Sosea, ChristopherA. Choquette-Choo, James Manyika, Brona Robenek, Harsha Vashisht, Sebastien Pereira, Hoi Lam, Marko Velic, Denese Owusu-Afriyie, Katherine Lee, Tolga Bolukbasi, Alicia Parrish, Shawn Lu, Jane Park, Balaji Venkatraman, Alice Talbert, Lambert Rosique, Yuchung Cheng, Andrei Sozanschi, Adam Paszke, Praveen Kumar, Jessica Austin, LuLi, Khalid Salama, Wooyeol Kim, Nandita Dukkipati, Anthony Baryshnikov, Christos Kaplanis,XiangHai Sheng, Yuri Chervonyi, Caglar Unlu, Diego deLasCasas, Harry Askham, Kathryn Tunyasuvunakool, Felix Gimeno, Siim Poder, Chester Kwak, Matt Miecnikowski, Vahab Mirrokni, Alek Dimitriev, Aaron Parisi, Dangyi Liu, Tomy Tsai, Toby Shevlane, Christina Kouridi, Drew Garmon, Adrian Goedeckemeyer, AdamR. Brown, Anitha Vijayakumar, Ali Elqursh, Sadegh Jazayeri, Jin Huang, SaraMc Carthy, Jay Hoover, Lucy Kim, Sandeep Kumar, Wei Chen, Courtney Biles, Garrett Bingham, Evan Rosen, Lisa Wang, Qijun Tan, David Engel, Francesco Pongetti, Dario deCesare, Dongseong Hwang, Lily Yu, Jennifer Pullman, Srini Narayanan, Kyle Levin, Siddharth Gopal, Megan Li, Asaf Aharoni, Trieu Trinh, Jessica Lo, Norman Casagrande, Roopali Vij, Loic Matthey, Bramandia Ramadhana, Austin Matthews, CJCarey, Matthew Johnson, Kremena Goranova, Rohin Shah, Shereen Ashraf, Kingshuk Dasgupta, Rasmus Larsen, Yicheng Wang, ManishReddy Vuyyuru, Chong Jiang, Joana Ijazi, Kazuki Osawa, Celine Smith, RamyaSree Boppana, Taylan Bilal, YumaKoizumi, Ying Xu, Yasemin Altun, Nir Shabat, Ben Bariach, Alex Korchemniy, Kiam Choo, Olaf Ronneberger, Chimezie Iwuanyanwu, Shubin Zhao, David Soergel, Cho-Jui Hsieh, Irene Cai, Shariq Iqbal, Martin Sundermeyer, Zhe Chen, Elie Bursztein, Chaitanya Malaviya, Fadi Biadsy, Prakash Shroff, Inderjit Dhillon, Tejasi Latkar, Chris Dyer, Hannah Forbes, Massimo Nicosia, Vitaly Nikolaev, Somer Greene, Marin Georgiev, Pidong Wang, Nina Martin, Hanie Sedghi, John Zhang, Praseem Banzal, Doug Fritz, Vikram Rao, Xuezhi Wang, Jiageng Zhang, Viorica Patraucean, Dayou Du, Igor Mordatch, Ivan Jurin, Lewis Liu, Ayush Dubey, Abhi Mohan, Janek Nowakowski, Vlad-Doru Ion, Nan Wei, Reiko Tojo, MariaAbi Raad, DrewA. Hudson, Vaishakh Keshava, Shubham Agrawal, Kevin Ramirez, Zhichun Wu, Hoang Nguyen, JiLiu, Madhavi Sewak, Bryce Petrini, DongHyun Choi, Ivan Philips, Ziyue Wang, Ioana Bica, Ankush Garg, Jarek Wilkiewicz, Priyanka Agrawal, Xiaowei Li, Danhao Guo, Emily Xue, Naseer Shaik, Andrew Leach, SadhMNM Khan, Julia Wiesinger,Sammy Jerome, Abhishek Chakladar, AlekWenjiao Wang, Tina Ornduff, Folake Abu, Alireza Ghaffarkhah, Marcus Wainwright, Mario Cortes, Frederick Liu, Joshua Maynez, Andreas Terzis, Pouya Samangouei, Riham Mansour, Tomasz Kępa, François-Xavier Aubet, Anton Algymr, Dan Banica, Agoston Weisz, Andras Orban, Alexandre Senges, Ewa Andrejczuk, Mark Geller, NiccoloDal Santo, Valentin Anklin, MajdAl Merey, Martin Baeuml, Trevor Strohman, Junwen Bai, Slav Petrov, Yonghui Wu, Demis Hassabis, Koray Kavukcuoglu, Jeffrey Dean, and Oriol Vinyals.Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024a.URL https://arxiv.org/abs/2403.05530.
- Team etal. (2024b)Reka Team, Aitor Ormazabal, Che Zheng, Cyprien deMassond’Autume, Dani Yogatama, Deyu Fu, Donovan Ong, Eric Chen, Eugenie Lamprecht, Hai Pham, Isaac Ong, Kaloyan Aleksiev, Lei Li, Matthew Henderson, Max Bain, Mikel Artetxe, Nishant Relan, Piotr Padlewski, QiLiu, Ren Chen, Samuel Phua, Yazheng Yang, YiTay, Yuqi Wang, Zhongkai Zhu, and Zhihui Xie.Reka core, flash, and edge: A series of powerful multimodal language models, 2024b.URL https://arxiv.org/abs/2404.12387.
- Vu etal. (2023)Thuy-Trang Vu, Xuanli He, Gholamreza Haffari, and Ehsan Shareghi.Koala: An index for quantifying overlaps with pre-training corpora.arXiv preprint arXiv:2303.14770, 2023.
- White etal. (2024)Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum.Livebench: A challenging, contamination-free llm benchmark, 2024.URL https://arxiv.org/abs/2406.19314.
- Writer (2024)Writer.Writer helm results, 2024.URL https://writer.com/blog/palmyra-helm-benchmark/.accessed on 10/10/2024.
- x.ai (2024)x.ai.Announcing grok 2.https://x.ai/blog/grok-2, 2024.
- Yang etal. (2024)AnYang, Baosong Yang, Binyuan Hui, BoZheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, NaNi, Pei Zhang, Peng Wang, RuPeng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Xuejing Liu, Yang Fan, Yang Yao, Yichang Zhang, YuWan, Yunfei Chu, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zhifang Guo, and Zhihao Fan.Qwen2 technical report, 2024.URL https://arxiv.org/abs/2407.10671.
- Yang etal. (2023)Shuo Yang, Wei-Lin Chiang, Lianmin Zheng, JosephE. Gonzalez, and Ion Stoica.Rethinking benchmark and contamination for language models with rephrased samples, 2023.