首页 / 开源数据市场 / 正文

Skywork 发布 Skywork/Skywork-Reward-Preference-80K-v0.2 数据集, 应用在奖励模型、偏好对领域

五号数据雷达开源数据市场2024-10-12 13:2076

Skywork/Skywork-Reward-Preference-80K-v0.2 是 Skywork 发布的数据集,于 2024-10-12 首发在 HuggingFace 应用于奖励模型、偏好对领域

Skywork 本次发布的数据集 Skywork/Skywork-Reward-Preference-80K-v0.2, --- dataset_info: features: - name: chosen list: - name: content dtype: string - name: role dtype: string - name: rejected list: - name: content dtype: string - name: role dtype: string - name: source dtype: string splits: - name: train num_bytes: 415622390 num_examples: 77016 download_size: 209172624 dataset_size: 415622390 configs: - config_name: default data_files: - split: train path: data/train-* --- # Skywork Reward Preference 80K > IMPORTANT: > This dataset is the decontaminated version of [Skywork-Reward-Preference-80K-v0.1](https://huggingface.co/datasets/Skywork/Skywork-Reward-Preference-80K-v0.1). We removed 4,957 pairs from the [magpie-ultra-v0.1](https://huggingface.co/datasets/argilla/magpie-ultra-v0.1) subset that have a significant n-gram overlap with the evaluation prompts in [RewardBench](https://huggingface.co/datasets/allenai/reward-bench). You can find the set of removed pairs [here](https://huggingface.co/datasets/chrisliu298/Skywork-Reward-Preference-80K-v0.1-Contaminated). For more information, see [this GitHub gist](https://gist.github.com/natolambert/1aed306000c13e0e8c5bc17c1a5dd300). > > **If your task involves evaluation on [RewardBench](https://huggingface.co/datasets/allenai/reward-bench), we strongly encourage you to use v0.2 instead of v0.1 of the dataset.** > > We will soon release our new version of the reward models! Skywork Reward Preference 80K is a subset of 80K preference pairs, sourced from publicly available data. This subset is used to train [**Skywork-Reward-Gemma-2-27B**](https://huggingface.co/Skywork/Skywork-Reward-Gemma-2-27B) and [**Skywork-Reward-Llama-3.1-8B**](https://huggingface.co/Skywork/Skywork-Reward-Llama-3.1-8B). ## Data Mixture We carefully curate the [Skywork Reward Data Collection](https://huggingface.co/collections/Skywork/skywork-reward-data-collection-66d7fda6a5098dc77035336d) (1) to include high-quality preference pairs and (2) to target specific capability and knowledge domains. The curated training dataset consists of approximately 80K samples, subsampled from multiple publicly available data sources, including 1. [HelpSteer2](https://huggingface.co/datasets/nvidia/HelpSteer2) 2. [OffsetBias](https://huggingface.co/datasets/NCSOFT/offsetbias) 3. [WildGuard (adversarial)](https://huggingface.co/allenai/wildguard) 4. Magpie DPO series: [Ultra](https://huggingface.co/datasets/argilla/magpie-ultra-v0.1), [Pro (Llama-3.1)](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-DPO-100K-v0.1), [Pro](https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-DPO-100K-v0.1), [Air](https://huggingface.co/datasets/Magpie-Align/Magpie-Air-DPO-100K-v0.1). **Disclaimer: We made no modifications to the original datasets listed above, other than subsampling the datasets to create the Skywork Reward Data Collection.** During dataset curation, we adopt several tricks to achieve both performance improvement and a balance between each domain, without compromising the overall performance: 1. We select top samples from math, code, and other categories in the combined Magpie dataset independently, based on the average ArmoRM score provided with the dataset. We subtract the ArmoRM average scores in the Magpie-Air subset and the Magpie-Pro subset by 0.1 and 0.05, respectively, to prioritize Magpie-Ultra and Magpie-Pro-Llama-3.1 samples. 2. Instead of including all preference pairs in WildGuard, we first train a reward model (RM) on three other data sources. We then (1) use this RM to score the chosen and rejected responses for all samples in WildGuard and (2) select only samples where the chosen responses RM score is greater than the rejected responses RM score. We observe that this approach largely preserves the original performance of Chat, Char hard, and Reasoning while improving Safety. For both models, we use the 27B model to score the WildGuard samples.

查看Skywork/Skywork-Reward-Preference-80K-v0.2

关于 Skywork , Skywork是一家专注于为航空航天、国防和安全市场提供先进无人机技术和解决方案的公司，致力于开发和生产高性能、可靠的无人机系统以满足客户需求。

关于 HuggingFace , Hugging Face是一个机器学习社区协作平台，专注于模型、数据集和应用程序的创建、发现和协作。该平台支持多种数据类型，包括文本、图像、视频、音频和3D数据，并提供开源工具和付费计算及企业解决方案。

社区讨论

近期热门

Skywork 发布 Skywork/Skywork-Reward-Preference-80K-v0.2 数据集, 应用在 奖励模型、偏好对 领域

社区讨论

Skywork 发布 Skywork/Skywork-Reward-Preference-80K-v0.2 数据集, 应用在奖励模型、偏好对领域