Beijing Academy of Artificial Intelligence 本次发布的数据集 CCI 2.0, To address the scarcity of high-quality safety datasets in the Chinese, we open-sourced the CCI (Chinese Corpora Internet) dataset on November 29, 2023. Building on this foundation, we continue to expand the data source, adopt stricter data cleaning methods, and complete the construction of the CCI 2.0 dataset. This dataset is composed of high-quality, reliable Internet data from trusted sources. It has undergone strict data cleaning and de-duplication, with targeted detection and filtering carried out for content quality and safety. The rules for data processing include: Rule-based filtering: safety filtering based on keywords, spam information filtering, etc. Model-based filtering: filtering of low-quality content by training a classification model. Deduplication: within and between datasets dedup. The CCI 2.0 corpus released is 501GB in size.
Dataset card 内容:
Files and versions 内容:
关于 Beijing Academy of Artificial Intelligence , 智源研究院(BAAI)是中国顶尖的人工智能研究机构,致力于推动AI基础理论与应用创新。
关于 HuggingFace , Hugging Face是一个机器学习社区协作平台,专注于模型、数据集和应用程序的创建、发现和协作。该平台支持多种数据类型,包括文本、图像、视频、音频和3D数据,并提供开源工具和付费计算及企业解决方案。





_1769672084863.jpg)