five
五号数据雷达
产品上架
产权登记
知识产权
公共数据
首页 / 开源数据市场 / 正文

Beijing Academy of Artificial Intelligence 发布 CCI 2.0 数据集, 应用在 自然语言处理、网络安全 领域

五号数据雷达开源数据市场2024-12-13 07:1235
CCI 2.0 是 Beijing Academy of Artificial Intelligence 发布的数据集,于 2024-04-26 首发在 HuggingFace 应用于 自然语言处理、网络安全 领域

Beijing Academy of Artificial Intelligence 本次发布的数据集 CCI 2.0, To address the scarcity of high-quality safety datasets in the Chinese, we open-sourced the CCI (Chinese Corpora Internet) dataset on November 29, 2023. Building on this foundation, we continue to expand the data source, adopt stricter data cleaning methods, and complete the construction of the CCI 2.0 dataset. This dataset is composed of high-quality, reliable Internet data from trusted sources. It has undergone strict data cleaning and de-duplication, with targeted detection and filtering carried out for content quality and safety. The rules for data processing include: Rule-based filtering: safety filtering based on keywords, spam information filtering, etc. Model-based filtering: filtering of low-quality content by training a classification model. Deduplication: within and between datasets dedup. The CCI 2.0 corpus released is 501GB in size.

查看CCI 2.0

Dataset card 内容: 

 

Files and versions 内容: 

 

关于 Beijing Academy of Artificial Intelligence , 智源研究院(BAAI)是中国顶尖的人工智能研究机构,致力于推动AI基础理论与应用创新。

关于 HuggingFace , Hugging Face是一个机器学习社区协作平台,专注于模型、数据集和应用程序的创建、发现和协作。该平台支持多种数据类型,包括文本、图像、视频、音频和3D数据,并提供开源工具和付费计算及企业解决方案。

数据合作广告位

社区讨论

近期热门
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

二维码
关注我们