Releases: ConardLi/easy-dataset
[1.7.2] 2026-02-25
如果遇到 Github 下载慢的问题可以使用网盘下载:https://pan.quark.cn/s/194b7eedf16e
🔧 修复(Fixes)
- 修复自定义评估题目 Prompt 无法生效的问题 #676
- 修复筛选重置时仍使用旧条件发起请求的竞态问题,确保任务进度、状态展示更稳定
- 修复多轮对话与评估功能切换时页面卡顿/假死的问题
⚡ 优化(Optimizations)
1. 模型提供商交互体验
- 新增/接入多家模型提供商(provider)图标资源,统一 provider logo 映射工具,标准化视觉展示
- 调整模型设置、项目模型配置、会话列表展示等页面与接口的配置逻辑,优化提供商选择流程与展示一致性
2. 页面性能与响应速度
- 解决多轮对话/评估功能切换卡顿问题,提升页面切换、高频筛选场景的响应速度
- 前端新增请求取消机制(AbortController),避免快速切换操作时旧请求堆积导致的页面卡顿
3. 接口与数据请求轻量化
- 优化多轮对话列表接口:分页查询不再返回
rawMessages等大字段,显著降低列表请求数据体积 - 优化批量任务“全选”逻辑:获取待操作数据时仅查询 ID,避免拉取整条记录,减少数据传输量
- 优化评估数据集接口:
- 需统计数据时,并发查询列表与统计信息,提升加载效率;
- 筛选/翻页场景默认仅拉取列表数据,减少重复查询压力
感谢 @AdamPlatin123 的贡献
🔧 Fixes
- Fixed the issue where custom evaluation question Prompts failed to take effect #676
- Fixed the race condition where old filter conditions were still used for requests after reset, ensuring stable display of task and progress status
- Fixed freezing/unresponsiveness when switching between multi-turn dialogue and evaluation features
⚡ Optimizations
1. Model Provider Display & Selection Experience
- Added/connected icon resources for multiple model providers, unified the provider logo mapping tool to standardize visual display
- Adjusted configuration logic of related pages and interfaces (covering model settings, project model configuration, session list display, etc.), optimizing provider selection experience
2. Page Performance & Response Speed
- Resolved freezing/unresponsiveness when switching between multi-turn dialogue and evaluation features, improved response speed for page switching and high-frequency filtering scenarios
- Added request cancellation mechanism (AbortController) on the frontend to avoid page lag caused by accumulated old requests during rapid switching
3. Interface & Data Request Lightweight
- Optimized multi-turn dialogue list interface: paginated queries no longer return large fields such as
rawMessages, significantly reducing request data volume - Optimized batch task "select all" logic: only query IDs of data to be operated on when fetching data, avoiding pulling entire records and reducing data transmission
- Optimized evaluation dataset interface:
- Concurrently queries list and statistical data when statistics are needed to improve loading efficiency;
- Only pulls list data by default for filtering/pagination scenarios to reduce repeated query pressure
[1.7.1] 2026-01-24
如果遇到 Github 下载慢的问题可以使用网盘下载:https://pan.quark.cn/s/194b7eedf16e
🔧 Fixes
- Fixed the synchronization issue between text block full selection status and filter conditions
- Fixed the problem where model API call failures were incorrectly displayed as successful in asynchronous tasks
- Fixed the filter box experience issues in the dataset management module (e.g., invalid conditions, style anomalies)
⚡ Optimizations
- Added batch deletion support for text blocks to improve efficiency in multi-block scenarios
- Increased the maximum upload file size limit from 50MB to 300MB, enabling processing of larger documents or images
- Exposed image answer generation prompts in custom prompt configurations for fine-grained control of VQA generation logic #658
- Optimized export speed for extremely large datasets to reduce waiting time #667
🔧 修复
- 修复文本块全选状态与筛选条件不同步的问题
- 修复模型API调用失败,但异步任务状态仍显示为成功的异常
- 修复数据集管理模块筛选框的交互体验问题(如筛选条件不生效、样式异常等)
⚡ 优化
- 文本块支持批量删除,提升多文本块场景下的操作效率
- 上传文件大小限制从 50MB 放宽至 300MB,支持处理更大体积的文档或图像
- 自定义提示词配置中开放图像答案生成的提示词,支持精细化控制图像问答生成逻辑 #658
- 优化超大量数据集的导出速度,减少导出等待时间 #667
[1.7.0] 2026-01-12
In v1.7.0, we focus on addressing the core pain points of model evaluation: difficult test set construction, lack of quantitative metrics, and the gap between automation and manual testing. This release introduces a brand-new Evaluation Module, creating a closed loop from Test Set Generation to Automated Scoring and Human Blind Testing.
Here are the detailed updates:
🎉 Core Highlights
- One-Stop Evaluation Loop: Support automatic test question generation from raw documents, one-click multi-model automated scoring, and visualized comparison reports.
- LMArena Mode Integration: Built-in "Chatbot Arena" style human blind testing to quantify subjective "feelings" into actionable data.
- Multi-Dimensional Question Support: Covers 5 core question types, satisfying evaluation needs ranging from fact-checking to logical reasoning.
🚀 New Features
1. Intelligent Test Set Generation
Stop worrying about the lack of QA pairs. You can now quickly build high-quality evaluation datasets through multiple methods:
-
Document Extraction: Automatically slice and extract questions from PDF/Markdown/Docx domain literature.
-
Dataset Variants: Generate variants based on existing training sets (e.g., converting multiple-choice questions to true/false) to expand testing diversity.
-
5 Core Question Types:
-
✅ True/False: Detect model hallucinations.
-
✅ Single/Multiple Choice: Test knowledge extraction and discrimination.
-
✅ Short Answer: Test the ability to capture and express core knowledge concisely.
-
✅ Open-Ended: Evaluate long-text reasoning and summarization skills.
-
Ratio Configuration: Support custom distribution ratios for different question types during the generation task.
2. Automated Evaluation Tasks (Auto-Evaluation)
Conduct concurrent assessments on multiple models just like an "exam". The system supports two grading modes:
- Rule-Based Scoring (Objective): For Choice and True/False questions. No LLM is required; the system uses direct code-level comparison for zero cost and zero error.
- Teacher Model Scoring (LLM-as-a-Judge): For Short Answer and Open-Ended questions. Configure a "Teacher Model" to score and comment on answers with customizable criteria (Prompts).
3. Human Blind Test Arena
Return to the authentic "Side-by-Side" evaluation experience:
- Anonymous Battle: Model names are hidden. Answers are displayed side-by-side (supports streaming output).
- Intuitive Voting: Simply click "Left is better", "Right is better", or "Tie".
- Win Rate Statistics: Automatically generate win-rate comparison charts to eliminate brand bias.
4. Data & Ecosystem
- Multi-Format Import/Export: Fully supports JSON, XLS, and XLSX formats for test set management.
- Built-in Domain Question Banks: Pre-loaded standard test sets across multiple disciplines and domains, ready to use out of the box.
- Fully Open Prompts: Complete access to configure prompts for the evaluation system (Question Generation, Answer Extraction, Scoring Standards), allowing for high customization.
在 v1.7.0 版本中,我们重点解决了模型评估中 “测试集难构造”、“量化指标难获取” 以及 “自动化与人工割裂” 的痛点。新增全新的 「评估」 模块,打通了从 测试集生成 到 自动化评分 再到 人工盲测 的全流程闭环。
以下是本次更新的详细内容:
🎉 核心亮点
- 一站式评估闭环:支持从原始文档自动生成测试题,一键发起多模型自动化评分,并提供可视化的对比报告。
- LMArena 模式集成:内置类似 Chatbot Arena 的人工盲测竞技场,解决“感觉”无法量化的问题。
- 多维题型支持:覆盖 5 种核心题型,满足从事实核查到逻辑推理的全方位评估需求。
🚀 新增功能 (New Features)
1. 智能测试集构造 (Test Set Generation)
不再为没有 QA 对而发愁,现在支持通过多种方式快速构建高质量评估数据集:
-
文档提取:支持从 PDF/Markdown/Docx 领域文献中自动切片并提取题目。
-
数据集变体:支持基于现有训练集生成变体(如将选择题改为判断题),扩充测试多样性。
-
五大题型支持:
-
✅ 判断题:检测模型幻觉。
-
✅ 单选/多选题:考察知识提取与辨析。
-
✅ 简答题:考察核心知识点的精简表达。
-
✅ 开放题:考察长文本推理与逻辑总结。
-
比例配置:支持自定义生成任务中各题型的分布比例。
2. 自动化评估任务 (Auto-Evaluation)
像“考试”一样对多个模型进行并发测评,支持两种阅卷模式:
- 客观题规则评分:针对选择、判断题,无需调用大模型,直接代码级比对,零成本、零误差。
- 主观题教师模型评分 (LLM-as-a-Judge):针对简答和开放题,配置“教师模型”进行打分和点评,支持自定义评分标准(Prompt)。
3. 人工盲测竞技场 (Human Blind Test)
回归真实体感的 Side-by-Side 评估:
- 匿名对战:隐藏模型名称,左右分屏展示回答(支持流式输出)。
- 直观投票:只需点击“左边好”、“右边好”或“平局”。
- 胜率统计:自动生成模型胜率对比图,消除品牌偏见。
4. 数据与生态 (Data & Ecosystem)
- 多格式导入导出:支持 JSON、XLS、XLSX 格式的测试集导入导出。
- 内置领域题库:预置多学科、多领域标准测试集,开箱即用。
- Prompt 全开放:全面开放评估系统提示词配置(题目生成、答案提取、评分标准),支持高度定制化。
[1.6.2] 2025-12-29
如果遇到 Github 下载慢的问题可以使用网盘下载:https://pan.quark.cn/s/194b7eedf16e
🔧 修复
- 修复导入数据删除报错的Bug #645
- 修复部分OPEN AI模型
max_completion_tokens设置失效的问题 #623 - 修复部分阿里云百炼视觉模型无法识别图片的问题 #622
- 修复首页卡片未统计图片数据集数量的问题 #611 #607
- 修复MAC图片设计不规范的问题 #630
- 修复大量标签数据时,标签树查询接口阻塞导致问题管理列表加载异常的问题 #629
- 修复大量标签数据情况下,自动蒸馏任务构建标签速度极慢的问题 #629
✨ 新功能
- 支持单独导出问题 #644
- 支持批量选择文件并自定义添加GA #643
- 支持全选批量删除已上传文件 #636
- 问题管理支持“不匹配关键字”筛选,可快速过滤不符合需求的问题 #613
- 增加 Token 统计面板(从首页,右上角统计图标可进入此功能) #133
⚡ 优化
- 优化项目选择弹框的关闭方式
🔧 Fixes
- Fixed the error when deleting imported data #645
- Fixed the invalid
max_completion_tokenssetting issue for some OPEN AI models #623 - Fixed the image unrecognition problem of some Alibaba Cloud Bailian visual models #622
- Fixed the homepage card's failure to count image dataset quantities #611 #607
- Fixed the irregular design issue of MAC images #630
- Fixed the question management list loading exception caused by label tree query interface blocking (with large label data) #629
- Fixed the slow label construction speed of automatic distillation tasks (with large label data) #629
✨ New Features
- Supported exporting questions individually #644
- Supported batch selecting files and custom adding GA #643
- Supported selecting all and batch deleting uploaded files #636
- Added "exclude keyword" filtering in question management to quickly filter out unwanted questions #613
⚡ Optimizations
- Optimized the closing method of the project selection pop-up
[1.6.1] 2025-11-22
如果遇到 Github 下载慢的问题可以使用网盘下载:https://pan.quark.cn/s/194b7eedf16e
🔧 修复
- 数据集管理翻页后返回,分页设置恢复默认值(#594)
→ 修复翻页后进入数据集详情再返回列表,页码、每页条数等翻页设置自动重置为默认的问题,保持分页状态一致性。 - 领域树视图及问题列表相关Bug(#598)
→ 修复领域树视图中问题无法删除、未分类问题展示异常、问题列表查询条件分页状态不正确的问题。
⚡ 优化
- 菜单与组件样式适配
→ 菜单宽度不足时自动收缩至左侧菜单栏;模型选择框默认收缩为图标,鼠标悬浮时恢复完整显示,提升窄屏适配性。 - Toast提示优化(#595)
→ 调整默认Toast提示位置,降低遮挡风险;将默认停留时间缩短至1秒,减少对操作的干扰。
✨ 新功能/支持
- 多语言支持扩展
→ 新增土耳其语支持,适配多地区用户使用需求。 - 图片导入优化(#590)
→ 支持通过压缩包导入图片,解决Docker容器环境下无法直接选择本地图片路径的问题。 - 图片管理功能增强
→ 图片管理列表视图新增全选、多选删除功能,提升批量图片管理效率。
🔧 Fixes
- Pagination settings reset to default after returning from dataset details(#594)
→ Fixed the issue where pagination settings (page number, items per page, etc.) automatically reset to default when returning to the dataset list from the details page, ensuring consistent pagination state. - Domain tree view and question list bugs(#598)
→ Resolved issues including inability to delete questions in domain tree view, incorrect display of unclassified questions, and abnormal pagination status of question list query conditions.
⚡ Optimizations
- Menu and component style adaptation
→ The menu automatically collapses to the left sidebar when width is insufficient; the model selection box defaults to an icon and expands on mouse hover, improving narrow-screen compatibility. - Toast notification optimization(#595)
→ Adjusted the default position of Toast notifications to reduce obstruction; shortened the default display duration to 1 second, minimizing interference with operations.
✨ New Features/Support
- Multilingual support expansion
→ Added support for Turkish, adapting to users in multiple regions. - Image import enhancement(#590)
→ Supports importing images via compressed packages, solving the problem of being unable to select local image paths directly in Docker container environments. - Image management improvement
→ Added select all/multi-select delete functions in the image management list view, enhancing batch image management efficiency.
[1.6.0] 2025-10-30
如果遇到 Github 下载慢的问题可以使用网盘下载:https://pan.quark.cn/s/194b7eedf16e
✨ 新功能
- 问题模版功能
→ 可创建多种自定义问题类型(如“描述图像内容”“分析文本观点”),并应用于所有图像或文本块批量生成对应问题,提升问题生成的标准化与场景适配性。
- 支持更改蒸馏标签名称(#422)
→ 允许自定义蒸馏过程中生成的标签名称,适配不同场景下的标签管理需求。
🔧 修复
-
修复保存模型时 ModelId 更新错误的 Bug
→ 修正模型配置保存流程中 ModelId 字段同步异常的问题,确保模型标识唯一性。 -
修复数据集批量评估问题(#576)
→ 新增批量评估任务中断功能,支持手动终止正在执行的评估;优化评估算法,提升批量处理速度。 -
修复数据集快捷键导致输入中断(#578)
→ 调整快捷键触发逻辑,避免与文本输入操作冲突,确保输入过程不被意外打断。 -
修复大量数据集选择后导出失败(#578)
→ 优化导出任务分片机制,解决因数据量过大导致的内存溢出或连接超时问题。 -
修复平衡导出不生效(#561)
→ 修正平衡导出逻辑中样本分布计算错误,确保按预设比例导出不同类别数据。 -
修复阿里云百炼调用 Qwen3 模型报错(#412、#482)
→ 适配 Qwen3 模型接口协议,修正请求参数格式与认证逻辑,确保调用正常。
⚡ 优化
-
提升多轮对话数据集解析稳定性
→ 增强对多轮对话格式(如 ShareGPT)的兼容解析,减少因格式变体导致的解析失败。 -
异步执行单个文本块操作(#530、#494)
→ 将“单个文本块生成问题”“AI 智能优化数据集”改为后台异步任务,执行时不阻塞前端其他操作。 -
文本块筛选增强(#541)
→ 支持按关键字搜索文本块内容,及按字数范围(如 100-500 字)筛选,快速定位目标文本。 -
模型配置支持 Top 参数控制(#517)
→ 模型配置页新增 Top 参数(如 Top-K/Top-P)设置,可调节生成内容的多样性与确定性。 -
按文本块名称筛选(#275)
→ 问题列表与数据集列表支持按关联文本块(文件)名称筛选,提升跨模块数据定位效率。
🔧 Fixes
-
Fixed ModelId update error when saving models
→ Corrected the abnormal synchronization of the ModelId field during model configuration saving to ensure unique model identification. -
Fixed issues with batch dataset evaluation(#576)
→ Added the ability to interrupt batch evaluation tasks, supporting manual termination of ongoing evaluations; optimized evaluation algorithms to improve batch processing speed. -
Fixed input interruption caused by dataset shortcuts(#578)
→ Adjusted shortcut trigger logic to avoid conflicts with text input operations, ensuring input processes are not accidentally interrupted. -
Fixed export failure when selecting large numbers of datasets(#578)
→ Optimized the export task sharding mechanism to resolve memory overflow or connection timeout issues caused by excessive data volume. -
Fixed ineffective balanced export(#561)
→ Corrected sample distribution calculation errors in balanced export logic to ensure data of different categories are exported according to preset ratios. -
Fixed errors when calling Qwen3 model via Alibaba Cloud Bailian(#412、#482)
→ Adapted to Qwen3 model interface protocols, corrected request parameter formats and authentication logic to ensure normal calls.
⚡ Optimizations
-
Improved stability of multi-turn dialogue dataset parsing
→ Enhanced compatible parsing of multi-turn dialogue formats (e.g., ShareGPT) to reduce parsing failures caused by format variations. -
Asynchronous execution of single text block operations(#530、#494)
→ Changed "generate questions for single text blocks" and "AI intelligent dataset optimization" to background asynchronous tasks, which do not block other front-end operations during execution. -
Enhanced text block filtering(#541)
→ Supports filtering text blocks by keyword search and word count range (e.g., 100-500 words) for quick定位 of target text. -
Model configuration supports Top parameter control(#517)
→ Added Top parameter (e.g., Top-K/Top-P) settings on the model configuration page to adjust the diversity and determinism of generated content. -
Filter by text block name(#275)
→ Question lists and dataset lists support filtering by associated text block (file) names, improving cross-module data positioning efficiency.
✨ New Features
-
Fully automated dataset distillation background tasks(#432、#492、#495、#496)
→ Supports full-process automation from triggering distillation to dataset generation, completed via background asynchronous tasks without manual intervention, with real-time progress tracking. -
Support for renaming distillation labels(#422)
→ Allows custom naming of labels generated during distillation to adapt to label management needs in different scenarios. -
Generate Visual Question Answering (VQA) datasets(#130、#483、#537)
→ Supports uploading image files, automatically generating image-related questions and answers to build VQA datasets, suitable for vision-language model training. -
Question template function
→ Enables creation of multiple custom question types (e.g., "describe image content", "analyze text opinions") and applies them to all images or text blocks to generate corresponding questions in batches, improving standardization and scenario adaptability of question generation.
What's Changed
[1.5.1] 2025-10-19
如果遇到 Github 下载慢的问题可以使用网盘下载:https://pan.quark.cn/s/194b7eedf16e
🔧 修复
-
删除文件时领域树修订不准确
→ 再次优化文件删除后领域树的更新逻辑,确保仅移除与删除文件强关联的节点,避免误删或残留无效节点,提升领域树结构准确性。 -
删除答案后问题状态未更新(#572)
→ 修复删除问题生成的答案后,问题管理中仍显示“已生成答案”状态的问题,确保状态与实际数据一致。 -
数据集管理筛选BUG(#571、#569、#568)
→ 修复筛选条件组合失效、筛选结果不更新、特定标签筛选无响应等问题,提升筛选功能稳定性。 -
Alpaca/ShareGPT格式导入字段识别问题(#549、#564)
→ 优化两种格式数据集的字段映射逻辑,解决instruction/input/conversation等核心字段识别不准确的问题,确保导入数据完整性。
⚡ 优化
-
数据集导出支持选中项导出(#570)
→ 导出数据集时新增“仅导出选中项”选项,支持手动勾选特定数据集进行导出,提升批量操作灵活性。 -
数据集确认与编辑优化(#542)
- 新增“取消确认”功能:确认数据集后可随时撤销确认状态,避免误操作导致的不可逆影响。
- 数据集详情页支持直接编辑问题内容,无需跳转至单独页面,简化修改流程。
🔧 Fixes
-
Inaccurate domain tree revision when deleting files
→ Further optimized domain tree update logic after file deletion to ensure only nodes strongly associated with deleted files are removed, avoiding incorrect deletions or residual invalid nodes and improving structural accuracy. -
Question status remains "answered" after deleting answers(#572)
→ Fixed the issue where questions in the management list still showed "answer generated" status after their answers were deleted, ensuring status consistency with actual data. -
Dataset management filtering bugs(#571、#569、#568)
→ Resolved issues such as invalid filter combinations, unupdated results, and unresponsive tag filtering, improving the stability of filtering functions. -
Inaccurate field recognition during Alpaca/ShareGPT import(#549、#564)
→ Optimized field mapping logic for these formats, fixing misrecognition of core fields likeinstruction/input/conversationto ensure complete data import.
⚡ Optimizations
-
Support exporting only selected datasets(#570)
→ Added an option to "export only selected items" during dataset export, allowing manual selection of specific datasets for export to enhance batch operation flexibility. -
Dataset confirmation and editing improvements(#542)
- Added "undo confirmation" function: Allows reverting the confirmed status of datasets to avoid irreversible impacts from misoperations.
- Enabled direct question editing on the dataset details page, eliminating the need to navigate to a separate page and simplifying modification workflows.
[1.5.0] 2025-09-29
⚠️ BreakChange(兼容性变更)
- 1.5.0 之前版本配置的自定义提示词将失效,升级后需重新配置核心提示词。
✨ 新功能
-
全量核心提示词开放自定义
→ Easy Dataset 所有核心提示词(如问题生成、答案生产、数据清洗等)均开放配置,后续无需修改代码即可灵活调整,适配不同场景需求。 -
AI 数据集质量评估(#546)
→ 新增数据集质量自动评估功能,支持:- 单个数据集即时评估(含相关性、准确性、完整性等维度);
- 批量数据集异步评估(后台任务处理,支持查看评估报告)。
-
多轮对话 SFT 数据集生成(#504)
→ 支持生成多轮对话格式的 SFT 数据集,两种生成方式:- 基于文献内容提取多轮问答;
- 直接从大模型蒸馏多轮对话数据。
-
GPT OSS 多语言思维数据集格式导出(#560)
→ 新增对GPT OSS Multilingual-Thinking格式的导出支持,适配多语言模型训练场景。 -
自定义分隔符分块(#559)
→ 支持按自定义分隔符(如换行、特定符号)分割文本,分隔符将被自动舍弃,且分割后的文本块不受预设块大小限制,保留完整语义单元。
⚡ 优化
-
模型输出结构化稳定性提升
→ 增加更多兼容解析逻辑,减少模型输出格式异常(如JSON解析失败、字段缺失),提升结构化数据生成的稳定性。 -
Markdown 展示风格优化
→ 优化数据集详情页、自定义提示词编辑页的 Markdown 渲染样式,增强文本可读性(如调整字体、行间距、代码块高亮)。
🔧 修复
-
文献目录过大导致上下文溢出
→ 优化文献目录处理逻辑,自动截断或分段处理超长大目录,避免模型上下文长度超限。 -
数据清洗异常内容引入(#504、#529)
→ 修复数据清洗过程中意外引入无关内容或思维链信息的问题,确保清洗后文本纯净度。 -
删除文件时领域树修订不准确
→ 修正文件删除后领域树节点更新逻辑,确保仅移除与删除文件相关的节点,避免误删或残留无效节点。
⚠️ BreakChange
- Custom prompts configured in versions prior to 1.5.0 will become invalid. Users need to reconfigure core prompts after upgrading to 1.5.0.
✨ New Features
-
Full Core Prompts Customization
→ All core prompts in Easy Dataset (e.g., question generation, answer production, data cleaning) are now configurable. No code changes are required for future adjustments, adapting to diverse scenarios. -
AI Dataset Quality Evaluation(#546)
→ Added automatic dataset quality evaluation, supporting:- Instant evaluation for single datasets (covering relevance, accuracy, completeness, etc.);
- Asynchronous batch evaluation for multiple datasets (processed via background tasks, with evaluation reports available).
-
Multi-turn Dialogue SFT Dataset Generation(#504)
→ Supports generating multi-turn dialogue SFT datasets through two methods:- Extracting multi-turn Q&A from literature content;
- Distilling multi-turn dialogue data directly from large models.
-
GPT OSS Multilingual-Thinking Dataset Export(#560)
→ Added export support forGPT OSS Multilingual-Thinkingformat, adapting to multilingual model training scenarios. -
Custom Delimiter Chunking(#559)
→ Supports text splitting by custom delimiters (e.g., line breaks, specific symbols). Delimiters are automatically discarded, and split text blocks are not restricted by preset chunk sizes, preserving complete semantic units.
⚡ Optimizations
-
Improved Stability of Structured Model Output
→ Added more compatible parsing logic to reduce format anomalies in model outputs (e.g., JSON parsing failures, missing fields), enhancing the stability of structured data generation. -
Markdown Display Style Optimization
→ Optimized Markdown rendering styles for dataset detail pages and custom prompt editing pages, improving readability (e.g., adjusted fonts, line spacing, code block highlighting).
🔧 Fixes
-
Context Overflow Due to Oversized Literature Catalogs
→ Optimized literature catalog processing logic to automatically truncate or segment overly large catalogs, avoiding model context length limits. -
Unexpected Content Introduction in Data Cleaning(#504、#529)
→ Fixed issues where irrelevant content or thought chain information was accidentally introduced during data cleaning, ensuring the purity of cleaned text. -
Inaccurate Domain Tree Revision When Deleting Files
→ Corrected the domain tree node update logic after file deletion, ensuring only nodes related to deleted files are removed, avoiding incorrect deletions or residual invalid nodes.
[1.4.0] 2025-08-31
如果遇到 Github 下载慢的问题可以使用网盘下载:https://pan.quark.cn/s/194b7eedf16e
✨ 新功能
-
支持本地部署 MinerU 集成(#200、#245)
→ 可在任务设置中配置本地 MinerU 服务 URL,实现与本地部署的 MinerU 工具联动。 -
数据集增强管理功能(#81)
→ 新增数据集评分、自定义标签及备注功能,支持基于这些属性进行筛选查询。 -
文献内容清洗功能(#516)
→ 支持对原始文献内容进行预处理清洗,提升后续数据集生成质量;支持自定义数据清洗提示词,适配不同场景需求。 -
数据集导出选项扩展
-
文献格式支持扩展(#205)
→ 新增对 .epub 格式文献的上传与分析功能,拓宽文献处理范围。 -
数据集导入功能(#498)
→ 支持从本地文件导入已有数据集,快速复用外部数据资源。
⚡ 优化
🔧 修复
-
超大数据集导出问题(#502)
→ 修复大规模数据集导出时的卡死问题,新增分批导出机制,提升稳定性。 -
项目间问题冲突(#509)
→ 修复不同项目中问题 DIFF 对比时出现的冲突异常,确保跨项目数据一致性。
✨ New Features
-
Support for Local MinerU Deployment(#200、#245)
→ Allows configuration of local MinerU service URL in task settings, enabling integration with locally deployed MinerU tools. -
Enhanced Dataset Management(#81)
→ Added dataset rating, custom tags, and notes functions, with support for filtering based on these attributes. -
Literature Content Cleaning(#516)
→ Supports preprocessing and cleaning of original literature content to improve subsequent dataset quality; allows custom data cleaning prompts for different scenarios. -
Extended Dataset Export Options
-
Expanded Literature Format Support(#205)
→ Added support for uploading and analyzing .epub format documents, broadening literature processing scope. -
Dataset Import Function(#498)
→ Supports importing existing datasets from local files for quick reuse of external data resources.
⚡ Optimizations
-
Dataset Pagination Improvement(#497)
→ Automatically saves the selected state of Markdown tags during pagination to avoid repeated operations. -
Dataset List Filter Enhancement(#275)
→ Added filtering for "whether it is a distilled dataset" to quickly locate specific data types.
🔧 Fixes
-
Large Dataset Export Issue(#502)
→ Fixed freezing when exporting large-scale datasets; added batch export mechanism to improve stability. -
Cross-Project Question Conflicts(#509)
→ Resolved conflict anomalies in question DIFF comparisons between different projects, ensuring cross-project data consistency.
[1.3.9] 2025-07-03
如果遇到 Github 下载慢的问题可以使用网盘下载:https://pan.quark.cn/s/194b7eedf16e
⚡ 优化
-
Docker 部署配置优化(#442)
- 优化
docker-compose.yml,新增数据库文件夹挂载配置,确保数据持久化 - 补充 Docker 部署数据迁移说明文档,指导跨环境数据迁移流程
- 优化
-
文献处理模块交互优化
- 支持文件拖拽上传至文献处理区域,提升批量导入效率
- 重构按钮布局,减少空间占用,优化移动端适配
-
数据集详情查询性能(#419)
- 优化数据库查询索引,提升数据集详情页加载速度
-
数据集筛选功能增强(#429)
- 新增筛选维度:
- 思维链生成状态(已生成/未生成)
- 问题/答案内容关键词筛选
- 标签分类筛选(支持多标签组合查询)
- 新增筛选维度:
-
文档搜索功能(#426)
- 支持全文检索文档内容,可通过关键词快速定位相关文献
⚡ Optimizations
-
Docker Deployment Configuration(#442)
- Optimized
docker-compose.ymlto add database folder mounting for data persistence - Added documentation for Docker data migration, guiding cross-environment data transfer
- Optimized
-
Literature Processing UI Enhancement
- Enabled drag-and-drop file upload to the literature processing area
- Refactored button layout to reduce space occupation and improve mobile adaptability
-
Dataset Details Query Performance(#419)
- Optimized database query indexes, improving dataset details page loading speed by ~30%
- Implemented lazy-loading for paginated data to reduce memory usage with large datasets
-
Dataset Filtering Enhancement(#429)
- Added new filtering dimensions:
- Thought chain generation status (generated/ungenerated)
- Keyword filtering by question/answer content
- Tag classification filtering (supporting multi-tag combination queries)
- Added new filtering dimensions:
-
Document Search Function(#426)
- Supports full-text search of document content, enabling quick keyword-based literature