Pārlūkot izejas kodu

feat(rss): 引入 guid 去重机制,增强空标题防护与翻译质量,新增 Markdown 导出,升级至 v6.7.0

- RSS 存储新增 guid 字段,去重优先级改为 guid > url
- 解析器/渲染层多级空标题兜底(摘要→URL→feed名称)
- 翻译提示词要求保留编号顺序,空翻译不覆盖原始标题
- 防止 URL 格式标题覆盖有意义的已有标题
- HTML 报告新增 Markdown 格式导出(#1121)
- 修正 README MCP badge 版本号(v4.0.2 → v4.0.4)
sansan 3 dienas atpakaļ
vecāks
revīzija
b6152fe0cd

+ 16 - 9
README-EN.md

@@ -11,8 +11,8 @@ Deploy in <strong>30 seconds</strong> — Say goodbye to endless scrolling, only
 [![GitHub Stars](https://img.shields.io/github/stars/sansan0/TrendRadar?style=flat-square&logo=github&color=yellow)](https://github.com/sansan0/TrendRadar/stargazers)
 [![GitHub Forks](https://img.shields.io/github/forks/sansan0/TrendRadar?style=flat-square&logo=github&color=blue)](https://github.com/sansan0/TrendRadar/network/members)
 [![License](https://img.shields.io/badge/license-GPL--3.0-blue.svg?style=flat-square)](LICENSE)
-[![Version](https://img.shields.io/badge/version-v6.6.2-blue.svg)](https://github.com/sansan0/TrendRadar)
-[![MCP](https://img.shields.io/badge/MCP-v4.0.2-green.svg)](https://github.com/sansan0/TrendRadar)
+[![Version](https://img.shields.io/badge/version-v6.7.0-blue.svg)](https://github.com/sansan0/TrendRadar)
+[![MCP](https://img.shields.io/badge/MCP-v4.0.4-green.svg)](https://github.com/sansan0/TrendRadar)
 [![RSS](https://img.shields.io/badge/RSS-Feed_Support-orange.svg?style=flat-square&logo=rss&logoColor=white)](https://github.com/sansan0/TrendRadar)
 [![AI Translation](https://img.shields.io/badge/AI-Multi--Language-purple.svg?style=flat-square)](https://github.com/sansan0/TrendRadar)
 
@@ -193,14 +193,12 @@ This contributes to the sustainable maintenance of the project and the growth of
 - **Tip**: Check [Changelog] to understand specific [Features]
 
 
-### 2026/03/28 - v6.6.0
+### 2026/05/15 - v6.7.0
 
-- **HTML Report Browser Enhancement**: Open the HTML report in a browser to unlock widescreen layout, Tab navigation for keyword groups and standalone sections, real-time title search, and more — email clients still show the original narrow layout with zero regression
-- **Dark Mode**: One-click toggle for dark theme with automatic preference persistence, ideal for nighttime reading
-- **One-Click Copy**: Hover over a news number to copy the title and link instantly for quick sharing
-- **Export Optimization**: Full-page and segmented screenshots merged into a dropdown export button; screenshots auto-revert to clean layout
-- **Keyboard Shortcuts**: `W` widescreen toggle, `D` dark mode, `/` search, `?` view all shortcuts
-- **Reading Progress Bar**: Real-time reading progress displayed at the top of the page
+- **Markdown Export**: New Markdown option in the report export dropdown — generate structured text with clickable links, perfect for LLM processing and cross-platform sharing ([#1121](https://github.com/sansan0/TrendRadar/issues/1121))
+- **RSS GUID Deduplication**: RSS storage now supports GUID field with priority order guid > url, preventing duplicate entries caused by URL changes for the same article
+- **Empty Title Protection**: Full-chain fallback logic across parser, renderer, and translation backfill ensures items without titles still display properly
+- **Translation Quality Enhancement**: Translation prompt now enforces numbered-item ordering preservation; empty translation results no longer overwrite original titles
 
 ### 2026/02/09 - mcp-v4.0.0
 
@@ -214,6 +212,15 @@ This contributes to the sustainable maintenance of the project and the growth of
 <details>
 <summary>👉 Click to expand: <strong>Historical Updates</strong></summary>
 
+### 2026/03/28 - v6.6.0
+
+- **HTML Report Browser Enhancement**: Open the HTML report in a browser to unlock widescreen layout, Tab navigation for keyword groups and standalone sections, real-time title search, and more — email clients still show the original narrow layout with zero regression
+- **Dark Mode**: One-click toggle for dark theme with automatic preference persistence, ideal for nighttime reading
+- **One-Click Copy**: Hover over a news number to copy the title and link instantly for quick sharing
+- **Export Optimization**: Full-page and segmented screenshots merged into a dropdown export button; screenshots auto-revert to clean layout
+- **Keyboard Shortcuts**: `W` widescreen toggle, `D` dark mode, `/` search, `?` view all shortcuts
+- **Reading Progress Bar**: Real-time reading progress displayed at the top of the page
+
 ### 2026/03/12 - v6.5.0
 
 - **AI Smart News Filtering**: No more manual keyword setup! Describe your interests in everyday language in `ai_interests.txt` (e.g., "I want AI and renewable energy news"), and AI automatically extracts tags, scores every headline, and only pushes what truly matters to you. If AI filtering encounters issues, it auto-falls back to keyword matching — push delivery never stops

+ 16 - 9
README.md

@@ -12,8 +12,8 @@
 [![GitHub Stars](https://img.shields.io/github/stars/sansan0/TrendRadar?style=flat-square&logo=github&color=yellow)](https://github.com/sansan0/TrendRadar/stargazers)
 [![GitHub Forks](https://img.shields.io/github/forks/sansan0/TrendRadar?style=flat-square&logo=github&color=blue)](https://github.com/sansan0/TrendRadar/network/members)
 [![License](https://img.shields.io/badge/license-GPL--3.0-blue.svg?style=flat-square)](LICENSE)
-[![Version](https://img.shields.io/badge/version-v6.6.2-blue.svg)](https://github.com/sansan0/TrendRadar)
-[![MCP](https://img.shields.io/badge/MCP-v4.0.2-green.svg)](https://github.com/sansan0/TrendRadar)
+[![Version](https://img.shields.io/badge/version-v6.7.0-blue.svg)](https://github.com/sansan0/TrendRadar)
+[![MCP](https://img.shields.io/badge/MCP-v4.0.4-green.svg)](https://github.com/sansan0/TrendRadar)
 [![RSS](https://img.shields.io/badge/RSS-订阅源支持-orange.svg?style=flat-square&logo=rss&logoColor=white)](https://github.com/sansan0/TrendRadar)
 [![AI翻译](https://img.shields.io/badge/AI-多语言推送-purple.svg?style=flat-square)](https://github.com/sansan0/TrendRadar)
 
@@ -241,14 +241,12 @@
 - **提示**:建议查看【历史更新】,明确具体的【功能内容】
 
 
-### 2026/03/28 - v6.6.0
+### 2026/05/15 - v6.7.0
 
-- **HTML 报告浏览器增强**:在浏览器中打开报告可自动切换宽屏布局,关键词分组和独立展区均支持 Tab 快速切换,搜索框实时过滤新闻标题,邮件客户端仍显示原始窄屏布局,零回归
-- **暗色模式**:一键切换深色主题,自动记住偏好,适合夜间阅读
-- **一键复制新闻**:鼠标悬停新闻序号即可复制标题和链接,方便快速分享
-- **导出优化**:整页截图和分段截图合并为下拉式导出按钮,截图时自动还原干净布局
-- **快捷键系统**:支持 `W` 宽屏切换、`D` 暗色模式、`/` 搜索、`?` 查看快捷键提示
-- **阅读进度条**:页面顶部实时显示阅读进度
+- **Markdown 导出**:报告导出下拉菜单新增 Markdown 格式,一键生成带链接的结构化文本,方便 LLM 二次加工和跨平台分享([#1121](https://github.com/sansan0/TrendRadar/issues/1121))
+- **RSS guid 去重**:RSS 存储新增 guid 字段,去重优先级改为 guid > url,解决同一文章因 URL 变化导致重复入库的问题
+- **空标题防护**:解析器、渲染层、翻译回填全链路增加空标题兜底逻辑,确保无标题条目也能正常显示
+- **翻译质量增强**:翻译提示词要求保留编号顺序,空翻译结果不再覆盖原始标题
 
 ### 2026/02/09 - mcp-v4.0.0
 
@@ -262,6 +260,15 @@
 <details>
 <summary>👉 点击展开:<strong>历史更新</strong></summary>
 
+### 2026/03/28 - v6.6.0
+
+- **HTML 报告浏览器增强**:在浏览器中打开报告可自动切换宽屏布局,关键词分组和独立展区均支持 Tab 快速切换,搜索框实时过滤新闻标题,邮件客户端仍显示原始窄屏布局,零回归
+- **暗色模式**:一键切换深色主题,自动记住偏好,适合夜间阅读
+- **一键复制新闻**:鼠标悬停新闻序号即可复制标题和链接,方便快速分享
+- **导出优化**:整页截图和分段截图合并为下拉式导出按钮,截图时自动还原干净布局
+- **快捷键系统**:支持 `W` 宽屏切换、`D` 暗色模式、`/` 搜索、`?` 查看快捷键提示
+- **阅读进度条**:页面顶部实时显示阅读进度
+
 ### 2026/03/12 - v6.5.0
 
 - **AI 智能筛选系统**:不用再手动设关键词!在 `ai_interests.txt` 里用日常语言写下你关注的方向(如"我想看 AI 和新能源相关新闻"),AI 会自动提取标签并对每条新闻打分,只推送真正和你相关的内容。万一 AI 筛选出了问题,会自动切回关键词匹配,推送不中断

+ 1 - 1
config/ai_translation_prompt.txt

@@ -18,7 +18,7 @@
 1. 准确传达原文含义,不要遗漏关键信息。
 2. 保持新闻标题的吸引力,但不要做标题党。
 3. 专有名词(人名、地名、机构名)若有通用译名请使用通用译名,否则保留原文或在括号内备注。
-4. 输出格式必须严格遵循要求,不要输出任何多余的解释性文字。
+4. 输出格式必须严格遵循要求,不要输出任何多余的解释性文字。如果输入包含编号(如 [1]、[2]、[3]...),**必须**在输出中保留完全相同的编号和顺序,每条编号对应一条翻译结果,不得跳过、合并或增加任何编号条目。
 5. ⚠️重点:输入可能包含混合语言列表。请务必逐行检查每一条内容。如果某条内容不是 {target_language},**必须**将其翻译为 {target_language}。严禁保留非 {target_language} 的原文(除非是纯专有名词)。即使列表中 99% 已经是目标语言,也绝对不能忽略剩下的 1%。
 6. 格式严格限制:输出结果中**只允许包含目标语言**的文本。绝对禁止“原文 + 译文”的形式。如果进行了翻译,直接用译文替换原文,不要在后面括号备注原文,也不要保留原文。
 

+ 1 - 1
pyproject.toml

@@ -1,6 +1,6 @@
 [project]
 name = "trendradar"
-version = "6.6.2"
+version = "6.7.0"
 description = "TrendRadar - 热点新闻聚合与分析工具"
 requires-python = ">=3.12"
 dependencies = [

+ 1 - 1
trendradar/__init__.py

@@ -9,5 +9,5 @@ TrendRadar - 热点新闻聚合与分析工具
 
 from trendradar.context import AppContext
 
-__version__ = "6.6.2"
+__version__ = "6.7.0"
 __all__ = ["AppContext", "__version__"]

+ 9 - 4
trendradar/ai/translator.py

@@ -187,11 +187,16 @@ class AITranslator:
             translated_texts, raw_parsed_count = self._parse_batch_response(response, len(non_empty_texts))
             batch_result.parsed_count = raw_parsed_count
 
-            # 填充结果
+            # 填充结果(跳过空翻译,避免用空字符串覆盖原始标题)
             for idx, translated in zip(non_empty_indices, translated_texts):
-                batch_result.results[idx].translated_text = translated
-                batch_result.results[idx].success = True
-                batch_result.success_count += 1
+                if translated and translated.strip():
+                    batch_result.results[idx].translated_text = translated
+                    batch_result.results[idx].success = True
+                    batch_result.success_count += 1
+                else:
+                    batch_result.results[idx].translated_text = batch_result.results[idx].original_text
+                    batch_result.results[idx].success = True
+                    batch_result.success_count += 1
 
         except Exception as e:
             error_msg = f"批量翻译失败: {type(e).__name__}: {str(e)[:100]}"

+ 1 - 0
trendradar/crawler/rss/fetcher.py

@@ -157,6 +157,7 @@ class RSSFetcher:
                     feed_id=feed.id,
                     feed_name=feed.name,
                     url=parsed.url,
+                    guid=parsed.guid or "",
                     published_at=parsed.published_at or "",
                     summary=parsed.summary or "",
                     author=parsed.author or "",

+ 21 - 8
trendradar/crawler/rss/parser.py

@@ -125,20 +125,20 @@ class RSSParser:
 
     def _parse_json_feed_item(self, item_data: Dict[str, Any]) -> Optional[ParsedRSSItem]:
         """解析单个 JSON Feed 条目"""
-        # 标题:优先 title,否则使用 content_text 的前 100 字符
+        url = item_data.get("url", "") or item_data.get("external_url", "")
+
         title = item_data.get("title", "")
         if not title:
             content_text = item_data.get("content_text", "")
             if content_text:
-                title = content_text[:100] + ("..." if len(content_text) > 100 else "")
+                title = content_text[:20] + ("..." if len(content_text) > 20 else "")
 
         title = self._clean_text(title)
+        if not title and url:
+            title = url
         if not title:
             return None
 
-        # URL
-        url = item_data.get("url", "") or item_data.get("external_url", "")
-
         # 发布时间(ISO 8601 格式)
         published_at = None
         date_str = item_data.get("date_published") or item_data.get("date_modified")
@@ -216,12 +216,9 @@ class RSSParser:
     def _parse_entry(self, entry: Any) -> Optional[ParsedRSSItem]:
         """解析单个条目"""
         title = self._clean_text(entry.get("title", ""))
-        if not title:
-            return None
 
         url = entry.get("link", "")
         if not url:
-            # 尝试从 links 中获取
             links = entry.get("links", [])
             for link in links:
                 if link.get("rel") == "alternate" or link.get("type", "").startswith("text/html"):
@@ -230,6 +227,22 @@ class RSSParser:
             if not url and links:
                 url = links[0].get("href", "")
 
+        if not title:
+            raw_summary = entry.get("summary") or entry.get("description", "")
+            if not raw_summary:
+                content = entry.get("content", [])
+                if content and isinstance(content, list):
+                    raw_summary = content[0].get("value", "")
+            if raw_summary:
+                title = self._clean_text(raw_summary)
+                if len(title) > 20:
+                    title = title[:20] + "..."
+            if not title and url:
+                title = url
+
+        if not title:
+            return None
+
         published_at = self._parse_date(entry)
         summary = self._parse_summary(entry)
         author = self._parse_author(entry)

+ 3 - 1
trendradar/notification/dispatcher.py

@@ -193,10 +193,12 @@ class NotificationDispatcher:
             if unchanged_count > 0:
                 print(f"[翻译][DEBUG] (另有 {unchanged_count} 条未变化,已省略)")
 
-        # 回填翻译结果
+        # 回填翻译结果(仅在翻译文本非空时替换,防止空翻译覆盖原始标题)
         for i, (loc_type, idx1, idx2) in enumerate(title_locations):
             if i < len(result.results) and result.results[i].success:
                 translated = result.results[i].translated_text
+                if not translated or not translated.strip():
+                    continue
                 if loc_type == "stats":
                     report_data["stats"][idx1]["titles"][idx2]["title"] = translated
                 elif loc_type == "new_titles":

+ 2 - 0
trendradar/report/formatter.py

@@ -50,6 +50,8 @@ def format_title_for_platform(
 
     link_url = title_data["mobile_url"] or title_data["url"]
     cleaned_title = clean_title(title_data["title"])
+    if not cleaned_title:
+        cleaned_title = link_url or title_data["url"] or ""
 
     # 获取关键词标签(platform 模式使用)
     keyword = title_data.get("matched_keyword", "") if show_keyword else ""

+ 159 - 1
trendradar/report/html.py

@@ -86,7 +86,7 @@ def render_html_content(
                 padding: 32px 24px;
                 text-align: center;
                 position: relative;
-                overflow: hidden;
+                overflow: visible;
             }
 
             .header-watermark {
@@ -1254,6 +1254,7 @@ def render_html_content(
                         <div class="save-dropdown-menu">
                             <button class="save-dropdown-item" onclick="saveAsImage()"><svg class="dropdown-icon" viewBox="0 0 16 16" fill="none" stroke="currentColor" stroke-width="1.5"><rect x="2" y="2" width="12" height="12" rx="2"/><circle cx="8" cy="7.5" r="2.5"/><path d="M12 4h.01"/></svg>整页截图</button>
                             <button class="save-dropdown-item" onclick="saveAsMultipleImages()"><svg class="dropdown-icon" viewBox="0 0 16 16" fill="none" stroke="currentColor" stroke-width="1.5"><rect x="1" y="4" width="10" height="10" rx="1.5"/><path d="M5 4V2.5A1.5 1.5 0 016.5 1h7A1.5 1.5 0 0115 2.5v7a1.5 1.5 0 01-1.5 1.5H12"/></svg>分段截图</button>
+                            <button class="save-dropdown-item" onclick="saveAsMarkdown()"><svg class="dropdown-icon" viewBox="0 0 16 16" fill="none" stroke="currentColor" stroke-width="1.5"><path d="M2.5 2h11A1.5 1.5 0 0115 3.5v9a1.5 1.5 0 01-1.5 1.5h-11A1.5 1.5 0 011 12.5v-9A1.5 1.5 0 012.5 2z"/><path d="M4 11V5l2.5 3L9 5v6"/><path d="M11.5 8v3m0 0l-1.5-2m1.5 2l1.5-2"/></svg>Markdown</button>
                         </div>
                     </div>
                 </div>
@@ -2522,6 +2523,163 @@ def render_html_content(
                 }
             }
 
+            function saveAsMarkdown() {
+                var lines = [];
+                var now = new Date();
+                var dateStr = now.getFullYear() + '-' + String(now.getMonth() + 1).padStart(2, '0') + '-' + String(now.getDate()).padStart(2, '0');
+                var timeStr = String(now.getHours()).padStart(2, '0') + ':' + String(now.getMinutes()).padStart(2, '0');
+
+                // 标题
+                var headerTitle = document.querySelector('.header-title');
+                lines.push('# ' + (headerTitle ? headerTitle.textContent.trim() : 'TrendRadar'));
+                lines.push('');
+
+                // 报告元信息
+                var infoItems = document.querySelectorAll('.header-info .info-item');
+                if (infoItems.length) {
+                    infoItems.forEach(function(item) {
+                        var label = item.querySelector('.info-label');
+                        var value = item.querySelector('.info-value');
+                        if (label && value) {
+                            lines.push('- **' + label.textContent.trim() + '**: ' + value.textContent.trim());
+                        }
+                    });
+                    lines.push('');
+                }
+
+                // 提取 news-item 通用函数
+                function extractItem(item, idx) {
+                    var titleEl = item.querySelector('.news-title a');
+                    var titleText = '';
+                    var url = '';
+                    if (titleEl) {
+                        titleText = titleEl.textContent.trim();
+                        url = titleEl.href || '';
+                    } else {
+                        var titleDiv = item.querySelector('.news-title') || item.querySelector('.new-item-title');
+                        if (titleDiv) titleText = titleDiv.textContent.trim();
+                    }
+                    if (!titleText) return '';
+
+                    var meta = [];
+                    var rank = item.querySelector('.rank-num, .new-item-rank');
+                    if (rank && rank.textContent.trim() && rank.textContent.trim() !== '?') meta.push('#' + rank.textContent.trim());
+                    var source = item.querySelector('.source-name');
+                    if (source) meta.push(source.textContent.trim());
+                    var keyword = item.querySelector('.keyword-tag');
+                    if (keyword) meta.push(keyword.textContent.trim());
+                    var time = item.querySelector('.time-info');
+                    if (time) meta.push(time.textContent.trim());
+                    var count = item.querySelector('.count-info');
+                    if (count) meta.push(count.textContent.trim());
+
+                    var line = idx + '. ';
+                    if (url) {
+                        line += '[' + titleText.replace(/[[\]]/g, '') + '](' + url + ')';
+                    } else {
+                        line += titleText;
+                    }
+                    if (meta.length) line += '  `' + meta.join(' | ') + '`';
+                    return line;
+                }
+
+                // 热点关键词区
+                var wordGroups = document.querySelectorAll('.hotlist-section > .word-group');
+                if (wordGroups.length) {
+                    lines.push('## 热点新闻');
+                    lines.push('');
+                    wordGroups.forEach(function(group) {
+                        var wordName = group.querySelector('.word-name');
+                        var wordCount = group.querySelector('.word-count');
+                        if (wordName) {
+                            lines.push('### ' + wordName.textContent.trim() + (wordCount ? ' (' + wordCount.textContent.trim() + ')' : ''));
+                            lines.push('');
+                        }
+                        var items = group.querySelectorAll('.news-item');
+                        items.forEach(function(item, i) {
+                            var line = extractItem(item, i + 1);
+                            if (line) lines.push(line);
+                        });
+                        lines.push('');
+                    });
+                }
+
+                // 新增热点区
+                var newSection = document.querySelector('.new-section');
+                if (newSection) {
+                    var newTitle = newSection.querySelector('.new-section-title');
+                    lines.push('## ' + (newTitle ? newTitle.textContent.trim() : '本次新增热点'));
+                    lines.push('');
+                    var sourceGroups = newSection.querySelectorAll('.new-source-group');
+                    sourceGroups.forEach(function(sg) {
+                        var srcTitle = sg.querySelector('.new-source-title');
+                        if (srcTitle) {
+                            lines.push('### ' + srcTitle.textContent.trim());
+                            lines.push('');
+                        }
+                        var items = sg.querySelectorAll('.new-item');
+                        items.forEach(function(item, i) {
+                            var line = extractItem(item, i + 1);
+                            if (line) lines.push(line);
+                        });
+                        lines.push('');
+                    });
+                }
+
+                // 独立展示区(热榜平台 + RSS)
+                var standaloneSection = document.querySelector('.standalone-section');
+                if (standaloneSection) {
+                    var standaloneTitle = standaloneSection.querySelector('.standalone-section-title');
+                    lines.push('## ' + (standaloneTitle ? standaloneTitle.textContent.trim() : '独立展示区'));
+                    lines.push('');
+                    var groups = standaloneSection.querySelectorAll('.standalone-group');
+                    groups.forEach(function(group) {
+                        var name = group.querySelector('.standalone-name');
+                        var cnt = group.querySelector('.standalone-count');
+                        if (name) {
+                            lines.push('### ' + name.textContent.trim() + (cnt ? ' (' + cnt.textContent.trim() + ')' : ''));
+                            lines.push('');
+                        }
+                        var items = group.querySelectorAll('.news-item');
+                        items.forEach(function(item, i) {
+                            var line = extractItem(item, i + 1);
+                            if (line) lines.push(line);
+                        });
+                        lines.push('');
+                    });
+                }
+
+                // 错误区
+                var errorSection = document.querySelector('.error-section');
+                if (errorSection) {
+                    var errorItems = errorSection.querySelectorAll('.error-item');
+                    if (errorItems.length) {
+                        lines.push('## 抓取异常');
+                        lines.push('');
+                        errorItems.forEach(function(item) {
+                            lines.push('- ' + item.textContent.trim());
+                        });
+                        lines.push('');
+                    }
+                }
+
+                // 页脚
+                lines.push('---');
+                lines.push('*Generated by TrendRadar*');
+
+                // 下载
+                var md = lines.join('\n');
+                var blob = new Blob([md], { type: 'text/markdown;charset=utf-8' });
+                var link = document.createElement('a');
+                var filename = 'TrendRadar_' + dateStr + '_' + timeStr.replace(':', '') + '.md';
+                link.download = filename;
+                link.href = URL.createObjectURL(blob);
+                document.body.appendChild(link);
+                link.click();
+                document.body.removeChild(link);
+                URL.revokeObjectURL(link.href);
+            }
+
             document.addEventListener('DOMContentLoaded', function() {
                 window.scrollTo(0, 0);
 

+ 4 - 1
trendradar/report/rss_html.py

@@ -347,7 +347,10 @@ def render_rss_html_content(
                     </div>"""
 
         for item in items:
-            escaped_title = html_escape(item.get("title", ""))
+            raw_title = item.get("title", "")
+            if not raw_title or not raw_title.strip():
+                raw_title = item.get("url", "") or item.get("feed_name", "")
+            escaped_title = html_escape(raw_title)
             url = item.get("url", "")
             published_at = item.get("published_at", "")
             author = item.get("author", "")

+ 1 - 0
trendradar/storage/base.py

@@ -75,6 +75,7 @@ class RSSItem:
     feed_id: str                        # RSS 源 ID(如 "hacker-news")
     feed_name: str = ""                 # RSS 源名称(运行时使用)
     url: str = ""                       # 文章链接
+    guid: str = ""                      # GUID/ID(RSS guid 或 Atom id)
     published_at: str = ""              # RSS 发布时间(ISO 格式)
     summary: str = ""                   # 摘要/描述
     author: str = ""                    # 作者

+ 5 - 0
trendradar/storage/rss_schema.sql

@@ -26,6 +26,7 @@ CREATE TABLE IF NOT EXISTS rss_items (
     title TEXT NOT NULL,                      -- 标题
     feed_id TEXT NOT NULL,                    -- 所属 RSS 源
     url TEXT NOT NULL,                        -- 文章链接
+    guid TEXT DEFAULT '',                     -- GUID/ID(RSS guid 或 Atom id)
     published_at TEXT,                        -- RSS 发布时间(ISO 格式)
     summary TEXT,                             -- 摘要/描述
     author TEXT,                              -- 作者
@@ -98,5 +99,9 @@ CREATE INDEX IF NOT EXISTS idx_rss_title ON rss_items(title);
 CREATE UNIQUE INDEX IF NOT EXISTS idx_rss_url_feed
     ON rss_items(url, feed_id);
 
+-- GUID + feed_id 部分唯一索引(guid 非空时优先用 guid 去重)
+CREATE UNIQUE INDEX IF NOT EXISTS idx_rss_guid_feed
+    ON rss_items(guid, feed_id) WHERE guid != '';
+
 -- 抓取状态索引
 CREATE INDEX IF NOT EXISTS idx_rss_crawl_status_record ON rss_crawl_status(crawl_record_id);

+ 65 - 49
trendradar/storage/sqlite_mixin.py

@@ -96,8 +96,22 @@ class SQLiteStorageMixin:
                 with open(ai_filter_schema, "r", encoding="utf-8") as f:
                     conn.executescript(f.read())
 
+        if db_type == "rss":
+            self._migrate_rss_schema(conn)
+
         conn.commit()
 
+    def _migrate_rss_schema(self, conn: sqlite3.Connection) -> None:
+        """迁移 rss_items 表结构(为已有数据库添加 guid 列)"""
+        cursor = conn.execute("PRAGMA table_info(rss_items)")
+        columns = {row[1] for row in cursor.fetchall()}
+        if "guid" not in columns:
+            conn.execute("ALTER TABLE rss_items ADD COLUMN guid TEXT DEFAULT ''")
+            conn.execute("""
+                CREATE UNIQUE INDEX IF NOT EXISTS idx_rss_guid_feed
+                ON rss_items(guid, feed_id) WHERE guid != ''
+            """)
+
     # ========================================
     # 新闻数据存储
     # ========================================
@@ -156,14 +170,19 @@ class SQLiteStorageMixin:
                                 # 已存在,更新记录
                                 existing_id, existing_title = existing
 
+                                update_title = item.title
+                                if (update_title and update_title.strip().startswith(("http://", "https://", "//"))
+                                        and existing_title and not existing_title.strip().startswith(("http://", "https://", "//"))):
+                                    update_title = existing_title
+
                                 # 检查标题是否变化
-                                if existing_title != item.title:
+                                if existing_title != update_title:
                                     # 记录标题变更
                                     cursor.execute("""
                                         INSERT INTO title_changes
                                         (news_item_id, old_title, new_title, changed_at)
                                         VALUES (?, ?, ?, ?)
-                                    """, (existing_id, existing_title, item.title, now_str))
+                                    """, (existing_id, existing_title, update_title, now_str))
                                     title_changed_count += 1
 
                                 # 记录排名历史
@@ -183,7 +202,7 @@ class SQLiteStorageMixin:
                                         crawl_count = crawl_count + 1,
                                         updated_at = ?
                                     WHERE id = ?
-                                """, (item.title, item.rank, item.mobile_url,
+                                """, (update_title, item.rank, item.mobile_url,
                                       data.crawl_time, now_str, existing_id))
                                 updated_count += 1
                             else:
@@ -818,65 +837,62 @@ class SQLiteStorageMixin:
             for feed_id, rss_list in data.items.items():
                 for item in rss_list:
                     try:
-                        # 检查是否已存在(通过 URL + feed_id)
-                        if item.url:
+                        item_guid = getattr(item, "guid", "") or ""
+                        existing = None
+
+                        # 去重优先级:guid > url
+                        if item_guid:
+                            cursor.execute("""
+                                SELECT id, title FROM rss_items
+                                WHERE guid = ? AND feed_id = ?
+                            """, (item_guid, feed_id))
+                            existing = cursor.fetchone()
+
+                        if not existing and item.url:
                             cursor.execute("""
                                 SELECT id, title FROM rss_items
                                 WHERE url = ? AND feed_id = ?
                             """, (item.url, feed_id))
                             existing = cursor.fetchone()
 
-                            if existing:
-                                # 已存在,更新记录
-                                existing_id = existing[0]
-                                cursor.execute("""
-                                    UPDATE rss_items SET
-                                        title = ?,
-                                        published_at = ?,
-                                        summary = ?,
-                                        author = ?,
-                                        last_crawl_time = ?,
-                                        crawl_count = crawl_count + 1,
-                                        updated_at = ?
-                                    WHERE id = ?
-                                """, (item.title, item.published_at, item.summary,
-                                      item.author, data.crawl_time, now_str, existing_id))
-                                updated_count += 1
-                            else:
-                                # 不存在,插入新记录(使用 ON CONFLICT 兜底处理并发/竞争场景)
-                                cursor.execute("""
-                                    INSERT INTO rss_items
-                                    (title, feed_id, url, published_at, summary, author,
-                                     first_crawl_time, last_crawl_time, crawl_count,
-                                     created_at, updated_at)
-                                    VALUES (?, ?, ?, ?, ?, ?, ?, ?, 1, ?, ?)
-                                    ON CONFLICT(url, feed_id) DO UPDATE SET
-                                        title = excluded.title,
-                                        published_at = excluded.published_at,
-                                        summary = excluded.summary,
-                                        author = excluded.author,
-                                        last_crawl_time = excluded.last_crawl_time,
-                                        crawl_count = crawl_count + 1,
-                                        updated_at = excluded.updated_at
-                                """, (item.title, feed_id, item.url, item.published_at,
-                                      item.summary, item.author, data.crawl_time,
-                                      data.crawl_time, now_str, now_str))
-                                new_count += 1
-                        else:
-                            # URL 为空,用 try-except 处理重复
+                        if existing:
+                            existing_id = existing[0]
+                            existing_title = existing[1]
+                            update_title = item.title
+                            if (update_title and update_title.strip().startswith(("http://", "https://", "//"))
+                                    and existing_title and not existing_title.strip().startswith(("http://", "https://", "//"))):
+                                update_title = existing_title
+                            cursor.execute("""
+                                UPDATE rss_items SET
+                                    title = ?,
+                                    url = CASE WHEN ? != '' THEN ? ELSE url END,
+                                    guid = CASE WHEN ? != '' THEN ? ELSE guid END,
+                                    published_at = ?,
+                                    summary = ?,
+                                    author = ?,
+                                    last_crawl_time = ?,
+                                    crawl_count = crawl_count + 1,
+                                    updated_at = ?
+                                WHERE id = ?
+                            """, (update_title,
+                                  item.url, item.url,
+                                  item_guid, item_guid,
+                                  item.published_at, item.summary,
+                                  item.author, data.crawl_time, now_str, existing_id))
+                            updated_count += 1
+                        elif item.url or item_guid:
                             try:
                                 cursor.execute("""
                                     INSERT INTO rss_items
-                                    (title, feed_id, url, published_at, summary, author,
+                                    (title, feed_id, url, guid, published_at, summary, author,
                                      first_crawl_time, last_crawl_time, crawl_count,
                                      created_at, updated_at)
-                                    VALUES (?, ?, ?, ?, ?, ?, ?, ?, 1, ?, ?)
-                                """, (item.title, feed_id, "", item.published_at,
-                                      item.summary, item.author, data.crawl_time,
-                                      data.crawl_time, now_str, now_str))
+                                    VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, 1, ?, ?)
+                                """, (item.title, feed_id, item.url, item_guid,
+                                      item.published_at, item.summary, item.author,
+                                      data.crawl_time, data.crawl_time, now_str, now_str))
                                 new_count += 1
                             except sqlite3.IntegrityError:
-                                # 重复的空 URL 条目,忽略
                                 pass
 
                     except sqlite3.Error as e:

+ 1 - 1
uv.lock

@@ -1996,7 +1996,7 @@ wheels = [
 
 [[package]]
 name = "trendradar"
-version = "6.6.2"
+version = "6.7.0"
 source = { editable = "." }
 dependencies = [
     { name = "boto3" },

+ 1 - 1
version

@@ -1 +1 @@
-6.6.2
+6.7.0