feat: regex fallback for verification code extraction without Workers AI (#1048)

feat: add regex fallback for verification code extraction without Workers AI

When AI email extraction is enabled but no Workers AI binding is available,
fall back to a built-in, zero-dependency regex extractor so self-hosted
deployments without Workers AI still surface verification codes in Telegram
notifications and webhooks.

- Add worker/src/email/extract_code.ts: rule-based multilingual
  (English / Chinese / Japanese / Korean) verification-code extractor with
  year and YYYYMMDD date rejection to avoid false positives.
- ai_extract.ts: share the allowlist check and content parsing across both
  paths, extract a saveExtractMetadata helper, and use the regex fallback
  when env.AI is absent.
- Reuse the existing aiExtractResult pipeline (auth_code type), so Telegram
  and webhook output need no changes.
- Update bilingual CHANGELOG and AI-extract feature docs.
This commit is contained in:
Gene Dai
2026-06-02 15:35:43 +08:00
committed by GitHub
parent bf786947e3
commit b7718100c5
6 changed files with 158 additions and 26 deletions

View File

@@ -43,6 +43,18 @@ Or add in Cloudflare Dashboard Worker settings:
- **Variable name**: `AI`
- **Type**: Workers AI
## Fallback Without a Workers AI Binding
If `ENABLE_AI_EMAIL_EXTRACT` is enabled but **no Workers AI binding is configured** (e.g. a self-hosted deployment without Workers AI), the system automatically falls back to a built-in **regex verification-code extractor**:
- Extracts **verification codes** (`auth_code`) only; links are not extracted (link extraction requires AI)
- Zero dependency, zero cost, runs locally inside the Worker
- Supports common verification-code formats in English, Chinese, Japanese and Korean
- Rejects years (e.g. `2026`) and `YYYYMMDD` dates to reduce false positives
- Results are written to `metadata` and reuse the same Telegram / webhook placeholders (`aiExtractType` is `auth_code` in this case)
When a Workers AI binding is configured, AI extraction is still preferred (recognizing both codes and links) and this fallback does not apply.
## Address Allowlist (Optional)
To control costs and resource usage, you can configure an address allowlist in the Admin console's **AI Extract Settings** page:

View File

@@ -43,6 +43,18 @@ binding = "AI"
- **Variable name**: `AI`
- **Type**: Workers AI
## 无 Workers AI 绑定时的正则兜底
如果启用了 `ENABLE_AI_EMAIL_EXTRACT` 但**没有配置 Workers AI 绑定**(例如自部署时未开通 Workers AI系统会自动回退到内置的**正则验证码提取**
- 仅提取**验证码**`auth_code`),不提取链接(链接提取依赖 AI
- 零依赖、零成本,在 Worker 内本地完成
- 支持中文、英文、日文、韩文常见验证码格式
- 自动排除年份(如 `2026`)与 `YYYYMMDD` 日期,降低误判
- 提取结果同样写入 `metadata`,并复用 Telegram 推送与 Webhook 占位符(此时 `aiExtractType``auth_code`
当配置了 Workers AI 绑定时,仍优先使用 AI 提取(可识别验证码与各类链接),不受此回退影响。
## 地址白名单(可选)
为了控制成本和资源使用,可以在 Admin 控制台的 **AI 提取设置** 页面配置地址白名单: