Initial commit: HBR 文章爬蟲專案
- Scrapy 爬蟲框架,爬取 HBR 繁體中文文章 - Flask Web 應用程式,提供文章查詢介面 - SQL Server 資料庫整合 - 自動化排程與郵件通知功能 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
142
AI爬蟲設計.txt
Normal file
142
AI爬蟲設計.txt
Normal file
@@ -0,0 +1,142 @@
|
||||
|
||||
一、結論(可直接採用)
|
||||
|
||||
排程時間:每週一 08:00 Asia/Taipei。
|
||||
|
||||
流程:scrapy crawl hbr 產出 hbr_articles.csv → 以 Gmail SMTP 寄出(需 App Password)。
|
||||
|
||||
寄信對象:kaeruzak@gmail.com,主旨含日期,附檔 hbr_articles.csv。
|
||||
|
||||
|
||||
GitHub Actions(免自備機器,建議做法)。
|
||||
|
||||
下方已提供寄信腳本 send_mail.py 與兩種排程設定檔,照貼即可運作。
|
||||
|
||||
二、寄信腳本(共用)
|
||||
|
||||
檔名:send_mail.py(放在專案根目錄,與 hbr_articles.csv 同層或以參數指定路徑)
|
||||
|
||||
# send_mail.py
|
||||
import os, smtplib, sys, mimetypes
|
||||
from datetime import datetime, timezone, timedelta
|
||||
from email.message import EmailMessage
|
||||
|
||||
# 讀環境變數(請在 crontab 或 GitHub Actions secrets 設定)
|
||||
GMAIL_USER = os.environ.get("GMAIL_USERNAME") # 例如:yourname@gmail.com
|
||||
GMAIL_PASS = os.environ.get("GMAIL_APP_PASSWORD") # 16碼 App Password(非一般登入密碼)
|
||||
TO = os.environ.get("MAIL_TO", "kaeruzak@gmail.com")
|
||||
|
||||
# 參數:CSV 路徑(預設 ./hbr_articles.csv)
|
||||
csv_path = sys.argv[1] if len(sys.argv) > 1 else "hbr_articles.csv"
|
||||
if not os.path.exists(csv_path):
|
||||
print(f"[WARN] CSV not found: {csv_path}")
|
||||
# 可選:直接結束或改為寄送「今日無檔案」通知
|
||||
sys.exit(0)
|
||||
|
||||
# 產生台北時間日期字串
|
||||
tz = timezone(timedelta(hours=8))
|
||||
date_str = datetime.now(tz).strftime("%Y-%m-%d %H:%M")
|
||||
|
||||
# 組信
|
||||
msg = EmailMessage()
|
||||
msg["Subject"] = f"[HBRTW 每週爬取] 文章清單 CSV - {date_str}"
|
||||
msg["From"] = GMAIL_USER
|
||||
msg["To"] = TO
|
||||
msg.set_content(f"""您好,
|
||||
附件為本週 HBR Taiwan 最新/熱門文章彙整(CSV)。
|
||||
產生時間:{date_str}(Asia/Taipei)
|
||||
若您需要改排程或加上上傳雲端,回覆此信即可。
|
||||
""")
|
||||
|
||||
# 夾帶 CSV
|
||||
ctype, encoding = mimetypes.guess_type(csv_path)
|
||||
if ctype is None or encoding is not None:
|
||||
ctype = "application/octet-stream"
|
||||
maintype, subtype = ctype.split("/", 1)
|
||||
with open(csv_path, "rb") as f:
|
||||
msg.add_attachment(f.read(),
|
||||
maintype=maintype,
|
||||
subtype=subtype,
|
||||
filename=os.path.basename(csv_path))
|
||||
|
||||
# 寄送
|
||||
with smtplib.SMTP_SSL("smtp.gmail.com", 465) as smtp:
|
||||
smtp.login(GMAIL_USER, GMAIL_PASS)
|
||||
smtp.send_message(msg)
|
||||
print("[OK] Mail sent to", TO)
|
||||
|
||||
|
||||
安全性:請使用 Gmail App Password(兩步驟驗證後生成),不要用一般密碼。
|
||||
取得方式:Google 帳戶 → 安全性 → 兩步驟驗證 → App 密碼 → 選「郵件」、裝置隨意命名 → 取得 16 碼,填入下述 Secrets/環境變數。
|
||||
|
||||
四、方案 B:GitHub Actions(建議)
|
||||
|
||||
在 GitHub 專案的 Settings → Secrets and variables → Actions → New repository secret 新增:
|
||||
|
||||
GMAIL_USERNAME:您的 Gmail 地址
|
||||
|
||||
GMAIL_APP_PASSWORD:16 碼 App Password
|
||||
|
||||
MAIL_TO:kaeruzak@gmail.com
|
||||
|
||||
建立 .github/workflows/weekly.yml:
|
||||
|
||||
name: weekly-crawl
|
||||
on:
|
||||
schedule:
|
||||
- cron: "0 0 * * 1" # 週一 00:00 UTC ≈ 台北 08:00
|
||||
workflow_dispatch: {}
|
||||
jobs:
|
||||
crawl-and-mail:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- uses: actions/setup-python@v5
|
||||
with:
|
||||
python-version: "3.11"
|
||||
- run: pip install scrapy
|
||||
- name: Run crawler
|
||||
run: scrapy crawl hbr
|
||||
- name: Send mail with CSV
|
||||
env:
|
||||
GMAIL_USERNAME: ${{ secrets.GMAIL_USERNAME }}
|
||||
GMAIL_APP_PASSWORD: ${{ secrets.GMAIL_APP_PASSWORD }}
|
||||
MAIL_TO: ${{ secrets.MAIL_TO }}
|
||||
run: |
|
||||
python send_mail.py hbr_articles.csv
|
||||
- name: Upload CSV as artifact (optional)
|
||||
uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: hbr_articles_csv
|
||||
path: hbr_articles.csv
|
||||
|
||||
|
||||
優點:免維護伺服器;Secrets 隔離;失敗可在 Actions 介面看 Log。
|
||||
|
||||
五、快速驗證(手動)
|
||||
|
||||
在本機或任一環境先手動測一次:
|
||||
|
||||
# 1) 先跑爬蟲
|
||||
scrapy crawl hbr
|
||||
|
||||
# 2) 設定環境變數(僅當前終端有效)
|
||||
export GMAIL_USERNAME='yourname@gmail.com'
|
||||
export GMAIL_APP_PASSWORD='xxxxxxxxxxxxxxxx'
|
||||
export MAIL_TO='kaeruzak@gmail.com'
|
||||
|
||||
# 3) 寄信
|
||||
python send_mail.py hbr_articles.csv
|
||||
|
||||
|
||||
若收得到信,表示排程也會正常。
|
||||
|
||||
六、例外與健壯性建議
|
||||
|
||||
robots.txt/付費牆:既有設定已遵守;付費文章僅做 is_paywalled=1 標記,不抓內文。
|
||||
|
||||
站台改版:若有解析錯誤,優先檢查 spiders/hbr.py 中 CSS 選擇器。
|
||||
|
||||
空結果週:預設若找不到 CSV 會跳過寄信(可改為寄送「本週無新檔案」通知)。
|
||||
|
||||
多環境:如需同時上傳雲端(S3/Drive)或加 Slack/Teams 通知,可再加一步。
|
||||
Reference in New Issue
Block a user