Initial commit: HBR 文章爬蟲專案

- Scrapy 爬蟲框架，爬取 HBR 繁體中文文章 - Flask Web 應用程式，提供文章查詢介面 - SQL Server 資料庫整合 - 自動化排程與郵件通知功能 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-03 17:19:56 +08:00
commit f524713cb6
35 changed files with 6719 additions and 0 deletions
--- a/啟動說明.md
+++ b/啟動說明.md
@@ -0,0 +1,132 @@
+# HBR 爬蟲系統 - 啟動說明
+
+## 主要啟動檔案
+
+### 🚀 `run_crawler.py` - **主啟動腳本（推薦使用）**
+
+這是整合所有功能的主啟動腳本，會依序執行：
+1. 執行 Scrapy 爬蟲
+2. 檢查 CSV 檔案是否產生
+3. 發送郵件（如果已設定 Gmail）
+
+**使用方式**：
+```bash
+python run_crawler.py
+```
+
+**適用場景**：
+- 手動執行爬蟲
+- 排程任務（Crontab）
+- 自動化流程
+
+---
+
+## 其他 Python 檔案說明
+
+### 📧 `send_mail.py` - 郵件發送腳本
+
+僅負責發送郵件，不執行爬蟲。
+
+**使用方式**：
+```bash
+python send_mail.py [csv檔案路徑]
+```
+
+**功能**：
+- 讀取 CSV 檔案
+- 透過 Gmail SMTP 發送郵件（如果已設定）
+- 如果未設定 Gmail，會跳過郵件發送並顯示提示
+
+---
+
+### 🧪 `test_db_connection.py` - 資料庫連線測試
+
+測試資料庫連線並建立資料表結構。
+
+**使用方式**：
+```bash
+python test_db_connection.py
+```
+
+**功能**：
+- 測試資料庫連線
+- 建立 HBR_scraper 資料庫（如果不存在）
+- 建立資料表結構
+- 驗證資料表是否建立成功
+
+**建議**：在首次使用前執行一次
+
+---
+
+### 🕷️ `hbr_crawler/` - Scrapy 爬蟲專案
+
+這是 Scrapy 爬蟲的核心程式碼，包含：
+- `spiders/hbr.py` - 爬蟲主程式
+- `pipelines.py` - 資料處理管道（CSV 匯出、資料庫儲存）
+- `items.py` - 資料結構定義
+- `settings.py` - 爬蟲設定
+- `database.py` - 資料庫連線模組
+
+**直接使用 Scrapy 命令**：
+```bash
+cd hbr_crawler
+scrapy crawl hbr
+```
+
+**注意**：建議使用 `run_crawler.py` 而不是直接執行 Scrapy 命令，因為它會整合所有功能。
+
+---
+
+## 快速開始
+
+### 1. 首次設定
+
+```bash
+# 安裝依賴
+pip install -r requirements.txt
+
+# 測試資料庫連線（建立資料庫和資料表）
+python test_db_connection.py
+```
+
+### 2. 執行爬蟲
+
+```bash
+# 方式一：使用主啟動腳本（推薦）
+python run_crawler.py
+
+# 方式二：直接使用 Scrapy 命令
+cd hbr_crawler
+scrapy crawl hbr
+```
+
+### 3. 排程設定（Crontab）
+
+```bash
+# 每天 08:00 執行
+0 8 * * * cd /path/to/project && /usr/bin/python3 run_crawler.py >> logs/cron.log 2>&1
+```
+
+---
+
+## 檔案功能對照表
+
+| 檔案 | 功能 | 是否可獨立執行 | 用途 |
+|------|------|---------------|------|
+| `run_crawler.py` | 整合所有功能 | ✅ 是 | **主要啟動腳本** |
+| `send_mail.py` | 發送郵件 | ✅ 是 | 郵件發送 |
+| `test_db_connection.py` | 測試資料庫 | ✅ 是 | 資料庫設定 |
+| `hbr_crawler/spiders/hbr.py` | 爬蟲核心 | ❌ 需透過 Scrapy | 爬蟲邏輯 |
+| `hbr_crawler/pipelines.py` | 資料處理 | ❌ 需透過 Scrapy | 資料處理 |
+| `hbr_crawler/database.py` | 資料庫模組 | ❌ 被其他模組引用 | 資料庫連線 |
+
+---
+
+## 建議使用流程
+
+1. **首次設定**：執行 `test_db_connection.py`
+2. **日常使用**：執行 `run_crawler.py`
+3. **僅發送郵件**：執行 `send_mail.py`
+4. **排程任務**：在 Crontab 中設定 `run_crawler.py`
+
+