feat: enhance web_fetch fallbacks

2026-01-17 00:00:15 +00:00
parent a84000c6d9
commit c54c665f97
11 changed files with 802 additions and 27 deletions
--- a/docs/tools/firecrawl.md
+++ b/docs/tools/firecrawl.md
@@ -0,0 +1,58 @@
+---
+summary: "Firecrawl fallback for web_fetch (anti-bot + cached extraction)"
+read_when:
+  - You want Firecrawl-backed web extraction
+  - You need a Firecrawl API key
+  - You want anti-bot extraction for web_fetch
+---
+
+# Firecrawl
+
+Clawdbot can use **Firecrawl** as a fallback extractor for `web_fetch`. It is a hosted
+content extraction service that supports bot circumvention and caching, which helps
+with JS-heavy sites or pages that block plain HTTP fetches.
+
+## Get an API key
+
+1) Create a Firecrawl account and generate an API key.
+2) Store it in config or set `FIRECRAWL_API_KEY` in the gateway environment.
+
+## Configure Firecrawl
+
+```json5
+{
+  tools: {
+    web: {
+      fetch: {
+        firecrawl: {
+          apiKey: "FIRECRAWL_API_KEY_HERE",
+          baseUrl: "https://api.firecrawl.dev",
+          onlyMainContent: true,
+          maxAgeMs: 172800000,
+          timeoutSeconds: 60
+        }
+      }
+    }
+  }
+}
+```
+
+Notes:
+- `firecrawl.enabled` defaults to true when an API key is present.
+- `maxAgeMs` controls how old cached results can be (ms). Default is 2 days.
+
+## Stealth / bot circumvention
+
+Firecrawl exposes a **proxy mode** parameter for bot circumvention (`basic`, `stealth`, or `auto`).
+Clawdbot always uses `proxy: "auto"` plus `storeInCache: true` for Firecrawl requests.
+If proxy is omitted, Firecrawl defaults to `auto`. `auto` retries with stealth proxies if a basic attempt fails, which may use more credits
+than basic-only scraping.
+
+## How `web_fetch` uses Firecrawl
+
+`web_fetch` extraction order:
+1) Readability (local)
+2) Firecrawl (if configured)
+3) Basic HTML cleanup (last fallback)
+
+See [Web tools](/tools/web) for the full web tool setup.
--- a/docs/tools/index.md
+++ b/docs/tools/index.md
@@ -215,6 +215,7 @@ Notes:
 - Responses are cached (default 15 min).
 - For JS-heavy sites, prefer the browser tool.
 - See [Web tools](/tools/web) for setup.
+- See [Firecrawl](/tools/firecrawl) for the optional anti-bot fallback.

 ### `browser`
 Control the dedicated clawd browser.
--- a/docs/tools/web.md
+++ b/docs/tools/web.md
@@ -104,6 +104,7 @@ Fetch a URL and extract readable content.
 ### Requirements

 - `tools.web.fetch.enabled` must not be `false` (default: enabled)
+- Optional Firecrawl fallback: set `tools.web.fetch.firecrawl.apiKey` or `FIRECRAWL_API_KEY`.

 ### Config

@@ -116,8 +117,16 @@ Fetch a URL and extract readable content.
        maxChars: 50000,
        timeoutSeconds: 30,
        cacheTtlMinutes: 15,
-        userAgent: "clawdbot/2026.1.15",
-        readability: true
+        userAgent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_7_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
+        readability: true,
+        firecrawl: {
+          enabled: true,
+          apiKey: "FIRECRAWL_API_KEY_HERE", // optional if FIRECRAWL_API_KEY is set
+          baseUrl: "https://api.firecrawl.dev",
+          onlyMainContent: true,
+          maxAgeMs: 86400000, // ms (1 day)
+          timeoutSeconds: 60
+        }
      }
    }
  }
@@ -131,8 +140,11 @@ Fetch a URL and extract readable content.
 - `maxChars` (truncate long pages)

 Notes:
- `web_fetch` uses Readability (main-content extraction) by default and falls back to basic HTML cleanup if it fails.
+- `web_fetch` uses Readability (main-content extraction) first, then Firecrawl (if configured). If both fail, the tool returns an error.
+- Firecrawl requests use bot-circumvention mode and cache results by default.
+- `web_fetch` sends a Chrome-like User-Agent and `Accept-Language` by default; override `userAgent` if needed.
 - `web_fetch` is best-effort extraction; some sites will need the browser tool.
+- See [Firecrawl](/tools/firecrawl) for key setup and service details.
 - Responses are cached (default 15 minutes) to reduce repeated fetches.
 - If you use tool profiles/allowlists, add `web_search`/`web_fetch` or `group:web`.
 - If the Brave key is missing, `web_search` returns a short setup hint with a docs link.