feat: enhance web_fetch fallbacks
This commit is contained in:
58
docs/tools/firecrawl.md
Normal file
58
docs/tools/firecrawl.md
Normal file
@@ -0,0 +1,58 @@
|
||||
---
|
||||
summary: "Firecrawl fallback for web_fetch (anti-bot + cached extraction)"
|
||||
read_when:
|
||||
- You want Firecrawl-backed web extraction
|
||||
- You need a Firecrawl API key
|
||||
- You want anti-bot extraction for web_fetch
|
||||
---
|
||||
|
||||
# Firecrawl
|
||||
|
||||
Clawdbot can use **Firecrawl** as a fallback extractor for `web_fetch`. It is a hosted
|
||||
content extraction service that supports bot circumvention and caching, which helps
|
||||
with JS-heavy sites or pages that block plain HTTP fetches.
|
||||
|
||||
## Get an API key
|
||||
|
||||
1) Create a Firecrawl account and generate an API key.
|
||||
2) Store it in config or set `FIRECRAWL_API_KEY` in the gateway environment.
|
||||
|
||||
## Configure Firecrawl
|
||||
|
||||
```json5
|
||||
{
|
||||
tools: {
|
||||
web: {
|
||||
fetch: {
|
||||
firecrawl: {
|
||||
apiKey: "FIRECRAWL_API_KEY_HERE",
|
||||
baseUrl: "https://api.firecrawl.dev",
|
||||
onlyMainContent: true,
|
||||
maxAgeMs: 172800000,
|
||||
timeoutSeconds: 60
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Notes:
|
||||
- `firecrawl.enabled` defaults to true when an API key is present.
|
||||
- `maxAgeMs` controls how old cached results can be (ms). Default is 2 days.
|
||||
|
||||
## Stealth / bot circumvention
|
||||
|
||||
Firecrawl exposes a **proxy mode** parameter for bot circumvention (`basic`, `stealth`, or `auto`).
|
||||
Clawdbot always uses `proxy: "auto"` plus `storeInCache: true` for Firecrawl requests.
|
||||
If proxy is omitted, Firecrawl defaults to `auto`. `auto` retries with stealth proxies if a basic attempt fails, which may use more credits
|
||||
than basic-only scraping.
|
||||
|
||||
## How `web_fetch` uses Firecrawl
|
||||
|
||||
`web_fetch` extraction order:
|
||||
1) Readability (local)
|
||||
2) Firecrawl (if configured)
|
||||
3) Basic HTML cleanup (last fallback)
|
||||
|
||||
See [Web tools](/tools/web) for the full web tool setup.
|
||||
@@ -215,6 +215,7 @@ Notes:
|
||||
- Responses are cached (default 15 min).
|
||||
- For JS-heavy sites, prefer the browser tool.
|
||||
- See [Web tools](/tools/web) for setup.
|
||||
- See [Firecrawl](/tools/firecrawl) for the optional anti-bot fallback.
|
||||
|
||||
### `browser`
|
||||
Control the dedicated clawd browser.
|
||||
|
||||
@@ -104,6 +104,7 @@ Fetch a URL and extract readable content.
|
||||
### Requirements
|
||||
|
||||
- `tools.web.fetch.enabled` must not be `false` (default: enabled)
|
||||
- Optional Firecrawl fallback: set `tools.web.fetch.firecrawl.apiKey` or `FIRECRAWL_API_KEY`.
|
||||
|
||||
### Config
|
||||
|
||||
@@ -116,8 +117,16 @@ Fetch a URL and extract readable content.
|
||||
maxChars: 50000,
|
||||
timeoutSeconds: 30,
|
||||
cacheTtlMinutes: 15,
|
||||
userAgent: "clawdbot/2026.1.15",
|
||||
readability: true
|
||||
userAgent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_7_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
|
||||
readability: true,
|
||||
firecrawl: {
|
||||
enabled: true,
|
||||
apiKey: "FIRECRAWL_API_KEY_HERE", // optional if FIRECRAWL_API_KEY is set
|
||||
baseUrl: "https://api.firecrawl.dev",
|
||||
onlyMainContent: true,
|
||||
maxAgeMs: 86400000, // ms (1 day)
|
||||
timeoutSeconds: 60
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -131,8 +140,11 @@ Fetch a URL and extract readable content.
|
||||
- `maxChars` (truncate long pages)
|
||||
|
||||
Notes:
|
||||
- `web_fetch` uses Readability (main-content extraction) by default and falls back to basic HTML cleanup if it fails.
|
||||
- `web_fetch` uses Readability (main-content extraction) first, then Firecrawl (if configured). If both fail, the tool returns an error.
|
||||
- Firecrawl requests use bot-circumvention mode and cache results by default.
|
||||
- `web_fetch` sends a Chrome-like User-Agent and `Accept-Language` by default; override `userAgent` if needed.
|
||||
- `web_fetch` is best-effort extraction; some sites will need the browser tool.
|
||||
- See [Firecrawl](/tools/firecrawl) for key setup and service details.
|
||||
- Responses are cached (default 15 minutes) to reduce repeated fetches.
|
||||
- If you use tool profiles/allowlists, add `web_search`/`web_fetch` or `group:web`.
|
||||
- If the Brave key is missing, `web_search` returns a short setup hint with a docs link.
|
||||
|
||||
Reference in New Issue
Block a user