ArcFetch

Overview

ArcFetch is a URL-to-Markdown tool I built because every AI workflow I had eventually needed “give me the readable text from this page” and every existing option either choked on JavaScript-rendered sites, returned a wall of nav and ads, or silently handed back a paywall stub. It runs Mozilla Readability under plain HTTP first, retries with a stealthed Playwright browser when that fails, and rejects boilerplate, paywalls, and login walls before returning anything.

How It Works

Four stages in order: fetch, extract, quality-gate, output. Plain HTTP runs first because it’s fast and free; if it returns blank or low-quality content, ArcFetch retries with Playwright in stealth mode. The quality gates score each result 0–100 and reject boilerplate, paywalls, login walls, and error pages before anything is saved.

Features

HTTP first, automatic Playwright fallback for JS-heavy sites
Quality scoring (0–100) with detection for boilerplate, login walls, paywalls, and error pages
Anti-bot escape hatches: stealth plugin, viewport / timezone / locale rotation, realistic headers
Markdown output via Mozilla Readability + Turndown (typically 90–95% smaller than the raw HTML)
Cache-to-temp workflow: stash a fetch in a temp folder, promote to docs/ once you’ve checked it
Link extraction so you can batch-fetch every link on a page you’ve already cached
Available as both a CLI and an MCP server (6 tools)
Output as plain text, JSON, file path, or summary

Technology Stack

TypeScript on Bun
Mozilla Readability for content extraction
Playwright (loaded only when the fallback fires)
Turndown for HTML-to-Markdown conversion

Overview

How It Works

Features

Technology Stack

Share this article