Firecrawl Migration Deep Dive
Current State
!npm list puppeteer playwright cheerio 2>/dev/null | grep -E "puppeteer|playwright|cheerio" || echo 'No scraping libs found'
Overview
Migrate from custom scraping (Puppeteer, Playwright, Cheerio) or competing APIs to Firecrawl. Firecrawl eliminates browser management, anti-bot handling, and JS rendering infrastructure. This skill shows equivalent code for common scraping patterns.
Migration Comparison
| Feature | Puppeteer/Playwright | Cheerio | Firecrawl | |---------|---------------------|---------|-----------| | JS rendering | Manual browser | No | Automatic | | Anti-bot bypass | DIY (stealth plugin) | No | Built-in | | Output format | Raw HTML | Parsed HTML | Markdown/JSON/HTML | | Infrastructure | Browser instances | None | API call | | Concurrent scraping | Manage browser pool | Simple | Managed by Firecrawl | | Cost model | Compute (CPU/RAM) | Free | Credits per page |
Instructions
Step 1: Replace Puppeteer Single-Page Scrape
// BEFORE: Puppeteer (20+ lines, browser management)
import puppeteer from "puppeteer";
async function scrapePuppeteer(url: string) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url, { waitUntil: "networkidle2" });
const html = await page.content();
const title = await page.title();
await browser.close();
return { html, title };
}
// AFTER: Firecrawl (5 lines, no browser needed)
import FirecrawlApp from "@mendable/firecrawl-js";
const firecrawl = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY! });
async function scrapeFirecrawl(url: string) {
const result = await firecrawl.scrapeUrl(url, {
formats: ["markdown"],
onlyMainContent: true,
waitFor: 2000,
});
return { markdown: result.markdown, title: result.metadata?.title };
}
Step 2: Replace Cheerio HTML Parsing
// BEFORE: fetch + cheerio (manual parsing)
import * as cheerio from "cheerio";
async function scrapeCheerio(url: string) {
const html = await fetch(url).then(r => r.text());
const $ = cheerio.load(html);
return {
title: $("h1").first().text(),
content: $("main").text(),
links: $("a").map((_, el) => $(el).attr("href")).get(),
};
}
// AFTER: Firecrawl with extract (LLM-powered, no CSS selectors)
async function extractFirecrawl(url: string) {
const result = await firecrawl.scrapeUrl(url, {
formats: ["extract", "links"],
extract: {
schema: {
type: "object",
properties: {
title: { type: "string" },
content: { type: "string" },
},
},
},
});
return {
title: result.extract?.title,
content: result.extract?.content,
links: result.links,
};
}
Step 3: Replace Crawl Pipeline
// BEFORE: Playwright crawler (100+ lines, queue, browser pool)
// - launch browser pool
// - manage visited URLs set
// - extract links, enqueue
// - handle errors per page
// - close browsers on exit
// AFTER: Firecrawl crawl (10 lines)
async function crawlSite(baseUrl: string) {
const result = await firecrawl.crawlUrl(baseUrl, {
limit: 100,
maxDepth: 3,
includePaths: ["/docs/*", "/api/*"],
excludePaths: ["/blog/*"],
scrapeOptions: {
formats: ["markdown"],
onlyMainContent: true,
},
});
return result.data?.map(page => ({
url: page.metadata?.sourceURL,
title: page.metadata?.title,
content: page.markdown,
}));
}
Step 4: Gradual Migration with Adapter Pattern
// Adapter interface for gradual migration
interface ScrapeAdapter {
scrape(url: string): Promise<{ title: string; content: string }>;
crawl(url: string, maxPages: number): Promise<Array<{ url: string; content: string }>>;
}
class FirecrawlAdapter implements ScrapeAdapter {
private client: FirecrawlApp;
constructor() {
this.client = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY! });
}
async scrape(url: string) {
const result = await this.client.scrapeUrl(url, {
formats: ["markdown"],
onlyMainContent: true,
});
return {
title: result.metadata?.title || "",
content: result.markdown || "",
};
}
async crawl(url: string, maxPages: number) {
const result = await this.client.crawlUrl(url, {
limit: maxPages,
scrapeOptions: { formats: ["markdown"], onlyMainContent: true },
});
return (result.data || []).map(page => ({
url: page.metadata?.sourceURL || url,
content: page.markdown || "",
}));
}
}
// Feature flag controlled migration
function getScrapeAdapter(): ScrapeAdapter {
if (process.env.USE_FIRECRAWL === "true") {
return new FirecrawlAdapter();
}
return new LegacyPuppeteerAdapter();
}
Step 5: Remove Old Dependencies
set -euo pipefail
# After migration is complete and verified
npm uninstall puppeteer puppeteer-core
npm uninstall playwright @playwright/test
npm uninstall cheerio
# Remove browser downloads
npx playwright uninstall --all 2>/dev/null || true
# Verify no lingering references
grep -r "puppeteer\|playwright\|cheerio" src/ --include="*.ts" || echo "Clean!"
Migration Checklist
- [ ] Install
@mendable/firecrawl-js - [ ] Create adapter layer wrapping Firecrawl
- [ ] Replace single-page scrapes with
scrapeUrl - [ ] Replace crawl loops with
crawlUrl - [ ] Replace HTML parsing with
extractor markdown - [ ] Feature flag to switch between old and new
- [ ] Run both in parallel, compare outputs
- [ ] Remove old scraping dependencies
- [ ] Delete browser management code
Error Handling
| Issue | Cause | Solution |
|-------|-------|----------|
| Different output format | Puppeteer returns HTML, Firecrawl markdown | Adjust downstream consumers |
| Missing CSS selector data | Firecrawl doesn't use selectors | Use extract with JSON schema |
| Higher latency for single pages | API call vs local browser | Acceptable trade-off for zero infra |
| Content differences | Different JS wait timing | Tune waitFor parameter |
Resources
Next Steps
For advanced troubleshooting, see firecrawl-advanced-troubleshooting.