Firecrawl Architecture Variants Skill

Firecrawl Architecture Variants

Overview

Three deployment architectures for Firecrawl at different scales: on-demand scraping for simple use cases, scheduled crawl pipelines for content monitoring, and real-time ingestion pipelines for AI/RAG applications. Choose based on volume, latency requirements, and cost budget.

Decision Matrix

| Factor | On-Demand | Scheduled Pipeline | Real-Time Pipeline | |--------|-----------|-------------------|-------------------| | Volume | < 500/day | 500-10K/day | 10K+/day | | Latency | Sync (2-10s) | Async (hours) | Async (minutes) | | Use Case | Single page lookup | Site monitoring | Knowledge base, RAG | | Credit Control | Per-request | Per-crawl budget | Credit pipeline | | Complexity | Low | Medium | High |

Instructions

Architecture 1: On-Demand Scraping

User Request → Backend API → firecrawl.scrapeUrl → Clean Content → Response

Best for: chatbots, content preview, single-page extraction.

import FirecrawlApp from "@mendable/firecrawl-js";

const firecrawl = new FirecrawlApp({
  apiKey: process.env.FIRECRAWL_API_KEY!,
});

// Simple API endpoint
app.post("/api/scrape", async (req, res) => {
  const { url } = req.body;

  const result = await firecrawl.scrapeUrl(url, {
    formats: ["markdown"],
    onlyMainContent: true,
    waitFor: 3000,
  });

  res.json({
    title: result.metadata?.title,
    content: result.markdown,
    url: result.metadata?.sourceURL,
  });
});

// With LLM extraction
app.post("/api/extract", async (req, res) => {
  const { url, schema } = req.body;

  const result = await firecrawl.scrapeUrl(url, {
    formats: ["extract"],
    extract: { schema },
  });

  res.json({ data: result.extract });
});

Architecture 2: Scheduled Crawl Pipeline

Scheduler (cron) → Crawl Queue → firecrawl.asyncCrawlUrl → Result Store
                                                                  │
                                                                  ▼
                                                        Content Processor → Search Index

Best for: documentation monitoring, content indexing, competitive analysis.

import cron from "node-cron";

interface CrawlTarget {
  id: string;
  url: string;
  maxPages: number;
  paths?: string[];
  schedule: string; // cron expression
}

const targets: CrawlTarget[] = [
  { id: "docs", url: "https://docs.example.com", maxPages: 100, paths: ["/docs/*"], schedule: "0 2 * * *" },
  { id: "blog", url: "https://blog.example.com", maxPages: 50, schedule: "0 4 * * 1" },
];

// Schedule crawls
for (const target of targets) {
  cron.schedule(target.schedule, async () => {
    console.log(`Starting scheduled crawl: ${target.id}`);
    const job = await firecrawl.asyncCrawlUrl(target.url, {
      limit: target.maxPages,
      includePaths: target.paths,
      scrapeOptions: { formats: ["markdown"], onlyMainContent: true },
    });
    await db.saveCrawlJob({ targetId: target.id, jobId: job.id, startedAt: new Date() });
  });
}

// Separate worker polls for results
async function processPendingCrawls() {
  const pending = await db.getPendingCrawlJobs();
  for (const job of pending) {
    const status = await firecrawl.checkCrawlStatus(job.jobId);
    if (status.status === "completed") {
      await indexPages(job.targetId, status.data || []);
      await db.markComplete(job.id, status.data?.length || 0);
      console.log(`Crawl ${job.targetId} complete: ${status.data?.length} pages indexed`);
    }
  }
}
setInterval(processPendingCrawls, 30000);

Architecture 3: Real-Time Content Pipeline

URL Sources → Priority Queue → Firecrawl Workers → Content Validation
                                                          │
                                                          ▼
                                                   Vector DB + Search Index
                                                          │
                                                          ▼
                                                    RAG / AI Pipeline

Best for: AI training data, knowledge base, enterprise content platform.

import PQueue from "p-queue";

class ContentPipeline {
  private queue: PQueue;
  private firecrawl: FirecrawlApp;
  private creditBudget: number;
  private creditsUsed = 0;

  constructor(concurrency = 5, dailyBudget = 10000) {
    this.queue = new PQueue({ concurrency, interval: 1000, intervalCap: 10 });
    this.firecrawl = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY! });
    this.creditBudget = dailyBudget;
  }

  async ingest(urls: string[]) {
    if (this.creditsUsed + urls.length > this.creditBudget) {
      throw new Error("Daily credit budget exceeded");
    }

    // Use batch scrape for efficiency
    const result = await this.queue.add(() =>
      this.firecrawl.batchScrapeUrls(urls, {
        formats: ["markdown"],
        onlyMainContent: true,
      })
    );

    this.creditsUsed += urls.length;

    // Validate and process
    const pages = (result?.data || []).filter(page => {
      const md = page.markdown || "";
      return md.length > 100 && !/captcha|access denied/i.test(md);
    });

    // Store in vector DB
    for (const page of pages) {
      await vectorStore.upsert({
        id: page.metadata?.sourceURL,
        content: page.markdown,
        metadata: { title: page.metadata?.title, url: page.metadata?.sourceURL },
      });
    }

    return { ingested: pages.length, rejected: urls.length - pages.length };
  }

  async discover(siteUrl: string, pathFilter: string) {
    const map = await this.firecrawl.mapUrl(siteUrl);
    return (map.links || []).filter(url => url.includes(pathFilter));
  }
}

// Usage
const pipeline = new ContentPipeline(5, 10000);
const urls = await pipeline.discover("https://docs.example.com", "/api/");
const result = await pipeline.ingest(urls.slice(0, 100));
console.log(`Ingested ${result.ingested} pages into vector store`);

Choosing Your Architecture

Need real-time, user-facing response?
├── YES → On-Demand (Architecture 1)
└── NO → How many pages/day?
    ├── < 500 → On-Demand with caching
    ├── 500-10K → Scheduled Pipeline (Architecture 2)
    └── 10K+ → Real-Time Pipeline (Architecture 3)

Error Handling

| Issue | Cause | Solution | |-------|-------|----------| | Slow on-demand response | JS-heavy target page | Add caching layer, reduce waitFor | | Stale indexed content | Crawl schedule too infrequent | Increase frequency for critical sources | | Credit overrun | Pipeline ingesting too aggressively | Implement daily budget with hard cap | | Duplicate content | Re-crawling same pages | Deduplicate by content hash before indexing |

Resources

Next Steps

For common pitfalls, see firecrawl-known-pitfalls.

Agent Skills: Firecrawl Architecture Variants

Install this agent skill to your local

Skill Files