Apify Core Workflow A — Build & Deploy a Scraper
Overview
End-to-end workflow: define input schema, build a Crawlee-based Actor, extract structured data, store results in datasets, test locally, and deploy to Apify platform. This is the primary money-path workflow for Apify.
Prerequisites
npm install apify crawleein your projectnpm install -g apify-cliandapify logincompleted- Familiarity with
apify-sdk-patterns
Instructions
Step 1: Define Input Schema
Create .actor/INPUT_SCHEMA.json:
{
"title": "E-Commerce Scraper",
"type": "object",
"schemaVersion": 1,
"properties": {
"startUrls": {
"title": "Start URLs",
"type": "array",
"description": "Product listing page URLs to scrape",
"editor": "requestListSources",
"prefill": [{ "url": "https://example-store.com/products" }]
},
"maxItems": {
"title": "Max items",
"type": "integer",
"description": "Maximum number of products to scrape",
"default": 100,
"minimum": 1,
"maximum": 10000
},
"proxyConfig": {
"title": "Proxy configuration",
"type": "object",
"description": "Select proxy to use",
"editor": "proxy",
"default": { "useApifyProxy": true }
}
},
"required": ["startUrls"]
}
Step 2: Build the Actor with Router Pattern
// src/main.ts
import { Actor } from 'apify';
import { CheerioCrawler, createCheerioRouter, Dataset, log } from 'crawlee';
interface ProductInput {
startUrls: { url: string }[];
maxItems?: number;
proxyConfig?: { useApifyProxy: boolean; groups?: string[] };
}
interface Product {
url: string;
name: string;
price: number | null;
currency: string;
description: string;
imageUrl: string | null;
inStock: boolean;
scrapedAt: string;
}
const router = createCheerioRouter();
// LISTING pages — extract product links
router.addDefaultHandler(async ({ request, $, enqueueLinks, log }) => {
log.info(`Listing page: ${request.url}`);
await enqueueLinks({
selector: 'a.product-card',
label: 'PRODUCT',
});
// Handle pagination
await enqueueLinks({
selector: 'a.next-page',
label: 'LISTING',
});
});
// PRODUCT detail pages — extract structured data
router.addHandler('PRODUCT', async ({ request, $, log }) => {
log.info(`Product page: ${request.url}`);
const product: Product = {
url: request.url,
name: $('h1.product-title').text().trim(),
price: parseFloat($('.price').text().replace(/[^0-9.]/g, '')) || null,
currency: $('.currency').text().trim() || 'USD',
description: $('div.description').text().trim(),
imageUrl: $('img.product-image').attr('src') || null,
inStock: !$('.out-of-stock').length,
scrapedAt: new Date().toISOString(),
};
await Actor.pushData(product);
});
// Entry point
await Actor.main(async () => {
const input = await Actor.getInput<ProductInput>();
if (!input?.startUrls?.length) throw new Error('startUrls required');
const proxyConfiguration = input.proxyConfig?.useApifyProxy
? await Actor.createProxyConfiguration({
groups: input.proxyConfig.groups,
})
: undefined;
const crawler = new CheerioCrawler({
requestHandler: router,
proxyConfiguration,
maxRequestsPerCrawl: input.maxItems ?? 100,
maxConcurrency: 10,
requestHandlerTimeoutSecs: 60,
async failedRequestHandler({ request }, error) {
log.error(`Failed: ${request.url} — ${error.message}`);
await Actor.pushData({
url: request.url,
error: error.message,
'#isFailed': true,
});
},
});
await crawler.run(input.startUrls.map(s => s.url));
// Save run summary to key-value store
const dataset = await Dataset.open();
const info = await dataset.getInfo();
await Actor.setValue('SUMMARY', {
itemCount: info?.itemCount ?? 0,
finishedAt: new Date().toISOString(),
startUrls: input.startUrls.map(s => s.url),
});
log.info(`Done. Scraped ${info?.itemCount ?? 0} products.`);
});
Step 3: Configure Dockerfile
# .actor/Dockerfile
FROM apify/actor-node:20 AS builder
COPY package*.json ./
RUN npm ci --include=dev --audit=false
COPY . .
RUN npm run build
FROM apify/actor-node:20
COPY package*.json ./
RUN npm ci --omit=dev --audit=false
COPY --from=builder /usr/src/app/dist ./dist
COPY .actor .actor
CMD ["npm", "start"]
Step 4: Test Locally
# Create test input
mkdir -p storage/key_value_stores/default
echo '{"startUrls":[{"url":"https://example.com"}],"maxItems":5}' \
> storage/key_value_stores/default/INPUT.json
# Run locally
apify run
# Check results
ls storage/datasets/default/
cat storage/key_value_stores/default/SUMMARY.json
Step 5: Deploy to Apify Platform
# Push to Apify (creates Actor if it doesn't exist)
apify push
# Or push to a specific Actor
apify push username/my-actor
# Run on platform
apify actors call username/my-actor
Step 6: Retrieve Results Programmatically
import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: process.env.APIFY_TOKEN });
// Run the deployed Actor
const run = await client.actor('username/my-actor').call({
startUrls: [{ url: 'https://target-store.com/products' }],
maxItems: 500,
});
// Get results
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`Scraped ${items.length} products`);
// Download as CSV
const csv = await client.dataset(run.defaultDatasetId).downloadItems('csv');
Output
- Deployable Actor with typed input schema
- Router-based crawler handling listing + detail pages
- Structured product data in default dataset
- Run summary in default key-value store
- Failed requests tracked with error messages
Error Handling
| Error | Cause | Solution |
|-------|-------|----------|
| Actor build failed | Dockerfile/deps issue | Check build logs on platform |
| Selector returns empty | Page structure changed | Update CSS selectors |
| maxRequestsPerCrawl hit | Too many pages enqueued | Increase limit or filter URLs |
| Proxy errors | Anti-bot blocking | Switch to residential proxy |
| TIMED-OUT status | Actor exceeded timeout | Increase timeout or reduce scope |
Resources
Next Steps
For dataset/KV store management, see apify-core-workflow-b.