# Apify Documentation

> The entire content of Apify documentation is available in a single Markdown file at https://docs.apify.com/llms-full.txt


## Apify API

- [Apify API](https://docs.apify.com/api.md)
- [Apify API](https://docs.apify.com/api/v2.md): The Apify API (version 2) provides programmatic access to the [Apify platform](https://docs.apify.com).
- [Abort build](https://docs.apify.com/api/v2/act-build-abort-post.md): :::caution deprecated This endpoint has been deprecated and may be replaced or removed in future versions of the API.
- [Get default build](https://docs.apify.com/api/v2/act-build-default-get.md): Clients Python JavaScriptGet the default build for an Actor.
- [Get build](https://docs.apify.com/api/v2/act-build-get.md): :::caution deprecated API endpoints related to build of the Actor were moved under new namespace [`actor-builds`](#/reference/actor-builds).
- [Get list of builds](https://docs.apify.com/api/v2/act-builds-get.md): Clients Python JavaScriptGets the list of builds of a specific Actor.
- [Build Actor](https://docs.apify.com/api/v2/act-builds-post.md): Clients Python JavaScriptBuilds an Actor.
- [Delete Actor](https://docs.apify.com/api/v2/act-delete.md): Clients JavaScriptDeletes an Actor.
- [Get Actor](https://docs.apify.com/api/v2/act-get.md): Clients Python JavaScriptGets an object that contains all the details about a specific Actor.
- [Get OpenAPI definition](https://docs.apify.com/api/v2/act-openapi-json-get.md): Get the OpenAPI definition for Actor builds.
- [Update Actor](https://docs.apify.com/api/v2/act-put.md): Clients Python JavaScriptUpdates settings of an Actor using values specified by an Actor object passed as JSON in the POST payload.
- [Abort run](https://docs.apify.com/api/v2/act-run-abort-post.md): :::caution deprecated This endpoint has been deprecated and may be replaced or removed in future versions of the API.
- [Get run](https://docs.apify.com/api/v2/act-run-get.md): :::caution deprecated This endpoint has been deprecated and may be replaced or removed in future versions of the API.
- [Metamorph run](https://docs.apify.com/api/v2/act-run-metamorph-post.md): :::caution deprecated This endpoint has been deprecated and may be replaced or removed in future versions of the API.
- [Resurrect run](https://docs.apify.com/api/v2/act-run-resurrect-post.md): **[DEPRECATED]** API endpoints related to run of the Actor were moved under new namespace [`actor-runs`](#/reference/actor-runs).Resurrects a finished Actor run and returns an object that contains all the details about the resurrected run.
- [Without input](https://docs.apify.com/api/v2/act-run-sync-get.md): Runs a specific Actor and returns its output.
- [Run Actor synchronously without input and get dataset items](https://docs.apify.com/api/v2/act-run-sync-get-dataset-items-get.md): Runs a specific Actor and returns its dataset items.
- [Run Actor synchronously with input and get dataset items](https://docs.apify.com/api/v2/act-run-sync-get-dataset-items-post.md): Runs a specific Actor and returns its dataset items.
- [Run Actor synchronously with input and return output](https://docs.apify.com/api/v2/act-run-sync-post.md): Runs a specific Actor and returns its output.
- [Get list of runs](https://docs.apify.com/api/v2/act-runs-get.md): Clients Python JavaScriptGets the list of runs of a specific Actor.
- [Get last run](https://docs.apify.com/api/v2/act-runs-last-get.md): This is not a single endpoint, but an entire group of endpoints that lets you to retrieve and manage the last run of given Actor or any of its default storages.
- [Run Actor](https://docs.apify.com/api/v2/act-runs-post.md): Clients Python JavaScriptRuns an Actor and immediately returns without waiting for the run to finish.
- [Delete version](https://docs.apify.com/api/v2/act-version-delete.md): Deletes a specific version of Actor's source code.
- [Delete environment variable](https://docs.apify.com/api/v2/act-version-env-var-delete.md): Deletes a specific environment variable.
- [Get environment variable](https://docs.apify.com/api/v2/act-version-env-var-get.md): Clients PythonGets a [EnvVar object](#/reference/actors/environment-variable-object) that contains all the details about a specific environment variable of an Actor.
- [Update environment variable](https://docs.apify.com/api/v2/act-version-env-var-put.md): Clients PythonUpdates Actor environment variable using values specified by a [EnvVar object](#/reference/actors/environment-variable-object) passed as JSON in the POST payload.
- [Get list of environment variables](https://docs.apify.com/api/v2/act-version-env-vars-get.md): Clients PythonGets the list of environment variables for a specific version of an Actor.
- [Create environment variable](https://docs.apify.com/api/v2/act-version-env-vars-post.md): Clients PythonCreates an environment variable of an Actor using values specified in a [EnvVar object](#/reference/actors/environment-variable-object) passed as JSON in the POST payload.
- [Get version](https://docs.apify.com/api/v2/act-version-get.md): Clients PythonGets a [Version object](#/reference/actors/version-object) that contains all the details about a specific version of an Actor.
- [Update version](https://docs.apify.com/api/v2/act-version-put.md): Clients PythonUpdates Actor version using values specified by a [Version object](#/reference/actors/version-object) passed as JSON in the POST payload.
- [Get list of versions](https://docs.apify.com/api/v2/act-versions-get.md): Clients PythonGets the list of versions of a specific Actor.
- [Create version](https://docs.apify.com/api/v2/act-versions-post.md): Clients PythonCreates a version of an Actor using values specified in a [Version object](#/reference/actors/version-object) passed as JSON in the POST payload.
- [Get list of webhooks](https://docs.apify.com/api/v2/act-webhooks-get.md): Gets the list of webhooks of a specific Actor.
- [Abort build](https://docs.apify.com/api/v2/actor-build-abort-post.md): Clients Python JavaScriptAborts an Actor build and returns an object that contains all the details about the build.
- [Delete build](https://docs.apify.com/api/v2/actor-build-delete.md): Clients JavaScriptDelete the build.
- [Get build](https://docs.apify.com/api/v2/actor-build-get.md): Clients Python JavaScriptGets an object that contains all the details about a specific build of an Actor.
- [Get log](https://docs.apify.com/api/v2/actor-build-log-get.md): Check out [Logs](#/reference/logs) for full reference.
- [Get OpenAPI definition](https://docs.apify.com/api/v2/actor-build-openapi-json-get.md): Clients Python JavaScriptGet the OpenAPI definition for Actor builds.
- [Actor builds - Introduction](https://docs.apify.com/api/v2/actor-builds.md): The API endpoints described in this section enable you to manage, and delete Apify Actor builds.
- [Get user builds list](https://docs.apify.com/api/v2/actor-builds-get.md): Gets a list of all builds for a user.
- [Abort run](https://docs.apify.com/api/v2/actor-run-abort-post.md): Clients Python JavaScriptAborts an Actor run and returns an object that contains all the details about the run.
- [Delete run](https://docs.apify.com/api/v2/actor-run-delete.md): Clients JavaScriptDelete the run.
- [Get run](https://docs.apify.com/api/v2/actor-run-get.md): This is not a single endpoint, but an entire group of endpoints that lets you retrieve the run or any of its default storages.
- [Metamorph run](https://docs.apify.com/api/v2/actor-run-metamorph-post.md): Clients Python JavaScriptTransforms an Actor run into a run of another Actor with a new input.
- [Update status message](https://docs.apify.com/api/v2/actor-run-put.md): You can set a single status message on your run that will be displayed in the Apify Console UI.
- [Reboot run](https://docs.apify.com/api/v2/actor-run-reboot-post.md): Clients Python JavaScriptReboots an Actor run and returns an object that contains all the details about the rebooted run.
- [Actor runs - Introduction](https://docs.apify.com/api/v2/actor-runs.md): The API endpoints described in this section enable you to manage, and delete Apify Actor runs.
- [Get user runs list](https://docs.apify.com/api/v2/actor-runs-get.md): Gets a list of all runs for a user.
- [Delete task](https://docs.apify.com/api/v2/actor-task-delete.md): Clients JavaScriptDelete the task specified through the `actorTaskId` parameter.
- [Get task](https://docs.apify.com/api/v2/actor-task-get.md): Clients Python JavaScriptGet an object that contains all the details about a task.
- [Get task input](https://docs.apify.com/api/v2/actor-task-input-get.md): Clients Python JavaScriptReturns the input of a given task.
- [Update task input](https://docs.apify.com/api/v2/actor-task-input-put.md): Clients Python JavaScriptUpdates the input of a task using values specified by an object passed as JSON in the PUT payload.
- [Update task](https://docs.apify.com/api/v2/actor-task-put.md): Clients Python JavaScriptUpdate settings of a task using values specified by an object passed as JSON in the POST payload.
- [Run task synchronously](https://docs.apify.com/api/v2/actor-task-run-sync-get.md): Run a specific task and return its output.
- [Run task synchronously and get dataset items](https://docs.apify.com/api/v2/actor-task-run-sync-get-dataset-items-get.md): Run a specific task and return its dataset items.
- [Run task synchronously and get dataset items](https://docs.apify.com/api/v2/actor-task-run-sync-get-dataset-items-post.md): Runs an Actor task and synchronously returns its dataset items.
- [Run task synchronously](https://docs.apify.com/api/v2/actor-task-run-sync-post.md): Runs an Actor task and synchronously returns its output.
- [Get list of task runs](https://docs.apify.com/api/v2/actor-task-runs-get.md): Get a list of runs of a specific task.
- [Get last run](https://docs.apify.com/api/v2/actor-task-runs-last-get.md): This is not a single endpoint, but an entire group of endpoints that lets you to retrieve and manage the last run of given actor task or any of its default storages.
- [Run task](https://docs.apify.com/api/v2/actor-task-runs-post.md): Clients Python JavaScriptRuns an Actor task and immediately returns without waiting for the run to finish.
- [Get list of webhooks](https://docs.apify.com/api/v2/actor-task-webhooks-get.md): Gets the list of webhooks of a specific Actor task.
- [Actor tasks - Introduction](https://docs.apify.com/api/v2/actor-tasks.md): The API endpoints described in this section enable you to create, manage, delete, and run Apify Actor tasks.
- [Get list of tasks](https://docs.apify.com/api/v2/actor-tasks-get.md): Clients Python JavaScriptGets the complete list of tasks that a user has created or used.
- [Create task](https://docs.apify.com/api/v2/actor-tasks-post.md): Clients Python JavaScriptCreate a new task with settings specified by the object passed as JSON in the POST payload.
- [Actors - Introduction](https://docs.apify.com/api/v2/actors.md): The API endpoints in this section allow you to manage Apify Actors.
- [Actor builds - Introduction](https://docs.apify.com/api/v2/actors-actor-builds.md): The API endpoints in this section allow you to manage your Apify Actors builds.
- [Actor runs - Introduction](https://docs.apify.com/api/v2/actors-actor-runs.md): The API endpoints in this section allow you to manage your Apify Actors runs.
- [Actor versions - Introduction](https://docs.apify.com/api/v2/actors-actor-versions.md): The API endpoints in this section allow you to manage your Apify Actors versions.
- [Webhook collection - Introduction](https://docs.apify.com/api/v2/actors-webhook-collection.md): The API endpoint in this section allows you to get a list of webhooks of a specific Actor.
- [Get list of Actors](https://docs.apify.com/api/v2/acts-get.md): Clients Python JavaScriptGets the list of all Actors that the user created or used.
- [Create Actor](https://docs.apify.com/api/v2/acts-post.md): Clients Python JavaScriptCreates a new Actor with settings specified in an Actor object passed as JSON in the POST payload.
- [Delete dataset](https://docs.apify.com/api/v2/dataset-delete.md): Clients JavaScriptDeletes a specific dataset.
- [Get dataset](https://docs.apify.com/api/v2/dataset-get.md): Clients Python JavaScriptReturns dataset object for given dataset ID.
- [Get items](https://docs.apify.com/api/v2/dataset-items-get.md): Clients Python JavaScriptReturns data stored in the dataset in a desired format.
- [Store items](https://docs.apify.com/api/v2/dataset-items-post.md): Clients Python JavaScriptAppends an item or an array of items to the end of the dataset.
- [Update dataset](https://docs.apify.com/api/v2/dataset-put.md): Clients Python JavaScriptUpdates a dataset's name using a value specified by a JSON object passed in the PUT payload.
- [Get dataset statistics](https://docs.apify.com/api/v2/dataset-statistics-get.md): Returns statistics for given dataset.
- [Get list of datasets](https://docs.apify.com/api/v2/datasets-get.md): Clients Python JavaScriptLists all of a user's datasets.
- [Create dataset](https://docs.apify.com/api/v2/datasets-post.md): Clients Python JavaScriptCreates a dataset and returns its object.
- [Getting started with Apify API](https://docs.apify.com/api/v2/getting-started.md): The Apify API provides programmatic access to the [Apify platform](https://docs.apify.com).
- [Delete store](https://docs.apify.com/api/v2/key-value-store-delete.md): Clients JavaScriptDeletes a key-value store.
- [Get store](https://docs.apify.com/api/v2/key-value-store-get.md): Clients Python JavaScriptGets an object that contains all the details about a specific key-value store.
- [Get list of keys](https://docs.apify.com/api/v2/key-value-store-keys-get.md): Clients Python JavaScriptReturns a list of objects describing keys of a given key-value store, as well as some information about the values (e.g.
- [Update store](https://docs.apify.com/api/v2/key-value-store-put.md): Clients Python JavaScriptUpdates a key-value store's name using a value specified by a JSON object passed in the PUT payload.
- [Delete record](https://docs.apify.com/api/v2/key-value-store-record-delete.md): Clients Python JavaScriptRemoves a record specified by a key from the key-value store.
- [Get record](https://docs.apify.com/api/v2/key-value-store-record-get.md): Clients Python JavaScriptGets a value stored in the key-value store under a specific key.
- [Check if a record exists](https://docs.apify.com/api/v2/key-value-store-record-head.md): Clients Python JavaScriptCheck if a value is stored in the key-value store under a specific key.
- [Store record](https://docs.apify.com/api/v2/key-value-store-record-put.md): Clients Python JavaScriptStores a value under a specific key to the key-value store.
- [Get list of key-value stores](https://docs.apify.com/api/v2/key-value-stores-get.md): Clients Python JavaScriptGets the list of key-value stores owned by the user.
- [Create key-value store](https://docs.apify.com/api/v2/key-value-stores-post.md): Clients Python JavaScriptCreates a key-value store and returns its object.
- [Get log](https://docs.apify.com/api/v2/log-get.md): Clients Python JavaScriptRetrieves logs for a specific Actor build or run.
- [Logs - Introduction](https://docs.apify.com/api/v2/logs.md): The API endpoints described in this section are used the download the logs generated by Actor builds and runs.
- [Charge events in run](https://docs.apify.com/api/v2/post-charge-run.md): Clients Python JavaScriptCharge for events in the run of your [pay per event Actor](https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event).
- [Resurrect run](https://docs.apify.com/api/v2/post-resurrect-run.md): Clients Python JavaScriptResurrects a finished Actor run and returns an object that contains all the details about the resurrected run.
- [Delete request queue](https://docs.apify.com/api/v2/request-queue-delete.md): Clients JavaScriptDeletes given queue.
- [Get request queue](https://docs.apify.com/api/v2/request-queue-get.md): Clients Python JavaScriptReturns queue object for given queue ID.
- [Get head](https://docs.apify.com/api/v2/request-queue-head-get.md): Clients Python JavaScriptReturns given number of first requests from the queue.
- [Get head and lock](https://docs.apify.com/api/v2/request-queue-head-lock-post.md): Clients Python JavaScriptReturns the given number of first requests from the queue and locks them for the given time.
- [Update request queue](https://docs.apify.com/api/v2/request-queue-put.md): Clients Python JavaScriptUpdates a request queue's name using a value specified by a JSON object passed in the PUT payload.
- [Delete request](https://docs.apify.com/api/v2/request-queue-request-delete.md): Clients JavaScriptDeletes given request from queue.
- [Get request](https://docs.apify.com/api/v2/request-queue-request-get.md): Clients Python JavaScriptReturns request from queue.
- [Delete request lock](https://docs.apify.com/api/v2/request-queue-request-lock-delete.md): Clients Python JavaScriptDeletes a request lock.
- [Prolong request lock](https://docs.apify.com/api/v2/request-queue-request-lock-put.md): Clients Python JavaScriptProlongs request lock.
- [Update request](https://docs.apify.com/api/v2/request-queue-request-put.md): Clients Python JavaScriptUpdates a request in a queue.
- [Delete requests](https://docs.apify.com/api/v2/request-queue-requests-batch-delete.md): Clients Python JavaScriptBatch-deletes given requests from the queue.
- [Add requests](https://docs.apify.com/api/v2/request-queue-requests-batch-post.md): Clients Python JavaScriptAdds requests to the queue in batch.
- [List requests](https://docs.apify.com/api/v2/request-queue-requests-get.md): Clients Python JavaScriptReturns a list of requests.
- [Add request](https://docs.apify.com/api/v2/request-queue-requests-post.md): Clients Python JavaScriptAdds request to the queue.
- [Unlock requests](https://docs.apify.com/api/v2/request-queue-requests-unlock-post.md): Clients Python JavaScriptUnlocks requests in the queue that are currently locked by the client.
- [Get list of request queues](https://docs.apify.com/api/v2/request-queues-get.md): Clients Python JavaScriptLists all of a user's request queues.
- [Create request queue](https://docs.apify.com/api/v2/request-queues-post.md): Clients Python JavaScriptCreates a request queue and returns its object.
- [Delete schedule](https://docs.apify.com/api/v2/schedule-delete.md): Clients JavaScriptDeletes a schedule.
- [Get schedule](https://docs.apify.com/api/v2/schedule-get.md): Clients Python JavaScriptGets the schedule object with all details.
- [Get schedule log](https://docs.apify.com/api/v2/schedule-log-get.md): Clients Python JavaScriptGets the schedule log as a JSON array containing information about up to a 1000 invocations of the schedule.
- [Update schedule](https://docs.apify.com/api/v2/schedule-put.md): Clients Python JavaScriptUpdates a schedule using values specified by a schedule object passed as JSON in the POST payload.
- [Schedules - Introduction](https://docs.apify.com/api/v2/schedules.md): This section describes API endpoints for managing schedules.
- [Get list of schedules](https://docs.apify.com/api/v2/schedules-get.md): Clients Python JavaScriptGets the list of schedules that the user created.
- [Create schedule](https://docs.apify.com/api/v2/schedules-post.md): Clients Python JavaScriptCreates a new schedule with settings provided by the schedule object passed as JSON in the payload.
- [Datasets - Introduction](https://docs.apify.com/api/v2/storage-datasets.md): This section describes API endpoints to manage Datasets.
- [Key-value stores - Introduction](https://docs.apify.com/api/v2/storage-key-value-stores.md): This section describes API endpoints to manage Key-value stores.
- [Request queues - Introduction](https://docs.apify.com/api/v2/storage-request-queues.md): This section describes API endpoints to create, manage, and delete request queues.
- [Requests- Introduction](https://docs.apify.com/api/v2/storage-request-queues-requests.md): This section describes API endpoints to create, manage, and delete requests within request queues.
- [Requests locks - Introduction](https://docs.apify.com/api/v2/storage-request-queues-requests-locks.md): This section describes API endpoints to create, manage, and delete request locks within request queues.
- [Store - Introduction](https://docs.apify.com/api/v2/store.md): [Apify Store](https://apify.com/store) is home to thousands of public Actors available to the Apify community.
- [Get list of Actors in store](https://docs.apify.com/api/v2/store-get.md): Gets the list of public Actors in Apify Store.
- [Get public user data](https://docs.apify.com/api/v2/user-get.md): Returns public information about a specific user account, similar to what can be seen on public profile pages (e.g.
- [Users - Introduction](https://docs.apify.com/api/v2/users.md): The API endpoints described in this section return information about user accounts.
- [Get private user data](https://docs.apify.com/api/v2/users-me-get.md): Returns information about the current user account, including both public and private information.
- [Get limits](https://docs.apify.com/api/v2/users-me-limits-get.md): Returns a complete summary of your account's limits.
- [Update limits](https://docs.apify.com/api/v2/users-me-limits-put.md): Updates the account's limits manageable on your account's [Limits page](https://console.apify.com/billing#/limits).
- [Get monthly usage](https://docs.apify.com/api/v2/users-me-usage-monthly-get.md): Returns a complete summary of your usage for the current usage cycle, an overall sum, as well as a daily breakdown of usage.
- [Delete webhook](https://docs.apify.com/api/v2/webhook-delete.md): Clients JavaScriptDeletes a webhook.
- [Get webhook dispatch](https://docs.apify.com/api/v2/webhook-dispatch-get.md): Clients Python JavaScriptGets webhook dispatch object with all details.
- [Get list of webhook dispatches](https://docs.apify.com/api/v2/webhook-dispatches-get.md): Clients Python JavaScriptGets the list of webhook dispatches that the user have.
- [Get webhook](https://docs.apify.com/api/v2/webhook-get.md): Clients Python JavaScriptGets webhook object with all details.
- [Update webhook](https://docs.apify.com/api/v2/webhook-put.md): Clients Python JavaScriptUpdates a webhook using values specified by a webhook object passed as JSON in the POST payload.
- [Test webhook](https://docs.apify.com/api/v2/webhook-test-post.md): Clients Python JavaScriptTests a webhook.
- [Get collection](https://docs.apify.com/api/v2/webhook-webhook-dispatches-get.md): Clients PythonGets a given webhook's list of dispatches.
- [Get list of webhooks](https://docs.apify.com/api/v2/webhooks-get.md): Clients Python JavaScriptGets the list of webhooks that the user created.
- [Create webhook](https://docs.apify.com/api/v2/webhooks-post.md): Clients Python JavaScriptCreates a new webhook with settings provided by the webhook object passed as JSON in the payload.
- [Webhook dispatches - Introduction](https://docs.apify.com/api/v2/webhooks-webhook-dispatches.md): This section describes API endpoints to get webhook dispatches.
- [Webhooks - Introduction](https://docs.apify.com/api/v2/webhooks-webhooks.md): This section describes API endpoints to manage webhooks.

## open-source

- [Apify open source](https://docs.apify.com/open-source.md)

## sdk

- [Apify SDK](https://docs.apify.com/sdk.md)

## search

- [Search the documentation](https://docs.apify.com/search.md)

## Apify academy

- [Web Scraping Academy](https://docs.apify.com/academy.md): Learn everything about web scraping and automation with our free courses that will turn you into an expert scraper developer.
- [Actor marketing playbook](https://docs.apify.com/academy/actor-marketing-playbook.md): **Learn how to optimize and monetize your Actors on Apify Store by sharing them with other platform users.** --- [Apify Store](https://apify.com/store) is a marketplace featuring thousands of ready-made automation tools called Actors.
- [Actor description & SEO description](https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/actor-description.md): Learn about Actor description and meta description.
- [Actors and emojis](https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/actors-and-emojis.md): Using emojis in Actors is a science on its own.
- [How to create an Actor README](https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/how-to-create-an-actor-readme.md): **Learn how to write a comprehensive README to help users better navigate, understand and run public Actors in Apify Store.** --- ## What's a README in the Apify sense?
- [Importance of Actor URL](https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/importance-of-actor-url.md): **Actor URL (or technical name, as we call it), is the page URL of the Actor shown on the web.
- [Name your Actor](https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/name-your-actor.md): **Apify's standards for Actor naming.
- [Emails to Actor users](https://docs.apify.com/academy/actor-marketing-playbook/interact-with-users/emails-to-actor-users.md): **Getting users is one thing, but keeping them is another.
- [Handle Actor issues](https://docs.apify.com/academy/actor-marketing-playbook/interact-with-users/issues-tab.md): **Once you publish your Actor in Apify Store, it opens the door to new users, feedback, and… issue reports.
- [Your Apify Store bio](https://docs.apify.com/academy/actor-marketing-playbook/interact-with-users/your-store-bio.md): ## Your Apify Store bio and Store “README” To help our community showcase their talents and projects, we introduced public profile pages for developers.
- [Actor bundles](https://docs.apify.com/academy/actor-marketing-playbook/product-optimization/actor-bundles.md): **Learn what an Actor bundle is, explore existing examples, and discover how to promote them.** --- ## What is an Actor bundle?
- [How to create a great input schema](https://docs.apify.com/academy/actor-marketing-playbook/product-optimization/how-to-create-a-great-input-schema.md): Optimizing your input schema.
- [Affiliates](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/affiliates.md): The Apify Affiliate Program offers you a way to earn recurring commissions while helping others discover automation and web scraping solutions.
- [Blogs and blog resources](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/blogs-and-blog-resources.md): **Blogs remain a powerful tool for promoting your Actors and establishing authority in the field.
- [Marketing checklist](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/checklist.md): You're a developer, not a marketer.
- [Parasite SEO](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/parasite-seo.md): **Do you want to attract more users to your Actors?
- [Product Hunt](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/product-hunt.md): Product Hunt is one of the best platforms for introducing new tools, especially in the tech community.
- [SEO](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/seo.md): SEO means optimizing your content to rank high for your target queries in search engines such as Google, Bing, etc.
- [Social media](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/social-media.md): **Social media is a powerful way to connect with your Actor users and potential users.
- [Video tutorials](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/video-tutorials.md): **Videos and live streams are powerful tools for connecting with users and potential users, especially when promoting your Actors.
- [Webinars](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/webinars.md): Webinars and live streams are a fantastic way to connect with your audience, showcase your Actor's capabilities, and gather feedback from users.
- [How Actor monetization works](https://docs.apify.com/academy/actor-marketing-playbook/store-basics/how-actor-monetization-works.md): **You can turn your web scrapers into a source of income by publishing them on Apify Store.
- [How Apify Store works](https://docs.apify.com/academy/actor-marketing-playbook/store-basics/how-store-works.md): **Out of thousands of Actors on [Apify Store](https://apify.com/store) marketplace, most of them were created by developers just like you.
- [How to build Actors](https://docs.apify.com/academy/actor-marketing-playbook/store-basics/how-to-build-actors.md): At Apify, we try to make building web scraping and automation straightforward.
- [Wrap open-source as an Actor](https://docs.apify.com/academy/actorization.md): Apify is a cloud platform with a [marketplace](https://apify.com/store) of 6,000+ web scraping and automation tools called _Actors_.
- [Advanced web scraping](https://docs.apify.com/academy/advanced-web-scraping.md): In the [Web scraping basics for JavaScript devs](/academy/web-scraping-for-beginners) course, we have learned the necessary basics required to create a scraper.
- [Crawling sitemaps](https://docs.apify.com/academy/advanced-web-scraping/crawling/crawling-sitemaps.md): In the previous lesson, we learned what is the utility (and dangers) of crawling sitemaps.
- [Scraping websites with search](https://docs.apify.com/academy/advanced-web-scraping/crawling/crawling-with-search.md): # Scraping websites with search In this lesson, we will start with a simpler example of scraping HTML based websites with limited pagination.
- [Sitemaps vs search](https://docs.apify.com/academy/advanced-web-scraping/crawling/sitemaps-vs-search.md): The core crawling problem comes to down to ensuring that we reliably find all detail pages on the target website or inside its categories.
- [Tips and tricks for robustness](https://docs.apify.com/academy/advanced-web-scraping/tips-and-tricks-robustness.md): **Learn how to make your automated processes more effective.
- [AI agent tutorial](https://docs.apify.com/academy/ai/ai-agents.md): **In this section of the Apify Academy, we show you how to build an AI agent with the CrewAI Python framework.
- [Anti-scraping protections](https://docs.apify.com/academy/anti-scraping.md): # Anti-scraping protections {#anti-scraping-protections} **Understand the various anti-scraping measures different sites use to prevent bots from accessing them, and how to appear more human to fix these issues.** --- If at any point in time you've strayed away from the Academy's demo content, and into the Wild West by writing some scrapers of your own, you may have been hit with anti-scraping measures.
- [Anti-scraping mitigation](https://docs.apify.com/academy/anti-scraping/mitigation.md): # Anti-scraping mitigation {#anti-scraping-mitigation} **After learning about the various different anti-scraping techniques websites use, learn how to mitigate them with a few different techniques.** --- In the [techniques](../techniques/index.md) section of this course, you learned about multiple methods websites use to prevent bots from accessing their content.
- [Bypassing Cloudflare browser check](https://docs.apify.com/academy/anti-scraping/mitigation/cloudflare-challenge.md.md): # Bypassing Cloudflare browser check {#cloudflare-challenge} **Learn how to bypass Cloudflare browser challenge with Crawlee.** --- If you find yourself stuck, there are a few strategies that you can employ.
- [Generating fingerprints](https://docs.apify.com/academy/anti-scraping/mitigation/generating-fingerprints.md): # Generating fingerprints {#generating-fingerprints} **Learn how to use two super handy npm libraries to generate fingerprints and inject them into a Playwright or Puppeteer page.** --- In [**Crawlee**](https://crawlee.dev), you can use [**FingerprintOptions**](https://crawlee.dev/api/browser-pool/interface/FingerprintOptions) on a crawler to automatically generate fingerprints.
- [Proxies](https://docs.apify.com/academy/anti-scraping/mitigation/proxies.md): # Proxies {#about-proxies} **Learn all about proxies, how they work, and how they can be leveraged in a scraper to avoid blocking and other anti-scraping tactics.** --- A proxy server provides a gateway between users and the internet, to be more specific in our case - between the crawler and the target website.
- [Using proxies](https://docs.apify.com/academy/anti-scraping/mitigation/using-proxies.md): # Using proxies {#using-proxies} **Learn how to use and automagically rotate proxies in your scrapers by using Crawlee, and a bit about how to obtain pools of proxies.** --- In the [**Web scraping basics for JavaScript devs**](../../scraping_basics_javascript/crawling/pro_scraping.md) course, we learned about the power of Crawlee, and how it can streamline the development process of web crawlers.
- [Anti-scraping techniques](https://docs.apify.com/academy/anti-scraping/techniques.md): # Anti-scraping techniques {#anti-scraping-techniques} **Understand the various common (and obscure) anti-scraping techniques used by websites to prevent bots from accessing their content.** --- In this section, we'll be discussing some of the most common (as well as some obscure) anti-scraping techniques used by websites to detect and block/limit bots from accessing their content.
- [Browser challenges](https://docs.apify.com/academy/anti-scraping/techniques/browser-challenges.md): # Browser challenges {#fingerprinting} > Learn how to navigate browser challenges like Cloudflare's to effectively scrape data from protected websites.
- [Captchas](https://docs.apify.com/academy/anti-scraping/techniques/captchas.md): # Captchas {#captchas} **Learn about the reasons a bot might be presented a captcha, the best ways to avoid captchas in the first place, and how to programmatically solve them.** --- In general, a website will present a user (or scraper) a captcha for 2 main reasons: 1.
- [Fingerprinting](https://docs.apify.com/academy/anti-scraping/techniques/fingerprinting.md): # Fingerprinting {#fingerprinting} **Understand browser fingerprinting, an advanced technique used by browsers to track user data and even block bots from accessing them.** --- Browser fingerprinting is a method that some websites use to collect information about a browser's type and version, as well as the operating system being used, any active plugins, the time zone and language of the machine, the screen resolution, and various other active settings.
- [Firewalls](https://docs.apify.com/academy/anti-scraping/techniques/firewalls.md): # Firewalls {#firewalls} **Understand what a web-application firewall is, how they work, and the various common techniques for avoiding them altogether.** --- A web-application firewall (or **WAF**) is a tool for website admins which allows them to set various access rules for their visitors.
- [Geolocation](https://docs.apify.com/academy/anti-scraping/techniques/geolocation.md): # Geolocation {#geolocation} **Learn about the geolocation techniques to determine where requests are coming from, and a bit about how to avoid being blocked based on geolocation.** --- Geolocation is yet another way websites can detect and block access or show limited data.
- [Rate-limiting](https://docs.apify.com/academy/anti-scraping/techniques/rate-limiting.md): # Rate-limiting {#rate-limiting} **Learn about rate-limiting, a common tactic used by websites to avoid a large and non-human rate of requests coming from a single IP address.** --- When crawling a website, a web scraping bot will typically send many more requests from a single IP address than a human user could generate over the same period.
- [Using Apify API](https://docs.apify.com/academy/api.md): # Using Apify API **A collection of various tutorials explaining how to interact with the Apify platform programmatically using its API.** --- This section explains how you can run [Apify Actors](/platform/actors) using Apify's [API](/api/v2), retrieve their results, and integrate them into your own product and workflows.
- [API scraping](https://docs.apify.com/academy/api-scraping.md): # API scraping **Learn all about how the professionals scrape various types of APIs with various configurations, parameters, and requirements.** --- API scraping is locating a website's API endpoints, and fetching the desired data directly from their API, as opposed to parsing the data from their rendered HTML pages.
- [General API scraping](https://docs.apify.com/academy/api-scraping/general-api-scraping.md): # General API scraping {#general-api-scraping} **Learn the benefits and drawbacks of API scraping, how to locate an API, how to utilize its features, and how to work around common roadblocks.** --- This section will teach you everything you should know about API scraping before moving into the next sections in the **API Scraping** module.
- [Dealing with headers, cookies, and tokens](https://docs.apify.com/academy/api-scraping/general-api-scraping/cookies-headers-tokens.md): # Dealing with headers, cookies, and tokens {#challenges} **Learn about how some APIs require certain cookies, headers, and/or tokens to be present in a request in order for data to be received.** --- Unfortunately, most APIs will require a valid cookie to be included in the `cookie` field within a request's headers in order to be authorized.
- [Handling pagination](https://docs.apify.com/academy/api-scraping/general-api-scraping/handling-pagination.md): # Handling pagination {#handling-pagination} **Learn about the three most popular API pagination techniques and how to handle each of them when scraping an API with pagination.** --- When scraping large APIs, you'll quickly realize that most APIs limit the number of results it responds back with.
- [Locating API endpoints](https://docs.apify.com/academy/api-scraping/general-api-scraping/locating-and-learning.md): # Locating API endpoints {#locating-endpoints} **Learn how to effectively locate a website's API endpoints, and learn how to use them to get the data you want faster and more reliably.** --- In order to retrieve a website's API endpoints, as well as other data about them, the **Network** tab within Chrome's (or another browser's) DevTools can be used.
- [GraphQL scraping](https://docs.apify.com/academy/api-scraping/graphql-scraping.md): # GraphQL scraping {#graphql-scraping} **Dig into the topic of scraping APIs which use the latest and greatest API technology - GraphQL.
- [Custom queries](https://docs.apify.com/academy/api-scraping/graphql-scraping/custom-queries.md): # Custom queries {#custom-queries} **Learn how to write custom GraphQL queries, how to pass input values into GraphQL requests as variables, and how to retrieve and output the data from a scraper.** --- Sometimes, the queries found in the **Network** tab aren't good enough for your use case.
- [Introspection](https://docs.apify.com/academy/api-scraping/graphql-scraping/introspection.md): # Introspection {#introspection} **Understand what introspection is, and how it can help you understand a GraphQL API to take advantage of the features it has to offer before writing any code.** --- [Introspection](https://graphql.org/learn/introspection/) is when you make a query to the target GraphQL API requesting information about its schema.
- [Modifying variables](https://docs.apify.com/academy/api-scraping/graphql-scraping/modifying-variables.md): # Modifying variables {#modifying-variables} **Learn how to modify the variables of a JSON format GraphQL query to use the API without needing to write any GraphQL language or create custom queries.** --- In the introduction of this course, we searched for the term **test** on the [Cheddar](https://www.cheddar.com/) website and discovered a request to their GraphQL API.
- [How to retry failed requests](https://docs.apify.com/academy/api/retry-failed-requests.md): **Learn how to re-scrape only failed requests in your run.** --- Requests of a scraper can fail for many reasons.
- [Run Actor and retrieve data via API](https://docs.apify.com/academy/api/run-actor-and-retrieve-data-via-api.md): **Learn how to run an Actor/task via the Apify API, wait for the job to finish, and retrieve its output data.
- [Tutorials on Apify Actors](https://docs.apify.com/academy/apify-actors.md): **Learn how to deploy your API project to the Apify platform.** --- This tutorial shows you how to add your existing RapidAPI project to Apify, giving you access to managed hosting, data storage, and a broader user base through Apify Store while maintaining your RapidAPI presence.
- [Adding your RapidAPI project to Apify](https://docs.apify.com/academy/apify-actors/adding-rapidapi-project.md): If you've published an API project on [RapidAPI](https://rapidapi.com/), you can expand your project's visibility by listing it on Apify Store.
- [Introduction to the Apify platform](https://docs.apify.com/academy/apify-platform.md): # Introduction to the Apify platform {#about-the-platform} **Learn all about the Apify platform, all of the tools it offers, and how it can improve your overall development experience.** --- The [Apify platform](https://apify.com) was built to serve large-scale and high-performance web scraping and automation needs.
- [Using ready-made Apify scrapers](https://docs.apify.com/academy/apify-scrapers.md): # Using ready-made Apify scrapers **Discover Apify's ready-made web scraping and automation tools.
- [Scraping with Cheerio Scraper](https://docs.apify.com/academy/apify-scrapers/cheerio-scraper.md): [//]: # (TODO: Should be updated) # This scraping tutorial will go into the nitty gritty details of extracting data from **https://apify.com/store** using **Cheerio Scraper** ([apify/cheerio-scraper](https://apify.com/apify/cheerio-scraper)).
- [Getting started with Apify scrapers](https://docs.apify.com/academy/apify-scrapers/getting-started.md): [//]: # (TODO: Should be updated) # Welcome to the getting started tutorial!
- [Scraping with Puppeteer Scraper](https://docs.apify.com/academy/apify-scrapers/puppeteer-scraper.md): [//]: # (TODO: Should be updated) # This scraping tutorial will go into the nitty gritty details of extracting data from **https://apify.com/store** using **Puppeteer Scraper** ([apify/puppeteer-scraper](https://apify.com/apify/puppeteer-scraper)).
- [Scraping with Web Scraper](https://docs.apify.com/academy/apify-scrapers/web-scraper.md): [//]: # (TODO: Should be updated) # This scraping tutorial will go into the nitty gritty details of extracting data from **https://apify.com/store** using **Web Scraper** ([apify/web-scraper](https://apify.com/apify/web-scraper)).
- [Validate your Actor idea](https://docs.apify.com/academy/build-and-publish/actor-ideas/actor-validation.md): Before investing time into building an Actor, validate that people actually need it.
- [Find ideas for new Actors](https://docs.apify.com/academy/build-and-publish/actor-ideas/find-actor-ideas.md): Learn what kind of software tools are suitable to be packaged and published as Actors on Apify, and where you can find inspiration what to build.
- [Why publish Actors on Apify](https://docs.apify.com/academy/build-and-publish/why.md): Publishing Actors on Apify Store transforms your web scraping and automation code into revenue-generating products without the overhead of traditional SaaS development.
- [Concepts 🤔](https://docs.apify.com/academy/concepts.md): # Concepts 🤔 {#concepts} **Learn about some common yet tricky concepts and terms that are used frequently within the academy, as well as in the world of scraper development.** --- You'll see some terms and concepts frequently repeated throughout various courses in the academy.
- [CSS selectors](https://docs.apify.com/academy/concepts/css-selectors.md): CSS selectors are patterns used to select [HTML elements](./html_elements.md) on a web page.
- [Dynamic pages and single-page applications (SPAs)](https://docs.apify.com/academy/concepts/dynamic-pages.md): # Dynamic pages and single-page applications (SPAs) {#dynamic-pages} **Understand what makes a page dynamic, and how a page being dynamic might change your approach when writing a scraper for it.** --- Oftentimes, web pages load additional information dynamically, long after their main body is loaded in the browser.
- [HTML elements](https://docs.apify.com/academy/concepts/html-elements.md): An HTML element is a building block of an HTML document.
- [HTTP cookies](https://docs.apify.com/academy/concepts/http-cookies.md): # HTTP cookies {#cookies} **Learn a bit about what cookies are, and how they are utilized in scrapers to appear logged-in, view specific data, or even avoid blocking.** --- HTTP cookies are small pieces of data sent by the server to the user's web browser, which are typically stored by the browser and used to send later requests to the same server.
- [HTTP headers](https://docs.apify.com/academy/concepts/http-headers.md): # HTTP headers {#headers} **Understand what HTTP headers are, what they're used for, and three of the biggest differences between HTTP/1.1 and HTTP/2 headers.** --- [HTTP headers](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers) let the client and the server pass additional information with an HTTP request or response.
- [Querying elements](https://docs.apify.com/academy/concepts/querying-css-selectors.md): `document.querySelector()` and `document.querySelectorAll()` are JavaScript functions that allow you to select elements on a web page using [CSS selectors](./css_selectors.md).
- [What is robotic process automation (RPA)?](https://docs.apify.com/academy/concepts/robotic-process-automation.md): # What is robotic process automation (RPA)?
- [Deploying your code to Apify](https://docs.apify.com/academy/deploying-your-code.md): # Deploying your code to Apify {#deploying} **In this course learn how to take an existing project of yours and deploy it to the Apify platform as an Actor.** --- This section will discuss how to use your newfound knowledge of the Apify platform and Actors from the [**Getting started**](../getting_started/index.md) section to deploy your existing project's code to the Apify platform as an Actor.
- [Creating dataset schema](https://docs.apify.com/academy/deploying-your-code/dataset-schema.md): **Learn how to generate an appealing Overview table interface to preview your Actor results in real time on the Apify platform.** --- The dataset schema generates an interface that enables users to instantly preview their Actor results in real time.
- [Publishing your Actor](https://docs.apify.com/academy/deploying-your-code/deploying.md): **Push local code to the platform, or create a new Actor on the console and integrate it with a Git repository to optionally automatically rebuild any new changes.** --- Once you've **actorified** your code, there are two ways to deploy it to the Apify platform.
- [Creating Actor Dockerfile](https://docs.apify.com/academy/deploying-your-code/docker-file.md): **Understand how to write a Dockerfile (Docker image blueprint) for your project so that it can be run within a Docker container on the Apify platform.** --- The **Dockerfile** is a file which gives the Apify platform (or Docker, more specifically) instructions on how to create an environment for your code to run in.
- [How to write Actor input schema](https://docs.apify.com/academy/deploying-your-code/input-schema.md): **Learn how to generate a user interface on the platform for your Actor's input with a single file - the INPUT_SCHEMA.json file.** --- Though writing an [input schema](/platform/actors/development/actor-definition/input-schema) for an Actor is not a required step, it is most definitely an ideal one.
- [Managing Actor inputs and outputs](https://docs.apify.com/academy/deploying-your-code/inputs-outputs.md): **Learn to accept input into your Actor, do something with it, and then return output.
- [Expert scraping with Apify](https://docs.apify.com/academy/expert-scraping-with-apify.md): # Expert scraping with Apify {#expert-scraping} **After learning the basics of Actors and Apify, learn to develop pro-level scrapers on the Apify platform with this advanced course.** --- This course will teach you the nitty gritty of what it takes to build pro-level scrapers with Apify.
- [Webhooks & advanced Actor overview](https://docs.apify.com/academy/expert-scraping-with-apify/actors-webhooks.md): # Webhooks & advanced Actor overview {#webhooks-and-advanced-actors} **Learn more advanced details about Actors, how they work, and the default configurations they can take.
- [Apify API & client](https://docs.apify.com/academy/expert-scraping-with-apify/apify-api-and-client.md): # Apify API & client {#api-and-client} **Gain an in-depth understanding of the two main ways of programmatically interacting with the Apify platform - through the API, and through a client.** --- You can use one of the two main ways to programmatically interact with the Apify platform: by directly using [Apify's RESTful API](/api/v2), or by using the [JavaScript](/api/client/js) and [Python](/api/client/python) API clients.
- [Bypassing anti-scraping methods](https://docs.apify.com/academy/expert-scraping-with-apify/bypassing-anti-scraping.md): # Bypassing anti-scraping methods {#bypassing-anti-scraping-methods} **Learn about bypassing anti-scraping methods using proxies and proxy/session rotation together with Crawlee and the Apify SDK.** --- Effectively bypassing anti-scraping software is one of the most crucial, but also one of the most difficult skills to master.
- [Managing source code](https://docs.apify.com/academy/expert-scraping-with-apify/managing-source-code.md): # Managing source code {#managing-source-code} **Learn how to manage your Actor's source code more efficiently by integrating it with a GitHub repository.
- [Migrations & maintaining state](https://docs.apify.com/academy/expert-scraping-with-apify/migrations-maintaining-state.md): # Migrations & maintaining state {#migrations-maintaining-state} **Learn about what Actor migrations are and how to handle them properly so that the state is not lost and runs can safely be resurrected.** --- We already know that Actors are Docker containers that can be run on any server.
- [Saving useful run statistics](https://docs.apify.com/academy/expert-scraping-with-apify/saving-useful-stats.md): # Saving useful run statistics {#savings-useful-run-statistics} **Understand how to save statistics about an Actor's run, what types of statistics you can save, and why you might want to save them for a large-scale scraper.** --- Using Crawlee and the Apify SDK, we are now able to collect and format data coming directly from websites and save it into a Key-Value store or Dataset.
- [Solutions](https://docs.apify.com/academy/expert-scraping-with-apify/solutions.md): # Solutions **View all of the solutions for all of the activities and tasks of this course.
- [Handling migrations](https://docs.apify.com/academy/expert-scraping-with-apify/solutions/handling-migrations.md): # Handling migrations {#handling-migrations} **Get real-world experience of maintaining a stateful object stored in memory, which will be persisted through migrations and even graceful aborts.** --- Let's first head into our **demo-actor** and create a new file named **asinTracker.js** in the **src** folder.
- [Integrating webhooks](https://docs.apify.com/academy/expert-scraping-with-apify/solutions/integrating-webhooks.md): # Integrating webhooks {#integrating-webhooks} **Learn how to integrate webhooks into your Actors.
- [Managing source](https://docs.apify.com/academy/expert-scraping-with-apify/solutions/managing-source.md): # Managing source **View in-depth answers for all three of the quiz questions that were provided in the corresponding lesson about managing source code.** --- In the lesson corresponding to this solution, we discussed an extremely important topic: source code management.
- [Rotating proxies/sessions](https://docs.apify.com/academy/expert-scraping-with-apify/solutions/rotating-proxies.md): # Rotating proxies/sessions {#rotating-proxy-sessions} **Learn firsthand how to rotate proxies and sessions in order to avoid the majority of the most common anti-scraping protections.** --- If you take a look at our current code for the Amazon scraping Actor, you might notice this snippet: ```js const proxyConfiguration = await Actor.createProxyConfiguration({ groups: ['RESIDENTIAL'], }); ``` We didn't provide much explanation for this initially, as it was not directly relevant to the lesson at hand.
- [Saving run stats](https://docs.apify.com/academy/expert-scraping-with-apify/solutions/saving-stats.md): # Saving run stats {#saving-stats} **Implement the saving of general statistics about an Actor's run, as well as adding request-specific statistics to dataset items.** --- The code in this solution will be similar to what we already did in the **Handling migrations** solution; however, we'll be storing and logging different data.
- [Using the Apify API & JavaScript client](https://docs.apify.com/academy/expert-scraping-with-apify/solutions/using-api-and-client.md): # Using the Apify API & JavaScript client {#using-api-and-client} **Learn how to interact with the Apify API directly through the well-documented RESTful routes, or by using the proprietary Apify JavaScript client.** --- Since we need to create another Actor, we'll once again use the `apify create` command and start from an empty template.
- [Using storage & creating tasks](https://docs.apify.com/academy/expert-scraping-with-apify/solutions/using-storage-creating-tasks.md): # Using storage & creating tasks {#using-storage-creating-tasks} ## Quiz answers 📝 {#quiz-answers} **Q: What is the relationship between Actors and tasks?** **A:** Tasks are pre-configured runs of Actors.
- [Tasks & storage](https://docs.apify.com/academy/expert-scraping-with-apify/tasks-and-storage.md): # Tasks & storage {#tasks-and-storage} **Understand how to save the configurations for Actors with Actor tasks.
- [Monetizing your Actor](https://docs.apify.com/academy/get-most-of-actors/monetizing-your-actor.md): **Learn how you can monetize your web scraping and automation projects by publishing Actors to users in Apify Store.** --- When you publish your Actor on the Apify platform, you have the option to make it a _Paid Actor_ and earn revenue from users who benefit from your tool.
- [Getting started](https://docs.apify.com/academy/getting-started.md): # Getting started {#getting-started} **Get started with the Apify platform by creating an account and learning about the Apify Console, which is where all Apify Actors are born!** --- Your gateway to the Apify platform is your Apify account.
- [Actors](https://docs.apify.com/academy/getting-started/actors.md): **What is an Actor?
- [The Apify API](https://docs.apify.com/academy/getting-started/apify-api.md): # The Apify API {#the-apify-api} **Learn how to use the Apify API to programmatically call your Actors, retrieve data stored on the platform, view Actor logs, and more!** --- [Apify's API](/api/v2) is your ticket to the Apify platform without even needing to access the [Apify Console](https://console.apify.com?asrc=developers_portal) web-interface.
- [Apify client](https://docs.apify.com/academy/getting-started/apify-client.md): # Apify client {#apify-client} **Interact with the Apify API in your code by using the apify-client package, which is available for both JavaScript and Python.** --- Now that you've gotten your toes wet with interacting with the Apify API through raw HTTP requests, you're ready to become familiar with the **Apify client**, which is a package available for both JavaScript and Python that allows you to interact with the API in your code without explicitly needing to make any GET or POST requests.
- [Creating Actors](https://docs.apify.com/academy/getting-started/creating-actors.md): **This lesson offers hands-on experience in building and running Actors in Apify Console using a template.
- [Inputs & outputs](https://docs.apify.com/academy/getting-started/inputs-outputs.md): **Create an Actor from scratch which takes an input, processes that input, and then outputs a result that can be used elsewhere.** --- Actors, as any other programs, take inputs and generate outputs.
- [Why a glossary?](https://docs.apify.com/academy/glossary.md): # Why a glossary?
- [Scraping with Node.js](https://docs.apify.com/academy/node-js.md): # Scraping with Node.js **A collection of various Node.js tutorials on scraping sitemaps, optimizing your scrapers, using popular Node.js web scraping libraries, and more.** --- This section contains various web-scraping or web-scraping related tutorials for Node.js.
- [How to add external libraries to Web Scraper](https://docs.apify.com/academy/node-js/add-external-libraries-web-scraper.md): Sometimes you need to use some extra JavaScript in your [Web Scraper](https://apify.com/apify/web-scraper) page functions.
- [How to analyze and fix errors when scraping a website](https://docs.apify.com/academy/node-js/analyzing-pages-and-fixing-errors.md): # How to analyze and fix errors when scraping a website {#scraping-with-sitemaps} **Learn how to deal with random crashes in your web-scraping and automation jobs.
- [Apify's free Google SERP API](https://docs.apify.com/academy/node-js/apify-free-google-serp-api.md): You need to regularly grab SERP data about your target keywords?
- [Avoid EACCES error in Actor builds with a custom Dockerfile](https://docs.apify.com/academy/node-js/avoid-eacces-error-in-actor-builds.md): Sometimes when building an Actor using a custom Dockerfile, you might receive errors like: ```shell Missing write access to ...
- [Block requests in Puppeteer](https://docs.apify.com/academy/node-js/block-requests-puppeteer.md): :::caution Improve Performance: Use `blockRequests` Unfortunately, in the recent version of Puppeteer, request interception disables the native cache and slows down the Actor significantly.
- [How to optimize Puppeteer by caching responses](https://docs.apify.com/academy/node-js/caching-responses-in-puppeteer.md): # How to optimize Puppeteer by caching responses {#caching-responses-in-puppeteer} **Learn why it is important for performance to cache responses in memory when intercepting requests in Puppeteer and how to implement it in your code.** --- > In the latest version of Puppeteer, the request-interception function inconveniently disables the native cache and significantly slows down the crawler.
- [How to choose the right scraper for the job](https://docs.apify.com/academy/node-js/choosing-the-right-scraper.md): # How to choose the right scraper for the job {#choosing-the-right-scraper} **Learn basic web scraping concepts to help you analyze a website and choose the best scraper for your particular use case.** --- You can use one of the two main ways to proceed with building your crawler: 1.
- [How to scrape from dynamic pages](https://docs.apify.com/academy/node-js/dealing-with-dynamic-pages.md): # How to scrape from dynamic pages {#dealing-with-dynamic-pages} **Learn about dynamic pages and dynamic content.
- [Running code in a browser console](https://docs.apify.com/academy/node-js/debugging-web-scraper.md): A lot of beginners struggle through trial and error while scraping a simple site.
- [Filter out blocked proxies using sessions](https://docs.apify.com/academy/node-js/filter-blocked-requests-using-sessions.md): _This article explains how the problem was solved before the [SessionPool](/sdk/js/docs/api/session-pool) class was added into [Apify SDK](/sdk/js/).
- [BasicCrawler](https://docs.apify.com/academy/node-js/handle-blocked-requests-puppeteer.md): One of the main defense mechanisms websites use to ensure they are not scraped by bots is allowing only a limited number of requests from a specific IP address.
- [How to fix 'Target closed' error in Puppeteer and Playwright](https://docs.apify.com/academy/node-js/how_to_fix_target-closed.md): # How to fix 'Target closed' error in Puppeteer and Playwright **Learn about common causes for the 'Target closed' error in browser automation and what you can do to fix it.** --- The `Target closed` error happens when you try to access the `page` object (or some of its parent objects like the `browser`), but the underlying browser tab has already been closed.
- [How to save screenshots from puppeteer](https://docs.apify.com/academy/node-js/how-to-save-screenshots-puppeteer.md): A good way to debug your puppeteer crawler in Apify Actors is to save a screenshot of a browser window to the Apify key-value store.
- [How to scrape hidden JavaScript objects in HTML](https://docs.apify.com/academy/node-js/js-in-html.md): # How to scrape hidden JavaScript objects in HTML {#what-is-js-in-html} **Learn about "hidden" data found within the JavaScript of certain pages, which can increase the scraper reliability and improve your development experience.** --- Depending on the technology the target website is using, the data to be collected not only can be found within HTML elements, but also in a JSON format within `` tags in the DOM.
- [Scrape website in parallel with multiple Actor runs](https://docs.apify.com/academy/node-js/multiple-runs-scrape.md): # Scrape website in parallel with multiple Actor runs **Learn how to run multiple instances of an Actor to scrape a website faster.
- [How to optimize and speed up your web scraper](https://docs.apify.com/academy/node-js/optimizing-scrapers.md): # How to optimize and speed up your web scraper {#optimizing-scrapers} **We all want our scrapers to run as cost-effective as possible.
- [Enqueuing start pages for all keywords](https://docs.apify.com/academy/node-js/processing-multiple-pages-web-scraper.md): Sometimes you need to process the same URL several times, but each time with a different setup.
- [Request labels and how to pass data to other requests](https://docs.apify.com/academy/node-js/request-labels-in-apify-actors.md): Are you trying to use Actors for the first time and don't know how to deal with the request label or how to pass data to the request?
- [How to scrape from sitemaps](https://docs.apify.com/academy/node-js/scraping-from-sitemaps.md): # How to scrape from sitemaps {#scraping-with-sitemaps} :::tip Processing sitemaps automatically with Crawlee Crawlee allows you to scrape sitemaps with ease.
- [How to scrape sites with a shadow DOM](https://docs.apify.com/academy/node-js/scraping-shadow-doms.md): # How to scrape sites with a shadow DOM {#scraping-shadow-doms} **The shadow DOM enables isolation of web components, but causes problems for those building web scrapers.
- [Scraping a list of URLs from a Google Sheets document](https://docs.apify.com/academy/node-js/scraping-urls-list-from-google-sheets.md): You can export URLs from [Google Sheets](https://workspace.google.com/products/sheets/) such as [this one](https://docs.google.com/spreadsheets/d/1-2mUcRAiBbCTVA5KcpFdEYWflLMLp9DDU3iJutvES4w) directly into an [Actor](/platform/actors)'s Start URLs field.
- [Downloading the file to memory](https://docs.apify.com/academy/node-js/submitting-form-with-file-attachment.md): When doing web automation with Apify, it can sometimes be necessary to submit an HTML form with a file attachment.
- [Submitting forms on .ASPX pages](https://docs.apify.com/academy/node-js/submitting-forms-on-aspx-pages.md): Apify users sometimes need to submit a form on pages created with ASP.NET (URL typically ends with .aspx).
- [Using man-in-the-middle proxy to intercept requests in Puppeteer](https://docs.apify.com/academy/node-js/using-proxy-to-intercept-requests-puppeteer.md): Sometimes you may need to intercept (or maybe block) requests in headless Chrome / Puppeteer, but `page.setRequestInterception()` is not 100% reliable when the request is started in a new window.
- [Waiting for dynamic content](https://docs.apify.com/academy/node-js/waiting-for-dynamic-content.md): Use these helper functions to wait for data: - `page.waitFor` in [Puppeteer](https://pptr.dev/) (or Puppeteer Scraper ([apify/puppeteer-scraper](https://apify.com/apify/puppeteer-scraper))).
- [When to use Puppeteer Scraper](https://docs.apify.com/academy/node-js/when-to-use-puppeteer-scraper.md): You may have read in the [Web Scraper](https://apify.com/apify/web-scraper) readme or somewhere else at Apify that [Puppeteer Scraper](https://apify.com/apify/puppeteer-scraper) is more powerful and gives you more control over the browser, enabling you to do almost anything.
- [How to use Apify from PHP](https://docs.apify.com/academy/php/use-apify-from-php.md): # How to use Apify from PHP Apify's [RESTful API](https://docs.apify.com/api/v2#) allows you to use the platform from basically anywhere.
- [Puppeteer & Playwright course](https://docs.apify.com/academy/puppeteer-playwright.md): # Puppeteer & Playwright course {#puppeteer-playwright-course} **Learn in-depth how to use two of the most popular Node.js libraries for controlling a headless browser - Puppeteer and Playwright.** --- [Puppeteer](https://pptr.dev/) and [Playwright](https://playwright.dev/) are libraries that allow you to automate browsing.
- [Browser](https://docs.apify.com/academy/puppeteer-playwright/browser.md): # Browser {#browser} **Understand what the Browser object is in Puppeteer/Playwright, how to create one, and a bit about how to interact with one.** --- In order to automate a browser in Playwright or Puppeteer, we need to open one up programmatically.
- [Creating multiple browser contexts](https://docs.apify.com/academy/puppeteer-playwright/browser-contexts.md): # Creating multiple browser contexts {#creating-browser-contexts} **Learn what a browser context is, how to create one, how to emulate devices, and how to use browser contexts to automate multiple sessions at one time.** --- A [**BrowserContext**](https://playwright.dev/docs/api/class-browsercontext) is an isolated incognito session within a **Browser** instance.
- [Common use cases](https://docs.apify.com/academy/puppeteer-playwright/common-use-cases.md): # Common use cases {#common-use-cases} **Learn about some of the most common use cases of Playwright and Puppeteer, and how to handle these use cases when you run into them.** --- You can do about anything with a headless browser, but, there are some extremely common use cases that are important to understand and be prepared for when you might run into them.
- [Downloading files](https://docs.apify.com/academy/puppeteer-playwright/common-use-cases/downloading-files.md): # Downloading files **Learn how to automatically download and save files to the disk using two of the most popular web automation libraries, Puppeteer and Playwright.** --- Downloading a file using Puppeteer can be tricky.
- [Logging into a website](https://docs.apify.com/academy/puppeteer-playwright/common-use-cases/logging-into-a-website.md): # Logging into a website {#logging-into-a-website} **Understand the "login flow" - logging into a website, then maintaining a logged in status within different browser contexts for an efficient automation process.** --- Whether it's auto-renewing a service, automatically sending a message on an interval, or automatically cancelling a Netflix subscription, one of the most popular things headless browsers are used for is automating things within a user's account on a certain website.
- [Paginating through results](https://docs.apify.com/academy/puppeteer-playwright/common-use-cases/paginating-through-results.md): # Paginating through results {#paginating-through-results} **Learn how to paginate through results on websites that use either pagination based on page numbers or dynamic lazy loading.** --- If you're trying to [collect data](../executing_scripts/extracting_data.md) on a website that has millions, thousands, or even hundreds of results, it is very likely that they are paginating their results to reduce strain on their back-end as well as on the users loading and rendering the content.
- [Scraping iFrames](https://docs.apify.com/academy/puppeteer-playwright/common-use-cases/scraping-iframes.md): # Scraping iFrames **Extracting data from iFrames can be frustrating.
- [Submitting a form with a file attachment](https://docs.apify.com/academy/puppeteer-playwright/common-use-cases/submitting-a-form-with-a-file-attachment.md): # Submitting a form with a file attachment **Understand how to download a file, attach it to a form using a headless browser in Playwright or Puppeteer, then submit the form.** --- We can use Puppeteer or Playwright to simulate submitting the same way a human-operated browser would.
- [Executing scripts](https://docs.apify.com/academy/puppeteer-playwright/executing-scripts.md): # Executing scripts {#executing-scripts} **Understand the two different contexts which your code can be run in, and how to run custom scripts in the context of the browser.** --- An important concept to understand when dealing with headless browsers is the **context** in which your code is being run.
- [Extracting data](https://docs.apify.com/academy/puppeteer-playwright/executing-scripts/collecting-data.md): # Extracting data {#extracting-data} **Learn how to extract data from a page with evaluate functions, then how to parse it by using a second library called Cheerio.** --- Now that we know how to execute scripts on a page, we're ready to learn a bit about [data extraction](../../scraping_basics_javascript/data_extraction/index.md).
- [Injecting code](https://docs.apify.com/academy/puppeteer-playwright/executing-scripts/injecting-code.md): # Injecting code {#injecting-code} **Learn how to inject scripts prior to a page's load (pre-injecting), as well as how to expose functions to be run at a later time on the page.** --- In the previous lesson, we learned how to execute code on the page using `page.evaluate()`, and though this fits the majority of use cases, there are still some more unusual cases.
- [Opening a page](https://docs.apify.com/academy/puppeteer-playwright/page.md): # Opening a page {#opening-a-page} **Learn how to create and open a Page with a Browser, and how to use it to visit and programmatically interact with a website.** --- When you open up your regular browser and visit a website, you open up a new page (or tab) before entering the URL in the search bar and hitting the **Enter** key.
- [Interacting with a page](https://docs.apify.com/academy/puppeteer-playwright/page/interacting-with-a-page.md): # Interacting with a page {#interacting-with-a-page} **Learn how to programmatically do actions on a page such as clicking, typing, and pressing keys.
- [Page methods](https://docs.apify.com/academy/puppeteer-playwright/page/page-methods.md): # Page methods {#page-methods} **Understand that the Page object has many different methods to offer, and learn how to use two of them to capture a page's title and take a screenshot.** --- Other than having methods for interacting with a page and waiting for events and elements, the **Page** object also supports various methods for doing other things, such as [reloading](https://pptr.dev/api/puppeteer.page.reload), [screenshotting](https://playwright.dev/docs/api/class-page#page-screenshot), [changing headers](https://playwright.dev/docs/api/class-page#page-set-extra-http-headers), and extracting the [page's content](https://pptr.dev/api/puppeteer.page.content).
- [Waiting for elements and events](https://docs.apify.com/academy/puppeteer-playwright/page/waiting.md): # Waiting for elements and events {#waiting-for-elements-and-events} **Learn the importance of waiting for content and events before running interaction or extraction code, as well as the best practices for doing so.** --- In a perfect world, every piece of content served on a website would be loaded instantaneously.
- [Using proxies](https://docs.apify.com/academy/puppeteer-playwright/proxies.md): # Using proxies {#using-proxies} **Understand how to use proxies in your Puppeteer and Playwright requests, as well as a couple of the most common use cases for proxies.** --- [Proxies](../anti_scraping/mitigation/proxies.md) are a great way of appearing as if you are making requests from a different location.
- [Reading & intercepting requests](https://docs.apify.com/academy/puppeteer-playwright/reading-intercepting-requests.md): # Reading & intercepting requests {#reading-intercepting-requests} **You can use DevTools, but did you know that you can do all the same stuff (plus more) programmatically?
- [Scraping with Python](https://docs.apify.com/academy/python.md): # Scraping with Python **A collection of various Python tutorials to aid you in your journey to becoming a master web scraping and automation developer.** --- This section contains various web-scraping or web-scraping related tutorials for Python.
- [How to process data in Python using Pandas](https://docs.apify.com/academy/python/process-data-using-python.md): # How to process data in Python using Pandas **Learn how to process the resulting data of a web scraper in Python using the Pandas library, and how to visualize the processed data using Matplotlib.** --- In the [previous tutorial](/academy/python/scrape-data-python), we learned how to scrape data from the web in Python using the [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) library.
- [How to scrape data in Python using Beautiful Soup](https://docs.apify.com/academy/python/scrape-data-python.md): # How to scrape data in Python using Beautiful Soup **Learn how to create a Python Actor and use Python libraries to scrape, process and visualize data extracted from the web.** --- Web scraping is not limited to the JavaScript world.
- [Run a web server on the Apify platform](https://docs.apify.com/academy/running-a-web-server.md): **A web server running in an Actor can act as a communication channel with the outside world.
- [Web scraping basics for JavaScript devs](https://docs.apify.com/academy/scraping-basics-javascript2.md): **Learn how to use JavaScript to extract information from websites in this practical course, starting from the absolute basics.** --- In this course we'll use JavaScript to create an application for watching prices.
- [Crawling websites with Node.js](https://docs.apify.com/academy/scraping-basics-javascript2/crawling.md): **In this lesson, we'll follow links to individual product pages.
- [Extracting data from a web page with browser DevTools](https://docs.apify.com/academy/scraping-basics-javascript2/devtools-extracting-data.md): **In this lesson we'll use the browser tools for developers to manually extract product data from an e-commerce website.** --- In our pursuit to scrape products from the [Sales page](https://warehouse-theme-metal.myshopify.com/collections/sales), we've been able to locate parent elements containing relevant data.
- [Inspecting web pages with browser DevTools](https://docs.apify.com/academy/scraping-basics-javascript2/devtools-inspecting.md): **In this lesson we'll use the browser tools for developers to inspect and manipulate the structure of a website.** --- A browser is the most complete tool for navigating websites.
- [Locating HTML elements on a web page with browser DevTools](https://docs.apify.com/academy/scraping-basics-javascript2/devtools-locating-elements.md): **In this lesson we'll use the browser tools for developers to manually find products on an e-commerce website.** --- Inspecting Wikipedia and tweaking its subtitle is fun, but let's shift gears and focus on building an app to track prices on an e-commerce site.
- [Downloading HTML with Node.js](https://docs.apify.com/academy/scraping-basics-javascript2/downloading-html.md): **In this lesson we'll start building a Node.js application for watching prices.
- [Extracting data from HTML with Node.js](https://docs.apify.com/academy/scraping-basics-javascript2/extracting-data.md): **In this lesson we'll finish extracting product data from the downloaded HTML.
- [Using a scraping framework with Node.js](https://docs.apify.com/academy/scraping-basics-javascript2/framework.md): **In this lesson, we'll rework our application for watching prices so that it builds on top of a scraping framework.
- [Getting links from HTML with Node.js](https://docs.apify.com/academy/scraping-basics-javascript2/getting-links.md): **In this lesson, we'll locate and extract links to individual product pages.
- [Locating HTML elements with Node.js](https://docs.apify.com/academy/scraping-basics-javascript2/locating-elements.md): **In this lesson we'll locate product data in the downloaded HTML.
- [Parsing HTML with Node.js](https://docs.apify.com/academy/scraping-basics-javascript2/parsing-html.md): **In this lesson we'll look for products in the downloaded HTML.
- [Using a scraping platform with Node.js](https://docs.apify.com/academy/scraping-basics-javascript2/platform.md): **In this lesson, we'll deploy our application to a scraping platform that automatically runs it daily.
- [Saving data with Node.js](https://docs.apify.com/academy/scraping-basics-javascript2/saving-data.md): **In this lesson, we'll save the data we scraped in the popular formats, such as CSV or JSON.
- [Scraping product variants with Node.js](https://docs.apify.com/academy/scraping-basics-javascript2/scraping-variants.md): **In this lesson, we'll scrape the product detail pages to represent each product variant as a separate item in our dataset.** --- We'll need to figure out how to extract variants from the product detail page, and then change how we add items to the data list so we can add multiple items after scraping one product URL.
- [Web scraping basics for Python devs](https://docs.apify.com/academy/scraping-basics-python.md): **Learn how to use Python to extract information from websites in this practical course, starting from the absolute basics.** --- In this course we'll use Python to create an application for watching prices.
- [Crawling websites with Python](https://docs.apify.com/academy/scraping-basics-python/crawling.md): **In this lesson, we'll follow links to individual product pages.
- [Extracting data from a web page with browser DevTools](https://docs.apify.com/academy/scraping-basics-python/devtools-extracting-data.md): **In this lesson we'll use the browser tools for developers to manually extract product data from an e-commerce website.** --- In our pursuit to scrape products from the [Sales page](https://warehouse-theme-metal.myshopify.com/collections/sales), we've been able to locate parent elements containing relevant data.
- [Inspecting web pages with browser DevTools](https://docs.apify.com/academy/scraping-basics-python/devtools-inspecting.md): **In this lesson we'll use the browser tools for developers to inspect and manipulate the structure of a website.** --- A browser is the most complete tool for navigating websites.
- [Locating HTML elements on a web page with browser DevTools](https://docs.apify.com/academy/scraping-basics-python/devtools-locating-elements.md): **In this lesson we'll use the browser tools for developers to manually find products on an e-commerce website.** --- Inspecting Wikipedia and tweaking its subtitle is fun, but let's shift gears and focus on building an app to track prices on an e-commerce site.
- [Downloading HTML with Python](https://docs.apify.com/academy/scraping-basics-python/downloading-html.md): **In this lesson we'll start building a Python application for watching prices.
- [Extracting data from HTML with Python](https://docs.apify.com/academy/scraping-basics-python/extracting-data.md): **In this lesson we'll finish extracting product data from the downloaded HTML.
- [Using a scraping framework with Python](https://docs.apify.com/academy/scraping-basics-python/framework.md): **In this lesson, we'll rework our application for watching prices so that it builds on top of a scraping framework.
- [Getting links from HTML with Python](https://docs.apify.com/academy/scraping-basics-python/getting-links.md): **In this lesson, we'll locate and extract links to individual product pages.
- [Locating HTML elements with Python](https://docs.apify.com/academy/scraping-basics-python/locating-elements.md): **In this lesson we'll locate product data in the downloaded HTML.
- [Parsing HTML with Python](https://docs.apify.com/academy/scraping-basics-python/parsing-html.md): **In this lesson we'll look for products in the downloaded HTML.
- [Using a scraping platform with Python](https://docs.apify.com/academy/scraping-basics-python/platform.md): **In this lesson, we'll deploy our application to a scraping platform that automatically runs it daily.
- [Saving data with Python](https://docs.apify.com/academy/scraping-basics-python/saving-data.md): **In this lesson, we'll save the data we scraped in the popular formats, such as CSV or JSON.
- [Scraping product variants with Python](https://docs.apify.com/academy/scraping-basics-python/scraping-variants.md): **In this lesson, we'll scrape the product detail pages to represent each product variant as a separate item in our dataset.** --- We'll need to figure out how to extract variants from the product detail page, and then change how we add items to the data list so we can add multiple items after scraping one product URL.
- [Tools 🔧](https://docs.apify.com/academy/tools.md): # Tools 🔧 {#tools} **Discover a variety of tools that can be used to enhance the scraper development process, or even unlock doors to new scraping possibilities.** --- Here at Apify, we've found many tools, some quite popular and well-known and some niche, which can aid any developer in their scraper development process.
- [The Apify CLI](https://docs.apify.com/academy/tools/apify-cli.md): # The Apify CLI {#the-apify-cli} **Learn about, install, and log into the Apify CLI - your best friend for interacting with the Apify platform via your terminal.** --- The [Apify CLI](/cli) helps you create, develop, build and run Apify Actors, and manage the Apify cloud platform from any computer.
- [What's EditThisCookie?](https://docs.apify.com/academy/tools/edit-this-cookie.md): # What's EditThisCookie?
- [What is Insomnia](https://docs.apify.com/academy/tools/insomnia.md): # What is Insomnia {#what-is-insomnia} **Learn about Insomnia, a valuable tool for testing requests and proxies when building scalable web scrapers.** --- Despite its name, the [Insomnia](https://insomnia.rest/download) desktop application has absolutely nothing to do with having a lack of sleep.
- [What is ModHeader?](https://docs.apify.com/academy/tools/modheader.md): # What is ModHeader?
- [What is Postman?](https://docs.apify.com/academy/tools/postman.md): # What is Postman?
- [What's Proxyman?](https://docs.apify.com/academy/tools/proxyman.md): # What's Proxyman?
- [Quick JavaScript Switcher](https://docs.apify.com/academy/tools/quick-javascript-switcher.md): # Quick JavaScript Switcher **Discover a handy tool for disabling JavaScript on a certain page to determine how it should be scraped.
- [What is SwitchyOmega?](https://docs.apify.com/academy/tools/switchyomega.md): # What is SwitchyOmega?
- [User-Agent Switcher](https://docs.apify.com/academy/tools/user-agent-switcher.md): # User-Agent Switcher **Learn how to switch your User-Agent header to different values in order to monitor how a certain site responds to the changes.** --- **User-Agent Switcher** is a Chrome extension that allows you to quickly change your **User-Agent** and see how a certain website would behave with different user agents.
- [Tutorials 📚](https://docs.apify.com/academy/tutorials.md): # Tutorials 📚 **Learn about various different specific topics related to web-scraping and web-automation with the Apify Academy tutorial lessons!** --- In web scraping, there are a whole lot of niche cases that you will run into.
- [Web scraping basics for JavaScript devs](https://docs.apify.com/academy/web-scraping-for-beginners.md): # Web scraping basics for JavaScript devs {#welcome} **Learn how to develop web scrapers with this comprehensive and practical course.
- [Best practices when writing scrapers](https://docs.apify.com/academy/web-scraping-for-beginners/best-practices.md): # Best practices when writing scrapers {#best-practices} **Understand the standards and best practices that we here at Apify abide by to write readable, scalable, and maintainable code.** --- Every developer has their own style, which evolves as they grow and learn.
- [Challenge](https://docs.apify.com/academy/web-scraping-for-beginners/challenge.md): # Challenge **Test your knowledge acquired in the previous sections of this course by building an Amazon scraper using Crawlee's CheerioCrawler!** --- Before moving onto the other courses in the academy, we recommend following along with this section, as it combines everything you've learned in the previous lessons into one cohesive project that helps you prove to yourself that you've thoroughly understood the material.
- [Initialization & setting up](https://docs.apify.com/academy/web-scraping-for-beginners/challenge/initializing-and-setting-up.md): # Initialization & setting up **When you extract links from a web page, you often end up with a lot of irrelevant URLs.
- [Modularity](https://docs.apify.com/academy/web-scraping-for-beginners/challenge/modularity.md): # Modularity **Before you build your first web scraper with Crawlee, it is important to understand the concept of modularity in programming.** --- Now that we've gotten our first request going, the first challenge is going to be selecting all of the resulting products on the page.
- [Scraping Amazon](https://docs.apify.com/academy/web-scraping-for-beginners/challenge/scraping-amazon.md): # Scraping Amazon **Build your first web scraper with Crawlee.
- [Basics of crawling](https://docs.apify.com/academy/web-scraping-for-beginners/crawling.md): **Learn how to crawl the web with your scraper.
- [Exporting data](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/exporting-data.md): # Exporting data {#exporting-data} **Learn how to export the data you scraped using Crawlee to CSV or JSON.** --- In the previous lessons, you learned that: 1.
- [Filtering links](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/filtering-links.md): # Filtering links {#filtering-links} **When you extract links from a web page, you often end up with a lot of irrelevant URLs.
- [Finding links](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/finding-links.md): **Learn what a link looks like in HTML and how to find and extract their URLs when web scraping using both DevTools and Node.js.** --- Many kinds of links exist on the internet, and we'll cover all the types in the advanced Academy courses.
- [Your first crawl](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/first-crawl.md): # Your first crawl {#your-first-crawl} **Learn how to crawl the web using Node.js, Cheerio and an HTTP client.
- [Headless browsers](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/headless-browser.md): # Headless browsers {#headless-browser} **Learn how to scrape the web with a headless browser using only a few lines of code.
- [Professional scraping 👷](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/pro-scraping.md): # Professional scraping 👷 {#pro-scraping} **Learn how to build scrapers quicker and get better and more robust results by using Crawlee, an open-source library for scraping in Node.js.** --- While it's definitely an interesting exercise to do all the programming manually, and we hope you enjoyed it, it's neither the most effective, nor the most efficient way of scraping websites.
- [Recap of data extraction basics](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/recap-extraction-basics.md): # Recap of data extraction basics {#quick-recap} **Review our e-commerce website scraper and refresh our memory about its code and the programming techniques we used to extract and save the data.** --- We finished off the [first section](../data_extraction/index.md) of the _Web scraping basics for JavaScript devs_ course by creating a web scraper in Node.js.
- [Relative URLs](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/relative-urls.md): # Relative URLs {#filtering-links} **Learn about absolute and relative URLs used on web pages and how to work with them when parsing HTML with Cheerio in your scraper.** --- You might have noticed in the previous lesson that while printing URLs to the DevTools console, they would always show in full length, like this: ```text https://warehouse-theme-metal.myshopify.com/products/denon-ah-c720-in-ear-headphones ``` But in the Elements tab, when checking the `` attributes, the URLs would look like this: ```text /products/denon-ah-c720-in-ear-headphones ``` What's up with that?
- [Scraping data](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/scraping-the-data.md): # Scraping data {#scraping-data} **Learn how to add data extraction logic to your crawler, which will allow you to extract data from all the websites you crawled.** --- At the [very beginning of this course](../index.md), we learned that the term web scraping usually means a combined process of data extraction and crawling.
- [Basics of data extraction](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction.md): # Basics of data extraction {#basics} **Learn about HTML, CSS, and JavaScript, the basic building blocks of a website, and how to use them in web scraping and data extraction.** --- Every web scraping project starts with some detective work.
- [Starting with browser DevTools](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/browser-devtools.md): **Learn about browser DevTools, a valuable tool in the world of web scraping, and how you can use them to extract data from a website.** --- Even though DevTools stands for developer tools, everyone can use them to inspect a website.
- [Prepare your computer for programming](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/computer-preparation.md): # Prepare your computer for programming {#prepare-computer} **Set up your computer to be able to code scrapers with Node.js and JavaScript.
- [Extracting data with DevTools](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/devtools-continued.md): **Continue learning how to extract data from a website using browser DevTools, CSS selectors, and JavaScript via the DevTools console.** --- In the previous parts of the DevTools tutorial, we were able to extract information about a single product from the Sales collection of the [Warehouse store](https://warehouse-theme-metal.myshopify.com/collections/sales).
- [Extracting data with Node.js](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/node-continued.md): **Continue learning how to create a web scraper with Node.js and Cheerio.
- [Scraping with Node.js](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/node-js-scraper.md): **Learn how to use JavaScript and Node.js to create a web scraper, plus take advantage of the Cheerio and Got-scraping libraries to make your job easier.** --- Finally, we have everything ready to start scraping!
- [Setting up your project](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/project-setup.md): # Setting up your project {#setting-up} **Create a new project with npm and Node.js.
- [Saving results to CSV](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/save-to-csv.md): # Saving results to CSV {#saving-to-csv} **Learn how to save the results of your scraper's collected data to a CSV file that can be opened in Excel, Google Sheets, or any other spreadsheets program.** --- In the last lesson, we were able to extract data about all the on-sale products from [Warehouse Store](https://warehouse-theme-metal.myshopify.com/collections/sales).
- [Finding elements with DevTools](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/using-devtools.md): **Learn how to use browser DevTools, CSS selectors, and JavaScript via the DevTools console to extract data from a website.** --- With the knowledge of the basics of DevTools we can finally try doing something more practical - extracting data from a website.
- [Introduction](https://docs.apify.com/academy/web-scraping-for-beginners/introduction.md): # Introduction {#introduction} **Start learning about web scraping, web crawling, data extraction, and popular tools to start developing your own scraper.** --- Web scraping or crawling?

## Legal documents

- [Apify Legal](https://docs.apify.com/legal.md): ## Company details (Impressum) **Apify Technologies s.r.o.** Registered seat: Vodickova 704/36, 110 00 Prague 1, Czech Republic VAT ID: CZ04788290 (EU), GB373153700 (UK) Company ID: 04788290 Czech limited liability company registered in the [Commercial Register](https://or.justice.cz/ias/ui/rejstrik-firma.vysledky?subjektId=924944&typ=PLATNY) kept by the Municipal Court of Prague, File No.: C 253224 Represented by managing director Jan Čurn IBAN: CZ0355000000000027434378 SWIFT / BIC: RZBCCZPP ### Contacts General: hello@apify.com Legal team contact: legal@apify.com Privacy team contact: privacy@apify.com Apify Trust Center: https://trust.apify.com/ ### Trademarks "APIFY" is a word trademark registered with USPTO (4517178), EUIPO (011628377), UKIPO (UK00911628377), and DPMA (3020120477984).
- [Apify Acceptable Use Policy](https://docs.apify.com/legal/acceptable-use-policy.md): # Apify Acceptable Use Policy Apify Technologies s.r.o., with its registered seat at Vodičkova 704/36, Nové Město, 110 00 Prague 1, Czech Republic, Company ID No.: 04788290, registered in the Commercial Register kept by the Municipal Court of Prague, File No.: C 253224 (hereinafter referred to as “**we**” or “**Apify**”), is committed to making sure that the Platform and the Website are being used only for legitimate and legal purposes.
- [Apify Affiliate Program Terms and Conditions](https://docs.apify.com/legal/affiliate-program-terms-and-conditions.md): # Apify Affiliate Program Terms and Conditions Effective date: May 14, 2024 Latest version effective from: July 5, 2025 --- **Apify Technologies s.r.o.**, with its registered seat at Vodičkova 704/36, 110 00 Prague 1, Czech Republic, Company reg.
- [Apify Candidate Referral Program](https://docs.apify.com/legal/candidate-referral-program-terms.md): # Apify Candidate Referral Program Last Updated: April 14, 2025 --- Apify Technologies s.r.o., as the announcer (“**Apify**”), is constantly looking for new employees and prefers to recruit people based on credible references.Therefore, Apify is announcing this public candidate referral program.
- [Apify $1M Challenge Terms and Conditions](https://docs.apify.com/legal/challenge-terms-and-conditions.md): # Apify $1M Challenge Terms and Conditions Effective date: November 3, 2025 Apify Technologies s.r.o., a company registered in the Czech Republic, with its registered office at Vodičkova 704/36, 110 00 Prague 1, Czech Republic, Company ID No.: 04788290 ("**Apify**", "**we**", "**us**") offers you (also referred to as "**participant**") the opportunity to enroll in the "Apify \$1M Challenge" ("**Challenge**"), which is subject to the following "Apify 1M Challenge Terms and Conditions" ("**Challenge Terms**").
- [Apify Community Code of Conduct](https://docs.apify.com/legal/community-code-of-conduct.md): # Apify Community Code of Conduct Effective Date: August 18, 2025 --- ## Overview and Purpose Apify community is intended to be a place for further collaboration, support, and brainstorming.
- [Apify Cookie Policy](https://docs.apify.com/legal/cookie-policy.md): # Apify Cookie Policy **Apify Technologies s.r.o.**, with its registered seat at Vodičkova 704/36, 110 00 Prague 1, Czech Republic, Company reg.
- [Apify Data Processing Addendum](https://docs.apify.com/legal/data-processing-addendum.md): # Apify Data Processing Addendum Last Updated: January 13, 2025 --- If you wish to execute this DPA, continue [here](https://eform.pandadoc.com/?eform=5344745e-5f8e-44eb-bcbd-1a2f45dbd692) and follow instructions in the PandaDoc form.
- [Apify Event Terms and Conditions](https://docs.apify.com/legal/event-terms-and-conditions.md): # Apify Event Terms and Conditions Effective date: November 3, 2025 These Event Terms and Conditions ("**Terms**") apply to all Events organized or co-organized by Apify Technologies s.r.o., a company registered in the Czech Republic, with its registered office at Vodičkova 704/36, 110 00 Prague 1, Czech Republic, Company ID No.: 04788290 ("**Apify**", "**we**", "**us**"), whether in-person, hybrid, or online ("**Events**").
- [Apify Open Source Fair Share Program Terms and Conditions](https://docs.apify.com/legal/fair-share-program-terms-and-conditions.md): # Apify Open Source Fair Share Program Terms and Conditions You are reading terms and conditions that are no longer effective.
- [Apify GDPR Information](https://docs.apify.com/legal/gdpr-information.md): # Apify GDPR Information The European Union (“**EU**”) General Data Protection Regulation (“**GDPR**”) replaces the 1995 EU Data Protection Directive.
- [Apify General Terms and Conditions](https://docs.apify.com/legal/general-terms-and-conditions.md): # Apify General Terms and Conditions Effective date: May 14, 2024 --- Apify Technologies s.r.o., with its registered seat at Vodičkova 704/36, 110 00 Prague 1, Czech Republic, Company reg.
- [Apify General Terms and Conditions October 2022](https://docs.apify.com/legal/old/general-terms-and-conditions-october-2022.md): ## Version History You are reading terms and conditions that are no longer effective.
- [Apify Store Publishing Terms and Conditions December 2022](https://docs.apify.com/legal/old/store-publishing-terms-and-conditions-december-2022.md): ## Version History You are reading terms and conditions that are no longer effective.
- [Apify Privacy Policy](https://docs.apify.com/legal/privacy-policy.md): # Apify Privacy Policy Last Updated: February 10, 2025 Welcome to the Apify Privacy Policy!
- [Apify Store Publishing Terms and Conditions](https://docs.apify.com/legal/store-publishing-terms-and-conditions.md): # Apify Store Publishing Terms and Conditions Last updated: February 26, 2025 --- Apify Technologies s.r.o., with its registered seat at Vodičkova 704/36, 110 00 Prague 1, Czech Republic, Company reg.
- [Apify Whistleblowing Policy](https://docs.apify.com/legal/whistleblowing-policy.md): # Apify Whistleblowing Policy [verze v českém jazyce níže] Last updated: April 14, 2025 At Apify, we are committed to upholding the highest standards of integrity, ethics, and accountability.

## Platform documentation

- [Apify platform](https://docs.apify.com/platform.md): > **Apify** is a cloud platform that helps you build reliable web scrapers, fast, and automate anything you can do manually in a web browser.
- [Actors](https://docs.apify.com/platform/actors.md): **Learn how to develop, run and share serverless cloud programs.
- [Actor development](https://docs.apify.com/platform/actors/development.md): **Read about the technical part of building Apify Actors.
- [Actor definition](https://docs.apify.com/platform/actors/development/actor-definition.md): **Learn how to turn your arbitrary code into an Actor simply by adding an Actor definition directory.** --- A single isolated Actor consists of source code and various settings.
- [actor.json](https://docs.apify.com/platform/actors/development/actor-definition/actor-json.md): **Learn how to write the main Actor configuration in the `.actor/actor.json` file.** --- Your main Actor configuration is in the `.actor/actor.json` file at the root of your Actor's directory.
- [Dataset schema specification](https://docs.apify.com/platform/actors/development/actor-definition/dataset-schema.md): **Learn how to define and present your dataset schema in an user-friendly output UI.** --- The dataset schema defines the structure and representation of data produced by an Actor, both in the API and the visual user interface.
- [Dataset validation](https://docs.apify.com/platform/actors/development/actor-definition/dataset-schema/validation.md): **Specify the dataset schema within the Actors so you can add monitoring and validation at the field level.** --- To define a schema for a default dataset of an Actor run, you need to set `fields` property in the dataset schema.
- [Dockerfile](https://docs.apify.com/platform/actors/development/actor-definition/dockerfile.md): **Learn about the available Docker images you can use as a base for your Apify Actors.
- [Actor input schema](https://docs.apify.com/platform/actors/development/actor-definition/input-schema.md): **Learn how to define and validate a schema for your Actor's input with code examples.
- [Secret input](https://docs.apify.com/platform/actors/development/actor-definition/input-schema/secret-input.md): **Learn about making some Actor input fields secret and encrypted.
- [Actor input schema specification](https://docs.apify.com/platform/actors/development/actor-definition/input-schema/specification/v1.md): **Learn how to define and validate a schema for your Actor's input with code examples.
- [Key-value store schema specification](https://docs.apify.com/platform/actors/development/actor-definition/key-value-store-schema.md): **Learn how to define and present your key-value store schema to organize records into collections.** --- The key‑value store schema organizes keys into logical groups called collections, which can be used to filter and categorize data both in the API and the visual user interface.
- [Actor output schema](https://docs.apify.com/platform/actors/development/actor-definition/output-schema.md): **Learn how to define and present the output of your Actor.** --- The Actor output schema builds upon the schemas for the [dataset](/platform/actors/development/actor-definition/dataset-schema) and [key-value store](/platform/actors/development/actor-definition/key-value-store-schema).
- [Source code](https://docs.apify.com/platform/actors/development/actor-definition/source-code.md): **Learn about the Actor's source code placement and its structure.** --- The Apify Actor's source code placement is defined by its [Dockerfile](./docker.md).
- [Automated tests for Actors](https://docs.apify.com/platform/actors/development/automated-tests.md): **Learn how to automate ongoing testing and make sure your Actors perform over time.
- [Builds and runs](https://docs.apify.com/platform/actors/development/builds-and-runs.md): **Learn about Actor builds and runs, their lifecycle, versioning, and other properties.** --- Actor **builds** and **runs** are fundamental concepts within the Apify platform.
- [Builds](https://docs.apify.com/platform/actors/development/builds-and-runs/builds.md): **Learn about Actor build numbers, versioning, and how to use specific Actor version in runs.
- [Runs](https://docs.apify.com/platform/actors/development/builds-and-runs/runs.md): **Learn about Actor runs, how to start them, and how to manage them.** --- When you start an Actor, you create a run.
- [State persistence](https://docs.apify.com/platform/actors/development/builds-and-runs/state-persistence.md): **Learn how to maintain an Actor's state to prevent data loss during unexpected restarts.
- [Deployment](https://docs.apify.com/platform/actors/development/deployment.md): **Learn how to deploy your Actors to the Apify platform and build them.** --- Deploying an Actor involves uploading your [source code](/platform/actors/development/actor-definition) and [building](/platform/actors/development/builds-and-runs/builds) it on the Apify platform.
- [Continuous integration for Actors](https://docs.apify.com/platform/actors/development/deployment/continuous-integration.md): **Learn how to set up automated builds, deploys, and testing for your Actors.** --- Automating your Actor development process can save time and reduce errors, especially for projects with multiple Actors or frequent updates.
- [Source types](https://docs.apify.com/platform/actors/development/deployment/source-types.md): **Learn about Apify Actor source types and how to deploy an Actor from GitHub using CLI or Gist.** --- This section explains the various sources types available for Apify Actors and how to deploy an Actor from GitHub using CLI or Gist.
- [Performance](https://docs.apify.com/platform/actors/development/performance.md): **Learn how to get the maximum value out of your Actors, minimize costs, and maximize results.** --- ## Optimization Tips This guide provides tips to help you maximize the performance of your Actors, minimize costs, and achieve optimal results.
- [Programming interface](https://docs.apify.com/platform/actors/development/programming-interface.md): **Learn about the programming interface of Apify Actors, important commands and features provided by the Apify SDK, and how to use them in your Actors.** --- This chapter will guide you through all the commands you need to build your first Actor.
- [Basic commands](https://docs.apify.com/platform/actors/development/programming-interface/basic-commands.md): **Learn how to use basic commands of the Apify SDK for both JavaScript and Python.** --- This page covers essential commands for the Apify SDK in JavaScript & Python.
- [Container web server](https://docs.apify.com/platform/actors/development/programming-interface/container-web-server.md): **Learn about how to run a web server inside your Actor to enable communication with the outside world through both UI and API.** --- Each Actor run is assigned a unique URL (e.g.
- [Actor environment variables](https://docs.apify.com/platform/actors/development/programming-interface/environment-variables.md): **Learn how to provide your Actor with context that determines its behavior through a plethora of pre-defined environment variables set by the Apify platform.** --- ## How to use environment variables in an Actor You can set up environment variables for your Actor in two ways: - [Set up environment variables in `actor.json`](#set-up-environment-variables-in-actorjson) - [Set up environment variables in Apify Console](#set-up-environment-variables-in-apify-console) :::info Environment variable precedence Your local `.actor/actor.json` file overrides variables set in Apify Console.
- [Metamorph](https://docs.apify.com/platform/actors/development/programming-interface/metamorph.md): **The metamorph operation transforms an Actor run into the run of another Actor with a new input.** --- ## Transform Actor runs Metamorph is a powerful operation that transforms an Actor run into the run of another Actor with a new input.
- [Standby mode](https://docs.apify.com/platform/actors/development/programming-interface/standby.md): **Use Actors as an API server for fast response times.** --- Traditional Actors are designed to run a single task and then stop.
- [Status messages](https://docs.apify.com/platform/actors/development/programming-interface/status-messages.md): **Learn how to use custom status messages to inform users about an Actor's progress.** --- Each Actor run has a status, represented by the `status` field.
- [System events in Apify Actors](https://docs.apify.com/platform/actors/development/programming-interface/system-events.md): **Learn about system events sent to your Actor and how to benefit from them.** --- ## Understand system events Apify's system notifies Actors about various events, such as: - Migration to another server - Abort operations triggered by another Actor - CPU overload These events help you manage your Actor's behavior and resources effectively.
- [Quick start](https://docs.apify.com/platform/actors/development/quick-start.md): **Create your first Actor using the Apify Web IDE or locally in your IDE.** --- :::info Before you build Before you start building your own Actor, try out a couple of existing Actors from [Apify Store](https://apify.com/store).
- [Build with AI](https://docs.apify.com/platform/actors/development/quick-start/build-with-ai.md): **Use pre-built prompts, reference Apify docs via llms.txt, and follow best practices to build Actors efficiently with AI coding assistants.** --- You will learn several approaches to building Apify Actors with the help of AI coding assistants.
- [Local development](https://docs.apify.com/platform/actors/development/quick-start/locally.md): **Create your first Actor locally on your machine, deploy it to the Apify platform, and run it in the cloud.** --- ## What you'll learn This guide walks you through the full lifecycle of an Actor.
- [Web IDE](https://docs.apify.com/platform/actors/development/quick-start/web-ide.md): **Create your first Actor using the web IDE in Apify Console.** --- ## What you'll learn This guide walks you through the full lifecycle of an Actor.
- [Publishing and monetization](https://docs.apify.com/platform/actors/publishing.md): **Apify provides a platform for developing, publishing, and monetizing web automation solutions called Actors.
- [Monetize your Actor](https://docs.apify.com/platform/actors/publishing/monetize.md): **Learn how you can monetize your web scraping and automation projects by publishing Actors to users in Apify Store.** --- Apify Store allows you to monetize your web scraping, automation and AI Agent projects by publishing them as paid Actors.
- [Pay per event](https://docs.apify.com/platform/actors/publishing/monetize/pay-per-event.md): **Learn how to monetize your Actor with pay-per-event (PPE) pricing, charging users for specific actions like Actor starts, dataset items, or API calls, and understand how to set profitable, transparent event-based pricing.** --- The PPE pricing model offers a flexible monetization option for Actors on Apify Store.
- [Pay per result](https://docs.apify.com/platform/actors/publishing/monetize/pay-per-result.md): **Learn how to monetize your Actor with pay-per-result (PPR) pricing, charging users based on the number of results produced and stored in the dataset, and understand how to set profitable, transparent result-based pricing.** --- In this model, you set a price per 1,000 results.
- [Pricing and costs](https://docs.apify.com/platform/actors/publishing/monetize/pricing-and-costs.md): **Learn how to set Actor pricing and calculate your costs, including platform usage rates, discount tiers, and profit formulas for PPE and PPR monetization models.** --- ## Computing your costs for PPE and PPR Actors For both PPE and PPR Actors, profit is computed using the formula `(0.8 * revenue) - costs`.
- [Rental pricing model](https://docs.apify.com/platform/actors/publishing/monetize/rental.md): **Learn how to monetize your Actor with the rental pricing model, offering users a free trial and a flat monthly fee, and understand how profit is calculated and the limitations of this approach.** --- With the rental model, you can specify a free trial period and a monthly rental price.
- [Publish your Actor](https://docs.apify.com/platform/actors/publishing/publish.md): **Prepare your Actor for Apify Store with a description and README file, and learn how to make your Actor available to the public.** --- Before making your Actor public, it's important to ensure your Actor has a clear **Description** and comprehensive **README** section.
- [Actor quality score](https://docs.apify.com/platform/actors/publishing/quality-score.md): The Actor quality score is a metric that evaluates your Actor's performance across multiple dimensions, including reliability, ease of use, popularity, and other quality indicators.
- [Actor status badge](https://docs.apify.com/platform/actors/publishing/status-badge.md): The Actor status badge can be embedded in the README or documentation to show users the current status and usage of your Actor on the Apify platform.
- [Automated testing](https://docs.apify.com/platform/actors/publishing/test.md): **Apify has a QA system that regularly runs automated tests to ensure that all Actors in the store are functional.** --- ### Why we test We want to make sure that all Actors in Apify Store are top-notch, or at least as top-notch as they can be.
- [Running Actors](https://docs.apify.com/platform/actors/running.md): **In this section, you learn how to run Apify Actors using Apify Console or programmatically.
- [Actors in Store](https://docs.apify.com/platform/actors/running/actors-in-store.md): **[Apify Store](https://apify.com/store) is home to thousands of public Actors available to the Apify community.
- [Input and output](https://docs.apify.com/platform/actors/running/input-and-output.md): **Configure your Actor's input parameters using Apify Console, locally or via API.
- [Runs and builds](https://docs.apify.com/platform/actors/running/runs-and-builds.md): **Learn about Actor builds and runs, their lifecycle, sharing, and data retention policy.** --- ## Builds An Actor is a combination of source code and various settings in a Docker container.
- [Standby mode](https://docs.apify.com/platform/actors/running/standby.md): **Use Actors in lightweight Standby mode for fast API responses.** --- Traditional Actors are designed to run a single job and then stop.
- [Actor tasks](https://docs.apify.com/platform/actors/running/tasks.md): **Create and save reusable configurations of Apify Actors tailored to specific use cases.** --- Actor tasks let you create multiple reusable configurations of a single Actor, adapted for specific use cases.
- [Usage and resources](https://docs.apify.com/platform/actors/running/usage-and-resources.md): **Learn about your Actors' memory and processing power requirements, their relationship with Docker resources, minimum requirements for different use cases and its impact on the cost.** --- ## Resources [Actors](../index.mdx) run in [Docker containers](https://www.docker.com/resources/what-container/), which have a [limited amount of resources](https://phoenixnap.com/kb/docker-memory-and-cpu-limit) (memory, CPU, disk size, etc).
- [Collaboration](https://docs.apify.com/platform/collaboration.md): **Learn how to collaborate with other users and manage permissions for organizations or private resources such as Actors, Actor runs, and storages.** --- Apify was built from the ground up as a collaborative platform.
- [Access rights](https://docs.apify.com/platform/collaboration/access-rights.md): **Manage permissions for your private resources such as Actors, Actor runs, and storages.
- [General resource access](https://docs.apify.com/platform/collaboration/general-resource-access.md): Some resources, like storages, Actor runs or Actor builds, can be shared simply by sending their unique resource ID or Console link and the recipient can then view the data in Console or fetch it via API without needing an API token.
- [List of permissions](https://docs.apify.com/platform/collaboration/list-of-permissions.md): **Learn about the access rights you can grant to other users.
- [Organization account](https://docs.apify.com/platform/collaboration/organization-account.md): **Create a specialized account for your organization to encourage collaboration and manage permissions.
- [Using the organization account](https://docs.apify.com/platform/collaboration/organization-account/how-to-use.md): **Learn to use and manage your organization account using the Apify Console or API.
- [Setup](https://docs.apify.com/platform/collaboration/organization-account/setup.md): **Configure your organization account by inviting new members and assigning their roles.
- [Apify Console](https://docs.apify.com/platform/console.md): **Learn about Apify Console's easy account creation and user-friendly homepage for efficient web scraping management.** --- ## Sign-up To use Apify Console, you first need to create an account.
- [Billing](https://docs.apify.com/platform/console/billing.md): **The Billings page is the central place for all information regarding your invoices, billing information regarding usage in the current billing cycle, historical usage, subscriptions & limits.** --- ## Current period The **Current period** tab is a comprehensive resource for understanding your platform usage during the ongoing billing cycle.
- [Account settings](https://docs.apify.com/platform/console/settings.md): **Learn how to manage your Apify account, configure integrations, create and manage organizations, and set notification preferences in the Settings tab.** --- ## Account By clicking the **Settings** tab on the side menu, you will be presented with an Account page where you can view & edit various settings regarding your account, such as: * account email * username * profile information * theme * login information * session information * account delete :::info Verify your identity The **Login & Privacy** tab (**Security & Privacy** for organization accounts) contains sensitive settings.
- [Apify Store](https://docs.apify.com/platform/console/store.md): **Explore Apify Store, browse and select Actors, search by criteria, sort by relevance, and adjust settings for immediate or future runs.** --- ![apify-console-store](./images/console-store.png) Apify Store is a place where you can explore a variety of Actors, both created and maintained by Apify or our community members.
- [Two-factor authentication setup](https://docs.apify.com/platform/console/two-factor-authentication.md): **Learn about Apify Console's account two-factor authentication process and how to set it up.** --- If you use your email and password to sign in to Apify Console, you can enable two-factor authentication for your account.
- [Integrations](https://docs.apify.com/platform/integrations.md): **Learn how to integrate the Apify platform with other services, your systems, data pipelines, and other web automation workflows.** --- > The whole is greater than the sum of its parts.
- [What are Actor integrations?](https://docs.apify.com/platform/integrations/actors.md): **Learn how to integrate with other Actors and tasks.** --- :::note Integration Actors You can check out a catalogue of our Integration Actors within [Apify Store](https://apify.com/store/categories/integrations).
- [Integrating Actors via API](https://docs.apify.com/platform/integrations/actors/integrating-actors-via-api.md): **Learn how to integrate with other Actors and tasks using the Apify API.** --- You can integrate Actors via API using the [Create webhook](/api/v2/webhooks-post) endpoint.
- [Creating integration Actors](https://docs.apify.com/platform/integrations/actors/integration-ready-actors.md): **Learn how to create Actors that are ready to be integrated with other Actors and tasks.** --- Any Actor can be used in integrations.
- [Agno Integration](https://docs.apify.com/platform/integrations/agno.md): **Integrate Apify with Agno to power AI agents with web scraping, automation, and data insights.** --- ## What is Agno?
- [Airbyte integration](https://docs.apify.com/platform/integrations/airbyte.md): **Learn how to integrate your Apify datasets with Airbyte.** --- Airbyte is an open-source data integration platform that allows you to move your data between different sources and destinations using pre-built connectors, which are maintained either by Airbyte itself or by its community.
- [Airtable integration](https://docs.apify.com/platform/integrations/airtable.md): **Learn how to integrate your Apify Actors with Airtable.
- [API integration](https://docs.apify.com/platform/integrations/api.md): **Learn how to integrate with Apify using the REST API.** --- All aspects of the Apify platform can be controlled via a REST API, which is described in detail in the [**API Reference**](/api/v2).
- [Amazon Bedrock integrations](https://docs.apify.com/platform/integrations/aws_bedrock.md): **Learn how to integrate Apify with Amazon Bedrock Agents to provide web data for AI agents.** --- [Amazon Bedrock](https://aws.amazon.com/bedrock/) is a fully managed service that provides access to large language models (LLMs), allowing users to create and manage retrieval-augmented generative (RAG) pipelines, and create AI agents to plan and perform actions.
- [Bubble integration](https://docs.apify.com/platform/integrations/bubble.md): **Learn how to integrate your Apify Actors with Bubble for automated workflows and notifications.** --- [Bubble](https://bubble.io/) is a no-code platform that allows you to build web applications without writing code.
- [🤖🚀 CrewAI integration](https://docs.apify.com/platform/integrations/crewai.md): **Learn how to build AI Agents with Apify and CrewAI.** --- ## What is CrewAI [CrewAI](https://www.crewai.com/) is an open-source Python framework designed to orchestrate autonomous, role-playing AI agents that collaborate as a "crew" to tackle complex tasks.
- [Google Drive integration](https://docs.apify.com/platform/integrations/drive.md): **Learn how to integrate your Apify Actors with Google Drive.
- [Flowise integration](https://docs.apify.com/platform/integrations/flowise.md): **Learn how to integrate Apify with Flowise.** --- ## What is Flowise?
- [GitHub integration](https://docs.apify.com/platform/integrations/github.md): **Learn how to integrate your Apify Actors with GitHub.
- [Gmail integration](https://docs.apify.com/platform/integrations/gmail.md): **Learn how to integrate your Apify Actors with Gmail.
- [Gumloop integration](https://docs.apify.com/platform/integrations/gumloop.md): With the Gumloop Apify integration you can retrieve key data for your AI-powered workflows in a flash.
- [Gumloop - Instagram Actor integration](https://docs.apify.com/platform/integrations/gumloop/instagram.md): Get Instagram profile posts, details, stories, reels, post comments and hashtags, users, and tagged posts in Gumloop.
- [Gumloop - Google maps Actor integration](https://docs.apify.com/platform/integrations/gumloop/maps.md): Search, extract, and enrich business data from Google Maps in Gumloop.
- [Gumloop - TikTok Actor integration](https://docs.apify.com/platform/integrations/gumloop/tiktok.md): Get TikTok hashtag videos, profile videos, followers, video details, and search results in Gumloop.
- [Gumloop - YouTube Actor integration](https://docs.apify.com/platform/integrations/gumloop/youtube.md): Get YouTube search results, video details, channel videos, playlists, and channel metadata in Gumloop.
- [Haystack integration](https://docs.apify.com/platform/integrations/haystack.md): **Learn how to integrate Apify with Haystack to work with web data in the Haystack ecosystem.** --- [Haystack](https://haystack.deepset.ai/) is an open source framework for building production-ready LLM applications, agents, advanced retrieval-augmented generative pipelines, and state-of-the-art search systems that work intelligently over large document collections.
- [IFTTT integration](https://docs.apify.com/platform/integrations/ifttt.md): **Connect Apify Actors with IFTTT to automate workflows using Actor run events, data queries, and task actions.** --- [IFTTT](https://ifttt.com) is a service that helps you create automated workflows called Applets.
- [Integrate with Apify](https://docs.apify.com/platform/integrations/integrate.md): If you are building a service and your users could benefit from integrating with Apify or vice versa, we would love to hear from you!
- [Keboola integration](https://docs.apify.com/platform/integrations/keboola.md): **Integrate your Apify Actors with Keboola, a cloud-based data integration platform that consolidates data from various sources into a centralized storage.** --- With Apify integration for [Keboola](https://www.keboola.com/), you can extract data from various sources using your Apify Actors and load it into Keboola for further processing, transformation, and integration with other platforms.
- [🦜🔗 LangChain integration](https://docs.apify.com/platform/integrations/langchain.md): **Learn how to integrate Apify with LangChain, in order to feed vector databases and LLMs with data crawled from the web.** --- > For more information on LangChain visit its [documentation](https://python.langchain.com/docs/).
- [Langflow integration](https://docs.apify.com/platform/integrations/langflow.md): **Learn how to integrate Apify with Langflow to run complex AI agent workflows.** --- ## What is Langflow [Langflow](https://langflow.org/) is a low-code, visual tool that enables developers to build powerful AI agents and workflows that can use any API, models, or databases.
- [🦜🔘➡️ LangGraph integration](https://docs.apify.com/platform/integrations/langgraph.md): **Learn how to build AI Agents with Apify and LangGraph.** --- ## What is LangGraph [LangGraph](https://www.langchain.com/langgraph) is a framework designed for constructing stateful, multi-agent applications with Large Language Models (LLMs), allowing developers to build complex AI agent workflows that can leverage tools, APIs, and databases.
- [Lindy integration](https://docs.apify.com/platform/integrations/lindy.md): **Learn how to integrate your Apify Actors with Lindy.** --- [Lindy](https://www.lindy.ai/) is an AI-powered automation platform that lets you create intelligent workflows and automate complex tasks.
- [LlamaIndex integration](https://docs.apify.com/platform/integrations/llama-index.md): **Learn how to integrate Apify with LlamaIndex to feed vector databases and LLMs with data crawled from the web.** --- > For more information on LlamaIndex, visit its [documentation](https://docs.llamaindex.ai/en/stable/).
- [Make integration](https://docs.apify.com/platform/integrations/make.md): **Learn how to integrate your Apify Actors with Make.** --- [Make](https://www.make.com/) _(formerly Integromat)_ allows you to create scenarios where you can integrate various services (modules) to automate and centralize jobs.
- [Make - AI crawling Actor integration](https://docs.apify.com/platform/integrations/make/ai-crawling.md): ## Apify Scraper for AI Crawling Apify Scraper for AI Crawling from [Apify](https://apify.com/) lets you extract text content from websites to feed AI models, LLM applications, vector databases, or Retrieval Augmented Generation (RAG) pipelines.
- [Make - Amazon Actor integration](https://docs.apify.com/platform/integrations/make/amazon.md): ## Apify Scraper for Amazon Data The Amazon Scraper module from [Apify](https://apify.com) allows you to extract product, search, or category data from Amazon.
- [Make - Facebook Actor integration](https://docs.apify.com/platform/integrations/make/facebook.md): ## Apify Scraper for Facebook Data The Facebook Scraper modules from [Apify](https://apify.com/) allow you to extract posts, comments, and profile data from Facebook.
- [Make - Instagram Actor integration](https://docs.apify.com/platform/integrations/make/instagram.md): **Learn about Instagram scraper modules.
- [Make - LLMs Actor integration](https://docs.apify.com/platform/integrations/make/llm.md): ## Apify Scraper for LLMs Apify Scraper for LLMs from [Apify](https://apify.com) is a web browsing module for OpenAI Assistants, RAG pipelines, and AI agents.
- [Make - Google Maps Leads Actor integration](https://docs.apify.com/platform/integrations/make/maps.md): ## Apify Scraper for Google Maps Leads The Google Maps Leads Scraper modules from [apify.com](http://apify.com/) allow you to extract valuable business lead data from Google Maps, including contact information, email addresses, social media profiles, business websites, phone numbers, and detailed location data.
- [Make - Google Search Actor integration](https://docs.apify.com/platform/integrations/make/search.md): ## Apify Scraper for Google Search The Google search modules from [Apify](https://apify.com) allows you to crawl Google Search Results Pages (SERPs) and extract data from those web pages in structured format such as JSON, XML, CSV, or Excel.
- [Make - TikTok Actor integration](https://docs.apify.com/platform/integrations/make/tiktok.md): ## Apify Scraper for TikTok Data The TikTok Scraper modules from [Apify](https://apify.com) allow you to extract hashtag, comments, and profile data from TikTok.
- [Make - YouTube Actor integration](https://docs.apify.com/platform/integrations/make/youtube.md): ## Apify Scraper for YouTube Data The YouTube Scraper module from [apify.com](https://apify.com) allows you to extract channel, video, streams, shorts, and search data from YouTube.
- [Mastra MCP integration](https://docs.apify.com/platform/integrations/mastra.md): **Learn how to build AI agents with Mastra and Apify Actors MCP Server.** --- ## What is Mastra [Mastra](https://mastra.ai) is an open-source TypeScript framework for building AI applications efficiently.
- [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md): The _Apify Model Context Protocol (MCP) Server_ enables AI applications to connect to Apify's extensive library of Actors.
- [Milvus integration](https://docs.apify.com/platform/integrations/milvus.md): **Learn how to integrate Apify with Milvus (Zilliz) to save data scraped from websites into the Milvus vector database.** --- [Milvus](https://milvus.io/) is an open-source vector database optimized for performing similarity searches on large datasets of high-dimensional vectors.
- [n8n integration](https://docs.apify.com/platform/integrations/n8n.md): **Connect Apify with n8n to automate workflows by running Actors, extracting structured data, and responding to Actor or task events.** --- [n8n](https://n8n.io/) is an open source, fair-code licensed tool for workflow automation.
- [n8n - Website Content Crawler by Apify](https://docs.apify.com/platform/integrations/n8n/website-content-crawler.md): Website Content Crawler from [Apify](https://apify.com/apify/website-content-crawler) lets you extract text content from websites to feed AI models, LLM applications, vector databases, or Retrieval Augmented Generation (RAG) pipelines.
- [OpenAI Assistants integration](https://docs.apify.com/platform/integrations/openai-assistants.md): **Learn how to integrate Apify with OpenAI Assistants to provide real-time search data and to save them into OpenAI Vector Store.** --- [OpenAI Assistants API](https://platform.openai.com/docs/assistants/overview) allows you to build your own AI applications such as chatbots, virtual assistants, and more.
- [Pinecone integration](https://docs.apify.com/platform/integrations/pinecone.md): **Learn how to integrate Apify with Pinecone to feed data crawled from the web into the Pinecone vector database.** --- [Pinecone](https://www.pinecone.io) is a managed vector database that allows users to store and query dense vectors for AI applications such as recommendation systems, semantic search, and retrieval augmented generation (RAG).
- [Qdrant integration](https://docs.apify.com/platform/integrations/qdrant.md): **Learn how to integrate Apify with Qdrant to transfer crawled data into the Qdrant vector database.** --- [Qdrant](https://qdrant.tech) is a high performance managed vector database that allows users to store and query dense vectors for next generation AI applications such as recommendation systems, semantic search, and retrieval augmented generation (RAG).
- [Slack integration](https://docs.apify.com/platform/integrations/slack.md): **Learn how to integrate your Apify Actors with Slack.
- [Telegram integration through Zapier](https://docs.apify.com/platform/integrations/telegram.md): **Learn how to integrate your Apify Actors with Telegram through Zapier.** --- With [Apify integration for Zapier](https://zapier.com/apps/apify/integrations), you can connect your Apify Actors to Slack, Trello, Google Sheets, Dropbox, Salesforce, and loads more.
- [🔺 Vercel AI SDK integration](https://docs.apify.com/platform/integrations/vercel-ai-sdk.md): **Learn how to integrate Apify Actors as tools for AI with Vercel AI SDK.** --- ## What is the Vercel AI SDK [Vercel AI SDK](https://ai-sdk.dev/) is the TypeScript toolkit designed to help developers build AI-powered applications and agents with React, Next.js, Vue, Svelte, Node.js, and more.
- [Webhook integration](https://docs.apify.com/platform/integrations/webhooks.md): **Learn how to integrate multiple Apify Actors or external systems with your Actor or task run.
- [Webhook actions](https://docs.apify.com/platform/integrations/webhooks/actions.md): **Send notifications when specific events occur in your Actor/task run or build.
- [Ad-hoc webhooks](https://docs.apify.com/platform/integrations/webhooks/ad-hoc-webhooks.md): **Set up one-time webhooks for Actor runs initiated through the Apify API or from the Actor's code.
- [Events types for webhooks](https://docs.apify.com/platform/integrations/webhooks/events.md): **Specify the types of events that trigger a webhook in an Actor or task run.
- [Zapier integration](https://docs.apify.com/platform/integrations/zapier.md): **Learn how to integrate your Apify Actors with Zapier.** --- With [Apify integration for Zapier](https://zapier.com/apps/apify/integrations), you can connect your Apify Actors to Slack, Trello, Google Sheets, Dropbox, Salesforce, and loads more.
- [Limits](https://docs.apify.com/platform/limits.md): **Learn the Apify platform's resource capability and limitations such as max memory, disk size and number of Actors and tasks per user.** --- The tables below demonstrate the Apify platform's default resource limits.
- [Monitoring](https://docs.apify.com/platform/monitoring.md): **Learn how to continuously make sure that your Actors and tasks perform as expected and retrieve correct results.
- [Proxy](https://docs.apify.com/platform/proxy.md): **Learn to anonymously access websites in scraping/automation jobs.
- [Datacenter proxy](https://docs.apify.com/platform/proxy/datacenter-proxy.md): **Learn how to reduce blocking when web scraping using IP address rotation.
- [Google SERP proxy](https://docs.apify.com/platform/proxy/google-serp-proxy.md): **Learn how to collect search results from Google Search-powered tools.
- [Residential proxy](https://docs.apify.com/platform/proxy/residential-proxy.md): **Achieve a higher level of anonymity using IP addresses from human users.
- [Proxy usage](https://docs.apify.com/platform/proxy/usage.md): **Learn how to configure and use Apify Proxy.
- [Using your own proxies](https://docs.apify.com/platform/proxy/using-your-own-proxies.md): **Learn how to use your own proxies while using the Apify platform.** --- In addition to our proxies, you can use your own both in Apify Console and SDK.
- [Schedules](https://docs.apify.com/platform/schedules.md): **Learn how to automatically start your Actor and task runs and the basics of cron expressions.
- [Security](https://docs.apify.com/platform/security.md): **Learn more about Apify's security practices and data protection measures that are used to protect your Actors, their data, and the Apify platform in general.** --- ## SOC 2 type II compliance The Apify platform is SOC 2 Type II compliant.
- [Storage](https://docs.apify.com/platform/storage.md): **Store anything from images and key-value pairs to structured output data.
- [Dataset](https://docs.apify.com/platform/storage/dataset.md): **Store and export web scraping, crawling or data processing job results.
- [Key-value store](https://docs.apify.com/platform/storage/key-value-store.md): **Store anything from Actor or task run results, JSON documents, or images.
- [Request queue](https://docs.apify.com/platform/storage/request-queue.md): **Queue URLs for an Actor to visit in its run.
- [Storage usage](https://docs.apify.com/platform/storage/usage.md): **Learn how to effectively use Apify's storage options.


---

# Full Documentation Content



https://docs.apify.com

https://docs.apify.com/academyhttps://docs.apify.com/platform

https://docs.apify.com/api

* https://docs.apify.com/api/v2
* https://docs.apify.com/api/client/js/
* https://docs.apify.com/api/client/python/

https://docs.apify.com/sdk

* https://docs.apify.com/sdk/js/
* https://docs.apify.com/sdk/python/

https://docs.apify.com/cli/

https://docs.apify.com/open-source

* https://crawlee.dev
* https://github.com/apify/got-scraping
* https://github.com/apify/fingerprint-suite
* https://github.com/apify
* https://whitepaper.actor

[Chat on Discord](https://discord.com/invite/jyEM2PRvMU)https://console.apify.com

# Apify API

Apify API provides programmatic access to the https://docs.apify.com/

## API reference

The Apify API allows developers to interact programmatically with apps using HTTP requests. The Apify API is built around https://en.wikipedia.org/wiki/REST.

The API has predictable resource-oriented URLs, returns JSON-encoded responses, and uses standard HTTP response codes, authentication, and verbs.

https://docs.apify.com/api/v2.md

cURL


```
# Prepare Actor input and run it synchronously
echo '{ "searchStringsArray": ["Apify"] }' |
curl -X POST -d @- \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer ' \
  -L 'https://api.apify.com/v2/acts/compass~crawler-google-places/run-sync-get-dataset-items'
```


## API client

The official library to interact with Apify API.

##### ![](/img/javascript-40x40.svg)![](/img/javascript-40x40.svg)JavaScript Client

##### ![](/img/python-40x40.svg)![](/img/python-40x40.svg)Python Client

### JavaScript API client

The official library to interact with Apify API from a web browser, Node.js, JavaScript, or Typescript applications.https://github.com/apify/apify-client-js

https://docs.apify.com/api/client/js/docshttps://docs.apify.com/api/client/js/reference


```
npm install apify-client
```



```
// Easily run Actors, await them to finish using the convenient .call() method, and retrieve results from the resulting dataset.
const { ApifyClient } = require('apify-client');

const client = new ApifyClient({
    token: 'MY-APIFY-TOKEN',
});

// Starts an actor and waits for it to finish.
const { defaultDatasetId } = await client.actor('john-doe/my-cool-actor').call();

// Fetches results from the actor's dataset.
const { items } = await client.dataset(defaultDatasetId).listItems();
```


## Related articles

https://blog.apify.com/web-scraping-with-client-side-vanilla-javascript/

https://blog.apify.com/web-scraping-with-client-side-vanilla-javascript/

https://blog.apify.com/web-scraping-with-client-side-vanilla-javascript/

https://blog.apify.com/apify-python-api-client/

https://blog.apify.com/apify-python-api-client/

https://blog.apify.com/apify-python-api-client/

https://blog.apify.com/api-for-dummies/

https://blog.apify.com/api-for-dummies/

https://blog.apify.com/api-for-dummies/

Learn

* https://docs.apify.com/academy
* https://docs.apify.com/platform

API

* https://docs.apify.com/api/v2
* https://docs.apify.com/api/client/js/
* https://docs.apify.com/api/client/python/

SDK

* https://docs.apify.com/sdk/js/
* https://docs.apify.com/sdk/python/

Other

* https://docs.apify.com/cli/
* https://docs.apify.com/open-source

More

* https://crawlee.dev
* https://github.com/apify
* https://discord.com/invite/jyEM2PRvMU
* https://trust.apify.com

https://apify.com


---



https://docs.apify.com

https://docs.apify.com/academyhttps://docs.apify.com/platform

https://docs.apify.com/api

* https://docs.apify.com/api/v2
* https://docs.apify.com/api/client/js/
* https://docs.apify.com/api/client/python/

https://docs.apify.com/sdk

* https://docs.apify.com/sdk/js/
* https://docs.apify.com/sdk/python/

https://docs.apify.com/cli/

https://docs.apify.com/open-source

* https://crawlee.dev
* https://github.com/apify/got-scraping
* https://github.com/apify/fingerprint-suite
* https://github.com/apify
* https://whitepaper.actor

[Chat on Discord](https://discord.com/invite/jyEM2PRvMU)https://console.apify.com

# Apify open source

Open-source tools and libraries created and maintained by Apify experts to help you with web scraping, browser automation, and proxy management.

## Crawlee

Crawlee is a fully open-source web scraping and browser automation library that helps you build reliable crawlers.

https://crawlee.dev/

### https://crawlee.dev/

https://crawlee.dev/python/

### https://crawlee.dev/python/

## Other

https://github.com/apify/fingerprint-suite

### https://github.com/apify/fingerprint-suite

Generate and inject browser fingerprints to avoid detection and improve scraper stealth.

https://github.com/apify/fingerprint-suite

https://github.com/apify/got-scraping

### https://github.com/apify/got-scraping

A powerful extension for sending browser-like requests and blending in with web traffic.

https://github.com/apify/got-scraping

### https://github.com/apify/proxy-chain

A Node.js proxy server with support for SSL, authentication, upstream proxy chaining, custom HTTP responses, and traffic statistics.

https://github.com/apify/proxy-chain

## Actor templates

Actor templates help you quickly set up your web scraping projects. Save development time and get immediate access to all the features of the Apify platform.

https://apify.com/templates

Learn

* https://docs.apify.com/academy
* https://docs.apify.com/platform

API

* https://docs.apify.com/api/v2
* https://docs.apify.com/api/client/js/
* https://docs.apify.com/api/client/python/

SDK

* https://docs.apify.com/sdk/js/
* https://docs.apify.com/sdk/python/

Other

* https://docs.apify.com/cli/
* https://docs.apify.com/open-source

More

* https://crawlee.dev
* https://github.com/apify
* https://discord.com/invite/jyEM2PRvMU
* https://trust.apify.com

https://apify.com


---



https://docs.apify.com

https://docs.apify.com/academyhttps://docs.apify.com/platform

https://docs.apify.com/api

* https://docs.apify.com/api/v2
* https://docs.apify.com/api/client/js/
* https://docs.apify.com/api/client/python/

https://docs.apify.com/sdk

* https://docs.apify.com/sdk/js/
* https://docs.apify.com/sdk/python/

https://docs.apify.com/cli/

https://docs.apify.com/open-source

* https://crawlee.dev
* https://github.com/apify/got-scraping
* https://github.com/apify/fingerprint-suite
* https://github.com/apify
* https://whitepaper.actor

[Chat on Discord](https://discord.com/invite/jyEM2PRvMU)https://console.apify.com

# Apify SDK

The Apify SDK is a toolkit for building Actors—serverless microservices running (not only) on the Apify platform. Apify comes with first-class support for JavaScript/TypeScript and Python, but you can run any containerized code on the Apify platform.

![](/img/javascript-40x40.svg)![](/img/javascript-40x40.svg)

## SDK for JavaScript

Toolkit for building Actors—serverless microservices running (not only) on the Apify platform.

https://github.com/apify/apify-sdk-js

https://docs.apify.com/sdk/js/docs/guides/apify-platformhttps://docs.apify.com/sdk/js/reference


```
npx apify-cli create my-crawler
```



```
// The Apify SDK makes it easy to initialize the actor on the platform with the Actor.init() method,
// and to save the scraped data from your Actors to a dataset by simply using the Actor.pushData() method.

import { Actor } from 'apify';
import { PlaywrightCrawler } from 'crawlee';

await Actor.init();
const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page, enqueueLinks }) {
        const title = await page.title();
        console.log(`Title of ${request.loadedUrl} is '${title}'`);
        await Actor.pushData({ title, url: request.loadedUrl });
        await enqueueLinks();
    }
});
await crawler.run(['https://crawlee.dev']);
await Actor.exit();
```


![](/img/python-40x40.svg)![](/img/python-40x40.svg)

## SDK for Python

The Apify SDK for Python is the official library for creating Apify Actors in Python. It provides useful features like actor lifecycle management, local storage emulation, and actor event handling.

https://github.com/apify/apify-sdk-python

https://docs.apify.com/sdk/python/docs/overview/introductionhttps://docs.apify.com/sdk/python/reference


```
apify create my-python-actor
```



```
# The Apify SDK makes it easy to read the actor input with the Actor.get_input() method,
# and to save the scraped data from your Actors to a dataset by simply using the Actor.push_data() method.

from apify import Actor
from bs4 import BeautifulSoup
import requests

async def main():
    async with Actor:
        actor_input = await Actor.get_input()
        response = requests.get(actor_input['url'])
        soup = BeautifulSoup(response.content, 'html.parser')
        await Actor.push_data({ 'url': actor_input['url'], 'title': soup.title.string })
```


Learn

* https://docs.apify.com/academy
* https://docs.apify.com/platform

API

* https://docs.apify.com/api/v2
* https://docs.apify.com/api/client/js/
* https://docs.apify.com/api/client/python/

SDK

* https://docs.apify.com/sdk/js/
* https://docs.apify.com/sdk/python/

Other

* https://docs.apify.com/cli/
* https://docs.apify.com/open-source

More

* https://crawlee.dev
* https://github.com/apify
* https://discord.com/invite/jyEM2PRvMU
* https://trust.apify.com

https://apify.com


---



https://docs.apify.com

https://docs.apify.com/academyhttps://docs.apify.com/platform

https://docs.apify.com/api

* https://docs.apify.com/api/v2
* https://docs.apify.com/api/client/js/
* https://docs.apify.com/api/client/python/

https://docs.apify.com/sdk

* https://docs.apify.com/sdk/js/
* https://docs.apify.com/sdk/python/

https://docs.apify.com/cli/

https://docs.apify.com/open-source

* https://crawlee.dev
* https://github.com/apify/got-scraping
* https://github.com/apify/fingerprint-suite
* https://github.com/apify
* https://whitepaper.actor

[Chat on Discord](https://discord.com/invite/jyEM2PRvMU)https://console.apify.com

# Search the documentation

Type your search here

https://www.algolia.com/

Learn

* https://docs.apify.com/academy
* https://docs.apify.com/platform

API

* https://docs.apify.com/api/v2
* https://docs.apify.com/api/client/js/
* https://docs.apify.com/api/client/python/

SDK

* https://docs.apify.com/sdk/js/
* https://docs.apify.com/sdk/python/

Other

* https://docs.apify.com/cli/
* https://docs.apify.com/open-source

More

* https://crawlee.dev
* https://github.com/apify
* https://discord.com/invite/jyEM2PRvMU
* https://trust.apify.com

https://apify.com


---

# Web Scraping Academy

Learn everything about web scraping and automation with our free courses that will turn you into an expert scraper developer.

## Beginner courses

https://docs.apify.com/academy/web-scraping-for-beginners.md

#### https://docs.apify.com/academy/web-scraping-for-beginners.md

https://docs.apify.com/academy/web-scraping-for-beginners.md

https://docs.apify.com/academy/scraping-basics-python.md

#### https://docs.apify.com/academy/scraping-basics-python.md

https://docs.apify.com/academy/scraping-basics-python.md

https://docs.apify.com/academy/apify-platform.md

#### https://docs.apify.com/academy/apify-platform.md

https://docs.apify.com/academy/apify-platform.md

## Advanced web scraping courses

https://docs.apify.com/academy/api-scraping.md

#### https://docs.apify.com/academy/api-scraping.md

https://docs.apify.com/academy/api-scraping.md

https://docs.apify.com/academy/anti-scraping.md

#### https://docs.apify.com/academy/anti-scraping.md

https://docs.apify.com/academy/anti-scraping.md

https://docs.apify.com/academy/expert-scraping-with-apify.md

#### https://docs.apify.com/academy/expert-scraping-with-apify.md

https://docs.apify.com/academy/expert-scraping-with-apify.md


---

# Actor marketing playbook

**Learn how to optimize and monetize your Actors on Apify Store by sharing them with other platform users.**

***



https://apify.com/store is a marketplace featuring thousands of ready-made automation tools called Actors. As a developer, you can publish your own Actors and generate revenue through our https://apify.com/partners/actor-developers.

To help you succeed, we've created a comprehensive Actor marketing playbook. You'll learn how to:

* Optimize your Actor's visibility on Apify Store
* Create compelling descriptions and documentation
* Build your developer brand
* Promote your work to potential customers
* Analyze performance metrics
* Engage with the Apify community

## Apify Store basics

#### https://docs.apify.com/academy/actor-marketing-playbook/store-basics/how-store-works.md

https://docs.apify.com/academy/actor-marketing-playbook/store-basics/how-store-works.md

#### https://docs.apify.com/academy/actor-marketing-playbook/store-basics/how-to-build-actors.md

https://docs.apify.com/academy/actor-marketing-playbook/store-basics/how-to-build-actors.md

#### https://docs.apify.com/academy/actor-marketing-playbook/store-basics/how-actor-monetization-works.md

https://docs.apify.com/academy/actor-marketing-playbook/store-basics/how-actor-monetization-works.md

## Actor basics

#### https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/name-your-actor.md

https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/name-your-actor.md

#### https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/importance-of-actor-url.md

https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/importance-of-actor-url.md

#### https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/actor-description.md

https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/actor-description.md

#### https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/how-to-create-an-actor-readme.md

https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/how-to-create-an-actor-readme.md

#### https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/actors-and-emojis.md

https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/actors-and-emojis.md

## Promoting your Actor

#### https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/seo.md

https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/seo.md

#### https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/social-media.md

https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/social-media.md

#### https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/parasite-seo.md

https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/parasite-seo.md

#### https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/product-hunt.md

https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/product-hunt.md

#### https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/blogs-and-blog-resources.md

https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/blogs-and-blog-resources.md

#### https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/video-tutorials.md

https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/video-tutorials.md

#### https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/webinars.md

https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/webinars.md

## Interacting with users

#### https://docs.apify.com/academy/actor-marketing-playbook/interact-with-users/emails-to-actor-users.md

https://docs.apify.com/academy/actor-marketing-playbook/interact-with-users/emails-to-actor-users.md

#### https://docs.apify.com/academy/actor-marketing-playbook/interact-with-users/issues-tab.md

https://docs.apify.com/academy/actor-marketing-playbook/interact-with-users/issues-tab.md

#### https://docs.apify.com/academy/actor-marketing-playbook/interact-with-users/your-store-bio.md

https://docs.apify.com/academy/actor-marketing-playbook/interact-with-users/your-store-bio.md

## Product optimization

#### https://docs.apify.com/academy/actor-marketing-playbook/product-optimization/how-to-create-a-great-input-schema.md

https://docs.apify.com/academy/actor-marketing-playbook/product-optimization/how-to-create-a-great-input-schema.md

#### https://docs.apify.com/academy/actor-marketing-playbook/product-optimization/actor-bundles.md

https://docs.apify.com/academy/actor-marketing-playbook/product-optimization/actor-bundles.md



Ready to grow your presence on the Apify platform? Check out our guide to https://docs.apify.com/platform/actors/publishing.md.


---

# Actor description & SEO description

Learn about Actor description and meta description. Where to set them and best practices for both content and length.

***

## What is an Actor description?

First impressions are important, especially when it comes to tools. Actor descriptions are the first connection potential users have with your Actor. You can set two kinds of descriptions: *regular description* (in Apify Store) and *SEO description* (on Google search), along with their respective names: regular name and SEO name.

tip

You can change descriptions and names as many times as you want.

## Regular description vs. SEO description

|                    | Actor description & name | SEO description & name |
| ------------------ | ------------------------ | ---------------------- |
| Name length        | 40-50 characters         | 40-50 characters       |
| Description length | 300 characters           | 145-155 characters     |
| Visibility         | Visible on Store         | Visible on Google      |

### Description & Actor name

Actor description is what users see on the Actor's web page in Apify Store, along with the Actor's name and URL. When creating an Actor description, a “warm” visitor experience is prioritized (more on that later).

![actor name \&amp; description](/assets/images/actor-description-name-bea8b2060a01d4c5d190cb2445a9a6c6.png)

Actor description is also present in Apify Console and across Apify Store.

![actor description in store](/assets/images/actor-description-store-bda4a42f8f8a0ca572e2fca5ce79d4b1.png)

### SEO description & SEO name

Actor SEO description is a tool description visible on Google. It is shorter and SEO-optimized (keywords matter here). When creating the SEO description, a “cold” visitor experience is prioritized.

![seo description](/assets/images/seo_description-12e904f852b518923f228bd2ef68a534.png)

Usually the way the potential user interacts with both these descriptions goes like this: SEO first, regular description second. Is there any benefit in them being different?

### Is there any benefit in the description and meta description being different?

Different descriptions give you a chance to target different stages of user acquisition. And make sure the acquisition takes place.

*SEO description (and SEO name)* is targeting a “cold” potential user who knows nothing about your tool yet and just came across it on Google search. They’re searching to solve a problem or use case. The goal of the meta description is to convince that visitor to click on your tool's page among other similar search results on Google. While it's shorter, SEO description is also the space to search-engine-optimize your language to the max to attract the most matching search intent.

*Description (and name)* is targeting a “warm” potential user who is already curious about your tool. They have clicked on the tool's page and have a few seconds to understand how complex the tool is and what it can do for them. Here you can forget SEO optimization and speak directly to the user. The regular description also has a longer character limit, which means you can expand on your Actor’s features.

Learn more about search intent here: https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/seo.md

## Where can Actor descriptions be set?

Both descriptions can be found and edited on the very right **Publication tab → Display information.** It has to be done separately for each Actor.

note

Setting the SEO description and SEO name is optional. If not set, the description will just be duplicated.

![changing seo name](/assets/images/changing__SEO_name-b739468d580f0dfd5aa0c82cb477f581.png)

![changing actor name and seo name](/assets/images/changing_Actor_name_and_SEO_name-fd56498d2164f1715ff6495538f5690d.png)

Actor description specifically can also be quick-edited in this pop-up on the Actor's page in Apify Console. Open the **Actor's page**, then click on **…** in the top right corner, and choose ✎ **Edit name or description**. Then set the URL in the **Unique name** ✎ field and click **Save**.

![changing actor description](/assets/images/change_Actor_description-703d7e6db0ba521eed798cd719c25a2a.png)

## Tips and recommendations on how to write descriptions

When writing a description, less is more. You only have a few seconds to capture attention and communicate what your Actor can do. To make the most of that time, follow these guidelines used by Apify (these apply to both types of descriptions):

### Use variations and experiment 🔄

* *SEO name vs. regular name*:

  

  * name: Airbnb Scraper
  * SEO name: Airbnb Data Scraper

* *Keywords on the web page*:

  

  Include variations, e.g. Airbnb API, Airbnb data, Airbnb data scraper, Airbnb rentals, Airbnb listings

  

  * No-code scraping tool to extract Airbnb data: host info, prices, dates, location, and reviews.
  * Scrape Airbnb listings without official Airbnb API!

* *Scraping/automation process variations*:
  
  Use terms, e.g. crawl, crawler, scraping tool, finder, scraper, data extraction tool, extract data, get data
  
  * Scrape XYZ data, scraped data, data scraper, data crawler.

### Choose how to start your sentences 📝

* *Noun-first (descriptive)*:
  
  * Data extraction tool to extract Airbnb data: host info, prices, dates, location, and reviews.
* *Imperative-first (motivating)*:
  
  * Try a free web scraping tool to extract Airbnb data: host info, prices, dates, location, and reviews.

### Keep it short and SEO-focused ✂️

* *Be concise and direct*: clearly state what your Actor does. Avoid unnecessary fluff and boilerplate text.

  

  * ✅ Scrapes job listings from Indeed and gathers...
  * ❌ \*This Actor scrapes job listings from Indeed in order to gather...

* *Optimize for search engines*: include popular keywords related to your Actor’s functionality that users might search for.

  

  * ✅ This Indeed scraper helps you collect job data efficiently. Use the tool to gather...
  * ❌ This tool will search through job listings on Indeed and offers you...

### List the data your Actor works with 📝

* Data extraction tool to extract Airbnb data: host info, prices, dates, location, and reviews.
* Get hashtags, usernames, mentions, URLs, comments, images, likes, locations without the official Instagram API.

### Use keywords or the language of the target website 🗣️

* Extract data from hundreds of Airbnb home rentals in seconds.
* Extract data from chosen tik-toks. Just add a TikTok URL and get TikTok video and profile data: URLs, numbers of shares, followers, hashtags, hearts, video, and music metadata.
* Scrape Booking with this hotels scraper and get data about accommodation on Booking.com.

### Highlight your strong suits 🌟

* Ease of use, no coding, user-friendly:
  
  * Easy scraping tool to extract Airbnb data.

* Fast and scalable:
  
  * Scrape whole cities or extract data from hundreds of Airbnb rentals in seconds.

* Free (only if the trial run can cover $5 free credits):

  

  * Try a free scraping tool to extract Airbnb data: host info, prices, dates, location, and reviews.
  * Extract host information, locations, availability, stars, reviews, images, and host/guest details for free.

* Available platform features (various formats, API, integrations, scheduling):
  
  * Export scraped data in formats like HTML, JSON, and Excel.

* Additional tips:

  

  * Avoid ending lists with etc.
  * Consider adding relevant emojis for visual appeal.

### Break it down 🔠

Descriptions typically fit into 2-3 sentences. Don't try to jam everything into one.

Examples:

1. Scrape whole cities or extract data from hundreds of Airbnb rentals in seconds.
2. Extract host information, addresses, locations, prices, availability, stars, reviews, images, and host/guest details.
3. Export scraped data, run the scraper via API, schedule and monitor runs, or integrate with other tools.

## FAQ

#### Can the Actor's meta description and description be the same?

Yes, they can, as long as they have the same (shorter) length (under 150 characters). But they can also be different - there's no harm in that.

#### How different can description and meta description be?

They can be vastly different and target different angles of your Actor. You can experiment by setting up different SEO descriptions for a period of time and seeing if the click-through rate rises.

#### I set a custom SEO description but Google doesn't show it

Sometimes Google picks up a part of the README as the SEO description. It's heavily dependent on the search query. Sometimes what you see on Google might look differently compared to how you set the SEO description. It's all a part of how Google customizes search results.


---

# Actors and emojis

Using emojis in Actors is a science on its own. Learn how emojis enhance the user experience in Actors by grabbing attention, simplifying navigation, and making information clearer.

## On the use of emojis in Actors

We started using emojis in Actors for several reasons. First, tech today often uses emojis to make things look more user-friendly. Second, people don’t read as much as we’d like. You only have a few seconds to grab their attention, and text alone can feel overwhelming. Third, we don’t have many opportunities or space to explain things about Actors, and we want to avoid users needing to open extra tabs or pages. Clarity should come instantly, so we turned to emojis.

When evaluating a new tool, those first 5 seconds are critical. That’s why we use emojis extensively with our Actors. They’re part of the Actor SEO title and description to help the tool stand out in Google search results, although Google doesn't always display them. In READMEs, they serve as shortcuts to different sections and help users quickly understand the type of data they’ll get. In complex input schemas, we rely on emojis to guide users and help them navigate the tool more efficiently.

## Emoji science

Believe it or not, there’s a science to emoji usage. When we use emojis in Actors and related content, we tap into the brain's iconic and working memory. Iconic memory holds information for less than a second - this is unconscious processing, where attributes like color, size, and location are instantly recognized. This part is where emojis guide the person's attention in the sea of text. They signify that something important is here. Emojis help with that immediate first impression and create a sense of clarity.

After that, the brain shifts to working memory, where it combines information into visual chunks. Since we can only hold about 3-4 chunks at once, emojis help reinforce key points, thus reducing cognitive load. Consistent emoji use across the Actor ecosystem ensures users can quickly connect information without getting overwhelmed.

As an example of this whole process, first, the user notices the emojis used in the field titles (pre-attentive processing). They learn to associate the emojis with those titles (attentive processing). Later, when they encounter the same emojis in a README section, they’ll make the connection, making it easier to navigate without drowning in a sea of text.

## Caveats to emojis

1. Don't overuse them, and don’t rely on emojis for critical information. Emojis should support the text, not replace key explanations or instructions. They're a crutch for concise copywriting, not a universal solution.
2. Use them consistently. Choose one and stick with it across all content: descriptions, parts of input schema, mentions in README, blog posts, etc.
3. Some emojis have multiple meanings, so choose the safest one. It could be general internet knowledge or cultural differences, so make sure the ones you choose won’t confuse or offend users in other markets.
4. Some emojis don’t render well on Windows or older devices. Try to choose ones that display correctly on Mac, Windows, and mobile platforms. Besides, emoji-heavy content can be harder for screen readers and accessibility tools to interpret. Make sure the information is still clear without the emojis.
5. It's okay not to use them.


---

# How to create an Actor README

**Learn how to write a comprehensive README to help users better navigate, understand and run public Actors in Apify Store.**

***

## What's a README in the Apify sense?

At Apify, when we talk about a README, we don’t mean a guide mainly aimed at developers that explains what a project is, how to set it up, or how to contribute to it. At least, not in its traditional sense.

You could argue our notion of README is closer to this https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-readmes:

README files typically include information on:

* What the project does
* Why the project is useful
* How users can get started with the project
* Where users can get help with your project

We mean all of this and even more. At Apify, when we talk about READMEs, we refer to the public Actor detail page on Apify Store. Specifically, its first tab. The README exists in the same form both on the web and in Console. What is it for then?

Before we dive in, a little disclaimer: you don't need your Apify README to fulfill all its purposes. Technically, you could even publish an Actor with just a single word in the README. But you'd be missing out if you did that.

Your Actor’s README has at least four functions:

1. *SEO* - If your README is well-structured and includes important keywords — both in headings and across the text — it has a high chance of being noticed and promoted by Google. Organic search brings the most motivated type of potential users. If you win this game, you've won most of the SEO game.
2. *First impression* - Your README is one of the first points of contact with a potential user. If you come across as convincing, clear, and reassuring it could be the factor that will make a user try your Actor for their task.
3. *Extended instruction* - The README is also the space that explains specific complex input settings. For example, special formatting of the input, any coding-related, or extended functions. Of course, you could put that all in a blog post as well, but the README should be their first point of contact.
4. *Support* - Your users come back to the README when they face issues. Use it as a space to let them know that's where they can find links to the tutorials if they run into issues, describe common troubleshooting techniques, share tricks, or warn you about bugs.

## README elements theory

These are the most important elements of the README. This structure is also not to be followed to a “t”. Of course, what you want to say to your potential users and how you want to promote your Actor will differ case by case. These are just the most common practices we have for our Actor READMEs. Beware that the headings are written with SEO in mind, which is why you see certain keywords repeated over and over.

Aim for sections 1–6 below and try to include at least 300 words. You can move the sections around to some extent if it makes sense, e.g. 3 might come after 6. Consider using emojis as bullet points or otherwise trying to break up the text.

### Intro and features

What is \[Actor]?

* explain in two or three sentences what the Actor does and the easiest way to try it. Mention briefly what kind of data it can extract and any other tangible goal the tool can achieve. Describe the input in one sentence. Highlight the most important words in bold.

What can this \[Actor] do?

* list the main features of this tool. list multiple ways of input if applicable. list platform advantages. If it's a bundle, mention the steps that the Actor will do for you, mention specific obstacles this tool is able to overcome, say upfront how many results you can get for free.

Remember the Apify platform!

Your Actor + the Apify platform. They come as a package. Don't forget to flaunt all the advantages that the platform gives to your solution.

Imagine if there was a solution that is identical to yours but without the platform advantages such as monitoring, access to API, scheduling, possibility of integrations, proxy rotation. Now, if that tool suddenly gained all those advantages it would surely make a selling point out of it. This is how you should be thinking about your tool — as a solution boosted by the Apify platform. Don't ever forget that advantage.

What data can \[Actor] extract?

What data can you extract from \[target website]

* Create a table that represents the main data points that the Actor can extract. You don't have to list every single one, just list the most understandable and relatable ones.

Depending on the complexity of your Actor, you might include one or all three of these sections. It will also depend on what your Actor does. If your Actor has simple input but does a lot of steps for the user under the hood (like a bundle would), you might like to include the "What can this Actor do?" section. If your Actor extracts data, it makes sense to include a section with a table.

### Tutorial section

This could be a simple listed step-by-step section or a paragraph with a link to a tutorial on a blog.

A step-by-step section is reassuring for the user, and it can be a section optimized for Google.

How do I use \[Actor] to scrape website data?

### Pricing

How much will it cost to scrape \[target site]?

How much will scraping \[target site] cost?

Is scraping \[target site] free?

How much does it cost to extract \[target site] data?

Web scraping can be very unpredictable because there are a lot of elements involved in order for the process to be successful: the complexity of the website, proxies, cookies, etc. This is why it's important to set the pricing and scraping volume expectations for your users.

You might think the part above the Actor detail page already indicates pricing. But this paragraph can still be useful. First of all, cost-related questions can show up in Google, if they are SEO optimized. Second, you can use this space to inform and reassure the user about the pricing, give more details about it, or entice them with the promise of very scalable scraping.

* If it's a consumption pricing model (only consumed CUs), you can use this space to set expectations and explain what it means to pay for Compute Units. Similarly, if it's a rental Actor, you can also use this paragraph to set expectations. Talk about the average amount of data that can be scraped per given price. Make it easy for users to imagine how much they will pay for a given dataset. This will also make it easier for them to compare your solution with others on the market price-wise and value-wise.
* If it's price per result, you can extrapolate how many results a user can get on a free plan and also entice them with a larger plan and how many thousands of results they can get with that.
* If it's a bundle that consists of a couple of Actors that are priced differently, you can use this section to talk about the difference between all the Actors involved and how that will affect the final price of a run.

In any case, on top of setting expectations and reassuring users, this paragraph can get into Google. If somebody is Googling "How much does it cost to scrape \[website]", they might come across this part of your README and it will lead them from Google search directly to your Actor's detail page. You don't want to miss that opportunity.

![readme example](/assets/images/readme-7f2dd6436cb16cefbbfcc9c83e10bb98.png)

### Input and output examples

This is what people click on the most in the table of contents of the README. After they are done scrolling through the first part of the README, users are interested in how difficult the input it, what it looks like, and what kind of information they can expect.

**Input**: often a screenshot of the input schema. This is also a way for people to see the platform even before they create an account.

**Output**: can be shown as a screenshot if your output schema looks like something you would want to promote to users. You can also just include a JSON example containing a few objects. Even better if there's continuity between the input example and output example.

If your datasets come out too complex and you want to save your users some scrolling, you can also show multiple output examples: one for reviews, one for contact details, one for ads, etc.

### Other Actors

Don't forget to promote your other Actors. While our system for Actor recommendation works - you can see related Actors at the bottom of the README — it only works within the same category or similar name. It won't recommend a completely different Actor from the same creator. Make sure to interconnect your work by taking the initiative yourself. You can mention your other Actors in a list or as a table.

### FAQ, disclaimers, and support

The FAQ is a section where you can keep all the secondary questions that might still come up.

Here are just a few things we usually push to the FAQ section.

* disclaimers and legality
* comparison table between your Actor and similar solutions
* information about the official API and how the scraper is a stand-in for it (SEO)
* questions brought up by the users
* tips on how best to use the Actor
* troubleshooting and mentioning known bugs
* mentioning the Issues tab and highlighting that you're open for feedback and collecting feedback
* mentioning being open to creating a custom solution based on the current one and showing a way to contact you
* interlinking
* mentioning the possibility of transferring data using an API — API tab
* possibility for integrations
* use cases for the data scraped, success stories exemplifying the use of data

## Format of the README

### Markdown

The README has to be written in Markdown. The most important elements are H2 and H3 headings, links to pages, links to images, and tables. For specific formatting, you can try using basic HTML. That will also work. CSS won’t.

### HTML use

You can mix HTML with Markdown interchangeably. The Actor README will display either on the Apify platform. That gives you more freedom to use HTML when needed. Remember, don't try CSS.

### Tone of the README

Apify Store has many Actors in its stock, and it's only growing. The advantage of an Actor is that an Actor can be anything, as versatile or complex as possible. From a single URL type of input to complex features that give customized control over the input parameters to the user. There are Actors that are intended for users who aren't familiar with coding and don't have any experience with it. Ideally, the README should reflect the level of skill one should need to use the Actor.

The tone of the README should make it immediately obvious who the tool is aimed at. If your tool's input includes glob patterns or looking for selectors, it should be immediately visible from the README. Before the user even tries the tool. Trying to simplify this information using simple words with ChatGPT can be misleading to the user. You will attract the wrong audience, and they will end up churning or asking you too many questions.

And vice versa. If your target audience is people with little to no coding skills, who just prefer point-and-click solutions, this should be visible from the README. Speak in regular terms, avoid code blocks or complex information at the beginning unless it's absolutely necessary. This means that, when people land on your Actor detail page, they will have their expectations set from the get-go.

### Length of a README

When working on improving a README, we regularly look at heatmaps that show us where our website visitors spend most of their time. From our experience, most first-time visitors don't scroll past the first 25% of a README. That means that the first quarter of the README is where you want to focus the most of your attention if you're trying to persuade the page visitor to try your Actor.

From the point of view of acquisition, the first few sections should make it immediately obvious what the tool is about, how hard it is to use, and who it is created for. This is why, in Apify's READMEs, you can see our first few paragraphs are built in such a way as to explain these things and reassure the visitors that anyone can use these tools.

From the point of view of retention, it doesn't mean you can't have long or complex READMEs or not care for the information beyond the 25% mark. Since the README is also intended to be used as a backup when something goes wrong or the user needs more guidance, your users will come back to it multiple times.

### Images and videos

As for using screenshots and gifs, put them in some sort of image hosting. Your own GitHub repository would be best because you have full control over it. Name the images with SEO in mind and try to keep them compressed but good enough quality. You don't want to load an image or gif for too long.

One trick is not only to add images but also to make them clickable. For some reason, people like clicking on images, at least they try to when we look at the heatmaps. You can lead the screenshot clicks towards a signup page, which is possible with Markdown.

If your screenshot seems too big or occupies too much space, smaller size images are possible by using HTML.

To embed a YouTube video, all you have to do is include its URL. No further formatting is needed, the thumbnail will render itself on the README page.

Try Carbon for code

If you want to add snippets of code anywhere in your README, you can use https://github.com/carbon-app/carbon.

If you need quick Markdown guidance, check out https://www.markdownguide.org/cheat-sheet/

## README and SEO

Your README is your landing page.

If there were only one thing to remember about READMEs on Apify Store, it would be this. A README on Apify Store is not just dry instructions on how to use your Actor. It has much more potential than that.

In the eyes of Google, your Actor's detail page, aka README, is a full-fledged landing page containing all the most important information to be found and understood by users.

Of course, that all only counts if your README is both well formatted and contains keywords. We'll talk about that part later on.

What makes a good README?

A good README has to be a balance between what you want your page visitors to know, your users to turn to when they run into trouble, and Google to register when it's indexing pages and considering which one deserves to be put up higher.

### Table of contents

The H1 of your page is the Actor name, so you don't have to set that up. Don't add more H1s. README headings should be H2 or H3. H2 headings will make up the table of contents on the right. If you don't want the table to be too crowded, keep the H2s to the basics and push all the longer phrases and questions to H3s. H3s will stay hidden in the accordion in the default state until the visitor hovers their cursor over it. H4 readings can also be included, of course, but they won't show up as a part of the table of contents.

### Keyword opportunities

Do SEO research for keywords and see how they can fit organically into the text. Priority for H2s and H3s, then the regular text. Add new keyword-heavy paragraphs if you see an opportunity.

The easiest sections to include keywords in are, for example:

* API, as in Instagram API
* data, as in extract Instagram data
* Python, as in extract data in Python
* scrape, as in how to scrape X
* scraping, as in scraping X

Now, could every H2 just say exactly what it is about, without SEO? Of course. You don't have to optimize your H2s and H3s, and are free to call them simply Features, How it works, Pricing, Support, etc. or not even to have many H2s at all and keep it all as one page.

However, the H2s and H3s are what sometimes get into the Google Search results. If you're familiar with the People Also Ask section, that's the best place to match your H2s. They can also get highlighted in the Sitelinks of Google Search Results.

Any part of your README can make it onto Google pages. The intro sentence describing what your Actor is about, a video, a random question. Each one can become a good candidate for those prime Google pages. That's why it's important to structure and write your README with SEO in mind.

### Importance of including a video

If your page has a video, it has a better chance of ranking higher in Google.

## README and input schema

The README should serve as a fallback for your users if something isn't immediately obvious in the input schema. There's also only that much space in the input schema and the tooltips, so naturally, if you want to provide more details about something, e.g. input, formatting, or expectations, you should put it in the README and refer to it from the relevant place in the input schema.

Learn about https://docs.apify.com/academy/actor-marketing-playbook/product-optimization/how-to-create-a-great-input-schema.md

## Readme elements template

1. What does (Actor name) do?

   

   * in 1–2 sentences describe what the Actor does and what it does not do
   * consider adding keywords like API, e.g. Instagram API
   * always have a link to the target website in this section

2. Why use (Actor name)? or Why scrape (target site)?

   

   * How it can be beneficial for the user
   * Business use cases
   * Link to a success story, a business use case, or a blog post.

3. How to scrape (target site)

   

   * Link to "How to…" blogs, if one exists (or suggest one if it doesn't)
   * Add a video tutorial or gif from an ideal Actor run.

Embedding YouTube videos

For better user experience, Apify Console automatically renders every YouTube URL as an embedded video player. Simply add a separate line with the URL of your YouTube video.

* Consider adding a short numbered tutorial, as Google will sometimes pick these up as rich snippets. Remember that this might be in search results, so you can repeat the name of the Actor and give a link, e.g.

1. Is it legal to scrape (target site)?

   * This can be used as a boilerplate text for the legal section, but you should use your own judgment and also customize it with the site name.

   > Our scrapers are ethical and do not extract any private user data, such as email addresses, gender, or location. They only extract what the user has chosen to share publicly. We therefore believe that our scrapers, when used for ethical purposes by Apify users, are safe. However, you should be aware that your results could contain personal data. Personal data is protected by the GDPR in the European Union and by other regulations around the world. You should not scrape personal data unless you have a legitimate reason to do so. If you're unsure whether your reason is legitimate, consult your lawyers. You can also read our blog post on the legality of web scraping

2. Input

   * Each Actor detail page has an input tab, so you just need to refer to that. If you like, you can add a screenshot showing the user what the input fields will look like.
   * This is an example of how to refer to the input tab:

   > Twitter Scraper has the following input options. Click on the input tab for more information.

3. Output

   * Mention "You can download the dataset extracted by (Actor name) in various formats such as JSON, HTML, CSV, or Excel.”
   * Add a simplified JSON dataset example, like here https://apify.com/compass/crawler-google-places#output-example

4. Tips or Advanced options section

   * Share any tips on how to best run the Actor, such as how to limit compute unit usage, get more accurate results, or improve speed.

If you want some general tips on how to make a GitHub README that stands out, check out these guides. Not everything in there will be suitable for an Apify Actor README, so you should cherry-pick what you like and use your imagination.

## Resources

https://towardsdatascience.com/build-a-stunning-readme-for-your-github-profile-9b80434fe5d7

https://yushi95.medium.com/how-to-create-a-beautiful-readme-for-your-github-profile-36957caa711c


---

# Importance of Actor URL

**Actor URL (or technical name, as we call it), is the page URL of the Actor shown on the web. When you're creating an Actor, you can set the URL yourself along with the Actor name. Here are best practices on how to do it well.**

![actor url example](/assets/images/what-is-actor-url-7560efc6bb6906af078c2cef44100b93.png)

***

## Why is Actor URL so important?

The Actor URL plays a crucial role in SEO. Google doesn't just read the Actor's name or README; it also analyzes the URL. The *URL is one of the first signals to Google about the content of your page*- whether it's a product listing, a tool, a blog post, a landing page for a specific offering, or something else entirely. Therefore, it's important to know how to use this shorthand to your advantage and clearly communicate to Google what your page offers.

Choose the URL carefully

This part of the manual is only applicable to new Actors. *Once set, existing Actor URLs shouldn't change*.

## How to choose a URL

The right naming can propel or hinder the success of the Actor on Google Search. Just as naming your Actor is important, so is choosing its URL. The only difference is, once set, the URL is intended to be permanent (more on this https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/importance-of-actor-url.md). What's the formula for the best Actor URL?

### Brainstorming

What does your Actor do? Does it scrape, find, extract, automate, connect? Think of these when you are looking for a name. You might already have a code name in mind, but it’s essential to ensure it stands out and is distinct from similar names—both on Google and on Apify Store.

### Matching URL and name

The easiest way is to make sure the Actor name and the technical name match. As in TikTok Scraper (tiktok-scraper) or Facebook Data Extractor (facebook-data-extractor). But they can also be different.

### SEO

The name should reflect not only what Actor does (or what website it targets), but also what words people use when they search for it. This is why it's also important to do SEO research to see which keywords work best for the topic. Ideally, the URL should include a keyword that has low complexity (low competition) but high traffic (high demand).

Learn more about SEO research and the best tools for it here: https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/seo.md

### Inspiration in Apify Store

Explore Store URLs of similar Actors. But avoid naming your Actor too similarly to what already exists, because of these two reasons:

1. There’s evidence that new URLs that are similar to existing ones can have drastically different levels of success. The first URL might thrive while a similar one published later struggles to gain traction. For example, *onedev/pentagon-scraper* was published first and has almost 100x traction than *justanotherdev/pentagon-scraper*. It will be very hard for the latter to beat the former. The reason for this is that Google operates on a "first come, first served” basis, and once it's set, it is very hard to make Google change its ways and make it pay attention to new pages with a similar name.
2. As Apify Store is growing, it's important to differentiate yourself from the competition. A different URL is just one more way to do that. If a person is doing research on Store, they will be less likely to get confused between two tools with the same name.

### Length of URL

Ideally, keep it under four words. As in, *Facebook Data Extractor* (*facebook-data-extractor*), not (*facebook-data-meta-online-extractor-light*). If the name is long and you're trying to match it with your URL, keep only the most essential words for the URL.

### Variations

It can be a long-tail keyword with the tool type in it: scraper, finder, extractor. But you can also consider keywords that include terms like API, data, and even variations of the website name. Check out what keywords competitors outside of Apify Store are using for similar tools.

### Nouns and adjectives

One last tip on this topic is to *avoid adjectives and verbs*. Your page is about a tool, so keep it to nouns. Anything regarding what the tool does (scrape, automate, import) and what it's like (fast, light, best) can be expressed in the Actor's name, not the Actor's URL. Adding an adjective or verb like that either does nothing for SEO and might even damage the SEO chances of the page.

## Why you shouldn’t change your Actor URL

Don't change the URL

There's only one rule about Actor URL: don't change the URL. The Actor's name, however, can be changed without any problems.

Once set, the page URL should not be changed. Because of those two important reasons:

* Google dislikes changes to URLs. Once your Actor has built up keyword associations and familiarity with Google, regaining that standing after a URL change can be challenging. You will have to start from scratch.
* Current integrations will break for your Actor's users. This is essential for maintaining functionality.

If you absolutely have to change the URL, you will have to communicate that fact to your users.

💡 Learn more about the easiest ways to communicate with your users: \[Emails to Actor users]

## How and where to set the Actor URL

In Console. Open the **Actor's page**, then click on **…** in the top right corner, and choose ✎ **Edit name or description**. Then set the URL in the **Unique name** ✎ field and click **Save**.

![set actor url in console](/assets/images/how-and-where-to-set-the-actor-url-console-18c354804a82c1ab93f59d39cabfcc97.png)

![set the actor url](/assets/images/how-and-where-to-set-the-actor-url-5f4f6293d3389f468863c78d086c97ee.png)

## FAQ

#### Can Actor URL be different from Actor name?

Yes. While they can be the same, they don’t have to be. For the best user experience, keeping them identical is recommended, but you can experiment with the Actor's name. Just avoid changing the Actor URL.

#### Can I change a very fresh Actor URL?

Yes, but act quickly. It takes Google a few days to start recognizing your page. For this reason, if you really have to, *it is best to change the Actor's name in the first few days*, before you build a steady user base and rapport with Google.

#### How long does it take Google to pick up on the new URL?

Google reindexes Apify web pages almost every day. It might take anywhere from 3-7 days for it to pick up a new URL. Or it might happen within a day.

#### Can I use the identical technical name as this other Actor?

Yes, you can. But it will most likely lower your chances of being noticed by Google.

#### Does changing my Apify account name affect the Actor URL?

Yes. If you're changing from *justanotherdev/pentagon-scraper* to *dev/pentagon-scraper*, it counts as a new page. Essentially, the consequences are the same as after changing the technical name of the Actor.


---

# Name your Actor

**Apify's standards for Actor naming. Learn how to choose the right name for scraping and automation Actors and how to optimize your Actor for search engines.**

***

Naming your Actor can be tricky, especially after you’ve worked hard on it. To help people find your Actor and make it stand out, we’ve set some naming guidelines. These will help your Actor rank better on Google and keep things consistent on https://apify.com/store.

Ideally, you should choose a name that clearly shows what your Actor does and includes keywords people might use to search for it.

## Parts of Actor naming

Your Actor's name consists of four parts: actual name, SEO name, URL, and GitHub repository name.

* Actor name (name shown in Apify Store), e.g. *Booking Scraper*.

  

  * Actor SEO name (name shown on Google Search, optional), e.g. *Booking.com Hotel Data Scraper*.
  * If the SEO name is not set, the Actor name will be the default name shown on Google.

* Actor URL (technical name), e.g. *booking-scraper*.
  
  * More on it on https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/importance-of-actor-url.md page.

* GitHub repository name (best to keep it similar to the other ones, for convenience), e.g. *actor-booking-scraper*.

## Actor name

The Actor name provides a human-readable name. The name is the most important real estate from an SEO standpoint. It should exactly match the most likely search query that potential users of your Actor will use. At the same time, it should give your Actor a clear name for people who will use it every day.

tip

Your Actor's name should be *40-50 characters* long. You can change your Actor name freely in Apify Console.

### Actor name vs. SEO name

There's an option to step away from your Actor's name for the sake of search engine optimization — the Actor SEO name. The Actor name and Actor SEO name serve different purposes:

* *Actor name*: this is the name visible in Apify Store and Console. It should be easy for users to understand and quickly show what your Actor does. It’s about attracting users who browse the Store.

  ![actor name example](/assets/images/actor-name-68e32093948ca0b704dda5e5672bf4d2.png)

* *Actor SEO name*: this is the name that appears in search engine results. It should include keywords people might search for to find your Actor. It’s about improving visibility on search engines and encouraging users to click on your link.

  ![actor seo name example](/assets/images/actor-seo-name-1a71276bdf8a0d33b3be5d33ba264288.png)

For example:

* *Actor name*: YouTube Scraper
* *Actor SEO name*: YouTube data extraction tool for video analysis

Here, the SEO name uses extra keywords to help people find it through search engines, while the Actor name is simpler and easier for users to understand and find on Apify Store.

💡 When creating the SEO name, focus on using relevant keywords that potential users might search for. It should still match what your Actor does. More about SEO name and description: \[Actor description and SEO description]

### Actor name vs. technical name

The Actor name and technical name (or URL) have different uses:

* *Actor name*: this is the name users see on Apify Store and Console. It’s designed to be user-friendly and should make the Actor's purpose clear to anyone browsing or searching for it.
* *Technical name*: this is a simplified, URL-friendly version used in technical contexts like API calls and scripts. This name should be concise and easily readable. Once set, it should not be changed as it can affect existing integrations and cause broken links.

For example:

* *Actor name*: Google Search Scraper
* *Technical name*: google-search-scraper

The Actor name is user-friendly and descriptive, while the technical name is a clean, URL-compatible version. Note that the technical name does not include spaces or special characters to ensure it functions properly in technical contexts.

important

This is important for SEO! Once set, the technical name should not be changed. Make sure you finalize this name early in development. More on why here: \[Importance of Actor URL]

## Best practices for naming

### Brainstorming

What does your Actor do? Does it scrape, find, extract, automate, connect, or upload? When choosing a name, ensure it stands out and is distinct from similar names both on Google and on Apify Store.

* *Use nouns and variations*: use nouns like "scraper", "extractor", “downloader”, “checker”, or "API" to describe what your Actor does. You can also include terms like API, data, or variations of the website name.
* *Include key features*: mention unique features or benefits to highlight what sets your Actor apart.
* *Check for uniqueness*: ensure your name isn’t too similar to existing Actors to avoid confusion and help with SEO.

### Match name and URL

The simplest approach is to make all names match. For example, TikTok Ads Scraper (tiktok-ads-scraper) or Facebook Data Extractor (facebook-data-extractor). However, variations are acceptable.

### Name length

Keep the name concise, ideally less than four words. For instance, Facebook Data Extractor is preferable to Facebook Meta Data Extractor Light.

### Check Apify Store for inspiration

Look at the names of similar Actors on Apify Store, but avoid naming your Actor too similarly. By choosing a unique name, you can stand out from the competition. This will also reduce confusion and help users easily distinguish your Actor.

### Keep SEO in mind

Even though you can set a different variation for SEO name specifically, consider doing a bit of research when setting the regular name as well. The name should reflect what the Actor does and the keywords people use when searching for it. If the keywords you find sound too robotic, save them for the SEO name. But if they sound like something you'd search for, it's a good candidate for a name.

You can also check the keywords competitors use for similar tools outside Apify Store.

### Occasionally experiment

You can test and refine your SEO assumptions by occasionally changing the SEO name. This allows you to track how changes to names affect search rankings and user engagement. Changing the regular name is not forbidden but still less desirable since it can confuse your existing users and also affect SEO.

## Naming examples

### Scraping Actors

✅:

* Technical name (Actor's name in the https://console.apify.com/): `${domain}-scraper`, e.g. youtube-scraper.
* Actor name: `${Domain} Scraper`, e.g. YouTube Scraper.
* Name of the GitHub repository: `actor-${domain}-scraper`, e.g. actor-youtube-scraper.

❌:

* Technical name: `the-scraper-of-${domain}`, e.g. the-scraper-of-youtube.
* Actor name: `The Scraper of ${Domain}`, e.g. The Scraper of YouTube.
* GitHub repository: `actor-the-scraper-of-${domain}`, e.g. actor-the-scraper-of-youtube.

If your Actor only caters to a specific service on a domain (and you don't plan on extending it), add the service to the Actor's name.

For example,

* Technical name: `${domain}-${service}-scraper`, e.g. google-search-scraper.
* Actor name: `${Domain} ${Service} Scraper`, e.g. https://apify.com/apify/google-search-scraper.
* GitHub repository: `actor-${domain}-${service}-scraper`, e.g. actor-google-search-scraper.

### Non-scraping Actors

Naming for non-scraping Actors is more liberal. Being creative and considering SEO and user experience are good places to start. Think about what your users will type into a search engine when looking for your Actor. What is your Actor's function?

Below are examples for the https://apify.com/lukaskrivka/google-sheets Actor.

✅:

* Technical name: google-sheets.
* Actor name: Google Sheets Import & Export.
* GitHub repository: actor-google-sheets.

❌:

* Technical name: import-to-and-export-from-google-sheets.
* Actor name: Actor for Importing to and Exporting from Google Sheets.
* GitHub repository: actor-for-import-and-export-google-sheets.

Renaming your Actor

You may rename your Actor freely, except when it comes to the Actor URL. Remember to read https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/importance-of-actor-url.md to find out why!


---

# Emails to Actor users

**Getting users is one thing, but keeping them is another. While emailing your users might not seem like a typical marketing task, any seasoned marketer will tell you it’s essential. It’s much easier to keep your current users happy and engaged than to find new ones. This guide will help you understand when and how to email your users effectively.**

***

## Whom and where to email

You can email the audience of a specific Actor directly from Apify Console. Go to **Messaging > Emails > Compose new**. From there, select the Actor whose users you want to email, write a subject line, and craft your message. An automatic signature will be added to the end of your email.

## How to write a good email

Emails can include text, formatting, images, GIFs, and links. Here are four main rules for crafting effective emails:

1. Don’t email users without a clear purpose.
2. Keep your message concise and friendly.
3. Make the subject line direct and to the point. Consider adding an emoji to give users a hint about the email’s content.
4. Use formatting to your advantage. Console emails support Markdown, so use bold, italics, and lists to highlight important details.

Additional tips:

* Show, don’t tell — use screenshots with arrows to illustrate your points.
* If you’re asking users to take action, include a direct link to what you're referring to.
* Provide alternatives if it suits the situation.
* Always send a preview to yourself before sending the email to all your users.

## When to email users

Our general policy is to avoid spamming users with unnecessary emails. We contact them only if there's a valid reason. Here’s the list of regular good reasons to contact users of the Actor:

### 1. Introducing a new feature of the Actor

New filter, faster scraping, changes in input schema, in output schema, a new Integration, etc.

> ✉️ 🏙️ Introducing Deep city search for Tripadvisor scrapers
>
> Hi,
>
> Tired of Tripadvisor's 3000 hotels-per-search limit? We've got your back. Say hello to our latest baked-in feature: Deep city search. Now, to get all results from a country-wide search you need to just set Max search results above 3000, and watch the magic happen.
>
> A bit of context: while Tripadvisor never limited the search for restaurants or attractions, hotel search was a different case; it always capped at 3000. Our smart search is designed to overcome that limit by including every city within your chosen location. We scrape hotels from each one, ensuring no hidden gems slip through the cracks. This feature is available for https://console.apify.com/actors/dbEyMBriog95Fv8CW/console and https://console.apify.com/actors/qx7G70MC4WBE273SM/console.
>
> Get ready for an unbeatable hotel-hunting experience. Give it a spin, and let us know what you think!

Introduce and explain the features, add a screenshot of a feature if it will show in the input schema, and ask for feedback.

### 2. Actor adapting to the changes of the website it scrapes

A common situation in web scraping that's out of your control.

> ✉️ 📣 Output changes for Facebook Ads Scraper
>
> Hi,
>
> We've got some news regarding your favorite Actor – https://console.apify.com/actors/JJghSZmShuco4j9gJ/console. Recently, Facebook Ads have changed their data format. To keep our Actor running smoothly, we'll be adapting to these changes by slightly tweaking the Actor Output. Don't worry; it's a breeze! Some of the output data might just appear under new titles.
>
> This change will take place on October 10; please\*\* \*\*make sure to remap your integrations accordingly.
>
> Need a hand or have questions? Our support team is just one friendly message away.

Inform users about the reason for changes and how the changes impact them and the Actor + give them a date when the change takes effect.

### 3. Actor changing its payment model (from rental to pay-per-result, for example)

Email 1 (before the change, warning about deprecation).

> ✉️ 🛎 Changes to Booking Scraper
>
> Hi,
>
> We’ve got news regarding the Booking scraper you have been using. This change will happen in two steps:
>
> 1. On September 22, we will deprecate it, i.e., new users will not be able to find it in Store. You will still be able to use it though.
> 2. At the end of October, we will unpublish this Actor, and from that point on, you will not be able to use it anymore.
>
> Please use this time to change your integrations to our new https://apify.com/voyager/booking-scraper.
>
> That’s it! If you have any questions or need more information, don’t hesitate to reach out.

Warn the users about the deprecation and future unpublishing + add extra information about related Actors if applicable + give them steps and the date when the change takes effect.

Email 2 (after the change, warning about unpublishing)

> ✉️ **📢 Deprecated Booking Scraper will stop working as announced 📢**
>
> Hi,
>
> Just a heads-up: today, the deprecated https://console.apify.com/actors/5T5NTHWpvetjeRo3i/console you have been using will be completely unpublished as announced, and you will not be able to use it anymore.
>
> If you want to continue to scrape Booking.com, make sure to switch to the https://apify.com/voyager/booking-scraper.
>
> For any assistance or questions, don't hesitate to reach out to our support team.

Remind users to switch to the Actor with a new model.

### 4. After a major issue

Actor downtime, performance issues, Actor directly influenced by platform hiccups.

> ✉️ **🛠️ Update on Google Maps Scraper: fixed and ready to go**
>
> Hi,
>
> We've got a quick update on the Google Maps Scraper for you. If you've been running the Actor this week, you might have noticed some hiccups — scraping was failing for certain places, causing retries and overall slowness.
>
> We apologize for any inconvenience this may have caused you. The **good news is those performance issues are now resolved**. Feel free to resurrect any affected runs using the "latest" build, should work like a charm now.
>
> Need a hand or have questions? Feel free to reply to this email.

Apologize to users and or let them know you're working on it/everything is fixed now. This approach helps maintain trust and reassures users that you're addressing the situation.

tip

It might be an obvious tip, but If you're not great at emails, just write a short draft and ask ChatGPT to polish it. Play with the style until you find the one that suits you. You can even create templates for each situation. If ChatGPT is being too wordy, you can ask it to write at 9th or 10th-grade level, and it will use simpler words and sentences.

## Emails vs. newsletters

While sending an email is usually a quick way to address immediate needs or support for your users, newsletters can be a great way to keep everyone in the loop on a regular basis. Instead of reaching out every time something small happens, newsletters let you bundle updates together.

Unless it's urgent, it’s better to wait until you have 2 or 3 pieces of news and share them all at once. Even if those updates span across different Actors, it’s perfectly fine to send one newsletter to all relevant users.

Here are a few things you can include in your newsletter:

* updates or new features for your Actors or Actor-to-Actor Integrations
* an invitation to a live webinar or tutorial session
* asking your users to upvote your Actor, leave a review or a star
* a quick feedback request after introducing new features
* spotlighting a helpful blog post or guide you wrote or found
* sharing success stories or use cases from other users
* announcing a promotion or a limited-time discount
* links to your latest YouTube videos or tutorials

Newsletters are a great way to keep your users engaged without overwhelming them. Plus, it's an opportunity to build a more personal connection by showing them you’re actively working to improve the tools they rely on.

## Emailing a separate user

There may be times when you need to reach out to a specific user — whether it’s to address a unique situation, ask a question that doesn’t fit the public forum of the **Issue tab**, or explore a collaboration opportunity. While there isn’t a quick way to do this through Apify Console just yet, you can ensure users can contact you by **adding your email or other contact info to your Store bio**. This makes it easy for them to reach out directly.

✍🏻 Learn best practices on how to use your Store bio to connect with your users https://docs.apify.com/academy/actor-marketing-playbook/interact-with-users/your-store-bio.md.


---

# Handle Actor issues

**Once you publish your Actor in Apify Store, it opens the door to new users, feedback, and… issue reports. Users can create issues and add comments after trying your Actor. But why is this space so important?**

***

## What is the Issues tab?

The Issues tab is a dedicated section on your Actor’s page where signed-in users can report problems, share feedback, ask questions, and have conversations with you. You can manage each issue thread individually, and the whole thread is visible to everyone. The tab is divided into three categories: **Open**, **Closed**, and **All**, and it shows how long each response has been there. While only signed-in users can post and reply, all visitors can see the interactions, giving your page a transparent and welcoming vibe.

Keep active

🕑 On the web, your average 🕑 **Response time** is calculated and shown in your Actor Metrics. The purpose of this metric is to make it easy for potential users to see how active you are and how well-maintained the Actor is.

You can view all the issues related to your Actors by going to **Actors** > https://console.apify.com/actors?tab=issues in Apify Console. Users can get automatic updates on their reported issues or subscribe to issues they are interested in, so they stay informed about any responses. When users report an issue, they’re encouraged to share their run, which helps you get the full context and solve the problem more efficiently. Note that shared runs aren’t visible on the public Actor page.

## What is the Issues tab for?

The tab is a series of conversations between you and your users. There are existing systems like GitHub for that. Why create a separate system like an Issues tab? Since the Issues tab exists both in private space (Console) and public space (Actor's page on the web), it can fulfill two different sets of purposes.

### Issues tab in Apify Console

Originally, the Issues tab was only available in Apify Console, and its main goals were:

* Convenience: a single space to hold the communication between you and your users.
* Unity and efficiency: make sure multiple users don't submit the same issue through multiple channels or multiple times.
* Transparency: make sure users have their issues addressed publicly and professionally. You can’t delete issues, you can only close them, so there's a clear record of what's been resolved and how.
* Quality of service and innovation: make sure the Actor gets fixed and continuously improved, and users get the quality scraping services they pay for.

### Issues tab on the web

Now that the Issues tab is public and on the web, it also serves other goals:

* Credibility: new users can check how active and reliable you are by looking at the issues and your average 🕑 **Response time** even before trying your Actor. It also sets expectations for when to expect a response from you.
* Collaboration: developers can learn from each other’s support styles, which motivates everyone to maintain good interactions and keep up good quality work.
* SEO boost: every issue now generates its own URL, potentially driving more keyword traffic to your Actor's page

## Example of a well-managed Issues tab

Check out how the team behind the **Apollo.io leads scraper** manages their https://apify.com/curious_coder/apollo-io-scraper/issues/open for a great example of professional responses and quick problem-solving.

Note that this Actor is a rental, so users expect a high-quality service.

![issues tab example](/assets/images/issues-tab-example-f6201ae99bc15f12f5e04c19857711fa.png)

warning

Once your Actor is public, you’re required to have an Issues tab.

## SEO for the Issues tab

Yes, you read that right! The public Issues tab can boost your search engine visibility. Each issue now has its own URL, which means every report could help your Actor rank for relevant keywords.

When we made the tab public, we took inspiration from StackOverflow’s SEO strategy. Even though StackOverflow started as a Q\&A forum, its strong SEO has been key to its success. Similarly, your Actor’s Issues tab can help bring in more traffic, with each question and answer potentially generating more visibility. This makes it easier for users to find solutions quickly.

## Tips for handling Actor issues

1. *Don’t stay silent*

   Respond quickly, even if it’s just a short note. If an issue takes weeks to resolve, keep the user in the loop. A quick update prevents frustration and shows the user (and others following it) that you’re actively working on solving the issue.

2. *Encourage search to avoid duplication*

   Save time by encouraging users to search for existing issues before submitting new ones. If a similar issue exists, they can follow that thread for updates instead of creating a new one.

3. *Encourage reporters to be specific*

   The more context, the better! Ask users to share details about their run, which helps you diagnose issues faster. If needed, remind them that runs are shared privately, so sensitive data won’t be exposed.

4. *Use screenshots and links*

   The same goes for your side. Screenshots and links to specific runs make your answers much clearer. It’s easier to walk the user through a solution if they can see what you’re referencing.

5. *Structure issue reporting*

   As you get more experienced, you’ll notice common types of issues: bugs, feature requests, questions, reports, misc. This way, you can prioritize and respond faster based on the category.

6. *Have ready answers for common categories*

   Once you recognize recurring types of issues, have pre-prepared responses. For example, if it’s a bug report, you might already have a troubleshooting guide you can link to, or if it’s a feature request, you can figure out the development timeline.

7. *Be polite and precise*

   Politeness goes a long way! Make sure your responses are respectful and straight to the point. It helps to keep things professional, even if the issue seems minor.

https://rewind.com/blog/best-practices-for-using-github-issues/


---

# Your Apify Store bio

## Your Apify Store bio and Store “README”

To help our community showcase their talents and projects, we introduced public profile pages for developers. On a dedicated page, you can showcase contact info, a summary of important Actor metrics (like total users, response time, and success rates), and all of their public Actors. We took inspiration from freelance platforms.

This space is all about helping you shine and promote your tools and skills. Here’s how you can use it to your advantage:

* Share your contact email, website, GitHub, X (Twitter), LinkedIn, or Discord handles.
* Summarize what you’ve been doing in Apify Store, your main skills, big achievements, and any relevant experience.
* Offer more ways for people to connect with you, such as links for booking a meeting, discounts, a subscription option for your email newsletter, or your YouTube channel or blog.
  
  * You can even add a Linktree to keep things neat.
* Highlight your other tools on different platforms.
* Get creative by adding banners and GIFs to give your profile some personality.

Everything is neatly available under a single URL, making it easy to share.

Need some inspiration? Check out examples of how others are using their Store bio and README. You can set yours up by heading to **Settings > Account > Profile.**

https://apify.com/anchor

https://apify.com/jupri

https://apify.com/apidojo

https://apify.com/curious_coder

https://apify.com/epctex

https://apify.com/microworlds


---

# Actor bundles

**Learn what an Actor bundle is, explore existing examples, and discover how to promote them.**

***

## What is an Actor bundle?

If an Actor is an example of web automation software, what is an Actor bundle? An Actor bundle is basically a chain of multiple Actors unified by a common use case. Bundles can include both scrapers and automation tools, and they are usually designed to achieve an overarching goal related to scraping or automation.

The concept of an Actor bundle originated from frequent customer requests for comprehensive tools. For example, someone would ask for a Twitter scraper that also performs additional tasks, or for a way to find all profiles of the same public figure across multiple social media platforms without needing to use each platform separately.

For example, consider a bundle that scrapes company reviews from multiple platforms, such as Glassdoor, LinkedIn, and Indeed. Typically, you would need to use several different scrapers and then consolidate the results. But this bundle would do it all in one run, once provided with the name of the company. Or consider a bundle that scrapes all posts and comments of a given profile, and then produces a sentiment score for each scraped comment.

The main advantage of an Actor bundle is its ease of use. The user inputs a keyword or a URL, and the Actor triggers all the necessary Actors sequentially to achieve the desired result. The user is not expected to use each Actor separately and then process and filter the results themselves.

### Examples of bundles

🔍 https://apify.com/tri_angle/social-media-finder searches for profiles on 13 social media sites provided just the (nick)name.

🍝 https://apify.com/tri_angle/restaurant-review-aggregator gets restaurant reviews from Google Maps, DoorDash, Uber Eats, Yelp, Tripadvisor, and Facebook in one place.

🤔 https://apify.com/tri_angle/social-media-sentiment-analysis-tool not only collects comments from Facebook, Instagram, and TikTok but also performs sentiment analysis on them. It unites post scrapers, comments scrapers and a text analysis tool.

🦾 https://apify.com/tri_angle/wcc-pinecone-integration scrapes a website and stores the data in a Pinecone database to build and improve your own AI chatbot assistant.

🤖 https://apify.com/tri_angle/pinecone-gpt-chatbot combines OpenAI's GPT models with Pinecone's vector database, which simplifies creating a GPT chatbot.

As you can see, they vary in complexity and range.

***

## Caveats

### Pricing model

Since bundles are still relatively experimental, profitability is not guaranteed and will depend heavily on the complexity of the bundle.

However, if you have a solid idea for a bundle, don’t hesitate to reach out. Prepare your case, write to our support team, and we’ll help determine if it’s worth it.

### Specifics of bundle promotion

First of all, when playing with the idea of creating a bundle, always check the keyword potential. Sometimes, there are true keyword gems just waiting to be discovered, with high search volume and little competition.

However, bundles may face the challenge of being "top-of-the-funnel" solutions. People might not search for them directly because they don't have a specific keyword in mind. For instance, someone is more likely to search for an Instagram comment scraper than imagine a bundle that scrapes comments from 10 different platforms, including Instagram.

Additionally, Google tends to favor tools with rather focused descriptions. If your tool offers multiple functions, it can send mixed signals that may conflict with each other rather than accumulate.

Sometimes, even though a bundle can be a very innovative tool product-wise, it can be hard to market from an SEO perspective and match the search intent.

In such cases, you may need to try different marketing and promotion strategies. Once you’ve exhausted every angle of SEO research, be prepared to explore non-organic marketing channels like Product Hunt, email campaigns, community engagement, Reddit, other social media, your existing customer base, word-of-mouth promotion, etc.

Remember, bundles originated as customized solutions for specific use cases - they were not primarily designed to be easily found.

This is also an opportunity to tell a story rather than just presenting a tool. Consider writing a blog post about how you created this tool, recording a video, or hosting a live webinar. If you go this route, it’s important to emphasize how the tool was created and what a technical feat it represents.

That said, don’t abandon SEO entirely. You can still capture some SEO value by referencing the bundle in the READMEs of the individual Actors that comprise it. For example, if a bundle collects reviews from multiple platforms, potential users are likely to search for review scrapers for each specific platform—Google Maps reviews scraper, Tripadvisor reviews scraper, Booking reviews scraper, etc. These keywords may not lead directly to your review scraping bundle, but they can guide users to the individual scrapers, where you can then present the bundle as a more comprehensive solution.

***

## Resources

Learn more about Actor Bundles: https://blog.apify.com/apify-power-actors/


---

# How to create a great input schema

Optimizing your input schema. Learn to design and refine your input schema with best practices for a better user experience.

***

## What is an input schema

You've succeeded: your user has:

1. Found your Actor on Google.
2. Explored the Actor's landing page.
3. Decided to try it.
4. Created an Apify account.

Now they’re on your Actor's page in Apify Console. The SEO fight is over. What’s next?

Your user is finally one-on-one with your Actor — specifically, its input schema. This is the moment when they try your Actor and decide whether to stick with it. The input schema is your representative here, and you want it to work in your favor.

Technically, the input schema is a `JSON` object with various field types supported by the Apify platform, designed to simplify the use of the Actor. Based on the input schema you define, the Apify platform automatically generates a *user interface* for your Actor.

Of course, you can create an Actor without setting up an elaborate input schema. If your Actor is designed for users who don't need a good interface (e.g. they’ll use a JSON object and call it via API), you can skip this guide. But most users engage with Actors in Manual mode, aka the Actor interface. If your Actor is complex or you’re targeting regular users who need an intuitive interface, it's essential to consider their experience.

In this article, *we’ll refer to the input schema as the user interface* of your Actor and focus exclusively on it.

Understand input schemas

To fully understand the recommendations in this blog post, you’ll first need to familiarize yourself with the https://docs.apify.com/platform/actors/development/actor-definition/input-schema. This context is essential to make good use of the insights shared here.

## The importance of a good input schema

It can feel intimidating when facing the Apify platform for the first time. You only have a few seconds for a user to assess the ease of using your Actor.

If something goes wrong or is unclear with the input, an ideal user will first turn to the tooltips in the input schema. Next, they might check the README or tutorials, and finally, they’ll reach out to you through the **Issues** tab. However, many users won’t go through all these steps — they may simply get overwhelmed and abandon the tool altogether.

A well-designed input schema is all about managing user expectations, reducing cognitive load, and preventing frustration. Ideally, a good input schema, as your first line of interaction, should:

* Make the tool as easy to use as possible
* Reduce the user’s cognitive load and make them feel confident about using and paying for it
* Give users enough information and control to figure things out on their own
* Save you time on support by providing clear guidance
* Prevent incorrect or harmful tool usage, like overcharges or scraping personal information by default

### Reasons to rework an input schema

* Your Actor is complex and has many input fields
* Your Actor offers multiple ways to set up input (by URL, search, profile, etc.)
* You’re adding new features to your Actor
* Certain uses of the Actor have caveats that need to be communicated immediately
* Users frequently ask questions about specific fields

👀 Input schema can be formatted using basic HTML.

## Most important elements of the input schema

You can see the full list of elements and their technical characteristics in https://docs.apify.com/academy/deploying-your-code/input-schema: titles, tooltips, toggles, prefills, etc. That's not what this guide is about. It's not enough to just create an input schema, you should ideally aim to place and word its elements to the user's advantage: to alleviate the user's cognitive load and make the acquaintance and usage of your tool as smooth as possible.

Unfortunately, when it comes to UX, there's only so much you can achieve armed with HTML alone. Here are the best elements to focus on, along with some best practices for using them effectively:

* **`description` at the top**

  * As the first thing users see, the description needs to provide crucial information and a sense of reassurance if things go wrong. Key points to mention: the easiest way to try the Actor, links to a guide, and any disclaimers or other similar Actors to try.

    ![Input schema description example](/assets/images/description-sshot-4a31a900bc58209d44032f409cf8eed6.png)

  * Descriptions can include multiple paragraphs. If you're adding a link, it’s best to use the `target_blank` property so your user doesn’t lose the original Actor page when clicking.

* **`title` of the field (regular bold text)**

  * This is the default way to name a field.

  * Keep it brief. The user’s flow should be 1. title → 2. tooltip → 3. link in the tooltip. Ideally, the title alone should provide enough clarity. However, avoid overloading the title with too much information. Instead, make the title as concise as possible, expand details in the tooltip, and include a link in the tooltip for full instructions.

    ![Input schema input example](/assets/images/title-sshot-59c5431c3d78f35f398c1c55d930b806.png)

* **`prefill`, the default input**

  * this is your chance to show rather than tell

    

    * Keep the **prefilled number** low. Set it to 0 if it's irrelevant for a default run.
    * Make the **prefilled text** example simple and easy to remember.
    * If your Actor accepts various URL formats, add a few different **prefilled URLs** to show that possibility.
    * Use the **prefilled date** format that the user is expected to follow. This way, they can learn the correct format without needing to check the tooltip.
    * There’s also a type of field that looks like a prefill but isn’t — usually a `default` field. It’s not counted as actual input but serves as a mock input to show users what to type or paste. It is gray and disappears after clicking on it. Use this to your advantage.

* **toggle**

  * The toggle is a boolean field. A boolean field represents a yes/no choice.

  * How would you word this toggle: **Skip closed places** or **Scrape open places only**? And should the toggle be enabled or disabled by default?

    ![Input schema toggle example](/assets/images/toggle-sshot-b27af75e3ef46c83a61ef2bad6670a56.png)

    * You have to consider this when you're choosing how to word the toggle button and which choice to set up as the default. If you're making this more complex than it's needed (e.g. by using negation as the ‘yes’ choice), you're increasing your user's cognitive load. You also might get them to receive way less, or way more, data than they need from a default run.
    * In our example, we assume the default user wants to scrape all places but still have the option to filter out closed ones. However, they have to make that choice consciously, so we keep the toggle disabled by default. If the toggle were enabled by default, users might not notice it, leading them to think the tool isn't working properly when it returns fewer results than expected.

* **sections or `sectionCaption` (BIG bold text) and `sectionDescription`**

  * A section looks like a wrapped toggle list.

    ![Input schema sections example](/assets/images/sections-sshot-fc6cbd06170d0a33c1c9ab909bd8d6d1.png)

  * It is useful to section off non-default ways of input or extra features. If your tool is complex, don't leave all fields in the first section. Just group them by topic and section them off (see the screenshot above ⬆️)

    * You can add a description to every section. Use `sectionDescription` only if you need to provide extra information about the section (see the screenshot below ⬇️.
    * sometimes `sectionDescription` is used as a space for disclaimers so the user is informed of the risks from the outset instead of having to click on the tooltip.

    ![Input schema section description example](/assets/images/section-description-sshot-3f2616cb044875c2841e131fe408554c.png)

* tooltips or `description` to the title

  * To see the tooltip's text, the user needs to click on the `?` icon.

  * This is your space to explain the title and what's going to happen in that field: any terminology, referrals to other fields of the tool, examples that don't fit the prefill, or caveats can be detailed here. Using HTML, you can add links, line breaks, code, and other regular formatting here. Use this space to add links to relevant guides, video tutorials, screenshots, issues, or readme parts if needed.

  * Wording in titles vs. tooltips. Titles are usually nouns. They have a neutral tone and simply inform on what content this field is accepting (**Usernames**).

    * Tooltips to those titles are usually verbs in the imperative that tell the user what to do (*Add, enter, use*).
    * This division is not set in stone, but the reason why the tooltip is an imperative verb is because, if the user is clicking on the tooltip, we assume they are looking for clarifications or instructions on what to do.

    ![Input schema tooltips example](/assets/images/tooltips-sshot-956de479172bfe492e0e8b98a06e6e01.png)

* emojis (visual component)

  * Use them to attract attention or as visual shortcuts. Use emojis consistently to invoke a user's iconic memory. The visual language should match across the whole input schema (and README) so the user can understand what section or field is referred to without reading the whole title.
    
    * Don't overload the schema with emojis. They attract attention, so you need to use them sparingly.

tip

Read more on the use of emojis: https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/actors-and-emojis.md

## Example of an improved input schema

1. A well-used `description` space. The description briefly introduces possible scraping options, visual language (sections represented by emojis), the easiest way to try the tool, and a link to a tutorial in case of issues. The description isn't too long, uses different formatting, and looks reassuring.
2. The main section is introduced and visually separated from the rest. This is the space for the user to try the first run before they can discover the other options.
3. The title says right away that this field refers to multiple other fields, not only the first section.
4. `prefill` is a small number (so in case users run the tool with default settings, it doesn't take too long and isn't expensive for them) and uses the language of the target website (not results or posts, *videos*).
5. The tooltip expands with more details and refers to other sections it's applicable to using matching emojis.
6. Section names are short. Sections are grouped by content type.
7. More technical parameters lack emojis. They are formatted this way to attract less attention and visually inform the user that this section is the most optional to set.
8. Visual language is unified across the whole input schema. Emojis are used as a shortcut for the user to understand what section or field is referred to without actually reading the whole title.

![Input schema example](/assets/images/improved-input-schema-example-193dcc1c44cbcc8db6016ced168d8dc5.png)

### Example of a worse input schema

The version above was the improved input schema. Here's what this tool's input schema looked like before:

1. Brief and dry description, with little value for the user, easy to miss. Most likely, the user already knows this info because what this Actor does is described in the Actor SEO description, description, and README.
2. The field title is wordy and reads a bit techie: it uses terminology that's not the most accurate for the target website (*posts*) and limiting terms (*max*). The field is applicable for scraping by hashtags (field above) and by profile (section below). Easy detail to miss.
3. The prefilled number is too high. If the user runs the Actor with default settings, they might spend a lot of money, and it will take some time. Users often just leave if an Actor takes a long time to complete on the first try.
4. The tooltip simply reiterates what is said in the title. Could've been avoided if the language of the title wasn't so complex.
5. Merging two possible input types into one (profiles and URLs) can cause confusion. Verbose, reminds the user about an unrelated field (hashtags).
6. This section refers to profiles but is separate. The user had to make extra effort to scrape profiles. They have to move across 3 sections: (use Max posts from section 1, use Profiles input from section 2, use Date sorting filters from section 3).
7. The proxy and browser section invites the users to explore it even though it's not needed for a default run. It's more technical to set up and can make an impression that you need to know how to set it so the tool works.

![Input schema example](/assets/images/worse-input-schema-f6354139a96611112dbeb1f9882ab2e9.png)

## Best practices

1. Keep it short. Don’t rely too much on text - most users prefer to read as little as possible.
2. Use formatting to your advantage (bold, italic, underline), links, and breaks to highlight key points.
3. Use specific terminology (e.g., posts, images, tweets) from the target website instead of generic terms like "results" or "pages."
4. Group related items for clarity and ease of use.
5. Use emojis as shortcuts and visual anchors to guide attention.
6. Avoid technical jargon — keep the language simple.
7. Minimize cognitive load wherever possible.

## Signs and tools for improving input schema

* *User feedback*. If they're asking obvious things, complaining, or consistently making silly mistakes with input, take notes. Feedback from users can help you understand their experience and identify areas for improvement.
* *High churn rates*. If your users are trying your tool but quickly abandon it, this is a sign they are having difficulties with your schema.
* *Input Schema Viewer*. Write your base schema in any code editor, then copy the file and put it into https://console.apify.com/actors/UHTe5Bcb4OUEkeahZ/source.\*\* This tool should help you visualize your Input Schema before you add it to your Actor and build it. Seeing how your edits look in Apify Console right away will make the process of editing the fields in code easier.

## Resources

* Basics of input schema: https://docs.apify.com/academy/deploying-your-code/input-schema
* Specifications of input schema: https://docs.apify.com/platform/actors/development/actor-definition/input-schema


---

# Affiliates

The Apify Affiliate Program offers you a way to earn recurring commissions while helping others discover automation and web scraping solutions. Whether you promote Apify Store or refer customers to Apify's professional services, you can monetize your expertise and network.

The program rewards collaboration with up to 30% recurring commission and up to $2,500 per customer for professional services referrals. With no time limits on commissions, transparent tracking, and flexible payout options, it's built for long-term partnerships.

***

## How the program works

The Apify Affiliate Program lets you promote three main offerings:

1. *Apify Store*: recommend Actors from the marketplace that help businesses automate lead generation, pricing intelligence, content aggregation, and more.
2. *Apify platform*: promote the platform's features, including scheduling, monitoring, data export options, proxies, and integrations.
3. *Professional services*: refer customers who need custom web scraping solutions to Apify's Professional Services team and earn up to $2,500 per closed deal.

### Commission structure

* *20% commission* for the first 3 months of each customer's subscription
* *30% commission* from month 4 onwards for as long as they remain customers
* *Up to $2,500 per customer* for professional services referrals
* *No time limits on commissions* - you earn as long as your referrals stay active

### Free trial advantage

Apify offers a $5 free trial that renews monthly, giving your referrals time to test the tools before subscribing. This increases conversion rates and helps you earn more consistent commissions.

***

## How to succeed as an affiliate

### Use word of mouth

Tell clients, business contacts, or colleagues how Apify solves their lead generation, data collection, and automation challenges. Personal recommendations carry weight, especially when you can speak to real use cases.

### Create educational content

Use your platform to demonstrate value:

* Blog posts: write tutorials, case studies, or problem-solving guides that feature Apify tools
* Video content: record demos, walkthroughs, or quick tips showing how Actors work
* Podcasts: discuss automation workflows and mention specific Actors that solve common problems
* Social media: share favorite Actors, tools, or workflows with your audience

### Engage your community

If you run a forum, Discord server, or an online group, position Apify as a resource for solving automation and data collection challenges. Answer questions and recommend relevant Actors when they fit the problem.

### Teach and inspire

If you teach AI automation, engineering, marketing, or lead generation, include Apify in your curriculum. Show students how to use Actors in webinars, online courses, or workshops.

***

## Benefits beyond commissions

### Exclusive perks for top performers

High-performing affiliates and their referrals can access:

* Exclusive discounts on platform usage
* Free prepaid credits
* Early access to new tools and features

### Co-marketing opportunities

Collaborate with Apify on:

* Joint marketing campaigns
* Workshops and webinars
* Partner success stories
* Industry events and conferences

These opportunities help you build visibility and strengthen relationships with your audience.

***

## Payment and tracking

### Transparent dashboard

Track referrals in real-time through a dashboard that shows:

* Active referrals
* Commission earnings
* Conversion rates
* Payment history

### Payment options

Receive payouts via:

* Bank transfer
* PayPal

You'll receive your first payment within 30 days of your first successful referral.

***

## Best practices for affiliate success

1. *Know your audience*: understand their pain points and recommend solutions that genuinely help them. Tailor your messaging to their technical level and needs.
2. *Be authentic*: promote tools you've used or understand. Personal experience builds trust and credibility.
3. *Provide context*: explain how Apify solves specific problems rather than just listing features. Use real examples and use cases.
4. *Follow up*: engage with people who click your links. Answer questions and provide additional resources to help them get started.
5. *Combine strategies*: use multiple channels to promote Apify. Cross-reference blog posts in videos, mention tutorials in newsletters, and share content on social media.
6. *Track what works*: monitor which content and channels drive the most conversions, then double down on what performs best.

***

## Getting started

To join the Apify Affiliate Program:

1. Sign up through the https://apify.com/partners/affiliate
2. Access your unique tracking links and promotional materials
3. Start sharing with your network
4. Monitor your referrals and earnings through the dashboard

Maximize your impact

Combine affiliate promotion with other marketing strategies covered in this guide, including SEO, social media, blogs, and video tutorials. The more touchpoints you create, the higher your conversion potential.


---

# Blogs and blog resources

**Blogs remain a powerful tool for promoting your Actors and establishing authority in the field. With social media, SEO, and other platforms, you might wonder if blogging is still relevant. The answer is a big yes. Writing blog posts can help you engage your users, share expertise, and drive organic traffic to your Actor.**

## Why blogs still matter

1. SEO. Blog posts are great for boosting your Actor’s search engine ranking. Well-written content with relevant keywords can attract users searching for web scraping or automation solutions. For example, a blog about “how to scrape social media profiles” could drive people to your Actor who might not find it on Google otherwise.
2. Establishing authority. When you write thoughtful, well-researched blog posts, you position yourself as an expert in your niche. This builds trust and makes it more likely users will adopt your Actors.
3. Long-form content. Blogs give you the space to explain the value of your Actor in-depth. This is especially useful for complex tools that need more context than what can fit into a README or product description.
4. Driving traffic. Blog posts can be shared across social media, linked in webinars, and included in your Actor’s README. This creates multiple avenues for potential users to discover your Actor.

## Good topics for blog posts

1. Problem-solving guides. Write about the specific problems your Actor solves. For example, if you’ve created an Actor that scrapes e-commerce reviews, write a post titled "How to automate e-commerce review scraping in 5 minutes". Focus on the pain points your tool alleviates.
2. Actor use cases. Show real-world examples of how your Actor can be applied. These can be case studies or hypothetical scenarios like "Using web scraping to track competitor pricing."
3. Tutorials and step-by-step guides. Tutorials showing how to use your Actor or similar tools are always helpful. Step-by-step guides make it easier for beginners to start using your Actor with minimal hassle.
4. Trends. If you’ve noticed emerging trends in web scraping or automation, write about them. Tie your Actor into these trends to highlight its relevance.
5. Feature announcements or updates. Have you recently added new features to your Actor? Write a blog post explaining how these features work and what makes them valuable.

🪄 These days, blog posts always need to be written with SEO in mind. Yeah, it's annoying to use keywords, but think of it this way: even if there's the most interesting customer story and amazing programming insights, but nobody can find it, it won't have the impact you want. Do try to optimize your posts with relevant keywords and phrases — across text, structure, and even images — to ensure they reach your target audience.

***

## Factors to consider when writing a blog

1. Audience. Know your target audience. Are they developers, small business owners, or data analysts? Tailor your writing to match their technical level and needs.
2. SEO. Incorporate relevant keywords naturally throughout your post. Don’t overstuff your content, but make sure it ranks for search queries like "web scraping tools", "automation solutions", or "how to scrape LinkedIn profiles". Remember to include keywords in H2 and H3 headings.
3. Clarity and simplicity. Avoid jargon, especially if your target audience includes non-technical users. Use simple language to explain how your Actor works and why it’s beneficial.
4. Visuals. Include screenshots, GIFs, or even videos to demonstrate your Actor’s functionality. Visual content makes your blog more engaging and easier to follow.
5. Call to action (CTA). Always end your blog with a clear CTA. Whether it’s "try our Actor today" or "download the demo", guide your readers to the next step.
6. Engage with comments. If readers leave comments or questions, engage with them. Answer their queries and use the feedback to improve both your blog and Actor.

***

## Best places to publish blogs

There are a variety of platforms where you can publish your blog posts to reach the right audience:

1. http://dev.to/: It's a developer-friendly platform where technical content gets a lot of visibility, and a great place to publish how-to guides, tutorials, and technical breakdowns of your Actor.
2. Medium: Allows you to reach a broader, less technical audience. It’s also good for writing about general topics like automation trends or how to improve data scraping practices.
3. ScrapeDiary: Run by Apify, http://scrapediary.com is a blog specifically geared toward Apify community devs and web scraping topics. Publishing here is a great way to reach users already interested in scraping and automation. Contact us if you want to publish a blog post there.
4. Personal blogs or company websites. If you have your own blog or a company site, post there. It’s the most direct way to control your content and engage your established audience.

***

## Not-so-obvious SEO tips for blog posts

Everybody knows you should include keywords wherever it looks natural. Some people know the structure of the blog post should be hierarchical and follow an H1 - H2 - H3 - H4 structure with only one possible H1. Here are some unobvious SEO tips for writing a blog post that can help boost its visibility and ranking potential:

### 1. Keep URL length concise and strategic

Optimal length. Keep your URL short and descriptive. URLs between 50-60 characters perform best, so aim for 3-4 words. Avoid unnecessary words like "and", "of", or long prepositions.

Include keywords. Ensure your primary keyword is naturally integrated into the URL. This signals relevance to both users and search engines.

Avoid dates. Don’t include dates or numbers in the URL to keep the content evergreen, as dates can make the post seem outdated over time.

### 2. Feature a video at the top of the post

Engagement boost. Videos significantly increase the time users spend on a page, positively influencing SEO rankings. Blog posts with videos in them generally do better SEO-wise.

Thumbnail optimization. Use an optimized thumbnail with a clear title and engaging image to increase click-through rates.

### 3. Alt text for images with a keyword focus

Descriptive alt text. Include a short, descriptive alt text for every image with one or two keywords where it makes sense. This also improves accessibility.

Optimize file names. Name your images with SEO-friendly keywords before uploading (e.g., "web-scraping-tools.png" rather than "IMG12345\_screenshot1.png"). This helps search engines understand the content of your images.

File format and size. Use web-optimized formats like WebP or compressed JPEGs/PNGs to ensure fast page loading, which is a key SEO factor.

Lazy loading images. Use lazy loading to only load images when the user scrolls to them, reducing initial page load times, which can help your SEO ranking.

### 4. Interlinking for better user experience and SEO

Internal links. Use contextual links to other relevant blog posts or product pages on your site. This not only helps with SEO but also keeps users engaged longer on your site, reducing bounce rates.

Anchor text. When linking internally, use keyword-rich anchor text that describes what users will find on the linked page.

Content depth. By interlinking, you can show Google that your site has a strong internal structure and is a hub of related, authoritative content.

### 5. Target the 'People Also Ask' section of Google results with an FAQ

Answer common questions. Including an FAQ section that answers questions people search for can help you rank in the "People Also Ask" section of Google. Research questions that come up in this feature related to your topic and address them in your content.

Provide clear, concise answers to the FAQs, typically between 40-60 words, since these match the format used in "People Also Ask".

Don't bother using FAQ schema. Google doesn't react to those anymore unless you’re a .gov or .edu domain.

### 6. Optimize for readability and structure

Short paragraphs and subheadings. Make your blog post easy to scan by using short paragraphs and meaningful subheadings that contain keywords.

Bullet points and lists. Include bullet points and numbered lists to break up content and make it more digestible. Search engines prioritize well-structured content.

Readability tools. Use tools like Hemingway Editor or Grammarly to improve readability. Content that is easy to read tends to rank higher, as it keeps readers engaged.

## Referring to blogs in your Actor’s ecosystem

To drive traffic to your blog and keep users engaged, reference your blog posts across various touchpoints:

1. README. Add links to your blog posts in your Actor’s README. If you’ve written a tutorial or feature guide, include it under a "Further reading" section.
2. Input schema. Use your input schema to link to blog posts. For instance, if a certain field in your Actor has complex configurations, add a link to a blog post that explains how to use it.
3. YouTube videos. If you’ve created tutorial videos about your Actor, link them in your blog and vice versa. Cross-promoting these assets will increase your overall engagement.
4. Webinars and live streams. Mention your blog posts during webinars, especially if you’re covering a topic that’s closely related. Include the links in follow-up emails after the event.
5. Social media. Share your blog posts on Twitter, LinkedIn, or other social platforms. Include snippets or key takeaways to entice users to click through.

🔄 Remember, you can always turn your blog into a video and vice versa. You can also use parts of blog posts for social media promotion.

## Additional tips for blog success

1. Consistency. Regular posting helps build an audience and makes sure you keep at it. Try to stick to a consistent schedule, whether it’s weekly, bi-weekly, or monthly. As Woody Allen said, “80 percent of success is showing up”.
2. Guest blogging. Reach out to other blogs or platforms like http://dev.to/ for guest blogging opportunities. This helps you tap into new audiences.
3. Repurpose content. Once you’ve written a blog post, repurpose it. Turn it into a YouTube video, break it down into social media posts, or use it as the base for a webinar.
4. Monitor performance. Use analytics to track how your blog is performing. Are people reading it? Is it driving traffic to your Actor? What keywords is it ranking for? Who are your competitors? Use this data to refine your content strategy.


---

# Marketing checklist

You're a developer, not a marketer. You built something awesome, and now you need people to find it. This checklist breaks down the marketing process into simple, actionable steps.

Complete many tasks using AI prompts that generate content in minutes. Each completed task brings you closer to your goals.

Tag Apify for broader reach

Tag @apify when you share content on X.com (Twitter) or LinkedIn to potentially reach thousands of additional users through Apify's social channels.

***



## Social media promotion

### Share on Twitter/X with a demo

Twitter's developer community is active and engaged. A well-crafted tweet with a video demo can reach thousands of potential users.

Create a 30-60 second demo video or gif showing your Actor in action. Include relevant hashtags like #webscraping, #API, #automation, and #buildinpublic.

Use pre-built prompt for your AI assistant

Show promptCopy prompt

### Share on LinkedIn with a demo

LinkedIn reaches professionals, decision-makers, and business users with purchasing power. The platform's algorithm favors native video content, giving it 5x more reach than link posts.

Create a 30-90 second demo video showing your Actor delivering business value. Upload the video directly to LinkedIn (native videos perform better than YouTube links). Focus your post on the business problem solved, not technical features. Use 3-5 relevant hashtags like #BusinessAutomation, #Productivity, #DataIntelligence, #Efficiency, or #MarketResearch.

Use pre-built prompt for your AI assistant

Show promptCopy prompt

### Post in relevant Discord and Slack communities

Developer communities on Discord and Slack are where your target users spend time. These platforms enable deeper conversations and direct feedback.

Join communities related to data for AI, web scraping, automation, data science, or your specific niche. Share your Actor in relevant channels, but always check the community rules first. Consider Apify Discord, web scraping communities, automation groups, and data engineering Slacks.

***

## Video content creation

### Create a tutorial video or walkthrough

Video content ranks well on YouTube and Google. It's perfect for developers who prefer visual learning. Videos get embedded and shared, multiplying your reach.

Record a 5-10 minute screen recording showing your Actor in action. Use Loom, OBS, or your computer's built-in recorder. Distribute your video across multiple channels:

* YouTube
* LinkedIn
* Twitter/X
* Your README and articles

**Video structure:**

1. **Introduction (30-45 seconds)** - Greet viewers, explain the problem you're solving, what they'll learn, and time estimate
2. **Outcome preview (30-45 seconds)** - Show the result first, preview the final output
3. **Step-by-step walkthrough (4-7 minutes)** - Navigate to the Actor, set up configuration, show optional features, run the Actor, review results, export the data
4. **Pro tips (30-60 seconds)** - Share 2-3 quick tips you've learned
5. **Wrap up (30-45 seconds)** - Recap, call-to-action, engagement prompt

**Recording tips:** Close unnecessary tabs, use a clean browser profile, speak clearly at a moderate pace, pause briefly between steps for easier editing, and use your natural voice.

### Create short-form videos (TikTok, YouTube Shorts, Instagram Reels)

Short-form video is one of the fastest-growing content formats with incredible organic reach. Even accounts with zero followers can get thousands of views. These videos showcase your Actor's value in 15-60 seconds and appear in AI-generated answers and search results.

Focus on the "wow factor": show the problem (manual work taking forever) versus the solution (your Actor doing it in seconds). Use trending sounds when possible, add text overlays explaining what's happening (most people watch without sound), and include a clear call-to-action at the end. Post the same video across all three platforms to maximize reach.

Best practices for short-form videos

* Hook viewers in the first 3 seconds (show the result or problem immediately)
* Keep it fast-paced
* Add captions and text overlays (essential for silent viewing)
* Record in portrait mode (9:16 aspect ratio)
* End with a clear next step: "Link in bio" or "Search \[Actor Name] on Apify"

***

## Launch and community engagement

### Create a Product Hunt launch

Product Hunt drives significant traffic and visibility. A successful launch brings hundreds of users and valuable feedback.

Create a Product Hunt listing for your Actor. Schedule it for a weekday morning (Tuesday through Thursday works best). Prepare assets: logo, screenshots, and demo video.

Learn more in the https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/product-hunt.md.

### Submit to Hacker News

Hacker News drives significant developer traffic and has high domain authority. A front-page post brings thousands of visitors and generates discussions that lead to improvements and feature ideas.

Submit your "How I Built This" post, tutorial, or Actor launch with a descriptive title. Post between 8-10 AM EST on weekdays for best results. Engage authentically in comments. The HN community values substance over promotion.

### Promote your Actor on Reddit

Reddit ranks highly for almost all keywords and topics. You can get your product mentioned in LLMs by engaging in popular threads.

1. Search `site:reddit.com [ACTOR NAME]` in Google
2. Find relevant Reddit threads
3. Comment authentically and mention your product naturally without being salesy

Craft comments that genuinely address the thread topic, naturally mention your Actor as a solution, and add real value to the conversation. Use casual Reddit tone, not corporate speak.

### Cross-post to relevant subreddits

Original posts in relevant subreddits (r/webdev, r/datascience, r/SideProject, r/programming, r/automation) drive significant traffic when done thoughtfully.

Write a Reddit-native post that explains the problem, your solution, and invites feedback. Use titles like "I built X to solve Y" instead of "Check out my new tool." Follow subreddit self-promotion rules (many require you to be an active community member first). Share both challenges and successes to foster authentic engagement.

### Answer Stack Overflow questions

Stack Overflow answers rank well in search and are frequently referenced by AI systems. Providing helpful answers that mention your Actor creates lasting SEO value.

Search for questions related to your Actor's use case (e.g., "web scraping", "API integration"). Provide genuinely helpful answers that solve the problem, and mention your Actor as one potential solution.

### Contribute to Quora discussions

Quora answers rank well in Google and are often featured in AI-generated answers. People actively seek solutions to problems on this platform.

1. Search `site:quora.com [ACTOR NAME]` or related keywords in Google
2. Find relevant Quora threads
3. Write comprehensive, helpful answers and mention your product naturally without being salesy

Write 300-500 word answers that open with a direct response, provide context, offer 2-3 different approaches, mention your Actor as one option, and include personal experience. Use subheadings for readability. Keep tone expert but approachable.

***

## Content marketing

### Write a technical "How I built this" blog post

Developers love reading about other developers' journeys. This positions you as an expert, builds trust, and naturally promotes your Actor while providing educational value. It's great for SEO and getting indexed by AI search engines.

Share your development process, challenges you faced, and how you solved them. Post on dev.to, Medium, Hashnode, or your personal blog.

Learn more in the https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/blogs-and-blog-resources.md.

### Create a "Best X" article for Medium

Medium has excellent SEO and a massive audience of professionals and developers. Publishing on Medium and submitting to relevant publications like "Better Programming" or "The Startup" can reach thousands of readers. Medium articles frequently appear in Google search results and AI-generated answers.

Write a comprehensive "Best \[CATEGORY]" roundup article (1,800-2,500 words) featuring 6-8 solutions with your Actor prominently positioned. Create a Medium account if you don't have one, and publish it. Use all 5 available tags strategically (e.g., "web scraping", "APIs", "automation", "developer tools", "\[your specific niche]"). Submit your article to relevant Medium publications to multiply your reach by 10x or more.

Write in first person with a conversational yet professional tone. Include pros and cons for each solution, add a comparison table, and share your genuine perspective.

Use pre-built prompt for your AI assistant

Show promptCopy prompt

### Create a "Best X" article for dev.to

dev.to is the go-to platform for developers seeking tools and tutorials. It has a highly engaged community and strong domain authority, so your articles rank well in search engines. The community actively comments and shares, boosting visibility. dev.to content is frequently referenced by AI tools.

Write a developer-focused "Best \[CATEGORY] for Developers" article (1,500-2,000 words) featuring 6-8 solutions. Create a dev.to account if needed and publish your article. Add relevant tags (up to 4 tags, e.g., #webdev, #api, #productivity, #tools). Engage with comments. The dev.to community values authentic interaction.

Write like you're advising a fellow developer: casual and helpful. Be genuinely objective about all tools, include code examples or API snippets where relevant, and use markdown formatting with H2/H3 headers.

Use pre-built prompt for your AI assistant

Show promptCopy prompt

### Create a "Best X" article for Hashnode

Hashnode is a rapidly growing developer blogging platform with excellent SEO and a clean reading experience. It's perfect for technical content with features built for developers (code highlighting, series, custom domains). Articles rank well in search results and are frequently discovered by developers. High-quality content often gets featured on Hashnode's homepage, dramatically increasing visibility.

Write a technical "Best \[CATEGORY] for \[SPECIFIC USE CASE]: A Developer's Guide" article (1,500-2,000 words). Create a Hashnode account if you don't have one (you can use a custom domain). Publish your article and add it to relevant Hashnode tags and communities.

Include a TL;DR section at the top, use proper heading hierarchy for auto-generated table of contents, and add code examples with proper syntax highlighting. Write with technical authority but remain accessible.

Use pre-built prompt for your AI assistant

Show promptCopy prompt

### Create a "Best X" article for LinkedIn

LinkedIn reaches a professional, business-oriented audience including decision-makers, CTOs, product managers, and team leads who have budget and purchasing authority. LinkedIn articles have strong SEO and are shared within professional networks, multiplying your reach. Content on LinkedIn is frequently indexed by AI systems.

Write a business-focused "Best \[CATEGORY] for \[BUSINESS OUTCOME]" article (1,200-1,800 words) featuring 5-7 solutions. Publish it as a LinkedIn Article (use the "Write article" feature, not just a post). After publishing, share the article link in a regular LinkedIn post with a compelling intro to drive traffic.

Use a professional, authoritative but accessible tone. Focus on business impact like time savings, cost efficiency, ROI, and productivity gains rather than technical features. Include comparison tables with business-relevant metrics.

Use pre-built prompt for your AI assistant

Show promptCopy prompt

### Create a "How to use \[Actor]" tutorial for dev.to

dev.to is the platform for developer tutorials. It has a massive, engaged community that actively searches for and shares how-to content. Tutorials rank exceptionally well in Google and are frequently referenced by AI systems when developers ask "how to" questions.

Write a step-by-step tutorial (1,200-2,000 words) showing developers how to use your Actor to achieve a specific outcome. Create a dev.to account if you don't have one, then publish your article with up to 4 relevant tags (e.g., #tutorial, #webdev, #api, #automation).

Structure: Introduction with hook, prerequisites, what they'll achieve, step-by-step guide (access the Actor, configure inputs, run it, view results, download data), understanding results, pro tips, troubleshooting, and next steps. Write like you're helping a friend get started.

Use pre-built prompt for your AI assistant

Show promptCopy prompt

### Create a "How to use \[Actor]" tutorial for Hashnode

Hashnode is perfect for comprehensive technical tutorials with a clean reading experience. It has excellent SEO, strong domain authority, and a growing developer community. The platform is built for technical writing with great code formatting and features like table of contents auto-generation.

Write a comprehensive "Complete Guide: How to \[ACHIEVE OUTCOME] Using \[YOUR ACTOR NAME]" tutorial (1,800-2,500 words). Sign up for Hashnode if you haven't already. Publish your article and add it to relevant tags.

Include a TL;DR section, detailed step-by-step walkthrough with screenshots, API integration examples with code blocks, advanced usage patterns, troubleshooting guide, and best practices. Write with technical authority, but be thorough and maintain clarity.

Use pre-built prompt for your AI assistant

Show promptCopy prompt

### Create a "How to use \[Actor]" tutorial for Medium

Medium reaches a broader, less technical audience. It's perfect for tutorials that appeal to marketers, entrepreneurs, product managers, no-code users, or less technical users. Medium's strong SEO means your tutorial can rank for years. Submitting to publications like "Better Programming", "The Startup", or "UX Collective" can reach tens of thousands of readers.

Write an accessible, engaging tutorial "How I \[ACHIEVED OUTCOME] in Minutes Using \[YOUR ACTOR] (Step-by-Step)" (1,500-2,200 words). Create or log into your Medium account, then publish the article. Use all 5 available tags strategically.

Take a story-driven approach with personal context. Write in first person, use simple jargon-free language, and make readers feel "I can do this too." Focus on the outcome and value, not technical complexity.

Use pre-built prompt for your AI assistant

Show promptCopy prompt

### Create a "How to use \[Actor]" tutorial for LinkedIn

LinkedIn tutorials reach professionals, decision-makers, and business users who value productivity and efficiency. LinkedIn articles have strong SEO and professional credibility. They're perfect for tutorials focused on business outcomes, time-saving, or solving professional challenges.

Write a professional "How to \[ACHIEVE BUSINESS OUTCOME] in \[TIME] Using \[YOUR ACTOR]: A Professional Guide" tutorial (1,400-2,000 words). Publish it as a LinkedIn Article using the "Write article" feature. After publishing, share the article in a regular LinkedIn post with an engaging business-focused intro.

Use professional, consultative tone focused on business value. Emphasize time savings, efficiency, and ROI. Include sections on business case, measuring success, professional best practices, and real-world business applications. Address common professional questions about security, cost, reliability, and team adoption.

Use pre-built prompt for your AI assistant

Show promptCopy prompt

***

## GitHub and developer resources

### Create a GitHub repository with examples

GitHub repos rank well in search and are developer-friendly. A repo with usage examples, tutorials, or integration guides makes it easier for others to adopt and reference your Actor.

Create a GitHub repo with code examples, integration guides, or sample projects using your Actor. Include a comprehensive README with use cases, code snippets, and links to your Actor.

Your README should include: project title with badges, short description, key features, quick start guide, installation and setup instructions, usage examples with code snippets, use cases section, configuration options, common questions and troubleshooting, links to Apify Store and documentation, contributing guidelines, and license.

Use pre-built prompt for your AI assistant

Show promptCopy prompt

***

## Quick wins

Simple actions you can take right now with minimal effort but immediate impact:

* Share your launch on your personal social media accounts (Twitter, LinkedIn, Facebook)
* Post about your new Actor on your personal website or blog
* Ask friends and colleagues to share
* Update your email signature to mention your Actor
* Add the Actor to your portfolio if you're a freelancer on UpWork or Fiverr

### Create a content hub

Create a free Notion page or GitHub README that lists all your Actors and content with links. Share this hub in your Actor description, social profiles, and email signature. This becomes your content portfolio and makes it easy for people to find all your work.


---

# Parasite SEO

**Do you want to attract more users to your Actors? Consider parasite SEO, a non-conventional method of ranking that leverages third-party sites.**

***

Here’s a full definition, from Authority Hackers:

> Parasite SEO involves publishing a quality piece of content on an established, high-authority external site to rank on search engines. This gives you the benefit of the host’s high traffic, boosting your chances for leads and successful conversions. These high DR websites have a lot of authority and trust in the eyes of Google

As you can see, you’re leveraging the existing authority of a third-party site where you can publish content promoting your Actors, and the content should rank better and faster as you publish it on an established site.

You can do parasite SEO for free, but you can also pay for guest posts on high-authority sites to post your articles promoting the Actors.

Let’s keep things simple and practical for this guide, so you can start immediately. We will cover only the free options, which should give you enough exposure to get started.

If you want to learn more, we recommend the following reading about parasite SEO:

* https://www.authorityhacker.com/parasite-seo/
* https://ahrefs.com/blog/parasite-seo/

In this guide, we will cover the following sites that you can use for parasite SEO for free:

* Medium
* LinkedIn
* Reddit
* Quora

## Medium

You probably know https://medium.com/. But you might not know that Google quite likes Medium, and you have a good chance of ranking high in Google with articles you publish there.

1. You need a Medium account. It’s free and easy to create.
2. Now, you need to do keyword research. Go to https://ahrefs.com/keyword-generator/?country=us, enter your main keyword (e.g. Airbnb scraper), and check what keyword has the highest search volume.
3. Search for that keyword in Google. Use incognito mode and a US VPN if you can. Analyze the results and check what type of content you need to create. Is it a how-to guide on how to create an Airbnb scraper? Or is it a list of the best Airbnb scrapers? Or perhaps it’s a review or just a landing page.
4. Now, you should have a good idea of the article you have to write. Write the article and try to mimic the structure of the first results.
5. Once you’re done with the article, don’t forget to include a few calls to action linking to your Actor on Apify Store. Don’t be too pushy, but mention all the benefits of your Actor.
6. Publish the article. Make sure your title and URL have the main keyword and that the main keyword is also in the first paragraph of the article. Also, try to use relevant tags for your Actor.

## LinkedIn Pulse

LinkedIn Pulse is similar to Medium, so we won’t go into too much detail. The entire process is the same as with Medium; the way you publish the article differs.

https://www.linkedin.com/pulse/how-publish-content-linkedin-pulse-hamza-sarfraz/ for publishing your article on LinkedIn Pulse.

## Reddit

1. You must have a Reddit account to use to comment in relevant subreddits.
2. Go to Google and perform this search: `site:reddit.com `, where you replace `` with the main topic of your Actor.
3. Now, list relevant Reddit threads that Google gives you. For an Airbnb scraper, this might be a good thread: https://www.reddit.com/r/webscraping/comments/m650ol/has_anybody_have_an_latest_airbnb_scraper_code/
4. To prioritize threads from the list, you can check the traffic they get from Google in https://ahrefs.com/traffic-checker. Just paste the URL, and the tool will give you traffic estimation. You can use this number to prioritize your list. If the volume exceeds 10, it usually has some traffic potential.
5. Now, the last step is to craft a helpful comment that also promotes your Actor. Try to do that subtly. People on Reddit usually don’t like people who promote their stuff, but you should be fine if you’re being genuinely helpful.

## Quora

Quora is similar to Reddit, so again we won’t go into too much detail. The entire process is the same. You just have to use a different search phrase in Google, which is `site:quora.com `.


---

# Product Hunt

Product Hunt is one of the best platforms for introducing new tools, especially in the tech community. It attracts a crowd of early adopters, startup enthusiasts, and developers eager to discover the latest innovations. Even https://www.producthunt.com/products/apify was on PH.

If you're looking to build awareness and generate short-term traffic, Product Hunt can be a powerful tool in your marketing strategy. It's a chance to attract a wide audience, including developers, startups, and businesses looking for automation. If your Actor solves a common problem, automates a tedious process, or enhances productivity, it's a perfect candidate for Product Hunt.

Product Hunt is also great for tools with a strong visual component or demo potential. If you can show the value of your Actor in action, you’re more likely to grab attention and drive engagement.

***

## How to promote your Actor on Product Hunt

### Create a compelling launch

Launching your Actor on Product Hunt requires thoughtful planning. Start by creating a product page that clearly explains what your Actor does and why it’s valuable. You’ll need:

* *A catchy tagline*. Keep it short and to the point. Think of something that captures your Actor's essence in just a few words.
* *Eye-catching visuals*. Screenshots, GIFs, or short videos that demonstrate your Actor in action are essential. Show users what they’ll get, how it works, and why it’s awesome.
* *Concise description*. Write a brief description of what your Actor does, who it’s for, and the problem it solves. Use plain language to appeal to a wide audience, even if they aren’t developers.
* *Demo video*. A short video that shows how your Actor works in a real-life scenario will resonate with potential users.

Once your page is set up, you’ll need to choose the right day to launch. Product Hunt is most active on weekdays, with Tuesday and Wednesday being the most popular launch days. Avoid launching on weekends or holidays when traffic is lower.

### Build momentum before launch

Start building awareness before your launch day. This is where your social media channels and community engagement come into play. Share teasers about your upcoming Product Hunt launch on Twitter (X), Discord, LinkedIn, and even StackOverflow, where other developers might take an interest. Highlight key features or the problems your Actor solves.

If you have a mailing list, give your subscribers a heads-up about your launch date. Encourage them to visit Product Hunt and support your launch by upvoting and commenting. This pre-launch activity helps create early momentum on launch day.

### Timing your launch

The timing of your Product Hunt launch matters a lot. Since Product Hunt operates on a daily ranking system, getting in early gives your product more time to gain votes. Aim to launch between 12:01 AM and 2:00 AM PST, as this will give your product a full day to collect upvotes.

Once you’ve launched, be ready to engage with the community throughout the day. Respond to comments, answer questions, and thank users for their support. Product Hunt users appreciate creators who are active and communicative, and this can help drive more visibility for your Actor.

### Engage with your audience

The first few hours after your launch are crucial for gaining traction. Engage with users who comment on your product page, answer any questions, and address any concerns they might have. The more interaction you generate, the more likely you are to climb the daily rankings.

Be transparent and friendly in your responses. If users point out potential improvements or bugs, acknowledge them and make a commitment to improve your Actor. Product Hunt users are often open to giving feedback, and this can help you iterate on your product quickly.

If possible, have team members or collaborators available to help respond to comments. The more responsive and helpful you are, the better the overall experience will be for users checking out your Actor.

Leverage Apify

You can also give a shoutout to Apify, this way your Actor will also notified to the community of Apify on Product Hunt: https://www.producthunt.com/stories/introducing-shoutouts

## Expectations and results

Launching on Product Hunt can provide a massive spike in short-term traffic and visibility. However, it’s important to manage your expectations. Not every launch will result in hundreds of upvotes or immediate sales. Here’s what you can realistically expect:

* *Short-term traffic boost*. Your Actor might see a surge in visitors, especially on the day of the launch. If your Actor resonates with users, this traffic may extend for a few more days.
* *Potential long-term benefits*. While the short-term traffic is exciting, the long-term value lies in the relationships you build with early users. Some of them may convert into paying customers or become advocates for your Actor.
* *SEO boost*. Product Hunt is a high-authority site with a 91 https://help.ahrefs.com/en/articles/1409408-what-is-domain-rating-dr. Having your product listed can provide an SEO boost and help your Actor's page rank higher in search engines.
* *User feedback*. Product Hunt is a great place to gather feedback. Users may point out bugs, request features, or suggest improvements.

## Tricks for a successful launch

1. *Leverage your network*. Ask friends, colleagues, and early users to support your launch. Ask the Apify community. Ask your users. Encourage them to upvote, comment, and share your product on social media.
2. *Prepare for feedback*. Product Hunt users can be critical, but this is an opportunity to gather valuable insights. Be open to suggestions and use them to improve your Actor.
3. *Use a consistent brand voice*. Make sure your messaging is consistent across all platforms when you're responding to comments and promoting your launch on social media.
4. *Offer a special launch deal*. Incentivize users to try your Actor by offering a discount or exclusive access for Product Hunt users. This can drive early adoption and build momentum.

## Caveats to Product Hunt promotion

* *Not every Actor is a good fit*. Product Hunt is best for tools with broad appeal or innovative features. If your Actor is highly specialized or niche, it may not perform as well.
* *High competition*. Product Hunt is a popular platform, and your Actor will be competing with many other launches. A strong marketing strategy is essential to stand out.
* *Short-term focus*. While the traffic spike is great, Product Hunt tends to focus on short-term visibility. To maintain long-term growth, you’ll need to continue promoting your Actor through other channels.


---

# SEO

SEO means optimizing your content to rank high for your target queries in search engines such as Google, Bing, etc. SEO is a great way to get more users for your Actors. It’s also free, and it can bring you traffic for years. This guide will give you a simple framework to rank better for your targeted queries.

## Search intent

Matching the search intent of potential users is super important when creating your Actor's README. The information you include should directly address the problems or needs that led users to search for a solution like yours. For example:

* *User goals*: What are users trying to accomplish?
* *Pain points*: What challenges are they facing?
* *Specific use cases*: How might they use your Actor?

Make sure your README demonstrates how your Actor aligns with the search intent. This alignment helps users quickly recognize your Actor's value and helps Google understand your Actor and rank you better.

*Example:*

Let’s say you want to create a “YouTube Hashtag Scraper” Actor. After you search YouTube HashTag Scraper, you see that most people searching for it want to extract hashtags from YouTube videos, not download videos using a certain hashtag.

## Keyword research

Keyword research is a very important part of your SEO success. Without that, you won’t know which keywords you should target with your Actor, and you might be leaving traffic on the table by not targeting all the angles or targeting the wrong one.

We will do keyword research with free tools, but if you want to take this seriously, we highly recommend https://ahrefs.com/.

### Google autocomplete suggestions

Start by typing your Actor's main function or purpose into Google. As you type, Google will suggest popular search terms. These suggestions are based on common user queries and can provide insight into what your potential users are searching for.

*Example:*

Let's say you've created an Actor for scraping product reviews. Type "product review scraper" into Google and note the suggestions:

* product review scraper free
* product review scraper amazon
* product review scraper python
* product review scraper api

These suggestions reveal potential features or use cases to highlight in your README.

### Alphabet soup method

This technique is similar to the previous one, but it involves adding each letter of the alphabet after your main keyword to discover more specific and long-tail keywords.

*Example*:

Continue with "product review scraper" and add each letter of the alphabet:

* product review scraper a (autocomplete might suggest "api")
* product review scraper b (might suggest "best")
* product review scraper c (might suggest "chrome extension")

...and so on through the alphabet.

### People Also Ask

Search for your Actor's main function or purpose on Google. Scroll down to find the "People Also Ask" section, which contains related questions.

*Example*:

For a "product review scraper" Actor:

* How do I scrape product reviews?
* Is it legal to scrape product reviews?
* What is the best tool for scraping reviews?
* How can I automate product review collection?

Now, you can expand the “People Also Ask” questions. Click on each question to reveal the answer and generate more related questions you can use in your README.

### Google Keyword Planner

Another way to collect more keywords is to use the official Google Keyword Planner. Go to https://ads.google.com/home/tools/keyword-planner/ and open the tool. You need a Google Ads account, so just create one for free if you don’t have one already.

After you’re in the tool, click on “Discover new keywords”, make sure you’re in the “Start with keywords” tab, enter your Actor's main function or purpose, and then select the United States as the region and English as the language. Click “Get results” to see keywords related to your actor.

Write them down.

### Ahrefs Keyword Generator

Go to https://ahrefs.com/keyword-generator, enter your Actor's main function or purpose, and click “Find keywords.” You should see a list of keywords related to your actor.

Write them down.

## What to do with the keywords

First, remove any duplicates that you might have on your list. You can use an online tool https://dedupelist.com/ for that.

After that, we need to get search volumes for your keywords. Put all your keywords in a spreadsheet, with one column being the keyword and the second one being the search volume.

Go to the https://backlinko.com/tools/keyword, enter the keyword, and write down the search volume. You will also see other related keywords, so you might as well write them down if you don’t have them on your list yet.

At the end, you should have a list of keywords together with their search volumes that you can use to prioritize the keywords, use the keywords to name your Actor, choose the URL, etc.

### Headings

If it makes sense, consider using keywords with the biggest search volume and the most relevant for your Actor as H2 headings in your README.

Put the most relevant keyword at the beginning of the heading when possible. Also, remember to use a clear hierarchy. The main features are H2, sub-features are H3, etc.

### Content

When putting keywords in your Actor’s README, it's important to maintain a natural, informative tone. Your primary goal should be to create valuable, easily understandable content for your users.

Aim to use your most important keyword in the first paragraph of your README. This helps both search engines and users quickly understand what your Actor does. But avoid forcing keywords where they don't fit naturally.

In your content, you can use the keywords you gathered before where they make sense. We want to include those keywords naturally in your README.

If there are relevant questions in your keyword list, you can always cover them within an “FAQ” section of your Actor.

Remember that while including keywords is important, always prioritize readability and user experience. Your content should flow naturally and provide real value to the reader.

## Learn more about SEO

If you want to learn more about SEO, these two free courses will get you started:

* https://ahrefs.com/academy/seo-training-course by Ahrefs
* https://www.semrush.com/academy/courses/seo/ by Semrush

The https://www.youtube.com/@AhrefsCom/featured is also a great resource. You can start with https://www.youtube.com/watch?v=xsVTqzratPs.


---

# Social media

**Social media is a powerful way to connect with your Actor users and potential users. Whether your tool focuses on web scraping or automation, social platforms can help you showcase its features, answer user questions, and grow your audience. This guide will show you how to use social media effectively, what to share, and how to avoid common mistakes along the way.**

Now, before we start listing social media platforms, it might be important to acknowledge something.

Developers are notorious for not using social media that much. Or they use social media exclusively in the context of their own interests: that won’t find them new users, but rather colleagues or collaborators.

That's a good start, and maybe it's enough. A developer that can also “do” social media is a unicorn. These are super rare. And if you want to really promote your Actor, you'll need to become that unicorn. Before we start, you need to understand the benefits of this activity.

***

## Why be active on social media

Engaging with your users on social media offers a lot of benefits beyond just promoting your Actor. Let’s look at some of the main reasons why being active online can be a game-changer for your Actor’s success:

1. Social platforms make it easy to gather real-time feedback and also provide support in real-time. You can quickly learn what users love, what they struggle with, and what features they’d like to see. This can guide your Actor’s future development. It also allows you to build trust and credibility with your audience.
2. Shot in the dark: social media exposes your Actor to new users who might not find you through search engines alone. A shared post or retweet can dramatically expand your reach, helping you grow your user base.
3. Consistent activity on social platforms creates more backlinks to your Actor’s page, which can improve its search engine ranking and drive organic traffic.

## Where to engage: Choosing the right platforms

Choosing the right platforms is key to reaching your target audience. Here's a breakdown of the best places for developers to promote their web scraping and automation tools:

* *Discord*: We started with an easy one. Create a community around your Actor to engage with users directly. Offering quick support and discussing the features of your Actor in a real-time chat setting can lead to deeper user engagement.

  Use Apify's Discord

  You can also promote your tools through https://discord.com/invite/crawlee-apify-801163717915574323.

* *Twitter (X)*: Good for short updates, feature announcements, and quick interactions with users. The tech community on Twitter is very active, which makes it a great spot for sharing tips and getting noticed.

* *Reddit*: In theory, subreddits like r/webscraping, r/automation, and r/programming allow you to share expertise, engage in discussions, and present your Actor as a solution. However, in reality, you have to be quite careful with promotion there. Be very mindful of subreddit rules to avoid spamming or over-promoting. For Reddit, personal stories on how you built the tool + a roadblock you might be facing right now are the safest formula. If a tool is already finished and perfected, it will be treated as promotional content. But if you're asking for advice - now that's a community activity.

* *TikTok*: Might not be an obvious choice, but that’s where most young people spend time. They discuss a myriad of topics, laugh at the newest memes, and create trends that take weeks to get to Reels and Shorts. If you want to create educational, fun, short video content (and be among the first to talk about web scraping), this is your place for experiments and taking algorithm guesses.

* *YouTube*: Ideal for tutorials and demos. A visual walk-through of how to use your Actor can attract users who prefer watching videos to reading tutorials or READMEs. It's also good for Shorts and short, funny content.

* *StackOverflow*: While not a traditional social media platform, StackOverflow is a great space to answer technical questions and demonstrate your expertise. Offering help related to web scraping or automation can build credibility, and you can subtly mention your Actor if it directly solves the issue (as long as it adheres to community guidelines).

* *LinkedIn*: If your Actor solves problems for professionals or automates business tasks, LinkedIn is the place to explain how your tool provides value to an industry or business.

***

## Best practices for promoting your Actor on social media

Now that you know where to engage and why it’s important, here are some best practices to help you make the most of social media:

1. *Offer value beyond promotion*: If you look around, you'll see that the golden rule of social media these days is to educate and entertain. Focus on sharing useful information related to your Actor. Post tips on automation, web scraping techniques, or industry insights that can help your audience. When you do promote your Actor, users will see it as part of a valuable exchange, not just an ad. Besides, constantly posting promotional content turns anybody off.
2. *Post consistently*: The most important rule for social media is to show up. Whether it’s a weekly post about new features or daily tips for using your Actor more effectively, maintaining a regular posting schedule keeps your audience connected.
3. *Visuals matter*: Screenshots, GIFs, and short videos can explain more than text ever could. Show users how your Actor works, the results it scrapes, or how automation saves time.
4. *Widen your reach*: Web scraping is a niche topic. Find ways to talk about it more widely. If you stumble upon ways to relate it to wider topics: news, science, research, even politics and art, use it. Or you can go more technical and talk about various libraries and languages you can use to build it.
5. *Use relevant hashtags*: Hashtags like #webscraping, #automation, #programming, and #IT help you reach a wider audience on platforms like Twitter and TikTok. Stick to a few relevant hashtags per post to avoid clutter.
6. *Engage actively*: Social media is a two-way street. Reply to comments, thank users for sharing your content, create stitches, and answer questions. Building relationships with your users helps foster loyalty and builds a sense of community around your Actor.
7. *Use polls and Q\&As*: Interactive content like polls or Q\&A sessions can drive engagement. Ask users what features they’d like to see next or run a live Q\&A to answer questions about using your Actor. These tools encourage participation and provide valuable insights.
8. *Collaborate with other creators*.

## Caveats to social media engagement

1. *Over-promotion*: Constantly pushing your Actor without offering value can turn users away. Balance your promotional content with educational posts, interesting links, or insights into the development process. Users are more likely to engage when they feel like they’re learning something, rather than just being sold to.
2. *Handling negative feedback*: Social media is a public forum, and not all feedback will be positive. Be prepared to address user concerns or criticism professionally. Responding kindly (or funnily) to criticism shows you’re committed to improving your tool and addressing users' needs.
3. *Managing multiple platforms*: Social media management can be time-consuming, especially if you’re active on multiple platforms. Focus on one or two platforms that matter most to your audience instead of spreading yourself too thin.
4. *Algorithm changes*: Social media platforms often tweak their algorithms, which can impact your content’s visibility. Stay updated on these changes, and adjust your strategy accordingly. If a post doesn’t perform well, experiment with different formats (videos, visuals, polls) to see what resonates with your audience.
5. *Privacy and compliance*: Very important here to be mindful of sharing user data or results, especially if your Actor handles sensitive information. Make sure your posts comply with privacy laws and don’t inadvertently expose any personal data.

## For inspiration

It's sometimes hard to think of a good reason to scream into the void that is social media. Here are 25 scenarios where you might use social media to promote your Actor or your work:

1. *Funny interaction with a user*: Share a humorous tweet or post about a quirky question or feedback from a user that highlights your Actor’s unique features.
2. *Roadblock story*: Post about a challenging bug you encountered while developing your Actor and how you solved it, including a screenshot or snippet of code.
3. *Success story*: Share a post detailing how a user’s feedback led to a new feature in your Actor and thank them for their suggestion.
4. *Tutorial video*: Create and share a short video demonstrating how to use a specific feature of your Actor effectively.
5. *Before-and-after example*: Post a visual comparison showing the impact of your Actor’s automation on a task or process.
6. *Feature announcement*: Announce a new feature or update in your Actor with a brief description and a call-to-action for users to try it out.
7. *User testimonial*: Share a positive review or testimonial from a user who benefited from your Actor, including their quote and a link to your tool.
8. *Live Q\&A*: Host a live Q\&A session on a platform like Twitter or Reddit, answering questions about your Actor and its capabilities.
9. *Behind-the-scenes look*: Post a behind-the-scenes photo or video of your development process or team working on your Actor.
10. *Debugging tip*: Share a tip or trick related to debugging or troubleshooting common issues with web scraping or automation.
11. *Integration highlight*: Post about how your Actor integrates with other popular tools or platforms, showcasing its versatility. Don't forget to tag them.
12. *Case study*: Share a case study or success story showing how a business or individual used your Actor to achieve specific results.
13. *Commentary on a news piece*: Offer your perspective on a recent news story related to technology, scraping, or automation. If possible, explain how it relates to your Actor.
14. *User-generated content*: Share content created by your users, such as screenshots or examples of how they’re using your Actor.
15. *Memes*: Post a relevant meme about the challenges of web scraping or automation.
16. Milestone celebration: Announce and celebrate reaching a milestone, such as a certain number of users or downloads for your Actor.
17. *Quick tip*: Share a short, useful tip or hack related to using your Actor more efficiently.
18. *Throwback post*: Share a throwback post about the early development stages of your Actor, including any challenges or milestones you achieved.
19. *Collaboration announcement*: Announce a new collaboration with another developer or tool, explaining how it enhances your Actor’s functionality.
20. *Community shout-out*. Give a shout-out to a user or community member who has been particularly supportive or helpful.
21. *Demo invitation*: Invite your followers to a live demo or webinar where you’ll showcase your Actor and answer questions.
22. *Feedback request*: Ask your audience for feedback on a recent update or feature release, and encourage them to share their thoughts.
23. *Book or resource recommendation*: Share a recommendation for a book or resource that helped you in developing your Actor, and explain its relevance.


---

# Video tutorials

**Videos and live streams are powerful tools for connecting with users and potential users, especially when promoting your Actors. You can use them to demonstrate functionality, provide tutorials, or engage with your audience in real time.**

***

## Why videos and live streams matter

1. *Visual engagement*. Videos allow you to show rather than just tell. Demonstrating how your Actor works or solving a problem in makes the content more engaging and easier to understand. For complex tools, visual explanations can be much more effective than text alone.
2. *Enhanced communication*. Live streams offer a unique opportunity for direct interaction. You can answer questions, address concerns, and gather immediate feedback from your audience, creating a more dynamic and personal connection.
3. *Increased reach*. Platforms like YouTube and TikTok have massive user bases, giving you access to a broad audience. Videos can also be shared across various social media channels, extending your reach even further.

Learn more about the rules of live streams in our next section: https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/webinars.md

## Optimizing videos for SEO

1. *Keywords and titles*. Use relevant keywords in your video titles and descriptions. For instance, if your Actor is a web scraping tool, include terms like “web scraping tutorial” or “how to use web scraping tools” to help users find your content.
2. *Engaging thumbnails*. Create eye-catching thumbnails that accurately represent the content of your video. Thumbnails are often the first thing users see, so make sure they are visually appealing and relevant.
3. *Transcriptions and captions*. Adding transcripts and captions to your videos improves accessibility and can enhance SEO. They allow search engines to index your content more effectively and help users who prefer reading or have hearing impairments.

## YouTube vs. TikTok

1. *YouTube*. YouTube is an excellent platform for longer, detailed videos. Create a channel dedicated to your Actors and regularly upload content such as tutorials, feature walkthroughs, and industry insights. Utilize YouTube’s SEO features by optimizing video descriptions, tags, and titles with relevant keywords. Engage with your audience through comments and encourage them to subscribe for updates. Collaborating with other YouTubers or influencers in the tech space can also help grow your channel.
2. *TikTok*. TikTok is ideal for short, engaging videos. Use it to share quick tips, demo snippets, or behind-the-scenes content about your Actors. The platform’s algorithm favors high engagement, so create catchy content that encourages viewers to interact. Use trending hashtags and participate in challenges relevant to your niche to increase visibility. Consistency is key, so post regularly and monitor which types of content resonate most with your audience.

## Growing your channels

1. *Regular content*. Consistently upload content to keep your audience engaged and attract new viewers. Create a content calendar to plan and maintain a regular posting schedule.
2. *Cross-promotion*. Share your videos across your social media channels, blogs, and newsletters. This cross-promotion helps drive traffic to your videos and increases your reach.
3. *Engage with your audience*. Respond to comments and feedback on your videos. Engaging with viewers builds a community around your content and encourages ongoing interaction.
4. *Analyze performance*. Use analytics tools provided by YouTube and TikTok to track the performance of your videos. Monitor metrics like watch time, engagement rates, and viewer demographics to refine your content strategy.

***

## Where to mention videos across your Actor ecosystem

1. *README*: include links to your videos in your Actor’s README file. For example, if you have a tutorial video, mention it in a "How to scrape X" or "Resources" section to guide users.
2. *Input schema*: if your Actor’s input schema includes complex fields, link to a video that explains how to configure these fields. This can be especially helpful for users who prefer visual guides.
3. *Social media*: share your videos on platforms like Twitter, LinkedIn, and Facebook. Use engaging snippets or highlights to attract users to watch the full video.
4. *Blog posts*: embed videos in your blog posts for a richer user experience. If you write a tutorial or feature update, include a video to provide additional context.
5. *Webinars and live streams*: mention your videos during webinars or live streams. If you’re covering a topic related to a video you’ve posted, refer to it as a supplemental resource.


---

# Webinars

Webinars and live streams are a fantastic way to connect with your audience, showcase your Actor's capabilities, and gather feedback from users. Though the term webinar might sound outdated these days, the concept of a live video tutorial is alive and well in the world of marketing and promotion.

Whether you're introducing a new feature, answering questions, or walking through a common use case, a live event can create more personal engagement, boost user trust, and open the door for valuable two-way communication.

But how do you get started? Here's a friendly guide on where to host, how to prepare, and what to do before, during, and after your webinar.

***

## Why host a live stream?

Here are a few reasons why live streams are ideal for promoting your Actor:

* *Demo*. You can show your Actor in action and highlight its most powerful features. You can tell a story about how you built it. You can also show how your Actor interacts with other tools and platforms and what its best uses are. A live demo lets users see immediately how your tool solves their problems.
* *Building trust and rapport*. Interacting directly with your users builds trust and rapport. Even showing up and showing your face/voice, it's a chance to let your users meet you and get a feel for the team behind the Actor.
* *Live Q\&A*. Users often have questions that can be hard to fully address in documentation, README, or tutorials. A live session allows for Q\&A, so you can explain complex features and demonstrate how to overcome common issues.
* *Tutorial or training*. If you don't have time for complex graphics, this is an easy replacement for a video tutorial until you do. Remember that some platforms (YouTube) give the option of publishing the webinar after it's over. You can reuse it later in other content as well as a guide. Also, if you’ve noticed users struggling with particular features, a webinar is a great way to teach them directly.

Webinars help build a community around your Actor and turn one-time users into loyal advocates.

## Where to host your webinar or live stream

It all goes back to where you have or would like to have your audience and whether you want to have the webinar available on the web later.

1. Social media:

   

   1. *YouTube*: ideal for reaching a broad audience. It’s free and easy to set up. You can also make recordings available for future viewing.
   2. *TikTok*: same, ideal for reaching a broad audience, free and easy to set up. However, live video will disappear once the broadcast has ended. TikTok does allow you to save your livestreams. You won't be able to republish them to the platform (we assume your live stream will be longer than 10 minutes). But you can later re-upload it elsewhere.
   3. *Twitch*: Known for gaming, Twitch has become a space for tech demos, coding live streams, and webinars. If your target audience enjoys an interactive and casual format, Twitch might be a good fit.
   4. *LinkedIn*: If your audience is more professional, LinkedIn Live could be a good fit to present your Actor there. Once a stream is complete, it will remain on the feed of your LinkedIn Page or profile as a video that was ‘previously recorded live’.
   5. Facebook: Not recommended.

2. General platforms:
   
   1. *Zoom* or *Google Meet*: More personal, these are great for smaller webinars where you might want closer interaction. They also give you control over who attends.

Pick a platform where your users are most likely to hang out. If your audience is primarily tech-savvy, YouTube or Twitch could work. If your Actor serves businesses, LinkedIn might be the best spot.

## Webinar/live stream prep

### Promote your webinar and get your users

Send an email blast if you have an email list of users or potential users, send a friendly invite. Include details about what you’ll cover and how they can benefit from attending.

* Social media promotion on Twitter (X), LinkedIn, or other platforms. Highlight what people will learn and any special features you’ll be demonstrating. Do it a few times - 2 weeks before the webinar, 1 week before, a day before, and the day of. Don't forget to announce on Apify’s Discord. These are places where your potential audience is likely hanging out. Let them know you’re hosting an event and what they can expect.
* Use every piece of real estate on Apify Store and Actor pages. Add a banner or notification to your Actor’s page (top of the README): This can be a great way to notify people who are already looking at your Actor. A simple “join us for a live demo on DATE” message works well. Add something like that to your Store bio and its README. Mention it at the top description of your Actor's input schema.

Use UTM tags

When creating a link to share to the webinar, you can add different UTM tags for different places where you will insert the link. That way you can later learn which space brought the most webinar sign-ups.

* Collaborate with other developers. If you can team up with someone in the Apify community, you’ll double your reach. Cross-promotion can bring in users from both sides.

***

### Plan the content

Think carefully about what you’ll cover. Focus on what’s most relevant for your audience:

* *Decide on your content*. What will you cover? A demo? A deep dive into Actor configurations? Create a flow and timeline to keep yourself organized.
* Prepare visuals. Slides, product demos, and examples are helpful to explain complex ideas clearly.
* *Feature highlights*. Demonstrate the key features of your Actor. Walk users through common use cases and be ready to show live examples.
* *Input schema*. If your Actor has a complex input schema, spend time explaining how to use it effectively. Highlight tips that will save users time and frustration. You can incorporate your knowledge from the issues tab.
* *Q\&A session*. Leave time for questions at the end. Make sure to keep this flexible, as it’s often where users will engage the most.

Don't forget to add an intro with an agenda and an outro with your contact details.

Consider timezones

When thinking of when to run the webinar, focus on the timezone of the majority of your users.

### Prepare technically

Test your setup before going live. Here’s what to focus on:

* *Stable internet connection*. This one’s obvious but essential. Test your stream quality ahead of time.
* *Test the Actor live*. If you're demoing your Actor, ensure it works smoothly. Avoid running scripts that take too long or have potential bugs during the live session.
* *Audio quality*. People are far more likely to tolerate a blurry video than bad audio. Use a good-quality microphone to ensure you’re heard clearly.
* Screen sharing. If you’re doing a live demo, make sure you know how to seamlessly switch between windows and share your screen effectively.
* *Backup plan*. Have a backup plan in case something goes wrong. This could be as simple as a recorded version of your presentation to share if things go south during the live session.
* *Make it interactive*. Consider using polls or a live Q\&A session to keep the audience engaged. Maybe have a support person assisting with that side of things while you're speaking.

## Best practices during the live stream

When the time comes, here’s how to make the most of your webinar or live stream:

* *Start with an introduction*. Begin with a brief introduction of yourself, the Actor you’re showcasing, and what attendees can expect to learn. This sets expectations and gives context. It's also best if you have a slide that shows the agenda.
* *Try to stay on time*. Stick to the agenda. Users appreciate when events run on schedule.
* *Show a live demo*. Walk through a live demo of your Actor. Show it solving a problem from start to finish.
* *Explain as you go*. Be mindful that some people might be unfamiliar with technical terms or processes. Try to explain things simply and offer helpful tips as you demonstrate but don't go off on a tangent.
* *Invite questions and engage your audience*. Encourage users to ask questions throughout the session. This creates a more conversational tone and helps you address their concerns in real time. You can also ask a simple question or poll to get the chat going. Try to direct the Q\&A into one place so you don't have to switch tabs. Throughout the presentation, pause for questions and make sure you're addressing any confusion in real time.
* *Wrap up with a clear call to action*. Whether it’s to try your Actor, leave a review, or sign up for a future live, finish with a clear CTA. Let them know the next step to take.

This works for when it's a simple tutorial walkthrough and if you have a code-along session, the practices work for it as well.s

## After the live session

Once your live session wraps up, there are still sides of it you can benefit from:

* *Make it public and share the recording*. Not everyone who wanted to attend will have been able to make it. Send a recording to all attendees whose emails you have and make it publicly available on your channels (emails, README, social media, etc.). Upload the recorded session to YouTube and your Actor’s documentation. If it's on YouTube, you can also ask Apify's video team to add it to their Community playlist. Make it easy for people to revisit the content or share it with others.
* *Follow up with attendees, thank them, and ask for feedback*. Send a follow-up email thanking people for attending. Include a link to the recording, additional resources, and ways to get in touch if they have more questions. Share any special offers or discount codes if relevant. If you don’t have the attendees' emails, include a link in your newsletter and publish it on your channels. Ask for feedback on what they liked and what could be improved. This can guide your next webinar or help fine-tune your Actor.
* *Answer lingering questions*. If any questions didn’t get answered live, take the time to address them in the follow-up email.
* *Create a blog post or article*. Summarize the key points of your webinar in a written format. This can boost your SEO and help users find answers in the future.
* *Review your performance*. Analyze the data from your webinar, if available. How many people attended? Which platform brought the most sign-ups? How many questions did you receive? Were there any technical difficulties? This helps refine your approach for future events.
* *Share snippets from the webinar or interesting takeaways on social media*. Encourage people to watch the recording and let them know when you’ll be hosting another event.


---

# How Actor monetization works

**You can turn your web scrapers into a source of income by publishing them on Apify Store. Learn how it's done and what monetization options you have.**

***

## Monetizing your Actor

Monetizing your Actor on the Apify platform involves several key steps:

1. *Development*: create and refine your Actor.
2. *Testing*: ensure your Actor works reliably.
3. *Publication & monetization*: publish your Actor and set up its monetization model.
4. *Promotion*: attract users to your Actor.

***

## Monetization models

### Pay-per-event pricing model

![pay per event model example](/assets/images/ppe-model-0e4ba61669f4bffb1fe144b4d225e3c2.png)

* *How it works*: you charge users based on specific events triggered programmatically by your Actor's code. You earn 80% of the revenue minus platform usage costs.

* * *Profit calculation*: `profit = (0.8 * revenue) - platform usage costs`

* *Event cost example*: you set the following events for your Actor:

  

  * `Actor start per 1 GB of memory` at $0.005
  * `Pages scraped` at $0.002
  * `Page opened with residential proxy` at $0.002 - this is on top of `Pages scraped`
  * `Page opened with a browser` at $0.002 - this is on top of `Pages scraped`

* *Example*:

  

  * User A:

    

    * Started the Actor 10 times = $0.05
    * Scraped 1,000 pages = $2.00
    * 500 of those were scraped using residential proxy = $1.00
    * 300 of those were scraped using browser = $0.60
    * This comes up to $3.65 of total revenue

  * User B:

    

    * Started the Actor 5 times = $0.025
    * Scraped 500 pages = $1.00
    * 200 of those were scraped using residential proxy = $0.40
    * 100 of those were scraped using browser = $0.20
    * This comes up to $1.625 of total revenue

  * That means if platform usage costs are $0.365 for user A and $0.162 for user B your profit is $4.748

Pay-per-event details

If you want more details about PPE pricing, refer to our https://docs.apify.com/platform/actors/publishing/monetize/pay-per-event.md.

### Pay-per-result pricing model

![pay per result model example](/assets/images/ppr-model-c7cd05e9f4a2a973bb8101fed2eaab67.png)

* *How it works*: you charge users based on the number of results your Actor generates. You earn 80% of the revenue minus platform usage costs.

* *Profit calculation*: `profit = (0.8 * revenue) - platform usage costs`

* *Cost breakdown*:

  

  * Compute unit: $0.3 per CU
  * Residential proxies: $13 per GB
  * SERPs proxy: $3 per 1,000 SERPs
  * Data transfer (external): $0.20 per GB
  * Dataset storage: $1 per 1,000 GB-hours

* *Example*: you set a price of $1 per 1,000 results. Two users generate 50,000 and 20,000 results, paying $50 and $20, respectively. If the platform usage costs are $5 and $2, your profit is $49.

Pay-per-result details

If you want more details about PPR pricing, refer to our https://docs.apify.com/platform/actors/publishing/monetize/pay-per-result.md.

### Rental pricing model

![rental model example](/assets/images/rental-model-727e0b838b54bbd57b7e6095cddd90a7.png)

* *How it works*: you offer a free trial period and set a monthly fee. Users on Apify paid plans can continue using the Actor after the trial. You earn 80% of the monthly rental fees.

* *Example*: you set a 7-day free trial and $30/month rental. If 3 users start using your Actor:

  

  * 1st user on a paid plan pays $30 after the trial (you earn $24).
  * 2nd user starts their trial but pays next month.
  * 3rd user on a free plan finishes the trial without upgrading to a paid plan and can’t use the Actor further.

Rental pricing details

If you want more details about rental pricing, refer to our https://docs.apify.com/platform/actors/publishing/monetize/rental.md.

## Setting up monetization

1. *Go to your Actor page*: navigate to the **Publication** tab and open the **Monetization** section.
2. *Fill in billing details*: set up your payment details for payouts.
3. *Choose your pricing model*: use the monetization wizard to select your model and set fees.

### Changing monetization

Adjustments to monetization settings take 14 days to take effect and can be made once per month.

### Tracking and promotion

* *Track profit*: review payout invoices and statistics in Apify Console (**Monitoring** tab).
* *Promote your Actor*: optimize your Actor’s description for SEO, share on social media, and consider creating tutorials or articles to attract users.

## Marketing tips for defining the price for your Actor

It's up to you to set the pricing, of course. It can be as high or low as you wish, you can even make your Actor free. But if you're generally aiming for a successful, popular Actor, here are a few directions:

### Do market research outside Apify Store

The easiest way to understand your tool's value is to look around. Are there similar tools on the market? What do they offer, and how much do they charge? What added value does your tool provide compared to theirs? What features can your tool borrow from theirs for the future?

Try competitor tools yourself (to assess the value and the quality they provide), check their SEO (to see how much traffic they get), and note ballpark figures. Think about what your Actor can do that competitors might be missing.

Also, remember that your Actor is a package deal with the Apify platform. All the platform's features automatically transfer onto your Actor and its value. Scheduling, monitoring runs, ways of exporting data, proxies, and integrations can all add value to your Actor (on top of its own functionalities). Be sure to factor this into your tool's value proposition and communicate that to the potential user.

### Do research in Apify Store

Apify Store is like any other marketplace, so take a look at your competition there. Are you the first in your lane, or are there other similar tools? What makes yours stand out? Remember, your README is your first impression — communicate your tool's benefits clearly and offer something unique. Competing with other developers is great, but collaborations can drive even better results 😉

Learn more about what makes a good readme here: https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/how-to-create-an-actor-readme.md

### Rental, pay-per-result (PPR), or pay-per-event (PPE)

Rental pricing allows you to charge a monthly fee for your Actor and users cover their own compute usage.

Pay-per-result (PPR) charges users based on the number of items your Actor adds to the dataset. This model works best when each dataset item represents clear value to the user - like scraped product listings, extracted contact information, or processed documents.

Pay-per-event (PPE) gives you the most flexibility and growth potential. You can charge for any meaningful event your Actor performs (for example, page scraped, browser page opened, or an external API call). This makes costs predictable for users, lets you model value precisely, and is fully compatible with AI and MCP-based integrations.

Additional benefits

Actors that implement PPE receive additional benefits, including increased visibility in Apify Store and enhanced discoverability.

To estimate pricing, run a few test runs and review the statistics in the Actor https://console.apify.com/actors?tab=analytics tab.

### Adapt when needed

Don’t be afraid to experiment with pricing, especially at the start. You can monitor your results in the dashboard and adjust if necessary.

Keep an eye on SEO as well. If you monitor the volume of the keywords your Actor is targeting as well as how well your Actor's page is ranking for those keywords, you can estimate the number of people who actually end up trying your tool (aka conversion rate). If your keywords are getting volume, but conversions are lower than expected, it might point to a few issues It could be due to your pricing, a verbose README, or complex input. If users are bouncing right away, it makes sense to check out your pricing and your closest competitors to see where adjustments might help.

### Summary & a basic plan

Pick a pricing model, run some tests, and calculate your preliminary costs (**Analytics** tab in Console).

Then check your costs against similar solutions in the Store and the market (try Google search or other marketplaces), and set a price that gives you some margin.

It’s also normal to adjust pricing as you get more demand. For context, most prices on Apify Store range between $1-10 per 1,000 results.

Example of useful pricing estimates from the **Analytics** tab:

![example of pricing estimates in analytics tab](/assets/images/analytisc-example-e5005177826fdce533bedec8beb29b4e.png)

Use emails!

📫 Don't forget to set an email sequence to warn and remind your users about pricing changes. Learn more about emailing your users here: \[Emails to Actor users]

## Resources

* Learn about https://apify.com/partners/actor-developers
* Detailed guide to https://docs.apify.com/academy/get-most-of-actors/monetizing-your-actor
* Guide to https://docs.apify.com/platform/actors/publishing
* Watch our webinar on how to https://www.youtube.com/watch?v=4nxStxC1BJM
* Read a blog post from our CEO on the https://blog.apify.com/make-regular-passive-income-developing-web-automation-actors-b0392278d085/
* Learn about the https://apify.com/pricing/creator-plan, which allows you to create and freely test your own Actors for $1


---

# How Apify Store works

**Out of thousands of Actors on https://apify.com/store marketplace, most of them were created by developers just like you. Let's get acquainted with the concept of Apify Store and what it takes to publish an Actor there.**

***

## What are Actors (and why they're called that)?

https://apify.com/actors are serverless cloud applications that run on the Apify platform, capable of performing various computing tasks on the web, such as crawling websites or sending automated emails. They are developed by independent developers all over the world, and *you can be one of them*.

The term "Actor" is used because, like human actors, these programs follow a script. This naming convention unifies both web scraping and web automation solutions, including AI agents, under a single term. Actors can range in complexity and function, targeting different websites or performing multiple tasks, which makes the umbrella term very useful.

## What is Apify Store?

https://apify.com/store is a public library of Actors that is constantly growing and evolving. It's basically a publicly visible (and searchable) part of the Apify platform. With thousands of Actors currently available, most of them are created and maintained by the community. Actors that consistently perform well remain on Apify Store, while those reported as malfunctioning or under maintenance are eventually removed. This keeps the tools in our ecosystem reliable, effective, and competitive.

### Types of Actors

* *Web scraping Actors*: for instance, https://apify.com/apidojo/twitter-user-scraper extracts data from Twitter.
* *Automation Actors*: for example, https://apify.com/jakubbalada/content-checker monitors website content for changes and emails you once a change occurs.
* *Bundles*: chains of multiple Actors united by a common data point or target website. For example, https://apify.com/tri_angle/restaurant-review-aggregator can scrape reviews from six platforms at once.

Learn more about bundles here: https://docs.apify.com/academy/actor-marketing-playbook/product-optimization/actor-bundles.md

## Public and private Actors

Actors on Apify Store can be public or private:

* *Private Actors*: these are only accessible to you in Apify Console. You can use them without exposing them to the web. However, you can still share the results they produce.
* *Public Actors*: these are available to everyone on Apify Store. You can choose to make them free or set a price. By publishing your web scrapers and automation solutions, you can attract users and generate income.

## How Actor monetization works (briefly)

You can monetize your Actors using three different pricing models:

* Pay for usage: charge based on how much the Actor is used.
* Pay per result: the price is based on the number of results produced, with the first few free.
* Pay per event: the price is based on specific events triggered by the Actor.
* Monthly billing: set a fixed monthly rental rate for using the Actor.

For detailed information on which pricing model might work for your Actor, refer to https://docs.apify.com/academy/actor-marketing-playbook/store-basics/how-actor-monetization-works.md.

## Actor ownership on Store

Actors are either created and maintained by Apify or by members of the community:

* *Maintained by Apify*: created and supported by the Apify team.
* *Maintained by Community*: created and managed by independent developers from the community.

To see who maintains an Actor, check the upper-right corner of the Actor's page.

When it comes to managing Actors on Apify, it’s important that every potential community developer understands the differences between Apify-maintained and Community-maintained Actors. Here’s what you need to know to navigate the platform effectively and ensure your work stands out.

### Community-maintained Actors

✨ *Features and functionality*: offers a broader range of use cases and features, often tailored to specific needs. Great for exploring unique or niche applications.

🧑‍💻 *Ownership*: created and maintained by independent developers like you.

🛠 *Maintenance*: you’re responsible for all updates, bug fixes, and ongoing maintenance. Apify hosts your Actor but does not manage its code.

👷‍♀️ *Reliability and testing*: it’s up to you to ensure your Actor’s reliability and performance.

☝️ *Support and Issues*: Apify does not provide direct support for Community-maintained Actors. You must manage issues through the Issues tab, where you handle user queries and problems yourself.

✍️ *Documentation*: you’re responsible for creating and maintaining documentation for your Actor. Make sure it’s clear and helpful for users.

Test your Actor!

For the best results, make sure your Actor is well-documented and thoroughly tested. Engage with users through the Issues tab to address any problems promptly. By maintaining high standards and being proactive, you’ll enhance your Actor’s reputation and usability in Apify Store.

## Importance of Actor testing and reliability

It's essential to test your Actors and make sure they work as intended. That's why Apify does it on our side as much as you should do it on yours.

Apify runs automated tests daily to ensure all Actors on Apify Store are functional and reliable. These tests check *if an Actor can successfully run with its default input within 5 minutes*. If an Actor fails for three consecutive days, it’s labeled under maintenance, and the developer is notified. Continuous failures for another 28 days lead to deprecation.

To restore an Actor's health, developers should fix and rebuild it. The testing system will automatically recognize the changes within 24 hours. If your Actor requires longer run times or authentication, contact support to explain why it should be excluded from tests. For more control, you can implement your own tests using the Actor Testing tool available on Apify Store.

### Actor metrics and reliability score

On the right panel of each Actor on Store, you can see a list of Actor metrics.

Actor metrics such as the number of monthly users, star ratings, success rates, response times, creation dates, and recent modifications collectively offer insights into its reliability. Basically, they serve as a *shorthand for potential users to assess your Actor's reliability* before even trying it out.

A high number of monthly users indicates widespread trust and effective performance, while a high star rating reflects user satisfaction. A success rate nearing 100% demonstrates consistent performance. Short response times show a commitment to addressing issues promptly, though quicker responses are ideal. A recent creation date suggests modern features and ongoing development, while recent modifications point to active maintenance and continuous improvements. Together, these metrics provide a comprehensive view of an Actor’s reliability and quality.

### Reporting Issues in Actors

Each Actor has an **Issues** tab in Apify Console and on the web. Here, users can open an issue (ticket) and engage in discussions with the Actor's creator, platform admins, and other users. The tab is ideal for asking questions, requesting new features, or providing feedback.

Since the **Issues** tab is public, the level of activity — or lack thereof — can be observed by potential users and may serve as an indicator of the Actor's reliability. A well-maintained Issues tab with prompt responses suggests an active and dependable Actor.

Learn more about how to handle the https://docs.apify.com/academy/actor-marketing-playbook/interact-with-users/issues-tab.md

## Resources

* Best practices on setting up https://docs.apify.com/platform/actors/publishing/test
* What are Apify-maintained and https://help.apify.com/en/articles/6999799-what-are-apify-maintained-and-community-maintained-actors? On ownership, maintenance, features, and support
* Step-by-step guide on how to https://docs.apify.com/platform/actors/publishing
* Watch our webinar on how to https://www.youtube.com/watch?v=4nxStxC1BJM
* Detailed https://docs.apify.com/platform/actors/running/actors-in-store for Actors in Store


---

# How to build Actors

At Apify, we try to make building web scraping and automation straightforward. You can customize our universal scrapers with JavaScript for quick tweaks, use our code templates for rapid setup in JavaScript, TypeScript, or Python, or build from scratch using our JavaScript and Python SDKs or Crawlee libraries for Node.js and Python for ultimate flexibility and control. This guide offers a quick overview of our tools to help you find the right fit for your needs.

## Three ways to build Actors

1. https://apify.com/scrapers/universal-web-scrapers — customize our boilerplate tools to your needs with a bit of JavaScript and setup.

2. https://apify.com/templates for web scraping projects — for a quick project setup to save you development time (includes JavaScript, TypeScript, and Python templates).

3. Open-source libraries and SDKs

   

   1. https://docs.apify.com/sdk/js/ & https://docs.apify.com/sdk/python/ — for creating your own solution from scratch on the Apify platform using our free development kits. Involves more coding but offers infinite flexibility.
   2. https://crawlee.dev/ and https://crawlee.dev/python — for creating your own solutions from scratch using our free web automation libraries. Involves even more coding but offers infinite flexibility. There’s also no need to host these on the platform.

## Universal scrapers & what are they for

https://apify.com/scrapers/universal-web-scrapers were built to provide an intuitive UI plus configuration that will help you start extracting data as quickly as possible. Usually, you just provide a https://docs.apify.com/tutorials/apify-scrapers/getting-started#the-page-function and set up one or two parameters, and you're good to go.

Since scraping and automation come in various forms, we decided to build not just one, but *six* scrapers. This way, you can always pick the right tool for the job. Let's take a look at each particular tool and its advantages and disadvantages.

| Scraper                  | Technology                                               | Advantages                                                                                                  | Disadvantages                                                                                          | Best for                                        |
| ------------------------ | -------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ | ----------------------------------------------- |
| 🌐 Web Scraper           | Headless Chrome Browser                                  | Simple, fully JavaScript-rendered pages                                                                     | Executes only client-side JavaScript                                                                   | Websites with heavy client-side JavaScript      |
| 👐 Puppeteer Scraper     | Headless Chrome Browser                                  | Powerful Puppeteer functions, executes both server-side and client-side JavaScript                          | More complex                                                                                           | Advanced scraping with client/server-side JS    |
| 🎭 Playwright Scraper    | Cross-browser support with Playwright library            | Cross-browser support, executes both server-side and client-side JavaScript                                 | More complex                                                                                           | Cross-browser scraping with advanced features   |
| 🍩 Cheerio Scraper       | HTTP requests + Cheerio parser (JQuery-like for servers) | Simple, fast, cost-effective                                                                                | Pages may not be fully rendered (lacks JavaScript rendering), executes only server-side JavaScript     | High-speed, cost-effective scraping             |
| ⚠️ JSDOM Scraper         | JSDOM library (Browser-like DOM API)                     | + Handles client-side JavaScript+ Faster than full-browser solutions+ Ideal for light scripting | Not for heavy dynamic JavaScript, executes server-side code only, depends on pre-installed NPM modules | Speedy scraping with light client-side JS       |
| 🍲 BeautifulSoup Scraper | Python-based, HTTP requests + BeautifulSoup parser       | Python-based, supports recursive crawling and URL lists                                                     | No full-featured web browser, not suitable for dynamic JavaScript-rendered pages                       | Python users needing simple, recursive crawling |

### How do I choose the right universal web scraper to start with?

🎯 Decision points:

* Use 🌐 https://apify.com/apify/web-scraper if you need simplicity with full browser capabilities and client-side JavaScript rendering.
* Use 🍩 https://apify.com/apify/cheerio-scraper for fast, cost-effective scraping of static pages with simple server-side JavaScript execution.
* Use 🎭 https://apify.com/apify/playwright-scraper when cross-browser compatibility is crucial.
* Use 👐 https://apify.com/apify/puppeteer-scraper for advanced, powerful scraping where you need both client-side and server-side JavaScript handling.
* Use ⚠️ https://apify.com/apify/jsdom-scraper for lightweight, speedy scraping with minimal client-side JavaScript requirements.
* Use 🍲 https://apify.com/apify/beautifulsoup-scraper for Python-based scraping, especially with recursive crawling and processing URL lists.

To make it easier, here's a short questionnaire that guides you on selecting the best scraper based on your specific use case:

Questionnaire

1. Is the website content rendered with a lot of client-side JavaScript?

   

   * Yes:

     

     * Do you need full browser capabilities?

       

       * Yes: use Web Scraper or Playwright Scraper
       * No, but I still want advanced features: use Puppeteer Scraper

   * No:

     

     * Do you prioritize speed and cost-effectiveness?

       

       * Yes: use Cheerio Scraper
       * No: use JSDOM Scraper

2. Do you need cross-browser support for scraping?

   

   * Yes:\*\* use Playwright Scraper
   * No:\*\* continue to the next step.

3. Is your preferred scripting language Python?\*\*

   

   * Yes:\*\* use BeautifulSoup Scraper
   * No:\*\* continue to the next step.

4. Are you dealing with static pages or lightweight client-side JavaScript?\*\*

   

   * Static pages: use Cheerio Scraper or BeautifulSoup Scraper

   * Light client-side JavaScript:

     

     * Do you want a balance between speed and client-side JavaScript handling?

       

       * Yes: use JSDOM Scraper
       * No: use Web Scraper or Puppeteer Scraper

5. Do you need to support recursive crawling or process lists of URLs?

   

   * Yes, and I prefer Python: use BeautifulSoup Scraper
   * Yes, and I prefer JavaScript: use Web Scraper or Cheerio Scraper
   * No: choose based on other criteria above.

This should help you navigate through the options and choose the right scraper based on the website’s complexity, your scripting language preference, and your need for speed or advanced features.

📚 Resources:

* How to use https://www.youtube.com/watch?v=5kcaHAuGxmY to scrape any website
* How to use https://www.youtube.com/watch?v=1KqLLuIW6MA to scrape the web
* Learn about our $1/month https://apify.com/pricing/creator-plan that encourages devs to build Actors based on universal scrapers

## Web scraping code templates

Similar to our universal scrapers, our https://apify.com/templates also provide a quick start for developing web scrapers, automation scripts, and testing tools. Built on popular libraries like BeautifulSoup for Python or Playwright for JavaScript, they save time on setup, allowing you to focus on customization. Though they require more coding than universal scrapers, they're ideal for those who want a flexible foundation while still needing room to tailor their solutions.

| Code template  | Supported libraries                                   | Purpose                                    | Pros                                                                                         | Cons                                                                                                       |
| -------------- | ----------------------------------------------------- | ------------------------------------------ | -------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------- |
| 🐍 Python      | Requests, BeautifulSoup, Scrapy, Selenium, Playwright | Creating scrapers Automation Testing tools | - Simplifies setup - Supports major Python libraries                                         | - Requires more manual coding (than universal scrapers)- May be restrictive for complex tasks              |
| ☕️ JavaScript | Playwright, Selenium, Cheerio, Cypress, LangChain     | Creating scrapers Automation Testing tools | - Eases development with pre-set configurations - Flexibility with JavaScript and TypeScript | - Requires more manual coding (than universal scrapers)- May be restrictive for tasks needing full control |

📚 Resources:

* https://www.youtube.com/watch?v=u-i-Korzf8w using a web scraper template.

## Toolkits and libraries

### Apify JavaScript and Python SDKs

https://docs.apify.com/sdk/js/ are designed for developers who want to interact directly with the Apify platform. It allows you to perform tasks like saving data in Apify Datasets, running Apify Actors, and accessing the key-value store. Ideal for those who are familiar with https://docs.apify.com/sdk/js/ and https://docs.apify.com/sdk/python/, SDKs provide the tools needed to develop software specifically on the Apify platform, offering complete freedom and flexibility within the JavaScript ecosystem.

* *Best for*: interacting with the Apify platform (e.g., saving data, running Actors, etc)
* *Pros*: full control over platform-specific operations, integrates seamlessly with Apify services
* *Cons*: requires writing boilerplate code, higher complexity with more room for errors

### Crawlee

https://crawlee.dev/ (for both Node.js and https://crawlee.dev/python) is a powerful web scraping library that focuses on tasks like extracting data from web pages, automating browser interactions, and managing complex scraping workflows. Unlike the Apify SDK, Crawlee does not require the Apify platform and can be used independently for web scraping tasks. It handles complex operations like concurrency management, auto-scaling, and request queuing, allowing you to concentrate on the actual scraping tasks.

* *Best for*: web scraping and automation (e.g., scraping paragraphs, automating clicks)
* *Pros*: full flexibility in web scraping tasks, does not require the Apify platform, leverages the JavaScript ecosystem
* *Cons*: requires more setup and coding, higher chance of mistakes with complex operations

### Combining Apify SDK and Crawlee

While these tools are distinct, they can be combined. For example, you can use Crawlee to scrape data from a page and then use the Apify SDK to save that data in an Apify dataset. This integration allows developers to make use of the strengths of both tools while working within the Apify ecosystem.

📚 Resources:

* Introduction to https://www.youtube.com/watch?v=g1Ll9OlFwEQ
* Crawlee https://crawlee.dev/blog
* Webinar on scraping with https://www.youtube.com/watch?v=iAk1mb3v5iI: how to create scrapers in JavaScript and TypeScript
* Step-by-step video guide: https://www.youtube.com/watch?v=yTRHomGg9uQ in Node.js with Crawlee
* Webinar on how to use https://www.youtube.com/watch?v=ip8Ii0eLfRY
* Introduction to Apify's https://www.youtube.com/watch?v=C8DmvJQS3jk

## Code templates vs. universal scrapers vs. libraries

Basically, the choice here depends on how much flexibility you need and how much coding you're willing to do. More flexibility → more coding.

https://apify.com/scrapers/universal-web-scrapers are simple to set up but are less flexible and configurable. Our https://crawlee.dev/, on the other hand, enable the development of a standard https://nodejs.org/ or Python application, so be prepared to write a little more code. The reward for that is almost infinite flexibility.

https://apify.com/templates are sort of a middle ground between scrapers and libraries. But since they are built on libraries, they are still on the rather more coding than less coding side. They will only give you a starter code to begin with. Please take this into account when choosing the way to build your scraper, and if in doubt — just ask us, and we'll help you out.

## Switching sides: How to transfer an existing solution from another platform

You can also take advantage of the Apify platform's features without having to modify your existing scraping or automation solutions.

### Integrating Scrapy spiders

The Apify platform fully supports Scrapy spiders. By https://apify.com/run-scrapy-in-cloud, you can take advantage of features like scheduling, monitoring, scaling, and API access, all without needing to modify your original spider. This process is made easy with the https://docs.apify.com/cli/, which allows you to convert your Scrapy spider into an Apify Actor with just a few commands. Once deployed, your spider can run in the cloud, offering a reliable and scalable solution for your web scraping needs.

Additionally, you can monetize your spiders by https://apify.com/partners/actor-developers on Apify Store, potentially earning passive income from your work while benefiting from the platform’s extensive features.

### ScrapingBee, ScrapingAnt, ScraperAPI

To make the transition from these platforms easier, we've also created https://apify.com/apify/super-scraper-api. This API is an open-source REST API designed for scraping websites by simply passing a URL and receiving the rendered HTML content in return. This service functions as a cost-effective alternative to other scraping services like ScrapingBee, ScrapingAnt, and ScraperAPI. It supports dynamic content rendering with a headless browser, can use various proxies to avoid blocking, and offers features such as capturing screenshots of web pages. It is ideal for large-scale scraping tasks due to its scalable nature.

To use SuperScraper API, you can deploy it with an Apify API token and access it via HTTP requests. The API supports multiple parameters for fine-tuning your scraping tasks, including options for rendering JavaScript, waiting for specific elements, and handling cookies and proxies. It also allows for custom data extraction rules and JavaScript execution on the scraped pages. Pricing is based on actual usage, which can be cheaper or more expensive than competitors, depending on the configuration.

📚 Resources:

* https://docs.apify.com/cli/docs/integrating-scrapy
* Scrapy monitoring: how to https://blog.apify.com/scrapy-monitoring-spidermon/
* Run ScrapingBee, ScraperAPI, and ScrapingAnt on Apify — https://www.youtube.com/watch?v=YKs-I-2K1Rg

## General resources

* Creating your Actor: https://docs.apify.com/academy/getting-started/creating-actors
* Use it, build it or buy it? https://help.apify.com/en/articles/3024655-choosing-the-right-solution
* How to programmatically retrieve data with the https://www.youtube.com/watch?v=ViYYDHSBAKM&t=0s
* Improved way to https://www.youtube.com/watch?v=8QJetr-BYdQ
* Webinar on https://www.youtube.com/watch?v=4nxStxC1BJM on Apify Store
* 6 things you should know before buying or https://blog.apify.com/6-things-to-know-about-web-scraping/
* For a comprehensive guide on creating your first Actor, visit the https://docs.apify.com/academy.


---

# Wrap open-source as an Actor

Apify is a cloud platform with a https://apify.com/store of 6,000+ web scraping and automation tools called *Actors*. These tools are used for extracting data from social media, search engines, maps, e-commerce sites, travel portals, and general websites.

Most Actors are developed by a global creator community, and some are developed by Apify. We have 18k monthly active users/developers on the platform (growing 138% YoY). Last month, we paid out $170k to creators (growing 118% YoY), and in total, over the program's history, we paid out almost $2M to them.

## What are Actors

Under the hood, Actors are programs packaged as Docker images, that accept a well-defined JSON input, perform an action, and optionally produce a well-defined JSON output. This makes it easy to auto-generate user interfaces for Actors and integrate them with one another or with external systems. For example, we have user-friendly integrations with Zapier, Make, LangChain, MCP, OpenAPI, and SDKs for TypeScript/Python, CLI, etc. etc.

Actors are a new way to build reusable serverless micro-apps that are easy to develop, share, integrate, and build upon—and, importantly, monetize. While Actors are our invention, we’re in the process of making them an open standard. Learn more at https://whitepaper.actor/.

While most Actors on our marketplace are web scrapers or crawlers, there are ever more Actors for other use cases including data processing, web automation, API backend, or https://apify.com/store/categories/agents. In fact, any piece of software that accepts input, performs a job, and can run in Docker, can be *Actorized* simply by adding an `.actor` directory to it with a couple of JSON files.

## Why Actorize

By publishing your service or project at https://apify.com/store your project will benefit from:

1. *Expanded reach*: Your tool instantly becomes available to Apify's user community and connects with popular automation platforms like https://www.make.com, https://n8n.io/, and https://zapier.com/.
2. *Multiple monetization paths*: Choose from flexible pricing models (monthly subscriptions, pay-per-result, or pay-per-event).
3. *AI integration*: Your Actor can serve as a tool for AI agents through Apify's MCP (Model Context Protocol) server, creating new use cases and opportunities while you earn 80% of all revenues.

Open-Source Benefits

For open-source developers, Actorization adds value without extra costs:

* Host your code in the cloud for easy user trials (no local installs needed).
* Avoid managing cloud infrastructure—users cover the costs.
* Earn income through https://apify.com/partners/open-source-fair-share via GitHub Sponsors or direct payouts.
* Publish and monetize 10x faster than building a micro-SaaS, with Apify handling infra, billing, and access to 700,000+ monthly visitors and 70,000 signups.

For example, IBM’s https://github.com/docling-project/docling merged our pull request that actorized their open-source GitHub repo (24k stars) and added the Apify Actor badge to the README:

![Docling Apify badge](/assets/images/docling-apify-badge-3b6ad8beefffa23d0ffcc9bc92d593bb.png)

### Example Actorized projects

You can Actorize various projects ranging from open-source libraries, throughout existing SaaS services, up to MCP server:

| Name            | Type                   | Source                                      | Actor                                                |
| --------------- | ---------------------- | ------------------------------------------- | ---------------------------------------------------- |
| Parsera         | SaaS service           | https://parsera.org/                        | https://apify.com/parsera-labs/parsera               |
| Monolith        | Open source library    | https://github.com/Y2Z/monolith             | https://apify.com/snshn/monolith                     |
| Crawl4AI        | Open source library    | https://github.com/unclecode/crawl4ai       | https://apify.com/janbuchar/crawl4ai                 |
| Docling         | Open source library    | https://github.com/docling-project/docling  | https://apify.com/vancura/docling/source-code        |
| Playwright MCP  | Open source MCP server | https://github.com/microsoft/playwright-mcp | https://apify.com/jiri.spilka/playwright-mcp-server  |
| Browserbase MCP | SaaS MCP server        | https://www.browserbase.com/                | https://apify.com/mcp-servers/browserbase-mcp-server |

### What projects are suitable for Actorization

Use these criteria to decide if your project is a good candidate for Actorization:

1. *Is it self-contained?* Does the project work non-interactively, with a well-defined, preferably structured input and output format? Positive examples include various data processing utilities, web scrapers and other automation scripts. Negative examples are GUI applications or applications that run indefinitely. If you want to run HTTP APIs on Apify, you can do so using https://docs.apify.com/platform/actors/development/programming-interface/standby.md.
2. *Can the state be stored in Apify storages?* If the application has state that can be stored in a small number of files it can utilize https://docs.apify.com/platform/storage/key-value-store.md, or if it processes records that can be stored in Apify’s https://docs.apify.com/platform/storage/request-queue.md. If the output consists of one or many similar JSON objects, it can utilize https://docs.apify.com/platform/storage/dataset.md.
3. *Can it be containerized?* The project needs to be able to run in a Docker container. Apify currently does not support GPU workloads. External services (e.g., databases) need to be managed by developer.
4. *Can it use Apify tooling?* Javascript/Typescript applications and Python applications can be Actorized with the help of the https://docs.apify.com/sdk.md, which makes easy for your code to interacts with the Apify platform. Applications that can be run using just the CLI can also be Actorized using the Apify CLI by writing a simple shell script that retrieves user input using https://docs.apify.com/cli, then runs your application and sends the results back to Apify (also using the CLI). If your application is implemented differently, you can still call the https://docs.apify.com/api/v2.md directly - it’s just HTTP and pretty much every language has support for that but the implementation is less straightforward.

## Actorization guide

This guide outlines the steps to convert your application into an Apify https://docs.apify.com/platform/actors.md. Follow the documentation links for detailed information - this guide provides an overview rather than exhaustive instructions.

### 1. Add Actor metadata - the `.actor` folder

The Apify platform requires your Actor repository to have a `.actor` folder at the root level, which contains the metadata needed to build and run the Actor.

For existing projects, you can add the `.actor` folder using the https://docs.apify.com/cli/docs/reference#apify-init-actorname.

In case you're starting a new project, we strongly advise to start with a https://apify.com/templates using the https://docs.apify.com/cli/docs/reference#apify-create-actorname based on your usecase

* https://apify.com/templates/ts-empty

* https://apify.com/templates/python-empty

* https://apify.com/templates/cli-start

* https://apify.com/templates/python-mcp-server

* … and many others, check out for comprehensive list https://apify.com/templates

  Quick Start for beginners

  For a step-by-step introduction to creating your first Actor (including tech stack choices and development paths), see https://docs.apify.com/platform/actors/development/quick-start.md.

The newly created `.actor` folder contains an `actor.json` file - a manifest of the Actor. See https://docs.apify.com/platform/actors/development/actor-definition/actor-json.md for more details

You must also make sure your Actor has a Dockerfile and that it installs everything needed to successfully run your application. Check out https://docs.apify.com/platform/actors/development/actor-definition/dockerfile.md by Apify. If you don't want to use these, you are free to use any image as the base of your Actor.

When launching the Actor, the Apify platform will simply run your Docker image. This means that a) you need to configure the `ENTRYPOINT` and `CMD` directives so that it launches your application and b) you can test your image locally using Docker.

These steps are the bare minimum you need to run your code on Apify. The rest of the guide will help you flesh it out better.

### 2. Define input and output

Most Actors accept an input and produce an output. As part of Actorization, you need to define the input and output structure of your application.

For detailed information, read the docs for https://docs.apify.com/platform/actors/development/actor-definition/input-schema.md, https://docs.apify.com/platform/actors/development/actor-definition/dataset-schema.md, and general https://docs.apify.com/platform/storage.md.

#### Design guidelines

1. If your application has some arguments or options, those should be part of the input defined by input schema.
2. If there is a configuration file or if your application is configured with environment variables, those should also be part of the input. Ideally, nested structures should be “unpacked”, i.e., try not to accept deeply nested structures in your input. Start with less input options and expand later.
3. If the output is a single file, you’ll probably want your Actor to output a single dataset item that contains a public URL to the output file stored in the Apify key-value store
4. If the output has a table-like structure or a series of JSON-serializable objects, you should output each row or object as a separate dataset item
5. If the output is a single key-value record, your Actor should return a single dataset item

### 3. Handle state persistence (optional)

If your application performs a number of well-defined subtasks, the https://docs.apify.com/platform/storage/request-queue.md lets you pause and resume execution on job restart. This is important for long-running jobs that might be migrated between servers at some point. In addition, this allows the Apify platform to display the progress to your users in the UI.

A lightweight alternative to the request queue is simply storing the state of your application as a JSON object in the key-value store and checking for that when your Actor is starting.

Fully-fledged Actors will often combine these two approaches for maximum reliability. More on this topic you find in the https://docs.apify.com/platform/actors/development/builds-and-runs/state-persistence.md article.

### 4. Write Actorization code

Perhaps the most important part of the Actorization process is writing the code that will be executed when the Apify platform launches your Actor.

Unless you’re writing an application targeted directly on the Apify platform, this will have the form of a script that calls your code and integrates it with the Apify Storages

Apify provides SDKs for https://docs.apify.com/sdk/js and https://docs.apify.com/sdk/python plus a https://docs.apify.com/cli allowing an easy interaction with Apify platform from command line.

Check out https://docs.apify.com/platform/actors/development/programming-interface.md documentation article for details on interacting with the Apify platform in your Actor's code.

### 5. Deploy the Actor

Deployment to Apify platform can be done easily via `apify push` command of https://docs.apify.com/cli and for details see https://docs.apify.com/platform/actors/development/deployment.md documentation.

### 6. Publish and monetize

For details on publishing the Actor in https://apify.com/store see the https://docs.apify.com/platform/actors/publishing.md. You can also follow our guide on https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/how-to-create-an-actor-readme.md and https://docs.apify.com/academy/actor-marketing-playbook.md.


---

# Advanced web scraping

In the https://docs.apify.com/academy/web-scraping-for-beginners.md course, we have learned the necessary basics required to create a scraper. In the following courses, we learned more about specific practices and techniques that will help us to solve most of the problems we will face.

In this course, we will take all of that knowledge, add a few more advanced concepts, and apply them to learn how to build a production-ready web scraper.

## What does production-ready mean

To scrape large and complex websites, we need to scale two essential aspects of the scraper: crawling and data extraction. Big websites can have millions of pages and the data we want to extract requires more sophisticated parsing techniques than just selecting elements by CSS selectors or using APIs as they are.

We will also touch on monitoring, performance, anti-scraping protections, and debugging.

If you've managed to follow along with all of the courses prior to this one, then you're more than ready to take these upcoming lessons on 😎

## First up

First, we will explore https://docs.apify.com/academy/advanced-web-scraping/crawling/sitemaps-vs-search.md that will help us to find all pages or products on the website.


---

# Crawling sitemaps

In the previous lesson, we learned what is the utility (and dangers) of crawling sitemaps. In this lesson, we will go in-depth to how to crawl sitemaps.

We will look at the following topics:

* How to find sitemap URLs
* How to set up HTTP requests to download sitemaps
* How to parse URLs from sitemaps
* Using Crawlee to get all URLs in a few lines of code

## How to find sitemap URLs

Sitemaps are commonly restricted to contain a maximum of 50k URLs so usually, there will be a whole list of them. There can be a master sitemap containing URLs of all other sitemaps or the sitemaps might simply be indexed in `robots.txt` and/or have auto-incremented URLs like `/sitemap1.xml`, `/sitemap2.xml`, etc.

### Google

You can try your luck on Google by searching for `site:example.com sitemap.xml` or `site:example.com sitemap.xml.gz` and see if you get any results. If you do, you can try to download the sitemap and see if it contains any useful URLs. The success of this approach depends on the website telling Google to index the sitemap file itself which is rather uncommon.

### robots.txt

If the website has a `robots.txt` file, it often contains sitemap URLs. The sitemap URLs are usually listed under `Sitemap:` directive.

### Common URL paths

You can check some common URL paths, such as the following:

/sitemap.xml /product\_index.xml /product\_template.xml /sitemap\_index.xml /sitemaps/sitemap\_index.xml /sitemap/product\_index.xml /media/sitemap.xml /media/sitemap/sitemap.xml /media/sitemap/index.xml

Make also sure you test the list with `.gz`, `.tar.gz` and `.tgz` extensions and by capitalizing the words (e.g. `/Sitemap_index.xml.tar.gz`).

Some websites also provide an HTML version, to help indexing bots find new content. Those include:

/sitemap /category-sitemap /sitemap.html /sitemap\_index

Apify provides the https://apify.com/vaclavrut/sitemap-sniffer, an open source actor that scans the URL variations automatically for you so that you don't have to check them manually.

## How to set up HTTP requests to download sitemaps

For most sitemaps, you can make a single HTTP request and parse the downloaded XML text. Some sitemaps are compressed and have to be streamed and decompressed. The code can get fairly complicated, but scraping frameworks, such as , can do this out of the box.

## How to parse URLs from sitemaps

Use your favorite XML parser to extract the URLs from inside the `` tags. Just be careful that the sitemap might contain other URLs that you don't want to crawl (e.g. `/about`, `/contact`, or various special category sections). For specific code examples, see https://docs.apify.com/academy/node-js/scraping-from-sitemaps.md.

## Using Crawlee

Fortunately, you don't have to worry about any of the above steps if you use https://crawlee.dev, a scraping framework, which has rich traversing and parsing support for sitemap. It can traverse nested sitemaps, download, and parse compressed sitemaps, and extract URLs from them. You can get all the URLs in a few lines of code:


```
import { RobotsFile } from 'crawlee';

const robots = await RobotsFile.find('https://www.mysite.com');

const allWebsiteUrls = await robots.parseUrlsFromSitemaps();
```


## Next up

That's all we need to know about sitemaps for now. Let's dive into a much more interesting topic - search, filters, and pagination.


---

# Scraping websites with search

In this lesson, we will start with a simpler example of scraping HTML based websites with limited pagination.

Limiting pagination is a common practice on e-commerce sites. It makes sense: a real user will never want to look through more than 200 pages of results – only bots love unlimited pagination. Fortunately, there are ways to overcome this limit while keeping our code clean and generic.

![Pagination in on Google search results page](data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAhQAAAC7CAMAAAAKcffFAAABDlBMVEX////d3d3q7O7s7vDo6u319vjz9Pf29/n09fjy8/Xx8vXw8fTv8POjpqv8/Pzs7e/q7O+usbfn6eyNkJVVVVXt7/LbejXv8PLt7/ERERF1dnZcXF1nZ2iVlZjl5uhxcXJZWVng4OG+v8G0tbVkZGXS09SGhofKysu4ubqsra7Pz9DDxceoqKkyMzP4+Pny8vKUlJS5Wyn9+vXY2dnGyM2nqq6MjIxsbGzj4+SwsLI7PDy7vL+dnZ6ZmZmQkJF8fH3QkWT9/f3b3N2BgYFAQUG8Yy7Hx8eioqPprnRgYGDcfzjKzNCqqqt5eXkcHB20t7ulpaYmJicuLi/nqGxISUnoyrMcHBz58uzVnXPLhFPI+UfpAAAFzElEQVR42uzSQREAAAgDoNm/tH9tsIMMBAAAAAAAAAAAAAAAAAAAAAAAAAAAAKDUwCEFTwAAAAAAAGDZudPmpKEojONHFtFCQOtBepNAwtKwUyzKVhZZpKK0tnX//l/EmzrtmAS5MYkv0PPrTDPpPDM3A/8Xnc5QQgghhBBCCCGEEEIIcen07bvnO717ewpc8v3T3d4n3c5844cEiT8SsXj78VRQzce3wL182Uru0nr50u3MH+chYsJHIhbvTkHg9B1wr6Kx3aKv3M5844cEiT8SsXjuciLFRCS3M9+kWLAkIN6iiIpIbme+SdFgURQUBUURWBSPRCS3M7/4IcGiKCgKiiKwKB6KSG5nfvFDgkVRUBQURWBRPBaR3M784ocE6z+JItn/61GE/ERxtDyH0fLEvGwVu7tSFH+mqCjKepqDbQyce4zi6+s3L75ti+LgV4fGNVuODiwk56zQNE2cszy7AI3lzItTdKLj+goAKgrqY7ByHjJv6gl+UaYHNvn6gTv/UBQNnJx3dT0DW5S7MW9RfH7BvRZGscH2FbKCKIpONpsdsoZzBiczmD2+vTgVe6NqD1dQx2lxgiNRFMeMmXkO1QOb9uX/GEUeYIOVWKmyWOczU11fRAulGkCqtKmWYjAbK6jyW7kLMC+toKDpalEYxffPb7ZHEbcrs4nlXto6G7PzuGMmdoxjUBUAmGILLByHHLPhZT4eH6rxeA2vm9n45lMh3v/Und6wT59CcRf2MYrI2e+jaOMqiohqTsWugYMUGgBVPM7iCYxxbSD2QdHM17iYQfWqeyWKgnMbRZfVxVG0WcPTO5DFGuAUAM6xL4yicaPeRpG7bE4UNo80mymNdYwhk+WtUQzyex9FSPldFOOqgXo0inIGOjgCUGXQ9BloCphR4NqMoXsXxQM0i/AeRcLmmGHCQtoye8AYU/L2mdgce9AyA4cB1pxR2B5kMGCVxFBNZNkqcchGiQorsWwiIV8mtisOrY+0f1GsdKV878QSBVfqQBTbAEXUFgtF4dfaQxybUaRwAZBE7S4KWKN8HFwUh8vLjjiKVGXQYz3bTCy2xhzMcAEAG+wIo6gmsBnhUchM07RLOZFQ2TLsiKJfuNe8OdvvKLqoy/c6lig6rSTAzyiqqBmGMYIWdiuYM6Po4wQA9N59FNEu4rn3KDIWRyVWyVhJztmt5rVtJjSTsW4+uwYAWQyBheMQHkWmxoyhmukxmRtkMhdsGcpk5C+WXbl074YtMr/YvyigqGd+/zvFXRSru/d7oVyoYEaRxB5AH6egqD+jADhRde9RpC0WrJG2kZwzU7h5bZsJdbEInIa8eVWfgYXjkBqrptNTxtS0wcppU50pbJJOayy9Xf56Y7nfwyhg0BZHEdOVav6sAlBDrN5GARfYXvWwDhquKj0s5gedvqp4j+LoVyOGVc7yM8k5qxjFgsyytpnIBnsVDgp4UWtjGyych9QYf47+NVOPygyLtUY9MlyG1uz4aMKuKpGjLYyG9X4fo+CEUUBZR8QJACi8Bxjzb5EeIo4BCvxi8FdLQVTq3qMI/0pltyw/k5yzQZOP1EPbTOQCbwGMEHGaBAvnITU24N/PmRoOj/lxN1WD1cPzLxjuLNmXediFfygKm1kuDzapTgu41rz187bv58/c4pd22yz/PufjHUh2HoLN7meJ1CwV9MPh/zsKb4KPQjzzTQq7RVH83ShSIpLbmW9SKlgUhdconohIbmd+8UOCRVF4jSIkIrmd+SaFgkVRUBQUxT8aRSS4L4rCexQREcntzDcpEiyKwutnSXOR3XKv3M5844cEKUefJfX4qfP6h9zhLrkPdbczf5yHiAkfiXj7/xRn2We7ZM+Sbme+mYcEJ3tG/5+CEEIIIYQQQgghhBBCCCGEEELIj3bpmAYAAIQBGPg3zQ83yY5WQ3nTsEjBUQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAANEGpIZeHRnBbfwAAAAASUVORK5CYII=)

> In a rush? Skip the tutorial and get the https://github.com/apify-projects/apify-extra-library/tree/master/examples/crawler-with-filters.

## How to overcome the limit

Websites usually limit the pagination of a single (sub)category to somewhere between 1,000 to 20,000 listings. The site might have over a million listings in total. Without a proven algorithm, it will be very manual and almost impossible to scrape all listings.

We will first look at a couple of ideas that don't work so well and then present the .

### Going deeper into subcategories

This is usually the first solution that comes to mind. You traverse the smallest subcategories and hope that those are below the pagination limits. Unfortunately, there are two big problems with this approach:

1. Any subcategory might be bigger than the pagination limit.
2. Some listings from the parent category might not be present in any subcategory.

While you can often manually test if the second problem is true on the site, the first problem is a hard blocker. You might be just lucky, and it may work on this site but usually, traversing subcategories is not enough. It can be used as a first step of the solution but not as the solution itself.

### Using filters

Most websites also provide a way for the user to select search filters. These allow a more granular level of search than categories and can be combined with them. Common filters allow you to select a **color**, **size**, **location** and similar attributes.

At first, it might seem like an easy solution. Enqueue all possible filter combinations and that should be so granular that it will never hit a pagination limit. Unfortunately, this solution is still far from good.

1. No guarantee that some products won't slip through the chosen filter combinations.
2. The resulting split might be too granular and end up having too many tiny paginations with many duplicate products. This leads to scraping a lot more pages than necessary and makes analytics much harder.

### Using filter ranges

The best option is to use only a specific type of filter that can be used as a range. The most common one is **price range** but there may be others like the apartment size, etc. You can split the pagination pages to only contain listings within that range, e.g. products costing between $10 and $20.

This has several benefits:

1. All listings can eventually be found in a range.
2. The ranges do not overlap, so we scrape the smallest possible number of pages and avoid duplicate listings.
3. Ranges can be controlled by a generic algorithm that can be reused for different sites.

## Splitting pages with range filters

In the previous section, we analyzed different options to split the pages to overcome the pagination limit. We have chosen range filters as the most reliable way to do that. In this section, we will discuss a generic algorithm to work with ranges, look at a few special cases and then write an example crawler.

![An example of range filters on a website](/assets/images/pagination-filters-ad8028367191ccc8ad1c7835e3f21067.png)

### The algorithm

The core algorithm can be used on any (even overlapping) range. This is a simplified presentation, we will discuss the details later.

1. We choose a few pivot ranges with a similar number of products and enqueue them. For example, **$0-$10**, **$100-$1000**, **$1000-$10000**, **$10000-**.
2. For each range, we open the page and check if the listings are below the limit. If yes, we continue to step 3. If not, we split the filter in half, e.g. **$0-$10** to **$0-$5** and **$5-$10** and enqueue those again. We recursively repeat step **2** for each range as long as needed.
3. We now have a pagination URL that is below the limit, we enqueue it under a pagination label and start enqueuing products.

Because the algorithm is recursive, we don't need to think about how big the final ranges should be, the algorithm will find them over time.

### Special cases to look for

We have the base algorithm, but before we start coding, let's answer a few questions to get more insight.

#### Can the ranges overlap?

Some sites will allow you to construct non-overlapping ranges. For example, you can set the ranges with cents, e.g. **$0-$4.99**, **$5-$9.99**, etc. If that is possible, create the pivot ranges this way, too.

Non-overlapping ranges should remove the possibility of duplicate products (unless a ) and the lowest number of pages.

If the website supports only overlapping ranges (e.g. **$0-$5**, **$5–10**), it is not a big problem. Only a small portion of the listings will be duplicates, and they can be removed using a https://docs.apify.com/platform/storage/request-queue.md.

#### Can a listing have more values?

In rare cases, a listing can have more than one value that you are filtering in a range. A typical example is Amazon, where each product has several offers and those offers have different prices. If any of those offers is within the range, the product is shown.

No easy way exists to get around this but the price range split works even with duplicate listings, use a https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Set or request queue to deduplicate them.

#### How is the range passed to the URL?

In the easiest case, you can pass the range directly in the page's URL. For example, `https://example.com/products?price=0-10`. Sometimes, you will need to do some query composition because the price range might be encoded together with more information into a single parameter.

Some sites don't have page URLs with filters and instead load the filtered products via https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest. Those can be GET or POST requests with varying **URL** and **payload** syntax.

The nice thing here is that if you get to understand how their internal API works, you can have it return more products per page or extract full product details just from this single request.

In addition, XHRs are smaller and faster than loading an HTML page. On the other hand, you should not overly abuse them (with setting overly large limits), as this can expose you.

#### Does the website show the number of products for each filtered page?

If it does, it's a nice bonus. It gives us a way to check if we are over or below the pagination limit and helps with analytics.

If it doesn't, we have to find a different way to check if the number of listings is within a limit. One option is to go to the last allowed page of the pagination. If that page is still full of products, we can assume the filter is over the limit.

#### How to handle (open) ends of the range

Logically, every full (price) range starts at 0 and ends at infinity. But the way this is encoded will differ on each site. The end of the price range can be either closed (0) or open (infinity). Open ranges require special handling when you split them (we will get to that).

Most sites will let you start with 0 (there might be exceptions, where you will have to make the start open), so we can use just that. The high end is more complicated. Because you don't know the biggest price, it is best to leave it open and handle it specially. Internally you can assign `null` to the value.

Here are a few examples of a query parameter with an open and closed high-end range:

* Open: `p:100-` (higher than 100), Closed: `p:100-200` (between 100 and 200)
* Open: `min_price=100`, Closed: `min_price=100&max_price=200`

#### Can the range exceed the limit on a single value?

In very rare cases, a site will have so many listings that a single value (e.g. **$100** or **$4.99**) will include a number of listings over the limit.  will recurse until the **min** value equals the **max** value and then stop because it cannot split that single value anymore.

In this rare case, you will need to add another range or other filters to combine it to get an even deeper split.

### Implementing a range filter

This section shows a code example implementing our solution for an imaginary website. Writing a real solution will bring up more complex problems but the previous section should prepare you for some of them.

First, let's define our imaginary site:

* It has a single `/products` path that contains all the products that we want to scrape.
* **Max** pagination limit is **1000**.
* The site contains over a million products.
* It allows for filtering over a price range with query parameters `min_price` and `max_price`.
* If `min_price` or `max_price` are not defined, it opens that end of the range (all products up to or all products over that).
* The site allows to specify the price in cents.
* Pagination is done via `page` query parameter.

#### Define and enqueue pivot ranges

This step is not necessary but it is useful. The algorithm doesn't start with splitting over too large or too small values.


```
import { Actor } from 'apify';
import { CheerioCrawler } from 'crawlee';

await Actor.init();

const MAX_PRODUCTS_PAGINATION = 1000;

// Just an example, choose what makes sense for your site
const PIVOT_PRICE_RANGES = [
    { min: 0, max: 9.99 },
    { min: 10, max: 99.99 },
    { min: 100, max: 999.99 },
    { min: 1000, max: 9999.99 },
    { min: 10000, max: null }, // open-ended
];

// Let's create a helper function for creating the filter URLs, you can move those to a utils.js file
const createFilterUrl = ({ min, max }) => {
    const minString = `min_price=${min}`;
    // We don't want to pass the parameter at all if it is null (open-ended)
    const maxString = max ? `&max_price=${max}` : '';
    return `https://www.mysite.com/products?${minString}${maxString}`;
};

// And another helper for getting filters back from the URL, we could also pass them in userData
const getFiltersFromUrl = (url) => {
    const min = Number(url.match(/min_price=([0-9.]+)/)[1]);
    // Max price might be empty
    const maxMatch = url.match(/max_price=([0-9.]+)/);
    const max = maxMatch ? Number(maxMatch[1]) : null;
    return { min, max };
};

// Actor setup things here
const crawler = new CheerioCrawler({
    async requestHandler(context) {
        // ...
    },
});

// Let's create the pivot requests
const initialRequests = [];
for (const { min, max } of PIVOT_PRICE_RANGES) {
    initialRequests.push({
        url: createFilterUrl({ min, max }),
        label: 'FILTER',
    });
}
// Let's start the crawl
await crawler.run(initialRequests);

await Actor.exit();
```


#### Define the logic for the `FILTER` page


```
import { CheerioCrawler } from 'crawlee';

// Doesn't matter what Crawler class we choose
const crawler = new CheerioCrawler({
    // Crawler options here
    // ...
    async requestHandler({ request, $ }) {
        const { label } = request;
        if (label === 'FILTER') {
            // Of course, change the selectors and make it more robust
            const numberOfProducts = Number($('.product-count').text());

            // The filter is either good enough of we have to split it
            if (numberOfProducts  max) {
        throw new Error(`WRONG FILTER - min(${min}) is greater than max(${max})`);
    }

    // We crate a middle value for the split. If max in null, we will use double min as the middle value
    const middle = max
        ? min + Math.floor((max - min) / 2)
        : min * 2;

    // We have to do the Math.max and Math.min to prevent having min > max
    const filterMin = {
        min,
        max: Math.max(middle, min),
    };
    const filterMax = {
        min: max ? Math.min(middle + 1, max) : middle + 1,
        max,
    };
    // We return 2 new filters
    return [filterMin, filterMax];
}
```


#### Enqueue the filters

Let's finish the crawler now. This code example will go inside the `else` block of the previous crawler example.


```
const { min, max } = getFiltersFromUrl(request.url);
// Our generic splitFilter function doesn't account for decimal values so we will have to convert to cents and back to dollars
const newFilters = splitFilter({ min: min * 100, max: max * 100 });

// And we enqueue those 2 new filters so the process will recursively repeat until all pages get to the PAGINATION phase
const requestsToEnqueue = [];
for (const filter of newFilters) {
    requestsToEnqueue.push({
        // Remember that we have to convert back from cents to dollars
        url: createFilterUrl({ min: filter.min / 100, max: filter.max / 100 }),
        label: 'FILTER',
    });
}

await crawler.addRequests(requestsToEnqueue);
```


## Summary

And that's it. We have an elegant solution for a complicated problem. In a real project, you would want to make this a bit more robust and https://docs.apify.com/academy/expert-scraping-with-apify/saving-useful-stats.md. This will let you know what filters you went through and how many products each of them had.

Check out the https://github.com/apify-projects/apify-extra-library/tree/master/examples/crawler-with-filters.


---

# Sitemaps vs search

The core crawling problem comes to down to ensuring that we reliably find all detail pages on the target website or inside its categories. This is trivial for small sites. We just open the home page or category pages and paginate to the end as we did in the https://docs.apify.com/academy/web-scraping-for-beginners.md course.

Unfortunately, *most modern websites restrict pagination* only to somewhere between 1 and 10,000 products. Solving this problem might seem relatively straightforward at first but there are multiple hurdles that we will explore in this lesson.

There are two main approaches to solving this problem:

* Extracting all page URLs from the website's *sitemap*.
* Using **categories, search and filters** to split the website so we get under the pagination limit.

Both of these approaches have their pros and cons so the best solution is to *use both and combine the results*. Here we will learn why.

## Pros and cons of sitemaps

Sitemap is usually a simple XML file that contains a list of all pages on the website. They are created and maintained mainly for search engines like Google to help ensure that the website gets fully indexed there. They are commonly located at URLs like `https://example.com/sitemap.xml` or `https://example.com/sitemap.xml.gz`. We will get to work with sitemaps in the next lesson.

### Pros

* *Quick to set up* - The logic to find all sitemaps and extract all URLs is usually simple and can be done in a few lines of code.
* *Fast to run* - You only need to run a single request for each sitemap that contains up to 50,000 URLs. This means you can get all the URLs in a matter of seconds.
* *Usually complete* - Websites have an incentive to keep their sitemaps up to date as they are used by search engines. This means that they usually contain all pages on the website.

### Cons

* *Does not directly reflect the website* - There is no way you can ensure that all pages on the website are in the sitemap. The sitemap also can contain pages that were already removed and will return 404s. This is a major downside of sitemaps which prevents us from using them as the only source of URLs.
* *Updated in intervals* - Sitemaps are usually not updated in real-time. This means that you might miss some pages if you scrape them too soon after they were added to the website. Common update intervals are 1 day or 1 week.
* *Hard to find or unavailable* - Sitemaps are not always trivial to locate. They can be deployed on a CDN with unpredictable URLs. Sometimes they are not available at all.
* *Streamed, compressed, and archived* - Sitemaps are often streamed and archived with .tgz extensions and compressed with gzip. This means that you cannot use default HTTP client settings and must handle these cases with extra code or use a scraping framework.

## Pros and cons of categories, search, and filters

This approach means traversing the website like a normal user does by going through categories, setting up different filters, ranges, and sorting options. The goal is to ensure that we cover all categories or ranges where products can be located, and that for each of those we stay under the pagination limit.

The pros and cons of this approach are pretty much the opposite of relying on sitemaps.

### Pros

* *Directly reflects the website* - With most scraping use-cases, we want to analyze the website as the regular users see it. By going through the intended user flow, we ensure that we are getting the same pages as the users.
* *Updated in real-time* - The website is updated in real-time so we can be sure that we are getting all pages.
* *Often contain detailed data* - While sitemaps are usually just a list of URLs, categories, searches and filters often contain additional data like product names, prices, categories, etc, especially if available via JSON API. This means that we can sometimes get all the data we need without going to the detail pages.

### Cons

* *Complex to set up* - The logic to traverse the website is usually complex and can take a lot of time to get right. We will get to this in the next lessons.
* *Slow to run* - The traversing can require a lot of requests. Some filters or categories will have products we already found.
* *Not always complete* - Sometimes the combination of filters and categories will not allow us to ensure we have all products. This is especially painful for sites where we don't know the exact number of products we are looking for. The tools we'll build in the following lessons will help us with this.

## Do we know how many products there are?

Most websites list a total number of detail pages somewhere. It might be displayed on the home page, search results, or be provided in the API response. We just need to make sure that this number really represents the whole site or category we are looking to scrape. By knowing the total number of products, we can tell if our approach to scrape all succeeded or if we still need to refine it.

Some sites, like Amazon, do not provide exact numbers. In this case, we have to work with what they give us and put even more effort into making our scraping logic accurate. We will tackle this in the following lessons as well.

## Next up

Next, we will look into https://docs.apify.com/academy/advanced-web-scraping/crawling/crawling-sitemaps.md. After that we will go through all the intricacies of the category, search and filter crawling, and build up tools implementing a generic approach that we can use on any website. At last, we will combine the results of both and set up monitoring and persistence to ensure we can run this regularly without any manual controls.


---

# Tips and tricks for robustness

**Learn how to make your automated processes more effective. Avoid common web scraping and web automation pitfalls, future-proof your programs and improve your processes.**

***

This collection of tips and tricks aims to help you make your scrapers work smoother and produce fewer errors.

## Proofs and verification

**Absence of evidence ≠ evidence of absence**.

Make sure output remains consistent regardless of any changes at the target host/website:

* Always base all important checks on the **presence** of proof.
* Never build any important checks on the **absence** of anything.

The absence of an expected element or message does **not** prove an action has been (un)successful. The website might have been updated or expected content may no longer exist in the original form. The **action relying on the absence** of something might still be failing. Instead, it must rely on **proof of presence**.

**Good**: Rely on the presence of an element or other content confirming a successful action.


```
async function isPaymentSuccessful() {
    try {
        await page.waitForSelector('#PaymentAccepted');
    } catch (error) {
        return OUTPUT.paymentFailure;
    }

    return OUTPUT.paymentSuccess;
}
```


**Avoid**: Relying on the absence of an element that may have been updated or changed.


```
async function isPaymentSuccessful() {
    const $paymentAmount = await page.$('#PaymentAmount');

    if (!$paymentAmount) return OUTPUT.paymentSuccess;
}
```


## Presumption of failure

**Every action has failed until it has provably succeeded.**

Always assume an action has failed before having a proof of success. Always verify important steps to avoid false positives or false negatives.

* False positive = **false / failed** outcome reported as **true / successful** on output.
* False negative = **true / successful** outcome reported as **false / failed** on output.

Assuming any action has been successful without direct proof is dangerous. Disprove failure actively through proof of success instead. Only then consider output valid and verified.

**Good**: Verify outcome through proof. Clearly disprove failure of an important action.


```
async function submitPayment() {
    await Promise.all([
        page.click('submitPayment'),
        page.waitForNavigation(),
    ]);

    try {
        await page.waitForFunction(
            (selector) => document.querySelector(selector).innerText.includes('Payment Success'),
            { polling: 'mutation' },
            '#PaymentOutcome',
        );
    } catch (error) {
        return OUTPUT.paymentFailure;
    }

    return OUTPUT.paymentSuccess;
}
```


**Avoid**: Not verifying an outcome. It can fail despite output claiming otherwise.


```
async function submitPayment() {
    await Promise.all([
        page.click('submitPayment'),
        page.waitForNavigation(),
    ]);

    return OUTPUT.paymentSuccess;
}
```


## Targeting elements

Be both as specific and as generic as possible at the same time.

### DOM element selectors

Make sure your https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors have the best chance to remain valid after a website is updated.

* Prefer https://developer.mozilla.org/en-US/docs/Web/CSS/Specificity selectors over lower specificity ones (**#id** over **.class**).
* Use https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors to search parts of attributes (prefix, suffix, etc.).
* Use element attributes with the **lowest probability of a future change**.
* Completely **avoid or strip** selectors of values that are clearly **random**.
* Completely **avoid or strip** selectors of values that are clearly **flexible**.
* **Extend low-specificity** selectors to reduce the probability of **collisions**.

Below is an example of stripping away too-specific parts of a selector that are likely random or subject to change.

`#P_L_v201w3_t3_ReceiptToolStripLabel` => `a[id*="ReceiptToolStripLabel"]`

If you are reasonably confident a page layout will remain without any dramatic future changes **and** need to increase the selector specificity to reduce the chance of a collision with other selectors, you can extend the selector as per the principle below.

`#ReceiptToolStripLabel_P_L_v201w3_t3` => `table li > a[id^="ReceiptToolStripLabel"]`

### Content pattern matching

Matching elements by content is already natively supported by https://playwright.dev/. Playwright is a https://nodejs.org/en/ library that allows you to automate Chromium, Firefox and WebKit with a single API.

In https://pptr.dev/, you can use custom utility functions to https://developer.mozilla.org/en-US/docs/Glossary/Polyfill this functionality.

## Event-bound flows

Always strive to make code as fluid as possible. Listen to events and react to them as needed by triggering consecutive actions immediately.

* **Avoid** any **fixed-duration** delays wherever possible.
* Prefer fluid flow based on the **occurrence of events**.


```
// Avoid:
await page.waitForTimeout(timeout);

// Good:
await page.waitForFunction(myFunction, options, args);

// Good:
await page.waitForFunction(() => {
    return window.location.href.includes('path');
});

// Good:
await page.waitForFunction(
    (selector) => document.querySelector(selector).innerText,
    { polling: 'mutation' },
    '[data-qa="btnAppleSignUp"]',
);
```


---

# AI agent tutorial

**In this section of the Apify Academy, we show you how to build an AI agent with the CrewAI Python framework. You’ll learn how to create an agent for Instagram analysis and integrate it with LLMs and Apify Actors.**

***

AI agents are goal-oriented systems that make independent decisions. They interact with environments using predefined tools and workflows to automate complex tasks.

On Apify, AI agents are built as Actors—serverless cloud programs for web scraping, data processing, and AI deployment. Apify evolved from running scrapers in the cloud to supporting LLMs that follow predefined workflows with dynamically defined goals.

## Prerequisites

To build an effective AI agent, you need prompts to guide it, tools for external interactions, a large language model (LLM) to connect the components, an agentic framework to handle LLM behavior, and a platform to run, deploy, and scale the solution.

## Benefits of using Apify for AI agents

Apify provides a complete platform for building and deploying AI agents with the following benefits:

* *Serverless execution* - without infrastructure management
* *Stateful execution* - with agent memory capabilities
* *Monetization options* - through usage-based charging
* *Extensive tool ecosystem* - with thousands of available Actors
* *Scalability and reliability* - for production environments
* *Pre-integrated tools* - for web scraping and automation

## Building an AI agent

### Step 1: Define the use case

This tutorial creates a social media analysis agent that analyzes Instagram posts based on user queries using the https://apify.com/apify/instagram-scraper.

*Example:*

* *Input:* "Analyze the last 10 posts from @openai and summarize AI trends."
* *Output:* Trend analysis based on post content.

### Step 2: Configure input and output

Define the input format (URL, JSON configuration, or text query) and output format (text response or structured data) for your agent.

*Example input:*

* User query: "Analyze @openai posts for AI trends"
* OpenAI model selection (e.g., `gpt-4`)

*Example output:*

* Text response with insights
* Data stored in Apify https://docs.apify.com/platform/storage/dataset.md

Agent memory

Agents can include memory for storing information between conversations. Single-task agents typically do not require memory.

### Step 3: Set up the development environment

Install the Apify CLI, which allows you to create, run, and deploy Actors from your local machine.


```
npm install -g @apify/cli
```


Create a new Actor project from the CrewAI template and navigate into the new directory.


```
apify create agent-actor -t python-crewai
cd agent-actor
```


### Step 4: Understand the project structure

The template includes:

* `.actor/` – Actor configuration files.

  

  * `actor.json` – The Actor's definition.
  * `input_schema.json` – Defines the UI for the Actor's input.
  * `dataset_schema.json` – Defines the structure of the output data.
  * `pay_per_event.json` – Configuration for monetization.

* `src/` – Source code

  

  * `main.py` – The main script for Actor execution, agent, and task definition.
  * `tools.py` – Implementations of the tools the agent can use.
  * `models.py` – Pydantic models for structured tool output.
  * `ppe_utils.py` – Helper functions for pay-per-event monetization.

### Step 5: Define input and output schemas

Update `.actor/input_schema.json` to define the Actor's inputs. This schema generates a user interface for running the Actor on the Apify platform.


```
{
  "title": "Instagram Analysis Agent Input",
  "type": "object",
  "schemaVersion": 1,
  "properties": {
    "query": {
      "title": "Query",
      "type": "string",
      "description": "Task for the agent to perform",
      "example": "Analyze @openai posts for AI trends"
    },
    "modelName": {
      "title": "Model Name",
      "type": "string",
      "description": "OpenAI model to use",
      "default": "gpt-4"
    }
  },
  "required": ["query"]
}
```


Define the dataset schema in `.actor/dataset_schema.json`. This helps structure the data pushed to the dataset.


```
{
  "title": "Instagram Analysis Output",
  "type": "object",
  "properties": {
    "query": {
      "title": "Query",
      "type": "string"
    },
    "response": {
      "title": "Response",
      "type": "string"
    }
  }
}
```


### Step 6: Configure tools

The Instagram post scraper tool is implemented using the https://apify.com/apify/instagram-scraper. The tool returns structured output as Pydantic models defined in `src/models.py`:


```
class InstagramPost(BaseModel):
    id: str
    url: str
    caption: str
    timestamp: datetime
    likes_count: int
    comments_count: int
```


The tool is defined in `src/tools.py` and includes:

* Tool description and argument schema for the agent
* Integration with Instagram Scraper Actor
* Data retrieval and formatting

### Step 7: Implement the agent

The agent implementation in `src/main.py` includes:

1. Handle Actor input: Read the user's query and any other parameters from the Actor input.


   ```
   async def main():
       async with Actor:
           actor_input = await Actor.get_input()
           query = actor_input.get("query")
           model_name = actor_input.get("modelName", "gpt-4")
   ```


2. Define the agent: Instantiate the agent, giving it a role, a goal, and access to the tools you configured.


   ```
   agent = Agent(
       role="Social Media Analyst",
       goal="Analyze Instagram posts and provide insights",
       backstory="Expert in social media analysis and trend identification",
       tools=[instagram_scraper_tool],
       llm=ChatOpenAI(model=model_name)
   )
   ```


3. Create task and crew: Define the task for the agent to complete based on the user's query.


   ```
   task = Task(
       description=query,
       agent=agent,
       expected_output="Detailed analysis with insights"
   )

   crew = Crew(
       agents=[agent],
       tasks=[task]
   )
   ```


4. Execute and save results: Kick off the crew to run the task and save the final result to the Actor's default dataset.


   ```
   result = crew.kickoff()
   await Actor.push_data({
       "query": query,
       "response": str(result)
   })
   ```


### Step 8: Test locally

Run the agent on your local machine using the Apify CLI. Ensure you have set any required environment variables (e.g., `OPENAI_API_KEY`).


```
apify run
```


### Step 9: Deploy to Apify

Push your Actor's code to the Apify platform.


```
apify push
```


After deployment:

1. Navigate to your Actor's settings.
2. Set `OPENAI_API_KEY` as a secret environment variable.
3. Rebuild the Actor version to apply the changes.

### Step 10: Test the deployed agent

Run the agent on the platform with a sample query and monitor the results in the output dataset.


```
Analyze the posts of the @openai and @googledeepmind and summarize me current trends in the AI.
```


Troubleshooting

Common issues and solutions:

* *Agent fails to call tools:* Check that the tool descriptions in src/tools.py are clear and the argument schemas are correct.
* *Instagram scraper fails:* Verify that the Instagram usernames exist and are public. Check the scraper Actor's run logs for specific errors.
* *Missing API key:* Ensure OPENAI\_API\_KEY is set as a secret environment variable in your Actor's Settings.

## Monetizing your AI agent

Apify's pay-per-event (PPE) pricing model allows charging users based on specific triggered events through the API or SDKs.

How pay-per-event pricing works

If you want more details about PPE pricing, refer to our https://docs.apify.com/platform/actors/publishing/monetize/pay-per-event.md.

### Step 1: Define chargeable events

You can configure charges for events like the Actor starting, a task completing successfully, or custom events such as specific API calls.

Example event definition:


```
{
  "eventName": "task-completed",
  "description": "Charge for completed analysis task",
  "price": 0.10
}
```


### Step 2: Implement charging in code

Add charging logic to your code:


```
await Actor.charge({
    "eventName": "task-completed",
    "amount": 1
})
```


### Step 3: Configure PPE settings

1. Enable pay-per-event monetization in Actor settings.
2. Define events from `pay_per_event.json`.
3. Set pricing for each event.

### Step 4: Publish the agent

Before making your agent public on https://apify.com/store, complete the following checklist:

* Update README with usage instructions.
* Validate `input_schema.json` and `dataset_schema.json`.
* Verify `OPENAI_API_KEY` environment variable is handled correctly.
* Check monetization settings on the Actor publication page.
* Test the Actor thoroughly.
* Set your Actor's visibility to public.

## Next steps

To continue developing AI agents:

1. *Use the CrewAI template:* Start with `apify create agent-actor -t python-crewai`
2. *Explore other templates:* Visit the Apify templates page for alternatives
3. *Review existing agents:* Check the AI agents collection on Apify Store
4. *Publish and monetize:* Deploy with `apify push` and enable monetization


---

# Anti-scraping protections

**Understand the various anti-scraping measures different sites use to prevent bots from accessing them, and how to appear more human to fix these issues.**

***

If at any point in time you've strayed away from the Academy's demo content, and into the Wild West by writing some scrapers of your own, you may have been hit with anti-scraping measures. This is extremely common in the scraping world; however, the good thing is that there are always solutions.

This section covers the essentials of mitigating anti-scraping protections, such as proxies, HTTP headers and cookies, and a few other things to consider when working on a reliable and scalable crawler. Proper usage of the methods taught in the next lessons will allow you to extract data which is specific to a certain location, enable your crawler to browse websites as a logged-in user, and more.

In development, it is crucial to check and adjust the configurations related to our next lessons' topics, as doing this can fix blocking issues on the majority of websites.

## Quick start

If you don't have time to read about the theory behind anti-scraping protections to fine-tune your scraping project and instead you need to get unblocked ASAP, here are some quick tips:

* Use high-quality proxies. https://docs.apify.com/platform/proxy/residential-proxy.md are the least blocked. You can find many providers out there like Apify, BrightData, Oxylabs, NetNut, etc.
* Set **real-user-like HTTP settings** and **browser fingerprints**. https://crawlee.dev/ uses statistically generated realistic HTTP headers and browser fingerprints by default for all of its crawlers.
* Use a browser to pass bot capture challenges. We recommend https://crawlee.dev/docs/examples/playwright-crawler-firefox because it is not that common for scraping. You can also play with https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions#headless and adjust other https://crawlee.dev/api/browser-pool/interface/FingerprintGeneratorOptions.
* Consider extracting data from **https://docs.apify.com/academy/api-scraping.md** or **mobile app APIs**. They are usually much less protected.
* Increase the number of request retries significantly to at least 10 with https://crawlee.dev/api/basic-crawler/interface/BasicCrawlerOptions#maxRequestRetries. Rotate sessions after every error with https://crawlee.dev/api/core/interface/SessionOptions#maxErrorScore
* If you cannot afford to use browsers for performance reasons, you can try https://playwright.dev/docs/api/class-playwright#playwright-request or https://www.npmjs.com/package/node-libcurl as the HTTP library for https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler or https://crawlee.dev/api/basic-crawler/class/BasicCrawler Crawlers, instead of its default https://crawlee.dev/docs/guides/got-scraping HTTP back end. These libraries have access to native code which offers much finer control over the HTTP traffic and mimics real browsers more than what can be achieved with plain Node.js implementation like `got-scraping`. These libraries should become part of Crawlee itself in the future.

In the vast majority of cases, this configuration should lead to success. Success doesn't mean that all requests will go through unblocked, that is not realistic. Some IP addresses and fingerprint combinations will still be blocked but the automatic retry system takes care of that. If you can get at least 10% of your requests through, you can still scrape the whole website with enough retries. The default https://crawlee.dev/api/core/class/SessionPool configuration will preserve the working sessions and eventually the success rate will increase.

If the above tips didn't help, you can try to fiddle with the following:

* Try different browsers. Crawlee & Playwright support Chromium, Firefox and WebKit out of the box. You can also try the https://brave.com which https://blog.apify.com/unlocking-the-potential-of-brave-and-playwright-for-browser-automation/.
* Don't use browsers at all. Sometimes the anti-scraping protections are extremely sensitive to browser behavior but will allow plain HTTP requests (with the right headers) just fine. Don't forget to match the specific https://docs.apify.com/academy/concepts/http-headers.md for each request.
* Decrease concurrency. Slower scraping means you can blend in better with the rest of the traffic.
* Add human-like behavior. Don't traverse the website like a bot (paginating quickly from 1 to 100). Instead, visit various types of pages, add time randomizations and you can even introduce some mouse movements and clicks.
* Try Puppeteer with the https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth plugin. Generally, Crawlee's default configuration should have stronger bypassing but some features might land first in the stealth plugin.
* Find different sources of the data. The data might be rendered to the HTML but you could also find it in JavaScript (inlined in the HTML or in files) or in the API responses. Especially the APIs are often much less protected (if you use the right headers).
* Reverse engineer the JavaScript challenges that run on the page so you can figure out how the bypass them. This is a very advanced topic that you can read about online. We plan to introduce more content about this.

Keep in mind that there is no silver bullet solution. You can find many anti-scraping systems and each of them behaves differently depending the website's configuration. That is why "trying a few things" usually leads to success. You will find more details about these tricks in the https://docs.apify.com/academy/anti-scraping/mitigation.md section below.

## First of all, why do websites want to block bots?

What's up with that?! A website might have a variety of reasons to block bots from accessing it. Here are a few of the main ones:

* To prevent the possibility of malicious bots from crawling the site to steal sensitive data like passwords or personal data about users.
* In order to avoid server performance hits due to bots making a large amount of requests to the website at a single time.
* To avoid their competitors to gain market insights about their business.
* To prevent bots from scraping their content and selling it to other websites or re-publishing it.
* To not skew their analytics data with bot traffic.
* If it is a social media website, they might be attempting to keep away bots programmed to mass create fake profiles (which are usually sold later).

> We recommend checking out https://blog.apify.com/is-web-scraping-legal/.

Unfortunately for these websites, they have to make compromises and tradeoffs. While super strong anti-bot protections will surely prevent the majority of bots from accessing their content, there is also a higher chance of regular users being flagged as bots and being blocked as well. Because of this, different sites have different scraping-difficulty levels based on the anti-scraping measures they take.

> Going into this topic, it's important to understand that there is no one silver bullet solution to bypassing protections against bots. Even if two websites are using Cloudflare (for example), one of them might be significantly more difficult to scrape due to harsher CloudFlare configurations. It is all about configuration, not the anti-scraping tool itself.

## The principles of anti-scraping protections

Anti-scraping protections can work on many different layers and use a large amount of bot-identification techniques.

1. **Where you are coming from** - The IP address of the incoming traffic is always available to the website. Proxies are used to emulate a different IP addresses but their quality matters a lot.
2. **How you look** - With each request, the website can analyze its HTTP headers, TLS version, ciphers, and other information. Moreover, if you use a browser, the website can also analyze the whole browser fingerprint and run challenges to classify your hardware (like graphics hardware acceleration).
3. **What you are scraping** - The same data can be extracted in many ways from a website. You can get the initial HTML or you can use a browser to render the full page or you can reverse engineer internal APIs. Each of those endpoints can be protected differently.
4. **How you behave** - The website can see patterns in how you are ordering your requests, how fast you are scraping, etc. It can also analyze browser behavior like mouse movement, clicks or key presses.

These are the 4 main principles that anti-scraping protections are based on.

Not all websites use all of these principles but they encompass the possibilities websites have to track and block bots. All techniques that help you mitigate anti-scraping protections are based on making yourself blend in with the crowd of regular users with each of these principles.

A bot can usually be detected in one of two ways, which follow two different types of web scraping:

1. Crawlers using **HTTP requests**
2. Crawlers using **browser automation** (usually with a headless browser)

Once a bot is detected, there are some countermeasures a website takes to prevent it from re-accessing it. The protection techniques are divided into two main categories:

1. Uses only the **information provided within the HTTP request**, such as headers, IP addresses, TLS versions, ciphers, etc.
2. Uses **JavaScript evaluation to collect browser fingerprint**, or even track the user behavior on the website. These JavaScript evaluations can also track mouse movement or keys pressed. Based on the information gathered, they can decide if the user is a bot or a human. This method is often paired with the first one.

Once one of these methods detects that the user is a bot, it will take countermeasures depending on how advanced its techniques are.

A common workflow of a website after it has detected a bot goes as follows:

1. The bot is added to the "greylist" (a list of suspicious IP addresses, fingerprints or any other value that can be used to uniquely identify the bot).
2. A https://en.wikipedia.org/wiki/Turing_test is provided to the bot. Typically a **captcha**. If the bot succeeds, it is added to the whitelist.
3. If the captcha is failed, the bot is added to the blacklist.

One thing to keep in mind while navigating through this course is that advanced anti-scraping methods are able to identify non-humans not only by one value (such as a single header value, or IP address), but are able to identify them through more complex things such as header combinations.

Watch a conference talk by https://github.com/mnmkng, which provides an overview of various anti-scraping measures and tactics for circumventing them.

https://www.youtube-nocookie.com/embed/aXil0K-M-Vs

Several years old?

Although the talk, given in 2021, features some outdated code examples, it still serves well as a general overview.

## Common anti-scraping measures

Because we here at Apify scrape for a living, we have discovered many popular and niche anti-scraping techniques. We've compiled them into a short and comprehensible list here to help understand the roadblocks before this course teaches you how to get around them.

> Not all issues you encounter are caused by anti-scraping systems. Sometimes, it's a configuration issue. Learn https://docs.apify.com/academy/node-js/analyzing-pages-and-fixing-errors.md.

### IP rate-limiting

This is the most straightforward and standard protection, which is mainly implemented to prevent DDoS attacks, but it also works for blocking scrapers. Websites using rate limiting don't allow to more than some defined number of requests from one IP address in a certain time span. If the max-request number is low, then there is a high potential for false-positive due to IP address uniqueness, such as in large companies where hundreds of employees can share the same IP address.

> Learn more about rate limiting https://docs.apify.com/academy/anti-scraping/techniques/rate-limiting.md

### Header checking

This type of bot identification is based on the given fact that humans are accessing web pages through browsers, which have specific https://docs.apify.com/academy/concepts/http-headers.md sets which they send along with every request. The most commonly known header that helps to detect bots is the `User-Agent` header, which holds a value that identifies which browser is being used, and what version it's running. Though `User-Agent` is the most commonly used header for the **Header checking** method, other headers are sometimes used as well. The evaluation is often also run based on the header consistency, and includes a known combination of browser headers.

### URL analysis

Solely based on the way how the bots operate. It compares data-rich page visits and the other page visits. The ratio of the data-rich and regular pages has to be high to identify the bot and reduce false positives successfully.

### Regular structure changes

By definition, this is not an anti-scraping method, but it can heavily affect the reliability of a scraper. If your target website drastically changes its CSS selectors, and your scraper is heavily reliant on selectors, it could break. In principle, websites using this method change their HTML structure or CSS selectors randomly and frequently, making the parsing of the data harder, and requiring more maintenance of the bot.

One of the best ways of avoiding the possible breaking of your scraper due to website structure changes is to limit your reliance on data from HTML elements as much as possible (see https://docs.apify.com/academy/api-scraping.md and https://docs.apify.com/academy/node-js/js-in-html.md)

### IP session consistency

This technique is commonly used to entirely block the bot from accessing the website altogether. It works on the principle that every entity that accesses the site gets a token. This token is then saved together with the IP address and HTTP request information such as User-Agent and other specific headers. If the entity makes another request, but without the session token, the IP address is added on the greylist.

### Interval analysis

This technique is based on analyzing the time intervals of the visit of a website. If the times are very similar, the entity is added to the greylist. This method’s premise is that the bot runs in regular intervals by, for example, a CRON job that starts every Monday. It is a long-term strategy, so it should be used as an extension. This technique needs only the information from the HTTP request to identify the frequency of the visits.

### Browser fingerprinting

One of the most successful and advanced methods is collecting the browser's "fingerprint", which is a fancy name for information such as fonts, audio codecs, canvas fingerprint, graphics card, and more. Browser fingerprints are highly unique, so they are a reliable means of identifying a specific user (or bot). If the fingerprint provides different/inconsistent information, the user is added to the greylist.

> It's important to note that this method also blocks all users that cannot evaluate JavaScript (such as bots sending only static HTTP requests), and combines both of the fundamental methods mentioned earlier.

### Honeypots

The honeypot approach is based on providing links that only bots can see. A typical example is hidden pagination. Usually, the bot needs to go through all the pages in the pagination, so the website's last "fake" page has a hidden link for the user, but has the same selector as the real one. Once the bot visits the link, it is automatically blacklisted. This method needs only the HTTP information.

## First up

In our https://docs.apify.com/academy/anti-scraping/techniques.md, we'll be discussing more in-depth about the various anti-scraping methods and techniques websites use, as well as how to mitigate these protections.


---

# Anti-scraping mitigation

**After learning about the various different anti-scraping techniques websites use, learn how to mitigate them with a few different techniques.**

***

In the https://docs.apify.com/academy/anti-scraping/techniques.md section of this course, you learned about multiple methods websites use to prevent bots from accessing their content. This **Mitigation** section will be all about how to circumvent these protections using various different techniques.

## Next up

In the https://docs.apify.com/academy/anti-scraping/mitigation/proxies.md of this section, you'll be learning about what proxies are and how to use them in your own crawler.


---

# Bypassing Cloudflare browser check

**Learn how to bypass Cloudflare browser challenge with Crawlee.**

***

If you find yourself stuck, there are a few strategies that you can employ. One key strategy is to ensure that your browser fingerprint is consistent. In some cases, the default browser fingerprint may actually be more effective than an inconsistently generated fingerprint. Additionally, it may be beneficial to avoid masking a Linux browser to look like a Windows or macOS browser, although this will depend on the specific configuration of the website you are targeting.

For those using Crawlee, the library provides out-of-the-box support for generating consistent fingerprints that are able to pass the Cloudflare challenge. However, it's important to note that in some cases, the Cloudflare challenge screen may return a 403 status code even if it is evaluating the fingerprint and the request is not blocked. This can cause the default Crawlee browser crawlers to throw an error and not wait until the challenge is submitted and the page is redirected to the target webpage.

To address this issue, it is necessary to alter the crawler configuration. For example, you might use the following code to remove default blocked status code handling from the crawler:


```
const crawler = new PlaywrightCrawler({
    ...otherOptions,
    sessionPoolOptions: {
        blockedStatusCodes: [],
    },
});
```


It's important to note that by removing default blocked status code handling, you should also add custom session retire logic on blocked pages to reduce retries. Additionally, you should add waiting logic to start the automation logic only after the Cloudflare challenge is solved and the page is redirected. This can be accomplished by waiting for a common selector that is available on all pages, such as a header logo.

In some cases, the browser may not pass the check and you may be presented with a captcha, indicating that your IP address has been graylisted. If you are working with a large pool of proxies you can retire the session and use another IP. However, if you have a small pool of proxies you might want to whitelist the IP. To do this, you'll need to solve the captcha to improve your IP address's reputation. You can find various captcha-solving services, such as https://anti-captcha.com/, that you can use for this purpose. For more info check the section about https://docs.apify.com/academy/anti-scraping/techniques/captchas.md.

![Cloudflare captcha](https://images.ctfassets.net/slt3lc6tev37/6sN2VXiUaJpjxqVfTbZEJd/9a4e13cbf08ce29797167c133c534e1f/image1.png)

In summary, while Cloudflare's browser challenge is designed to protect websites from automated scraping, it can be bypassed by ensuring a consistent browser fingerprint and customizing your scraping strategy. Crawlee offers out-of-the-box support for generating consistent fingerprints, but you may need to adjust your crawler configuration to handle Cloudflare's response. By following these tips, you can successfully navigate Cloudflare's browser challenge and continue scraping the data you need.


---

# Generating fingerprints

**Learn how to use two super handy npm libraries to generate fingerprints and inject them into a Playwright or Puppeteer page.**

***

In https://crawlee.dev, you can use https://crawlee.dev/api/browser-pool/interface/FingerprintOptions on a crawler to automatically generate fingerprints.


```
import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    browserPoolOptions: {
        fingerprintOptions: {
            fingerprintGeneratorOptions: {
                browsers: [{ name: 'firefox', minVersion: 80 }],
                devices: ['desktop'],
                operatingSystems: ['windows'],
            },
        },
    },
});
```


> Note that Crawlee will automatically generate fingerprints for you with no configuration necessary, but the option to configure them yourself is still there within **browserPoolOptions**.

## Using the fingerprint-generator package

Crawlee uses the https://github.com/apify/fingerprint-suite npm package to do its fingerprint generating magic. For maximum control outside of Crawlee, you can install it on its own. With this package, you can generate browser fingerprints.

> It is crucial to generate fingerprints for the specific browser and operating system being used to trick the protections successfully. For example, if you are trying to overcome protection locally with Firefox on a macOS system, you should generate fingerprints for Firefox and macOS to achieve the best results.


```
import { FingerprintGenerator } from 'fingerprint-generator';

// Instantiate the fingerprint generator with
// configuration options
const fingerprintGenerator = new FingerprintGenerator({
    browsers: [
        { name: 'firefox', minVersion: 80 },
    ],
    devices: [
        'desktop',
    ],
    operatingSystems: [
        'windows',
    ],
});

// Grab a fingerprint from the fingerprint generator
const generated = fingerprintGenerator.getFingerprint({
    locales: ['en-US', 'en'],
});
```


## Injecting fingerprints

Once you've manually generated a fingerprint using the **Fingerprint generator** package, it can be injected into the browser using https://github.com/apify/fingerprint-injector. This tool allows you to inject fingerprints into browsers automated by Playwright or Puppeteer:


```
import FingerprintGenerator from 'fingerprint-generator';
import { FingerprintInjector } from 'fingerprint-injector';
import { chromium } from 'playwright';

// Instantiate a fingerprint injector
const fingerprintInjector = new FingerprintInjector();

// Launch a browser in Playwright
const browser = await chromium.launch();

// Instantiate the fingerprint generator with
// configuration options
const fingerprintGenerator = new FingerprintGenerator({
    browsers: [
        { name: 'firefox', minVersion: 80 },
    ],
    devices: [
        'desktop',
    ],
    operatingSystems: [
        'windows',
    ],
});

// Grab a fingerprint
const generated = fingerprintGenerator.getFingerprint({
    locales: ['en-US', 'en'],
});

// Create a new browser context, plugging in
// some values from the fingerprint
const context = await browser.newContext({
    userAgent: generated.fingerprint.userAgent,
    locale: generated.fingerprint.navigator.language,
});

// Attach the fingerprint to the newly created
// browser context
await fingerprintInjector.attachFingerprintToPlaywright(context, generated);

// Create a new page and go to Google
const page = await context.newPage();
await page.goto('https://google.com');
```


> Note that https://crawlee.dev automatically applies wide variety of fingerprints by default, so it is not required to do this unless you aren't using Crawlee or if you need a super specific custom fingerprint to scrape with.

## Generating headers

Headers are also used by websites to fingerprint users (or bots), so it might sometimes be necessary to generate some user-like headers to mitigate anti-scraping protections. Similarly with fingerprints, **Crawlee** automatically generates headers for you, but you can have full control by using the https://github.com/apify/browser-headers-generator package.


```
import BrowserHeadersGenerator from 'browser-headers-generator';

const browserHeadersGenerator = new BrowserHeadersGenerator({
    operatingSystems: ['windows'],
    browsers: ['chrome'],
});

await browserHeadersGenerator.initialize();

const randomBrowserHeaders = await browserHeadersGenerator.getRandomizedHeaders();
```


## Wrap up

That's it for the **Mitigation** course for now, but be on the lookout for future lessons! We release lessons as we write them, and will be updating the Academy frequently, so be sure to check back every once in a while for new content!


---

# Proxies

**Learn all about proxies, how they work, and how they can be leveraged in a scraper to avoid blocking and other anti-scraping tactics.**

***

A proxy server provides a gateway between users and the internet, to be more specific in our case - between the crawler and the target website.

Many websites have https://docs.apify.com/academy/anti-scraping/techniques/rate-limiting.md set up, which is when a website **limits** the **rate** at which requests can be sent from a single IP address. In cases when a higher number of requests is expected for the crawler - using a proxy is essential to let the crawler run as smoothly as possible and avoid being blocked.

The following factors determine the quality of a proxy IP:

* How many users share the same proxy IP address?
* How did the previous user use (or overuse) the proxy?
* How long was the proxy left to "heal" before it was resold?
* What is the quality of the underlying server of the proxy? (latency)

Although IP quality is still the most important factor when it comes to using proxies and avoiding anti-scraping measures, nowadays it's not just about avoiding rate-limiting, which brings new challenges for scrapers that can no longer rely on IP rotation. Anti-scraping software providers, such as CloudFlare, have global databases of "suspicious" IP addresses. If you are unlucky, your newly bought IP might be blocked even before you use it. If the previous owners overused it, it might have already been marked as suspicious in many databases, or even (very likely) was blocked altogether. If you care about the quality of your IPs, use them as a real user, and any website will have a hard time banning them completely.

Fixing rate-limiting issues is only the tip of the iceberg of what proxies can do for your scrapers, though. By implementing proxies properly, you can successfully avoid the majority of anti-scraping measures listed in the https://docs.apify.com/academy/anti-scraping.md.

## About proxy links

To use a proxy, you need a proxy link, which contains the connection details, sometimes including credentials.


```
http://proxy.example.com:8080
```


The proxy link above has several parts:

* `http://` tells us we're using HTTP protocol,
* `proxy.example.com` is a hostname, i.e. an address to the proxy server,
* `8080` is a port number.

Sometimes the proxy server has no name, so the link contains an IP address instead:


```
http://123.456.789.10:8080
```


If proxy requires authentication, the proxy link can contain username and password:


```
http://USERNAME:PASSWORD@proxy.example.com:8080
```


## Proxy rotation

Web scrapers can implement a method called "proxy rotation" to **rotate** the IP addresses they use to access websites. Each request can be assigned a different IP address, which makes it appear as if they are all coming from different users in different location. This greatly enhances performance, and is a major factor when it comes to making a web scraper appear more human.

## Next up

Proxies are one of the most important things to understand when it comes to mitigating anti-scraping techniques in a scraper. Now that you're familiar with what they are, the next lesson will be teaching you how to configure your crawler in Crawlee to use and automatically rotate proxies. https://docs.apify.com/academy/anti-scraping/mitigation/using-proxies.md


---

# Using proxies

**Learn how to use and automagically rotate proxies in your scrapers by using Crawlee, and a bit about how to obtain pools of proxies.**

***

In the https://docs.apify.com/academy/web-scraping-for-beginners/crawling/pro-scraping.md course, we learned about the power of Crawlee, and how it can streamline the development process of web crawlers. You've already seen how powerful the `crawlee` package is; however, what you've been exposed to thus far is only the tip of the iceberg.

Because proxies are so widely used in the scraping world, Crawlee has built-in features for implementing them in an effective way. One of the main functionalities that comes baked into Crawlee is proxy rotation, which is when each request is sent through a different proxy from a proxy pool.

## Implementing proxies in a scraper

Let's borrow some scraper code from the end of the https://docs.apify.com/academy/web-scraping-for-beginners/crawling/pro-scraping.md lesson in our **Web scraping basics for JavaScript devs** course and paste it into a new file called **proxies.js**. This code enqueues all of the product links on https://demo-webstore.apify.org's on-sale page, then makes a request to each product page and scrapes data about each one:


```
// crawlee.js
import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    requestHandler: async ({ $, request, enqueueLinks }) => {
        if (request.label === 'START') {
            await enqueueLinks({
                selector: 'a[href*="/product/"]',
            });

            // When on the START page, we don't want to
            // extract any data after we extract the links.
            return;
        }

        // We copied and pasted the extraction code
        // from the previous lesson
        const title = $('h3').text().trim();
        const price = $('h3 + div').text().trim();
        const description = $('div[class*="Text_body"]').text().trim();

        // Instead of saving the data to a variable,
        // we immediately save everything to a file.
        await Dataset.pushData({
            title,
            description,
            price,
        });
    },
});

await crawler.addRequests([{
    url: 'https://demo-webstore.apify.org/search/on-sale',
    // By labeling the Request, we can identify it
    // later in the requestHandler.
    label: 'START',
}]);

await crawler.run();
```


In order to implement a proxy pool, we will first need some proxies. We'll quickly use the free https://apify.com/mstephen190/proxy-scraper on the Apify platform to get our hands on some quality proxies. Next, we'll need to set up a https://crawlee.dev/api/core/class/ProxyConfiguration and configure it with our custom proxies, like so:


```
import { ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: ['http://45.42.177.37:3128', 'http://43.128.166.24:59394', 'http://51.79.49.178:3128'],
});
```


Awesome, so there's our proxy pool! Usually, a proxy pool is much larger than this; however, a three proxies pool is totally fine for tutorial purposes. Finally, we can pass the `proxyConfiguration` into our crawler's options:


```
const crawler = new CheerioCrawler({
    proxyConfiguration,
    requestHandler: async ({ $, request, enqueueLinks }) => {
        if (request.label === 'START') {
            await enqueueLinks({
                selector: 'a[href*="/product/"]',
            });
            return;
        }

        const title = $('h3').text().trim();
        const price = $('h3 + div').text().trim();
        const description = $('div[class*="Text_body"]').text().trim();

        await Dataset.pushData({
            title,
            description,
            price,
        });
    },
});
```


> Note that if you run this code, it may not work, as the proxies could potentially be down/non-operating at the time you are going through this course.

That's it! The crawler will now automatically rotate through the proxies we provided in the `proxyUrls` option.

## A bit about debugging proxies

At the time of writing, the scraper above utilizing our custom proxy pool is working just fine. But how can we check that the scraper is for sure using the proxies we provided it, and more importantly, how can we debug proxies within our scraper? Luckily, within the same `context` object we've been destructuring `$` and `request` out of, there is a `proxyInfo` key as well. `proxyInfo` is an object which includes useful data about the proxy which was used to make the request.


```
const crawler = new CheerioCrawler({
    proxyConfiguration,
    // Destructure "proxyInfo" from the "context" object
    handlePageFunction: async ({ $, request, proxyInfo }) => {
        // Log its value
        console.log(proxyInfo);
        // ...
        // ...
    },
});
```


After modifying your code to log `proxyInfo` to the console and running the scraper, you're going to see some logs which look like this:

![proxyInfo being logged by the scraper](/assets/images/proxy-info-logs-edb7e733aab82acb15258e9d44ba8a64.png)

These logs confirm that our proxies are being used and rotated successfully by Crawlee, and can also be used to debug slow or broken proxies.

## Higher level proxy scraping

Though we will discuss it more in-depth in future courses, it is still important to mention that Crawlee has integrated support for the Apify SDK, which supports https://apify.com/proxy - a service that provides access to pools of both residential and datacenter IP addresses. A `proxyConfiguration` using Apify Proxy might look something like this:


```
import { Actor } from 'apify';

const proxyConfiguration = await Actor.createProxyConfiguration({
    countryCode: 'US',
});
```


Notice that we didn't provide it a list of proxy URLs. This is because the `SHADER` group already serves as our proxy pool (courtesy of Apify Proxy).

## Next up

https://docs.apify.com/academy/anti-scraping/mitigation/generating-fingerprints.md, we'll be checking out how to use two npm packages to generate and inject https://docs.apify.com/academy/anti-scraping/techniques/fingerprinting.md.


---

# Anti-scraping techniques

**Understand the various common (and obscure) anti-scraping techniques used by websites to prevent bots from accessing their content.**

***

In this section, we'll be discussing some of the most common (as well as some obscure) anti-scraping techniques used by websites to detect and block/limit bots from accessing their content.

When a scraper is detected, a website can respond in a variety of ways:

## "Access denied" page

This is a complete block which usually has a response status code of **403**. Usually, you'll hit an **Access denied** page if you have bad IP address or the website is restricted in the country of the IP address.

> For a better understanding of what all the HTTP status codes mean, we recommend checking out https://http.cat/ which provides a highly professional description for each status code.

## Captcha page

Probably the most common blocking method. The website gives you a chance to prove that you are not a bot by presenting you with a captcha. We'll be covering captchas within this course.

## Redirect

Another common method is redirecting to the home page of the site (or a different location).

## Request timeout/Socket hangup

This is the cheapest defense mechanism where the website won't even respond to the request. Dealing with timeouts in a scraper can be challenging, because you have to differentiate them from regular network problems.

## Custom status code or message

Similar to getting an **Access denied** page, but some sites send along specific status codes (eg. **503**) and messages explaining what was wrong with the request.

## Empty results

The website responds "normally," but pretends to not find any results. This requires manual testing to recognize the pattern.

## Fake results

The website responds with data, but the data is totally fake, which is very difficult to recognize and requires extensive manual testing. Luckily, this type of response is not all too common.

## Next up

In the https://docs.apify.com/academy/anti-scraping/techniques/rate-limiting.md of this course, you'll be learning about **rate limiting**, which is a technique used to prevent a large amount of requests from being sent from one user.


---

# Browser challenges

> Learn how to navigate browser challenges like Cloudflare's to effectively scrape data from protected websites.

## Browser challenges

Browser challenges are a type of security measure that relies on browser fingerprints. These challenges typically involve a JavaScript program that collects both static and dynamic browser fingerprints. Static fingerprints include attributes such as User-Agent, video card, and number of CPU cores available. Dynamic fingerprints, on the other hand, might involve rendering fonts or objects in the canvas (known as a https://docs.apify.com/academy/anti-scraping/techniques/fingerprinting.md#with-canvases), or playing audio in the https://docs.apify.com/academy/anti-scraping/techniques/fingerprinting.md#from-audiocontext. We were covering the details in the previous https://docs.apify.com/academy/anti-scraping/techniques/fingerprinting.md lesson.

While some browser challenges are relatively straightforward - for example, loading an image and checking if it renders correctly - others can be much more complex. One well-known example of a complex browser challenge is Cloudflare's browser screen check. In this challenge, Cloudflare visually inspects the browser screen and blocks the first request if any inconsistencies are found. This approach provides an extra layer of protection against automated attacks.

Many online protections incorporate browser challenges into their security measures, but the specific techniques used can vary.

## Cloudflare browser challenge

One of the most well-known browser challenges is the one used by Cloudflare. Cloudflare has a massive dataset of legitimate canvas fingerprints and User-Agent pairs, which they use in conjunction with machine learning algorithms to detect any device property spoofing. This might include spoofed User-Agent headers, operating systems, or GPUs.

![Cloudflare browser check](https://images.ctfassets.net/slt3lc6tev37/55EYMR81XJCIG5uxLjQQOx/252a98adf90fa0ff2f70437cc5c0a3af/under-attack-mode_enabled.gif)

When you encounter a Cloudflare browser challenge, the platform checks your canvas fingerprint against the expected value. If there is a mismatch, the request is blocked. However, if your canvas fingerprint matches the expected value, Cloudflare issues a cookie that allows you to continue scraping - even without the browser - until the cookie is invalidated.

It's worth noting that Cloudflare's protection is highly customizable, and can be adjusted to be extremely strict or relatively loose. This makes it a powerful tool for website owners who want to protect against automated traffic, while still allowing legitimate traffic to flow through.

If you want to learn how to bypass Cloudflare challenge visit the https://docs.apify.com/academy/anti-scraping/mitigation/cloudflare-challenge.md article.

## Next up

In the https://docs.apify.com/academy/anti-scraping/techniques/captchas.md, we'll be covering **captchas**, which were mentioned throughout this lesson. It's important to note that attempting to solve a captcha programmatically is the last resort - always try to avoid being presented with the captcha in the first place by using the techniques mentioned in this lesson.


---

# Captchas

**Learn about the reasons a bot might be presented a captcha, the best ways to avoid captchas in the first place, and how to programmatically solve them.**

***

In general, a website will present a user (or scraper) a captcha for 2 main reasons:

1. The website always does captcha checks to access the desired content.
2. One of the website's anti-bot measures (or the https://docs.apify.com/academy/anti-scraping/techniques/firewalls.md) has flagged the user as suspicious.

## Dealing with captchas

When you've hit a captcha, your first thought should not be how to programmatically solve it. Rather, you should consider the factors as to why you received the captcha in the first place: your bot didn't appear enough like a real user to avoid being presented the challenge.

Have you expended all of the possible options to make your scraper appear more human-like? Are you:

* Using https://docs.apify.com/academy/anti-scraping/mitigation/proxies.md?
* Making the request with the proper https://docs.apify.com/academy/concepts/http-headers.md and https://docs.apify.com/academy/concepts/http-cookies.md?
* Generating and using a custom https://docs.apify.com/academy/anti-scraping/techniques/fingerprinting.md?
* Trying different general scraping methods (HTTP scraping, browser scraping)? If you are using browser scraping, have you tried using a different browser?

## Solving captchas

If you've tried everything you can to avoid being presented the captcha and are still facing this roadblock, there are methods to programmatically solve captchas.

Tons of different types of captchas exist, but one of the most popular is Google's https://www.google.com/recaptcha/about/.

![Google's reCAPTCHA](https://miro.medium.com/max/1400/1*4NhFKMxr-qXodjYpxtiE0w.gif)

**reCAPTCHA**s can be solved using the https://apify.com/petr_cermak/anti-captcha-recaptcha Actor on the Apify platform (note that this method requires an account on https://anti-captcha.com).

Another popular captcha is the https://www.geetest.com/en/adaptive-captcha-demo. You can learn how to solve these types of captchas in Puppeteer by reading this https://filipvitas.medium.com/how-to-solve-geetest-slider-captcha-with-js-ac764c4e9905. Amazon's captcha can similarly also be solved programmatically.

## Wrap up

In this course, you've learned about some of the most common (and some of the most advanced) anti-scraping techniques. Keep in mind that as the web (and technology in general) evolves, this section of the **Anti scraping** course will evolve as well. In the https://docs.apify.com/academy/anti-scraping/mitigation.md, we'll be discussing how to mitigate the anti-scraping techniques you learned about in this section.


---

# Fingerprinting

**Understand browser fingerprinting, an advanced technique used by browsers to track user data and even block bots from accessing them.**

***

Browser fingerprinting is a method that some websites use to collect information about a browser's type and version, as well as the operating system being used, any active plugins, the time zone and language of the machine, the screen resolution, and various other active settings. All of this information is called the **fingerprint** of the browser, and the act of collecting it is called **fingerprinting**.

Yup! Surprisingly enough, browsers provide a lot of information about the user (and even their machine) that is accessible to websites! Browser fingerprinting wouldn't even be possible if it weren't for the sheer amount of information browsers provide, and the fact that each fingerprint is unique.

Based on https://www.eff.org/press/archives/2010/05/13 carried out by the Electronic Frontier Foundation, 84% of collected fingerprints are globally exclusive, and they found that the next 9% were in sets with a size of two. They also stated that even though fingerprints are dynamic, new ones can be matched up with old ones with 99.1% correctness. This makes fingerprinting a very viable option for websites that want to track the online behavior of their users in order to serve hyper-personalized advertisements to them. In some cases, it is also used to aid in preventing bots from accessing the websites (or certain sections of it).

## What makes up a fingerprint?

To collect a good fingerprint, websites must collect them from various places.

### From HTTP headers

Several https://docs.apify.com/academy/concepts/http-headers.md can be used to create a fingerprint about a user. Here are some of the main ones:

1. **User-Agent** provides information about the browser and its operating system (including its versions).
2. **Accept** tells the server what content types the browser can render and send, and **Content-Encoding** provides data about the content compression.
3. **Content-Language** and **Accept-Language** both indicate the user's (and browser's) preferred language.
4. **Referer** gives the server the address of the previous page from which the link was followed.

A few other headers commonly used for fingerprinting can be seen below:

![Fingerprinted headers](/assets/images/fingerprinted-headers-ec689af0e137398a072e51fb876a7a33.png)

### From window properties

The `window` is defined as a global variable that is accessible from JavaScript running in the browser. It is home to a vast amount of functions, variables, and constructors, and most of the global configuration is stored there.

Most of the attributes that are used for fingerprinting are stored under the `window.navigator` object, which holds methods and info about the user's state and identity starting with the **User-Agent** itself and ending with the device's battery status. All of these properties can be used to fingerprint a device; however, most fingerprinting solutions (such as https://valve.github.io/fingerprintjs/) only use the most crucial ones.

Here is a list of some of the most crucial properties on the `window` object used for fingerprinting:

| Property                        | Example                                                                  | Description                                                                           |
| ------------------------------- | ------------------------------------------------------------------------ | ------------------------------------------------------------------------------------- |
| `screen.width`                  | `1680`                                                                   | Defines the width of the device screen.                                               |
| `screen.height`                 | `1050`                                                                   | Defines the height of the device screen.                                              |
| `screen.availWidth`             | `1680`                                                                   | The portion of the screen width available to the browser window.                      |
| `screen.availHeight`            | `1050`                                                                   | The portion of the screen height available to the browser window.                     |
| `navigator.userAgent`           | `'Mozilla/5.0 (X11; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0'` | Same as the HTTP header.                                                              |
| `navigator.platform`            | `'MacIntel'`                                                             | The platform the browser is running on.                                               |
| `navigator.cookieEnabled`       | `true`                                                                   | Whether or not the browser accepts cookies.                                           |
| `navigator.doNotTrack`          | `'1'`                                                                    | Indicates the browser's Do Not Track settings.                                        |
| `navigator.buildID`             | `20181001000000`                                                         | The build ID of the browser.                                                          |
| `navigator.product`             | `'Gecko'`                                                                | The layout engine used.                                                               |
| `navigator.productSub`          | `20030107`                                                               | The version of the layout engine used.                                                |
| `navigator.vendor`              | `'Google Inc.'`                                                          | Vendor of the browser.                                                                |
| `navigator.hardwareConcurrency` | `4`                                                                      | The number of logical processors the user's computer has available to run threads on. |
| `navigator.javaEnabled`         | `false`                                                                  | Whether or not the user has enabled Java.                                             |
| `navigator.deviceMemory`        | `8`                                                                      | Approximately the amount of user memory (in gigabytes).                               |
| `navigator.language`            | `'en-US'`                                                                | The user's primary language.                                                          |
| `navigator.languages`           | `['en-US', 'cs-CZ', 'es']`                                               | Other user languages.                                                                 |

### From function calls

Fingerprinting tools can also collect pieces of information that are retrieved by calling specific functions:


```
// Get the WebGL vendor information
WebGLRenderingContext.getParameter(37445);

// Get the WebGL renderer information
WebGLRenderingContext.getParameter(37446);

// Pass any codec into this function (ex. "audio/aac"). It will return
// either "maybe," "probably," or "" indicating whether
// or not the browser can play that codec. An empty
// string means that  it can't be played.
HTMLMediaElement.canPlayType('some/codec');

// can ask for a permission if it is not already enabled.
// allows you to know which permissions the user has
// enabled, and which are disabled
navigator.permissions.query('some_permission');
```


### With canvases

This technique is based on rendering https://developer.mozilla.org/en-US/docs/Web/API/WebGL_API scenes to a canvas element and observing the pixels rendered. WebGL rendering is tightly connected with the hardware, and therefore provides high entropy. Here's a quick breakdown of how it works:

1. A JavaScript script creates a https://developer.mozilla.org/en-US/docs/Web/API/Canvas_API and renders some font or a custom shape.
2. The script then gets the pixel-map from the `` element.
3. The collected pixel-map is stored in a cryptographic hash specific to the device's hardware.

Canvas fingerprinting takes advantage of the CSS3 feature for importing fonts into CSS (called https://developer.mozilla.org/en-US/docs/Learn/CSS/Styling_text/Web_fonts). This means it's not required to use just the machine's preinstalled fonts.

Here's an example of multiple WebGL scenes visibly being rendered differently on different machines:

![Differences in canvas element renderings](/assets/images/canvas-differences-f6c668c93ead711787a67a7dac7ea62b.png)

### From AudioContext

The https://developer.mozilla.org/en-US/docs/Web/API/AudioContext API represents an audio-processing graph built from audio modules linked together, each represented by an https://developer.mozilla.org/en-US/docs/Web/API/AudioNode (https://developer.mozilla.org/en-US/docs/Web/API/OscillatorNode).

In the simplest cases, the fingerprint can be obtained by checking for the existence of AudioContext. However, this doesn't provide very much information. In advanced cases, the technique used to collect a fingerprint from AudioContext is quite similar to the `` method:

1. Audio is passed through an OscillatorNode.
2. The signal is processed and collected.
3. The collected signal is cryptographically hashed to provide a short ID.

> A downfall of this method is that two same machines with the same browser will get the same ID.

### From BatteryManager

The `navigator.getBattery()` function returns a promise which resolves with a https://developer.mozilla.org/en-US/docs/Web/API/BatteryManager interface. BatteryManager offers information about whether or not the battery is charging, and how much time is left until the battery has fully discharged/charged.

On its own this method is quite weak, but it can be potent when combined with the `` and AudioContext fingerprinting techniques mentioned above.

## Fingerprint example

When all is said and done, this is what a browser fingerprint might look like:


```
{
  "userAgent": "Mozilla/5.0 (X11; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0",
  "cookiesEnabled": true,
  "timezone": "Europe/Prague",
  "timezoneOffset": -60,
  "audioCodecs": {
    "ogg": "probably",
    "mp3": "maybe",
    "wav": "probably",
    "m4a": "maybe",
    "aac": "maybe"
  },
  "videoCodecs": {
    "ogg": "probably",
    "h264": "probably",
    "webm": "probably"
  },
  "videoCard": [
    "Intel Open Source Technology Center",
    "Mesa DRI Intel(R) HD Graphics 4600 (HSW GT2)"
  ],
  "productSub": "20100101",
  "hardwareConcurrency": 8,
  "multimediaDevices": {
    "speakers": 0,
    "micros": 0,
    "webcams": 0
  },
  "platform": "Linux x86_64",
  "pluginsSupport": true,
  "screenResolution": [ 1920, 1080 ],
  "availableScreenResolution": [ 1920, 1080 ],
  "colorDepth": 24,
  "touchSupport": {
    "maxTouchPoints": 0,
    "touchEvent": false,
    "touchStart": false
  },
  "languages": [ "en-US", "en" ]
}
```


## How it works

Sites employ multiple levels and different approaches to collect browser fingerprints. However, they all have one thing in common: they are using a script written in JavaScript to evaluate the target browser's context and collect information about it (oftentimes also storing it in their database, or in a cookie). These scripts are often obfuscated and difficult to track down and understand, especially if they are anti-bot scripts.

Multiple levels of script obfuscation are used to make fingerprinting scripts unreadable and hard to find:

### Randomization

The script is modified with some random JavaScript elements. Additionally, it also often incorporates a random number of whitespaces and other unusual formatting characters as well as cryptic variable and function names devoid of readable meaning.

### Data obfuscation

Two main data obfuscation techniques are widely employed:

1. **String splitting** uses the concatenation of multiple substrings. It is mostly used alongside an `eval()` or `document.write()`.
2. **Keyword replacement** allows the script to mask the accessed properties. This allows the script to have a random order of the substrings and makes it harder to detect.

Oftentimes, both of these data obfuscation techniques are used together.

### Encoding

Built-in JavaScript encoding functions are used to transform the code into, for example, hexadecimal string. Or, a custom encoding function is used and a custom decoding function decodes the code as it is evaluated in the browser.

## Detecting fingerprinting scripts

As mentioned above, many sites obfuscate their fingerprinting scripts to make them harder to detect. Luckily for us, there are ways around this.

### Manual de-obfuscation

Almost all sites using fingerprinting and tracking scripts try to protect them as much as much as they can. However, it is impossible to make client-side JavaScript immune to reverse engineering. It is only possible to make reverse engineering difficult and unpleasant for the developer. The procedure used to make the code as unreadable as possible is called https://www.techtarget.com/searchsecurity/definition/obfuscation#:~:text=Obfuscation%20means%20to%20make%20something,code%20is%20one%20obfuscation%20method..

When you want to dig inside the protection code to determine exactly which data is collected, you will probably have to deobfuscate it. Be aware that this can be a very time-consuming process. Code deobfuscation can take anywhere up to 1–2 days to be in a semi-readable state.

We recommend watching some videos from https://www.youtube.com/channel/UCJbZGfomrHtwpdjrARoMVaA/videos to learn the tooling necessary to deobfuscate code.

### Using browser extensions

Because of how common it has become to obfuscate fingerprinting scripts, there are many extensions that help identify fingerprinting scripts due to the fact that browser fingerprinting is such a big privacy question. Browser extensions such as https://github.com/freethenation/DFPM have been created to help detect them. In the extension's window, you can see a report on which functions commonly used for fingerprinting have been called, and which navigator properties have been accessed.

![Don\&#39;t Fingerprint Me extension window](/assets/images/dont-fingerprint-me-51a71cc91aec391b54c341abe69c3cf6.png)

This extension provides monitoring of only a few critical attributes, but in order to deceive anti-scraping protections, the full list is needed. However, the extension does reveal the scripts that collect the fingerprints.

## Anti-bot fingerprinting

On websites which implement advanced fingerprinting techniques, they will tie the fingerprint and certain headers (such as the **User-Agent** header) to the IP address of the user. These sites will block a user (or scraper) if it made a request with one fingerprint and set of headers, then tries to make another request on the same proxy but with a different fingerprint.

When dealing with these cases, it's important to sync the generation of headers and fingerprints with the rotation of proxies (this is known as session rotation).

## Next up

https://docs.apify.com/academy/anti-scraping/techniques/geolocation.md, we'll be covering **geolocation** methods that websites use to grab the location from which a request has been made, and how they relate to anti-scraping.


---

# Firewalls

**Understand what a web-application firewall is, how they work, and the various common techniques for avoiding them altogether.**

***

A web-application firewall (or **WAF**) is a tool for website admins which allows them to set various access rules for their visitors. The rules can vary on each website and are usually hard to detect; therefore, on sites using a WAF, you need to run a set of tests to test the rules and find out their limits.

One of the most common WAFs one can come across is the one from https://www.cloudflare.com. It allows setting a waiting screen that runs a few tests against the visitor to detect a genuine visitor or a bot. However, not all WAFs are that easy to detect.

![Cloudflare waiting screen](/assets/images/cloudflare-bd22fffac9bd5e98e327247500da14cb.png)

## How it works

WAFs work on a similar premise as regular firewalls. Web admins define the rules, and the firewall executes them. As an example of how the WAF can work, we will take a look at Cloudflare's solution:

1. The visitor sends a request to the webpage.
2. The request is intercepted by the firewall.
3. The firewall decides if presenting a challenge (captcha) is necessary. If the user already solved a captcha in the past or nothing is suspicious, it will immediately forward the request to the application's server.
4. A captcha is presented which must be solved. Once it is solved, a https://docs.apify.com/academy/concepts/http-cookies.md is stored in the visitor's browser.
5. The request is forwarded to the application's server.

![Cloudflare WAP workflow](/assets/images/cloudflare-graphic-8f4223bc691752af247662e7778589ff.jpg)

Since there are multiple providers, it is essential to say that the challenges are not always graphical and can be entirely server-side (without any JavaScript evaluation in the visitor browser).

## Bypassing web-application firewalls

* Using https://docs.apify.com/academy/anti-scraping/mitigation/proxies.md.
* Mocking https://docs.apify.com/academy/concepts/http-headers.md.
* Overriding the browser's https://docs.apify.com/academy/anti-scraping/techniques/fingerprinting.md (most effective).
* Farming the https://docs.apify.com/academy/concepts/http-cookies.md from a website with a headless browser, then using the farmed cookies to do HTTP based scraping (most performant).

As you likely already know, there is no solution that fits all. If you are struggling to get past a WAF provider, you can try using Firefox with Playwright.

## Next up

In the https://docs.apify.com/academy/anti-scraping/techniques/browser-challenges.md, we'll be covering **browser challenges** and specifically the Cloudflare browser challenge which is part of the Cloudflare WAF mentioned in this lesson.


---

# Geolocation

**Learn about the geolocation techniques to determine where requests are coming from, and a bit about how to avoid being blocked based on geolocation.**

***

Geolocation is yet another way websites can detect and block access or show limited data. Other than by using the https://developer.mozilla.org/en-US/docs/Web/API/Geolocation_API (which requires user permission in order to receive location data), there are two main ways that websites geolocate a user (or bot) visiting it.

## Cookies & headers

Certain websites might use certain location-specific/language-specific https://docs.apify.com/academy/concepts/http-headers.md/https://docs.apify.com/academy/concepts/http-cookies.md to geolocate a user. Some examples of these headers are `Accept-Language` and `CloudFront-Viewer-Country` (which is a custom HTTP header from https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/adding-cloudfront-headers.html).

On targets which are utilizing just cookies and headers to identify the location from which a request is coming from, it is pretty straightforward to make requests which appear like they are coming from somewhere else.

## IP address

The oldest (and still most common) way of geolocating is based on the IP address used to make the request. Sometimes, country-specific sites block themselves from being accessed from any other country (some Chinese, Indian, Israeli, and Japanese websites do this).

https://docs.apify.com/academy/anti-scraping/mitigation/proxies.md can be used in a scraper to bypass restrictions and to make requests from a different location. Oftentimes, proxies need to be used in combination with location-specific https://docs.apify.com/academy/concepts/http-cookies.md/https://docs.apify.com/academy/concepts/http-headers.md.

## Override/emulate geolocation when using a browser-based scraper

When using https://pptr.dev/#?product=Puppeteer&show=api-pagesetgeolocationoptions, you can emulate the geolocation with the `page.setGeolocation()` function.

In https://playwright.dev/docs/api/class-browsercontext#browsercontextsetgeolocationgeolocation, geolocation can be emulated by using `browserContext.setGeolocation()`.

Overriding browser geolocation should be used in tandem with a proper proxy corresponding to the emulated geolocation. You would still likely get blocked if you, for example, used a German proxy with the overridden location set to Japan.


---

# Rate-limiting

**Learn about rate-limiting, a common tactic used by websites to avoid a large and non-human rate of requests coming from a single IP address.**

***

When crawling a website, a web scraping bot will typically send many more requests from a single IP address than a human user could generate over the same period. Websites can monitor how many requests they receive from a single IP address, and block it or require a https://docs.apify.com/academy/anti-scraping/techniques/captchas.md test to continue making requests.

In the past, most websites had their own anti-scraping solutions, the most common of which was IP address rate-limiting. In recent years, the popularity of third-party specialized anti-scraping providers has dramatically increased, but a lot of websites still use rate-limiting to only allow a certain number of requests per second/minute/hour to be sent from a single IP; therefore, crawler requests have the potential of being blocked entirely quite quickly.

In cases when a higher number of requests is expected for the crawler, using a https://docs.apify.com/academy/anti-scraping/mitigation/proxies.md and rotating the IPs is essential to let the crawler run as smoothly as possible and avoid being blocked.

## Dealing with rate limiting by rotating proxy or session

The most popular and effective way of avoiding rate-limiting issues is by rotating https://docs.apify.com/academy/anti-scraping/mitigation/proxies.md after every **n** number of requests, which makes your scraper appear as if it is making requests from various different places. Since the majority of rate-limiting solutions are based on IP addresses, rotating IPs allows a scraper to make large amounts to a website without getting restricted.

In Crawlee, proxies are automatically rotated for you when you use `ProxyConfiguration` and a https://crawlee.dev/api/core/class/SessionPool within a crawler. The SessionPool handles a lot of the nitty gritty of proxy rotating, especially with https://docs.apify.com/academy/puppeteer-playwright.md by retiring a browser instance after a certain number of requests have been sent from it in order to use a new proxy (a browser instance must be retired in order to use a new proxy).

Here is an example of these features being used in a **PuppeteerCrawler** instance:


```
import { PuppeteerCrawler } from 'crawlee';
import { Actor } from 'apify';

const myCrawler = new PuppeteerCrawler({
    proxyConfiguration: await Actor.createProxyConfiguration({
        groups: ['RESIDENTIAL'],
    }),
    sessionPoolOptions: {
        // Note that a proxy is tied to a session
        sessionOptions: {
            // Let's say the website starts blocking requests after
            // 20 requests have been sent in the span of 1 minute from
            // a single user.
            // We can stay on the safe side and retire the browser
            // and rotate proxies after 15 pages (requests) have been opened.
            maxUsageCount: 15,
        },
    },
    // ...
});
```


> Take a look at the https://docs.apify.com/academy/anti-scraping/mitigation/using-proxies.md lesson to learn more about how to use proxies and rotate them in Crawlee.

### Configuring a session pool

To set up the SessionPool for different rate-limiting scenarios, you can use various configuration options in `sessionPoolOptions`. In the example above, we used `maxUsageCount` within `sessionOptions` to prevent more than 15 requests from being sent using a session before it was thrown away; however, a maximum age can also be set using `maxAgeSecs`.

When dealing with frequent and unpredictable blockage, the `maxErrorScore` option can be set to trash a session after it's hit a certain number of errors.

To learn more about all configurations available in `sessionPoolOptions`, refer to the https://crawlee.dev/api/core/interface/SessionPoolOptions.

> Don't worry too much about these configurations. Crawlee's defaults are usually good enough for the majority of use cases.

## Next up

Though rate limiting is still common today, a lot of sites have improved over the years to use more complicated techniques such as **browser fingerprinting**, which is covered in the https://docs.apify.com/academy/anti-scraping/techniques/fingerprinting.md.


---

# Using Apify API

**A collection of various tutorials explaining how to interact with the Apify platform programmatically using its API.**

***

This section explains how you can run https://docs.apify.com/platform/actors.md using Apify's https://docs.apify.com/api/v2.md, retrieve their results, and integrate them into your own product and workflows. You can do this using a raw HTTP client, or you can benefit from using one of our API clients for:

* https://docs.apify.com/api/client/js
* https://docs.apify.com/api/client/python


---

# API scraping

**Learn all about how the professionals scrape various types of APIs with various configurations, parameters, and requirements.**

***

API scraping is locating a website's API endpoints, and fetching the desired data directly from their API, as opposed to parsing the data from their rendered HTML pages.

> **Note:** In the next few lessons, we'll be using https://soundcloud.com as an example target, but the techniques described here can be applied to any site.

In this module, we will discuss the benefits and drawbacks of API scraping, how to locate an API, how to utilize its potential features, and how to work around some common roadblocks.

## What's an API?

An API is a custom service that lives on the server of any given website. They provide an intuitive way for the website's client-side pages to send and receive data to and from the server, where it can be stored in a database, manipulated, or used to perform an operation. Though not **all** sites have APIs, many do, especially those built as complex web applications. Learn more about APIs https://blog.apify.com/what-is-an-api/.

## Different types of APIs

Websites use APIs which can be either REST or GraphQL. While REST is a vague architectural style based only on conventions, GraphQL is a specification.

The REST APIs usually consists of many so-called endpoints, to which you can send your requests. In the responses you are provided with information about various resources, such as users, products, etc. Examples of typical REST API requests:


```
GET https://api.example.com/users/123
GET https://api.example.com/comments/abc123?limit=100
POST https://api.example.com/orders
```


In a GraphQL API, all requests are `POST` and point to a single URL, typically something like `https://api.example.com/graphql`. To get data, you send along a query in the GraphQL query language, optionally with variables. Example of such query:


```
query($number_of_repos: Int!) {
  viewer {
    name
     repositories(last: $number_of_repos) {
       nodes {
         name
       }
     }
   }
}
```


## Advantages of API scraping



### 1. More reliable

Since the data is coming directly from the site's API, as opposed to the parsing of HTML content based on CSS selectors, it can be relied on more, as it is less likely to change. Typically, websites change their APIs much less frequently than they change the structure/selectors of their pages.

### 2. Configurable

Most APIs accept query parameters such as `maxPosts` or `fromCountry`. These parameters can be mapped to the configuration options of the scraper, which makes creating a scraper that supports various requirements and use-cases much easier. They can also be utilized to filter and/or limit data results.

### 3. Fast and efficient

Especially for https://blog.apify.com/what-is-a-dynamic-page/, in which a headless browser would otherwise be required (it can sometimes be slow and cumbersome), scraping their API can prove to be much quicker and more efficient.

### 4. Easy on the target website

Depending on the website, sending large amounts of requests to their pages could result in a slight performance decrease on their end. By using their API instead, not only does your scraper run better, but it is less demanding of the target website.

## Disadvantages of API Scraping



### 1. Sometimes requires special tokens

Many APIs will require the session cookie, an API key, or some other special value to be included within the header of the request in order to receive any data back. For certain projects, this can be a challenge.

### 2. Potential overhead

For complex APIs that require certain headers and/or payloads in order to make a successful request, return encoded data, have rate limits, or that use GraphQL, there can be a slight overhead in figuring out how to utilize them in a scraper.

## Extra challenges



### 1. Different data formats

APIs come in all different shapes and sizes. That means every API will vary in not only the quality of the data that it returns, but also the format that it is in. The two most common formats are JSON and HTML.

JSON responses are ideal, as they can be manipulated in JavaScript code. In general, no serious parsing is necessary, and the data can be filtered and formatted to fit a scraper's dataset schema.

APIs which output HTML generally return the raw HTML of a small component of the page which is already hydrated with data. In these cases, it is still worth using the API, as it is still more efficient than making a request to the entire page; even though the data does still need to be parsed from the HTML response.

### 2. Encoded data

Sometimes, a response will look something like this:


```
{
    "title": "Scraping Academy Message",
    "message": "SGVsbG8hIFlvdSBoYXZlIHN1Y2Nlc3NmdWxseSBkZWNvZGVkIHRoaXMgYmFzZTY0IGVuY29kZWQgbWVzc2FnZSEgV2UgaG9wZSB5b3UncmUgbGVhcm5pbmcgYSBsb3QgZnJvbSB0aGUgQXBpZnkgU2NyYXBpbmcgQWNhZGVteSE="
}
```


Or some other encoding format. This example's `message` has some data encoded in https://en.wikipedia.org/wiki/Base64, which is one of the most common encoding types. For testing out Base64 encoding and decoding, you can use https://www.base64encode.org/ and https://www.base64decode.org/. Within a project where base64 decoding/encoding is necessary, the https://nodejs.org/api/buffer.html can be used like so:


```
const value = 'SGVsbG8hIFlvdSBoYXZlIHN1Y2Nlc3NmdWxseSBkZWNvZGVkIHRoaXMgYmFzZTY0IGVuY29kZWQgbWVzc2FnZSEgV2UgaG9wZSB5b3UncmUgbGVhcm5pbmcgYSBsb3QgZnJvbSB0aGUgQXBpZnkgU2NyYXBpbmcgQWNhZGVteSE=';

const decoded = Buffer.from(value, 'base64').toString('utf-8');

console.log(decoded);
```


## First up

Get started with this course by learning some general knowledge about API scraping in the https://docs.apify.com/academy/api-scraping/general-api-scraping.md section! This section will teach you everything you need to know about scraping APIs before moving into more complex sections.


---

# General API scraping

**Learn the benefits and drawbacks of API scraping, how to locate an API, how to utilize its features, and how to work around common roadblocks.**

***

This section will teach you everything you should know about API scraping before moving into the next sections in the **API Scraping** module. Learn how to find APIs, how to use them, how to paginate them, and how to get past some common roadblocks when dealing with them.

Each lesson will prepare you for real-world API scraping, and will help put yet another data extraction technique into your scraping toolbelt.

## Next up

In our https://docs.apify.com/academy/api-scraping/general-api-scraping/locating-and-learning.md, we will take a look at how to locate a website's API endpoints with DevTools, and how to use them. This is your entrypoint into learning how to scrape APIs.


---

# Dealing with headers, cookies, and tokens

**Learn about how some APIs require certain cookies, headers, and/or tokens to be present in a request in order for data to be received.**

***

Unfortunately, most APIs will require a valid cookie to be included in the `cookie` field within a request's headers in order to be authorized. Other APIs may require special tokens, or other data that validates the request.

Luckily, there are ways to retrieve and set cookies for requests prior to sending them, which will be covered more in-depth within future Scraping Academy modules. The most important things to know at the moment are:

## Cookies

1. For sites that heavily rely on cookies for user-verification and request authorization, certain generic requests (such as to the website's main page, or to the target page) will return back a (or multiple) `set-cookie` header(s).
2. The `set-cookie` response header(s) can be parsed and used as the `cookie` header in the headers of a request. A great package for parsing these values from a response's headers is https://www.npmjs.com/package/set-cookie-parser. With this package, cookies can be parsed from headers like so:


```
import axios from 'axios';

// import the set-cookie-parser module
import setCookieParser from 'set-cookie-parser';

const getCookie = async () => {
    // make a request to the target site
    const response = await axios.get('https://www.example.com/');

    // parse the cookies from the response
    const cookies = setCookieParser.parse(response);

    // format the parsed data into a usable string
    const cookieString = cookies.map(({ name, value }) => `${name}=${value};`).join(' ');

    // log the final cookie string to be used in a 'cookie' header
    console.log(cookieString);
};

getCookie();
```


## Headers

Other APIs may not require a valid cookie header, but instead will require certain headers to be attached to the request which are typically attached when a user makes a "real" request from a browser. The most commonly required headers are:

* `User-Agent`
* `Referer`
* `Origin`
* `Host`

Headers required by the target API can be configured manually in a manner such as this, and attached to every single request the scraper sends:


```
const HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko)'
        + 'Chrome/96.0.4664.110 YaBrowser/22.1.0.2500 Yowser/2.5 Safari/537.36',
    Referer: 'https://soundcloud.com',
    // ...
};
```


However, a much better option is to use either a custom implementation of generating random headers for each request, or to use a package such as https://www.npmjs.com/package/got-scraping to automatically do this.

With `got-scraping`, generating request-specific headers can be done right within a request with `headerGeneratorOptions`. Specific headers can also be set with the `headers` option:


```
const response = await gotScraping({
    url: 'https://example.com',
    headerGeneratorOptions: {
        browsers: [
            {
                name: 'chrome',
                minVersion: 87,
                maxVersion: 89,
            },
        ],
        devices: ['desktop'],
        locales: ['de-DE', 'en-US'],
        operatingSystems: ['windows', 'linux'],
    },
    headers: {
        'some-header': 'Hello, Academy!',
    },
});
```


## Tokens

For our SoundCloud example, testing the endpoint from the previous section in a tool like https://docs.apify.com/academy/tools/postman.md works perfectly, and returns the data we want; however, when the `client_id` parameter is removed, we receive a **401 Unauthorized** error. Luckily, the Client ID is the same for every user, which means that it is not tied to a session or an IP address (this is based on our own observations and tests). The big downfall is that the token being used by SoundCloud changes every few weeks, so it shouldn't be hardcoded. This case is actually quite common, and is not only seen with SoundCloud.

Ideally, this `client_id` should be scraped dynamically, especially since it changes frequently, but unfortunately, the token cannot be found anywhere on SoundCloud's pages. We already know that it's available within the parameters of certain requests though, and luckily, https://github.com/puppeteer/puppeteer offers a way to analyze each response when on a page. It's a bit like using browser DevTools, which you are already familiar with by now, but programmatically instead.

Here is a way you could dynamically scrape the `client_id` using Puppeteer:


```
// import the puppeteer module
import puppeteer from 'puppeteer';

const scrapeClientId = async () => {
    const browser = await puppeteer.launch({ headless: false });
    const page = await browser.newPage();

    // initialize a variable that will eventually hold the client_id
    let clientId = null;

    // handle each response
    page.on('response', async (res) => {
        // try to grab the 'client_id' parameter from each URL
        const id = new URL(res.url()).searchParams.get('client_id') ?? null;

        // if the parameter exists, set our clientId variable to the newly parsed value
        if (id) clientId = id;
    });

    // visit the page
    await page.goto('https://soundcloud.com/tiesto/tracks');

    // wait for a selector that ensures the page has time to load and make requests to its API
    await page.waitForSelector('.profileHeader__link');

    await browser.close();
    console.log(clientId); // log the retrieved client_id
};

scrapeClientId();
```


## Next up

Keep the code above in mind, because we'll be using it in the https://docs.apify.com/academy/api-scraping/general-api-scraping/handling-pagination.md when paginating through results from SoundCloud's API.


---

# Handling pagination

**Learn about the three most popular API pagination techniques and how to handle each of them when scraping an API with pagination.**

***

When scraping large APIs, you'll quickly realize that most APIs limit the number of results it responds back with. For some APIs, the max number of results is 5, while for others it's 2000. Either way, they all have something in common - pagination.

If you've never dealt with it before, trying to scrape thousands to hundreds of thousands of items from an API with pagination can be a bit challenging. In this lesson, we'll be discussing a few of the different types of pagination, as well as how to work with them.

## Page-number pagination

The most common and rudimentary forms of pagination have page numbers. Imagine paginating through a typical e-commerce website.

![Amazon pagination](data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAhQAAAC7CAMAAAAKcffFAAABDlBMVEX////d3d3q7O7s7vDo6u319vjz9Pf29/n09fjy8/Xx8vXw8fTv8POjpqv8/Pzs7e/q7O+usbfn6eyNkJVVVVXt7/LbejXv8PLt7/ERERF1dnZcXF1nZ2iVlZjl5uhxcXJZWVng4OG+v8G0tbVkZGXS09SGhofKysu4ubqsra7Pz9DDxceoqKkyMzP4+Pny8vKUlJS5Wyn9+vXY2dnGyM2nqq6MjIxsbGzj4+SwsLI7PDy7vL+dnZ6ZmZmQkJF8fH3QkWT9/f3b3N2BgYFAQUG8Yy7Hx8eioqPprnRgYGDcfzjKzNCqqqt5eXkcHB20t7ulpaYmJicuLi/nqGxISUnoyrMcHBz58uzVnXPLhFPI+UfpAAAFzElEQVR42uzSQREAAAgDoNm/tH9tsIMMBAAAAAAAAAAAAAAAAAAAAAAAAAAAAKDUwCEFTwAAAAAAAGDZudPmpKEojONHFtFCQOtBepNAwtKwUyzKVhZZpKK0tnX//l/EmzrtmAS5MYkv0PPrTDPpPDM3A/8Xnc5QQgghhBBCCCGEEEIIcen07bvnO717ewpc8v3T3d4n3c5844cEiT8SsXj78VRQzce3wL182Uru0nr50u3MH+chYsJHIhbvTkHg9B1wr6Kx3aKv3M5844cEiT8SsXjuciLFRCS3M9+kWLAkIN6iiIpIbme+SdFgURQUBUURWBSPRCS3M7/4IcGiKCgKiiKwKB6KSG5nfvFDgkVRUBQURWBRPBaR3M784ocE6z+JItn/61GE/ERxtDyH0fLEvGwVu7tSFH+mqCjKepqDbQyce4zi6+s3L75ti+LgV4fGNVuODiwk56zQNE2cszy7AI3lzItTdKLj+goAKgrqY7ByHjJv6gl+UaYHNvn6gTv/UBQNnJx3dT0DW5S7MW9RfH7BvRZGscH2FbKCKIpONpsdsoZzBiczmD2+vTgVe6NqD1dQx2lxgiNRFMeMmXkO1QOb9uX/GEUeYIOVWKmyWOczU11fRAulGkCqtKmWYjAbK6jyW7kLMC+toKDpalEYxffPb7ZHEbcrs4nlXto6G7PzuGMmdoxjUBUAmGILLByHHLPhZT4eH6rxeA2vm9n45lMh3v/Und6wT59CcRf2MYrI2e+jaOMqiohqTsWugYMUGgBVPM7iCYxxbSD2QdHM17iYQfWqeyWKgnMbRZfVxVG0WcPTO5DFGuAUAM6xL4yicaPeRpG7bE4UNo80mymNdYwhk+WtUQzyex9FSPldFOOqgXo0inIGOjgCUGXQ9BloCphR4NqMoXsXxQM0i/AeRcLmmGHCQtoye8AYU/L2mdgce9AyA4cB1pxR2B5kMGCVxFBNZNkqcchGiQorsWwiIV8mtisOrY+0f1GsdKV878QSBVfqQBTbAEXUFgtF4dfaQxybUaRwAZBE7S4KWKN8HFwUh8vLjjiKVGXQYz3bTCy2xhzMcAEAG+wIo6gmsBnhUchM07RLOZFQ2TLsiKJfuNe8OdvvKLqoy/c6lig6rSTAzyiqqBmGMYIWdiuYM6Po4wQA9N59FNEu4rn3KDIWRyVWyVhJztmt5rVtJjSTsW4+uwYAWQyBheMQHkWmxoyhmukxmRtkMhdsGcpk5C+WXbl074YtMr/YvyigqGd+/zvFXRSru/d7oVyoYEaRxB5AH6egqD+jADhRde9RpC0WrJG2kZwzU7h5bZsJdbEInIa8eVWfgYXjkBqrptNTxtS0wcppU50pbJJOayy9Xf56Y7nfwyhg0BZHEdOVav6sAlBDrN5GARfYXvWwDhquKj0s5gedvqp4j+LoVyOGVc7yM8k5qxjFgsyytpnIBnsVDgp4UWtjGyych9QYf47+NVOPygyLtUY9MlyG1uz4aMKuKpGjLYyG9X4fo+CEUUBZR8QJACi8Bxjzb5EeIo4BCvxi8FdLQVTq3qMI/0pltyw/k5yzQZOP1EPbTOQCbwGMEHGaBAvnITU24N/PmRoOj/lxN1WD1cPzLxjuLNmXediFfygKm1kuDzapTgu41rz187bv58/c4pd22yz/PufjHUh2HoLN7meJ1CwV9MPh/zsKb4KPQjzzTQq7RVH83ShSIpLbmW9SKlgUhdconohIbmd+8UOCRVF4jSIkIrmd+SaFgkVRUBQUxT8aRSS4L4rCexQREcntzDcpEiyKwutnSXOR3XKv3M5844cEKUefJfX4qfP6h9zhLrkPdbczf5yHiAkfiXj7/xRn2We7ZM+Sbme+mYcEJ3tG/5+CEEIIIYQQQgghhBBCCCGEEELIj3bpmAYAAIQBGPg3zQ83yY5WQ3nTsEjBUQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAANEGpIZeHRnBbfwAAAAASUVORK5CYII=)

This implementation makes it fairly straightforward to programmatically paginate through an API, as it pretty much entails incrementing up or down in order to receive the next set of items. The page number is usually provided right in the parameters of the request URL; however, some APIs require it to be provided in the request body instead.

## Offset pagination

The second most popular pagination technique used is based on using a **limit** parameter along with an **offset** parameter. The **limit** says how many records should be returned in a single request, while the **offset** parameter says how many records should be skipped.

For example, let's say that we have this dataset and an API route to retrieve its items:


```
const myAwesomeDataset = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15];
```


If we were to make a request with the **limit** set to **5** and the **offset** parameter also set to **5**, the API would skip over the first five items and return `[6, 7, 8, 9, 10]`.

## Cursor pagination

Sometimes pagination uses **cursor** instead of **offset**. Cursor is a marker of an item in the dataset. It can be a date, number, or a more or less random string of letters and numbers. Request with a **cursor** parameter will result in an API response containing items which follow after the item which the cursor points to.

One of the most painful things about scraping APIs with cursor pagination is that you can't skip to, for example, the 5th page. You have to paginate through each page one by one.

> Note: SoundCloud https://developers.soundcloud.com/blog/pagination-updates-on-our-api/ over to using cursor-based pagination; however, they did not change the parameter name from **offset** to **cursor**. Always be on the lookout for this type of stuff!

## Using "next page"

In a minute, we're going to create a mini-project which will scrape the first 100 of Tiësto's tracks by keeping a **limit** of 20 and paginating through until we've scraped 100 items.

Luckily for us, SoundCloud's API (and many others) provides a **next\_href** property in each response, which means we don't have to directly deal with setting the **offset** (cursor) parameter:


```
//...
{
    "next_href": "https://api-v2.soundcloud.com/users/141707/tracks?offset=2020-03-13T00%3A00%3A00.000Z%2Ctracks%2C00774168919&limit=20&representation=https%3A%2F%2Fapi-v2.soundcloud.com%2Fusers%2F141707%2Ftracks%3Flimit%3D20",
    "query_urn": null
}
```


This URL can take various different forms, and can be given different names; however, they all generally do the same thing - bring you to the next page of results.

## Mini project

First, create a new folder called **pagination-tutorial** and run this command inside of it:


```
# initialize the project and install the puppeteer
# and got-scraping packages
npm init -y && npm i puppeteer got-scraping
```


Now, make a new file called **scrapeClientId**, copying the **client\_id** scraping code from the previous lesson and making a slight modification:


```
// scrapeClientId.js
import puppeteer from 'puppeteer';

// export the function to be used in a different file
export const scrapeClientId = async () => {
    const browser = await puppeteer.launch({ headless: true });
    const page = await browser.newPage();

    let clientId = null;

    page.on('response', async (res) => {
        const id = new URL(res.url()).searchParams.get('client_id') ?? null;
        if (id) clientId = id;
    });

    await page.goto('https://soundcloud.com/tiesto/tracks');
    await page.waitForSelector('.profileHeader__link');
    await browser.close();

    // return the client_id
    return clientId;
};
```


Now, in a new file called **index.js** we'll write the skeleton for our pagination and item-scraping code:


```
// index.js
// we will need gotScraping to make HTTP requests
import { gotScraping } from 'got-scraping';
import { scrapeClientId } from './scrapeClientId';

const scrape100Items = async () => {
    // the initial request URL
    const nextHref = 'https://api-v2.soundcloud.com/users/141707/tracks?limit=20&offset=0';

    // create an array for all of our scraped items to live
    const items = [];

    // scrape the client ID with the script from the
    // previous lesson
    const clientId = await scrapeClientId();

    // More code will go here
};
```


Let's now take a step back and think about the condition on which we should continue paginating:

1. If the API responds with a **next\_href** set to **null**, that means that there are no more pages, and that we have scraped all of the possible items and we should stop paginating.
2. If our items list has 100 records or more, we should stop paginating. Otherwise, we should continue until 100+ items has been reached.

With a full understanding of this condition, we can translate it into code:


```
const scrape100Items = async () => {
    // ...previous code
    // continue making requests until either we've reached 100+ items
    while (items.flat().length  Note that it's better to add requests to a requests queue rather than processing them in memory. The crawlers offered by https://crawlee.dev/docs/ provide this functionality out of the box.


```
// index.js
import { gotScraping } from 'got-scraping';
import { scrapeClientId } from './scrapeClientId';

const scrape100Items = async () => {
    let nextHref = 'https://api-v2.soundcloud.com/users/141707/tracks?limit=20&offset=0';
    const items = [];

    const clientId = await scrapeClientId();

    while (items.flat().length  {
    // run the function
    const data = await scrape100Items();

    // log the length of the items array
    console.log(data.length);
})();
```


> We are using the https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/flat method when returning the **items** array to turn our array of arrays into a single array of items.

Here's what the output of this code looks like:


```
105
```


## Final note

Sometimes, APIs have limited pagination. That means that they limit the total number of results that can appear for a set of pages, or that they limit the pages to a certain number. To learn how to handle these cases, take a look at the https://docs.apify.com/academy/advanced-web-scraping/crawling/crawling-with-search.md article.

## Next up

This is the last lesson in the API scraping tutorial for now, but be on the lookout for more lessons soon to come! Thus far, you've learned how to:

1. Locate API endpoints
2. Understand located API endpoints and their parameters
3. Parse and modify cookies
4. Modify/set headers
5. Farm API tokens using Puppeteer
6. Use paginated APIs

If you'd still like to read more about API scraping, check out the https://docs.apify.com/academy/api-scraping/graphql-scraping.md course! GraphQL is the king of API scraping.


---

# Locating API endpoints

**Learn how to effectively locate a website's API endpoints, and learn how to use them to get the data you want faster and more reliably.**

***

In order to retrieve a website's API endpoints, as well as other data about them, the **Network** tab within Chrome's (or another browser's) DevTools can be used. This tab allows you to see all of the various network requests being made, and even allows you to filter them based on request type, response type, or by a keyword.

On our target page, we'll open up the Network tab, and filter by request type of `Fetch/XHR`, as opposed to the default of `All`. Next, we'll do some action on the page which causes the request for the target data to be sent, which will enable us to view the request in DevTools. The types of actions that need to be done can vary depending on the website, the type of page, and the type of data being returned. Sometimes, reloading the page is enough, while other times, a button must be clicked, or the page must be scrolled. For our example use case, reloading the page is sufficient.

*Here's what we can see in the Network tab after reloading the page:*

![Network tab results after completing an action on the page which results in the API being called](/assets/images/results-in-network-tab-be10d5fd17e35bf8aafca9b2899cdccd.png)

Let's say that our target data is a full list of Tiësto's uploaded songs on SoundCloud. We can use the **Filter** option to search for the keyword `tracks`, and see if any endpoints have been hit that include that word. Multiple results may still be in the list when using this feature, so it is important to carefully examine the payloads and responses of each request in order to ensure that the correct one is found.

Filtering requests

To find what we're looking for, we must wisely choose what piece of data (in this case a keyword) we filter by. Think of something that is most likely to be part of the endpoint (in this case a string `tracks`).

After a little bit of digging through the different response values of each request in our filtered list within the Network tab, we can discover this endpoint, which returns a JSON list including 20 of Tiësto's latest tracks:

![Endpoint found in the Network tab](/assets/images/endpoint-found-6c93a91aff4ad378bf5b5b1baceeba3e.png)

## Learning the API

The majority of APIs, especially for popular sites that serve up large amounts of data, are configurable through different parameters, query options, or payload values. A lot of times, an endpoint discovered through the Network tab will reveal at least a few of these options.

Here's what our target endpoint's URL looks like coming directly from the Network tab:


```
https://api-v2.soundcloud.com/users/141707/tracks?representation=&client_id=zdUqm51WRIAByd0lVLntcaWRKzuEIB4X&limit=20&offset=0&linked_partitioning=1&app_version=1646987254&app_locale=en
```


Since our request doesn't have any body/payload, we need to analyze the URL. We can break this URL down into chunks that help us understand what each value does.

![Breaking down the request url into understandable chunks](/assets/images/analyzing-the-url-d13462b4beaa20eb6bab7d8f95091507.png)

Understanding an API's various configurations helps with creating a game-plan on how to best scrape it, as many of the parameters can be utilized for pagination, or data-filtering. Additionally, these values can be mapped to a scraper's configuration options, which overall makes the scraper more versatile.

Let's say we want to receive all of the user's tracks in one request. Based on our observations of the endpoint's different parameters, we can modify the URL and utilize the `limit` option to return more than twenty songs. The `limit` option is extremely common with most APIs, and allows the person making the request to literally limit the maximum number of results to be returned in the request:


```
https://api-v2.soundcloud.com/users/141707/tracks?client_id=zdUqm51WRIAByd0lVLntcaWRKzuEIB4X&limit=99999
```


By using the ridiculously large number of `99999`, we ensure that all of the user's tracks will be captured in this single request. Luckily, with SoundCloud's API, there is no cap to the `limit` parameter; however, most other APIs will have a limit to ensure that hundreds of thousands of results aren't retrieved at one time. For this use-case, setting a massive results limit is not much of a risk, as most users don't have a track-count over 500 anyways, but receiving too many results at once can result in overflow errors.

## Next up

https://docs.apify.com/academy/api-scraping/general-api-scraping/cookies-headers-tokens.md will be all about cookies, headers, and tokens, and how they're relevant when scraping an API.


---

# GraphQL scraping

**Dig into the topic of scraping APIs which use the latest and greatest API technology - GraphQL. GraphQL APIs are very different from regular REST APIs.**

***

https://graphql.org/ APIs different from the regular https://www.redhat.com/en/topics/api/what-is-a-rest-apiful APIs you're likely familiar with, which means that different methods and tooling are used to scrape them. This course will teach you everything you need to know about GraphQL to scrape an API built with it.

## How do I know if it's a GraphQL API?

In this section, we'll be scraping https://www.cheddar.com/'s GraphQL API. When you visit the website and make a search for anything while your **Network Tab** is open, you'll see a request that has been sent to the endpoint **api.cheddar.com/graphql**.

![GraphQL endpoint](data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAWYAAACCCAMAAABCWpIEAAAABGdBTUEAALGPC/xhBQAAAAFzUkdCAK7OHOkAAAHWaVRYdFhNTDpjb20uYWRvYmUueG1wAAAAAAA8eDp4bXBtZXRhIHhtbG5zOng9ImFkb2JlOm5zOm1ldGEvIiB4OnhtcHRrPSJYTVAgQ29yZSA2LjAuMCI+CiAgIDxyZGY6UkRGIHhtbG5zOnJkZj0iaHR0cDovL3d3dy53My5vcmcvMTk5OS8wMi8yMi1yZGYtc3ludGF4LW5zIyI+CiAgICAgIDxyZGY6RGVzY3JpcHRpb24gcmRmOmFib3V0PSIiCiAgICAgICAgICAgIHhtbG5zOmV4aWY9Imh0dHA6Ly9ucy5hZG9iZS5jb20vZXhpZi8xLjAvIj4KICAgICAgICAgPGV4aWY6UGl4ZWxZRGltZW5zaW9uPjEzMDwvZXhpZjpQaXhlbFlEaW1lbnNpb24+CiAgICAgICAgIDxleGlmOlBpeGVsWERpbWVuc2lvbj4zNTg8L2V4aWY6UGl4ZWxYRGltZW5zaW9uPgogICAgICAgICA8ZXhpZjpVc2VyQ29tbWVudD5TY3JlZW5zaG90PC9leGlmOlVzZXJDb21tZW50PgogICAgICA8L3JkZjpEZXNjcmlwdGlvbj4KICAgPC9yZGY6UkRGPgo8L3g6eG1wbWV0YT4KFZoXxQAAAv1QTFRFDmOcKywpzc3NKi0pSUxQmqCmKSwoHyEjKiotAAAAvcbPvMXOKCgsHiAimZ+lLC0xKSwpJicrjpWbgYaMOj1ANjc6NDY5ISIlGhwdEhIURUhMMTM3lJqgHR4gkpifKywvlpyid3uBJCYnT1FOTlBNmJ6kPT9DzM7QREREusPMYWRpT1JOCwwNQEJGR0pOUFNYQEBAQ0NDIiMmWl5iT1FVNjg76+vrbXF27OzsZmtxTE5TOz5Bh42SlZuhyszNiY6UOTo+e4CFl52jQkJCLzAzcXZ6kZedsrrDRklNZmlvfoOJb3N3t8DIi5CWX2Jn5+nrCQsMQkVIU1dbHWygMDE1YmZrE2aesrvEtb7H6enpjJGXhYqPyMrMa290UFJPAb4A3t/ghouRTVBU2tzeucHJc3h+aW1yXGBlVlle5ObpxMTEjZSagIWKSkxQcnd8KSsvXFxci5KXKSoo4OLjZWhtSk1RGx0gg4iOWFtgen6CvLy9MDA1jKm9Im6hQ0ZJICAhMTMyfIGHpLbDVFRTc3RzkpOTGWmfu8PMr7e/Njo9hqa7oamxqLC4qbG5T1FPm7LAra+wNjY54uTnm5ydpKyz1dfZqbnEVVZYY5Ky5+fnb3R4lpeZnrPBioqKJyosrLS92NnapKWmsb7GUYiuO3yoRoKqwMHCwMbKrrzFnp+fg6S61dbVWVtgmZmcMnelaJWze31+xcjLysvKoaKj0dLTxcnLampqusPIw8fKZWVlQX6pLS8tj5CRJnGjxsfGc5u2Z2hoi9+Lio+VgYKBtre45OTkuMHJSElHLXSkdXl9Pj4+qaqsTU5NVtFWW46wHyElnaWsfaG5YWFhOHunG8oOsbO1la6/kqy+MDEvSoWsC3d2Aa0WBAQEd3h4ep+4NDYzl56mS0tLOtEkIMMfbm9uEIdhvO28DG2JBZk7DsEQubq6rrjAa5a0l6+/PbBaGLYuWdwzXNwzGBgYlq+/wsPHRctIddB8K8gtRMtFuuy6mLC/bpi1d523DizNK80pKj828AAAFZpJREFUeNrsmu9PGlsax6vbYw+JzrQWWJEBFAo2qyJVEfzBvYhcghrd1rAvBqsZJakuWMX6YxUT2wZ5a2yNMVPijb4xIdqkq8Ss0RhTSxqvvjJp0hdtmiZ91Zc3+wfsOQNoewWhtmmTvfMdODPPM5w5Mx8fn3MOhwuXRJe+mf7vL3Xua13gMfOYecw8Zh4zj5nHzGPmMfOYecypMV/k9R10ghkmkeBrBXnBC7E9j5nHzGPmMfOYecw8Zh4zj/lPgNm9g0t64zTJA3damPsAAOp682eXN7SfbpIEtYnuJFF1rJbhdJ4U1wZ13V+PrL4ufgRK4R3hJ0+AjPQxO44cqDh0fAHmmfQwX+3Sm8uows9uoDl9zImqY3Vr0qGDautL+woU3xBzhQOqu2HpMTp19xdgJr1eh2Pe+2tyzA9WWNehoGSHpsMSQYOLZsJu5KORb+npzmIDQ7MLiTFfRUW/HLZRwFKhp25BuN+PME80qS2FZNQLB/otRGESzNHqUqqN6q11Ern2GjkJoa2yrwJyJkldhBXUBHyyn7T2RSCFdwoshQOwnwBOsYIqtKjHSahxElQrnKAaCwgDlNYBon4AlsmJOnESzB37z4GtFlIaObAMD49DjQcQhRAb8I6JqNPDn1c9+wkxaz8JJu/RvFeZIGkwC0jshmBo+f0LtmSWPZpn995EMqdDtFsQ4Xw+emUqnKl7SkuSYS6lChXE+OR4Acz9BQ6o2xDmpoKyXvWTmPcfREWNPDlmVF0ELE+EVJO9UC0l2qAdWAeHFZxp9TyBTaASynsS1q4zmyfrbLBC3VZmuaMFbXb5vgIUVOKmqRZtNahxANuIAeg7TPY2Yl4IGrWDPyfBXAiqa6h6lDQqiQ4NeoLnTm0PEGKjB7SO2JyQIh5qk2DOPnY+2t0kE+VmmmFcDLvhZ18rleyU1yt4NBT0sr8LBGG3n36tvMdO+SICQRHjf7OZGDMgCCC/NTE8YTUQcJiCI2pFc7sC9ELYWBDzOlEU2JNgjlYXgVZ4FwjNZnXv1Yfwl3/BweGY2TEIKWezGIgT18bqhi37ZnOHPB9UizSTCjAGYbWpBogg7OpzgBqoANpqda/VLjR4zOYRkAzzIOpUnuPcjPIEwtyjUXSDfGwMjkN4CwipvsRJQ4uSRjYWom2d93q91rh1wmuZy8DsxjMWK/Auk42wwb0Il5v3aKygD33kt0WafvEmIWZnZWUp2ldTwERAKSjdH0c3acZQa8BA1GtqTZ6bo9VF6GwPx6y6TQ3lvQhzzCwFemKsZcyTJDeLRBqnE5rwRwloUAPbXQVAab0S9FhwJ+F0ANS9EhelqLOs0zdxl0yGuQk9RMsJ5rJcQEUx49uHoIZqTYQ5G2q12XFZ/V6lIs45O/uPmCMbR5GSkpL3eeE1mWAluEkjom73EY18hzLfC4FAJ8lboJ8lzc0QjoCygRECxY/BpEU3SYIR1JuoY95cA4S1Z+RmjFkKKwmr1WoXk8QYQSLMMROa9p+LiPoOmLT2KgE91Var2S7WDNgHTQoU3ajpMoD+eR+2xzAL9WRbwcP+LqvVUZMeZhIYrDCKGd++HpRSY6cw45DN1tqPMZN+ZXa2wv/oNOZlXA7tSiLbAn/kniskOGRnJeyS4JB2S+iAwE/fw5iHDgQSdu8MzBWElLyKAqVNbeJGGk6nQyTvi3mbLUK0S4XZQTTCEcIMmyzINzgcN8eBAVLgVnLMWkAaKL21a3xMbUZ9oQIMOqS2PinRD/OJ1hjmcSc5UNc0ArQDd+QpMVtWyeZ2Pco2jUCLDYNFQ/aZFIkwY9nt2VfiUuKiPG6d8ProwyWzKdh0Ma4FwfbQELMWyduNsCzjFmywNBsQYMzoiF6RnIFZTAFQDwohSTzhMNfmAtAiinknugBwpsSM/kTA0ghhGWjDmE/MEdShwuSYhaB7oh01ICK7UKKvUKCmgEcPyywAjCvi0SwHhKkUGgCQJ43mh3HMV4EBPUE7IJ7Lbdiwoi6AskOqNxHmUYlSKUmihGPhEly8QSU+rYz67h+fVf6eYhZoJqEUvUB8wCt1nHjhhDSt4avoTPMsWbkGpGYF6vA00mhrIvLTa+lxqRCnf0mxA5KxiQYpTjhuxrEsKe5EW0ydKlUnt6lU06rpaWxNq4rRW3VNhVXM7a5FX18x2f6bre5Hz4EVQPN9GrqAKV+RFP/lRBmndTmqrD/qnuorMJc1Sn80Ztjj+F7fafwwzH+ur45w1yfpPCfmazzmtJPGlZTRfHSNx/y1mK8cYy42XkdlZOoUZneATxrfDrPf+BKVLMa8xeGNlxzmrSjmLe7FR/N5MHd+jnlvMfLqXcZm2LiymfHeXbXGBC43+Njwh6zt8GLmuxl2aJbH/IWY8bTvOJoPQqGQceqoamHXFcqYWfL61jJ2ijaDxsDl5bWjEKubpfcO31YdvV27EUsa5ZOT+Xa7Vqvlf8B15o+7tPb8ycnyeDTPrK+vG6dCxqUlozHj5fa60fiARjlkMfDAyKwfGD/MurOyjqp2tg/j0Vwik+Xk5FzileKnig0NOt1v9z/PzSHjwkJo7zWzFshEmD9kZMwhzMsLC0uHsz4E1x8oYjvjmEcR5kv3eZCpMOtOY96g9/wzIZ0x4F9zZayteRdQ0tjZ8U6tqTDmZ77OXeNuLGlwmPloTon5xifRfDygCw4ZV+YzluiqlcUMf9i4yGxffp9ppENZTz9mZb3MrKJntvikce6kcTI92XqHy9cPuAHdu+gs8MHr+Dju+IiLZhmPOY1o1iXCnPZkOxrNPOcvSRrnxsxzPF/SSPsbOg4zP9BII5oR5/Ni1kWjmceYLma8eILXTzqLOzunpzu5ZZJpvGzC7Y4XUuIrKNG1E9W1EgmfmtPHvHVfWV5SUnLz5k38xiUyJJtvp95+KI+aN2PnTs5z4igzf+c5poVZp5u8NZr3uWTLRf9Z9xXdLs9LqlFuNJczxGNOifn6jRt4sDE5OYoUx4e2vOa5ImWezFc0P7p7O/zv1dFnRbNFYZ8yb/N2+DayfgqG/wt9LobZy3nMY04dzddRPDfo8idlWIo+j9Pj8TTJRkfX55ZGy4VCM/lybs43Nyfenpv7aXGxzx61Aosu1/IBU7Q09M+/8tGcTjSjeG5oyM/P4TiTubk2W64QHb1wzco2XC7X+oxru+aVKxh0Lcueug6Q1f2KCQZccw7Z22Dt6uPHJB/N6SUNxNluz0FpVpYjW6Xk8jF8HGI+5tx9Mccs3R5i0PYqyIT+x875xbR1nQEcXfHpi84JtmxcW8KesS2Bdi0ZlGssRAzhjwDXxDEywcFa4IHEASwTUEUQZCGp6EOxi60hh64TrUKrZetWrUuqNao0tWoTTWhRtSbSFE3qU16aPUTqy/5om/awc+61Mf+y4M7sIZxP4h6fc8/3nXN/97vfPfd+xt+7PbPCakwevT3zXlWV+81cIpHQJwTm/XnzsWMMsyZ6p2NU/fBFbsZTZXk3997PZ66/5b381oe5t6tu5x79cOaXWu3DqqrzudaOXEJ4834wq5y/39DAl7/8r7urXvvwcSI3PJNLfHm7//ynbyZOMrB4O/Hodu78px+zWuJ6Fbv3fVHd398tvHlf3sxBN1QicrpsW883iK92/zWR6F/p/xJXEv25FbzezzDnfocr/bnEStX1xPVX8Tf9/blcf5Pw5v14syoNDfmGrU90rqZutTR4im1VaIhsVswRRSAsAfOxykrB4v/gzQLzwWLOlwKzwCwwCxGYBWaBWYjALDALEZgFZoFZiMD8gmGuUX9Zwujfj7mJ5mleXIRTbGsD04j64xVz03gKZrf3jHr+98mbwjtbdlmdUl/lWk+Wj1h6bGttbNFZNsx1U/eNYNuPuXYw8+K4irmNYzZOOR3gZJiPb+85dEEr11WFc8v/xagjy7dmem7njrXYzpaC1U0h6nEN0FJA7jHS1jMZ31qbXKVlw2xk/gDwAN023agea0d1dqc9MmGfwBr7JWww6owRdPXpwFjfCWBr3I65B3Ee2guYL7datwOxkHpehOVnT3GQDvDCntq154jvQDDvMdIzMWNj+TDbTfU98AP8FfS2MuRnwdgOcNENbmyCHh/UVet0+ijMNUNv1AGTnu2YF02RdmguYLZBfQFIVo63oE0mcnwMU0skHvdNy1kSq0Fs3aAhA+KoPK72TD7lWz31sotliYbGMRwbItlaxFQ8U5zluQClU3mrGA7R5WuYDpB1hnk+TlMUTTFKl4+wS6NriCxvV5qNkQ0vToZohq4WRmJK2SHE5swqjSvROJHnMD+96NI6XTXg8RjNhMqLWRU/GsHTYYQIdCmY3MQ8BeMdToiuwammxqY9ggaXtqMFzOMt+gLmIZMxsN2b9WTZl3mIadKEAwpzKtLCd01TtxqOAlzdp9yTsYUsejamEJWeLUEj9dCc9hasyn36tiwmAx12hjn1OL1O0TqudMidiMvkwuD8dqXVVbMjgMnsCPVRX36kFPiYEjqJPT1uOc7OFfHkpxclThMdZ0qmDCkvZkdwxA59qFOZhaEV8b6GuRF6RtXG+2Y7254sYK5kvo/YBR0jcCrYUwfePWLzHDbRnZgnMCijIgdG07zqv8F3zaksb9Ag2zYYUzHCMCvYtbw9NpuIqWh1ngQCMaIf6mPNlXrSgBcoWpzLAZpkmNdxh5LCoHmINQmNMsovayNxpSmOmbr4HXQuGyDe/PSi1ILrrVwpWGbMLDbXQxfzzmummsYOXq2Gi2FYwxbo6YSoydPo97/sHwMYbAf1R/5ZJDGgGWBajc1RdorymI9OW4pRtJLsxNyIbjZrfee6Bo2LSx7hRV9I/fx4tplhZgdqy2xiHhhgNiPEWrTqph6/32+JdaKVVHaTCJ6k2ClH64fsDLMddyi56AS7gtLJx+fiKFdqI+lJDfZwzNyzcTU14SHh/PR4bM7cM5ABDJcf8wMA5sLV3jaotUFzlMXmRnCMOaCnCYzedph1wiW/A462wiJf+Ck2sE3qmEurmGvUW6AtmUxObInNGmaU1fWRh5j5sbUpD1dR72Ee1VIIGp0b/MQoMrODtaTxxjL35kt6hrCAmRAeiIaMLkNl3qqytIbKANqzhj4WNALtLrYcsK/jLCli3qqUBWU0hnnM2kj4tMua3cQcaMU+jlmdnoYZ120KEHVOx8uImd28XjoyxxYSb+G1OtDZ4aLLCHCWUWTht64ZrawGQbaoYAGCCf+VKLhn1TBfAQfDzCW4G3M1pc3cYchSWk9ClK0qIlSmmUJsVuLquvRCXP2mgpHSFPdmdie04rhMiZzhxGr59bBEaXXB6hSlNIMRmW4wzGOEhih6NiiNccy9ecxFJS+lpBlf0TDnR/Iu0dRSAXMLU6Xh/PTcHHM7ugkzzneOLtHaMj8FGq7wrWJVnHCRhSw1drKqWuhf4mFMuZGPCrUdR0pd+OuJ2aTa8BRVT8q8BUOd+S6D/KhDyh7PShb/YLGieHhF8auTuZHWIrHlWUpKvWuzpTASKqOrRXPpLdPLy7Snhhz0w7aKudyiJ+ZdbYr2fRvX1raW0AE+FudHeiWVXXI/Z3oHj/mKyVD+I1SCrn3183kP/iXEYLDF8Lzp1Y6JV0fiDZ3ALERgFpiFCMwCs8AsRGAWmIV8N8zfIbPNVNpZYQcovAawNjciTwTsEDvs8T9CLu1l82D6kGEuObPNz0wt+qCIOQJ9z8UcICRWjXiBkI16NK0TIiMQLtnDgbnkzDbH7MWgillV8jhA1+6Gy3aYc2EkqbNFH6DHqKtuA8Xd2lTAbEyfIr5a2ucKLKPjqdWwiHqfh4z7ug8H5pIz2zXQy6JGEuxg1pR8c9A24gZIAsxf0cGkA7yuOjjbC6BchmgBczta6GKQGNBJGXOtcX/ZiRcmNpeU2a6BuV7ogFEjmDUlvxY0JpmHN1+CMRZPjE1QjUd1oMy2FL4pFEiaq0lDp4w4TpROsr5oOFyYS85sM8wjzHPnGWZNqUnDHOQaakpAVxeGS4i922MzXXdiH8PsZR7d1CXL+kOFueTMNsNsYnCnGWZN6UrhFsj8fwTm8QbYB8CJD+pAcU27ikGDyQi18Ew08jyr87BhLi2zzTCjjUVnhjmv5AFHOI/ZA20Tc7CmZxeGc2ds5ks/Um1NZXDEhx7+jYnDhbm0zHYNi7trEOaY80oGIzjymDHMAsmkASd00NW2GzMuEhKPsKUcpQ8PU2zeIiVktncqoaGYSbNYXZrig71HUR9Paj2WQ/+wfSCZbfFOY6ccSGZbYBYiMAvMQgRmgfnFxVwh5ABFYBaYBWYhZcf8+s1f78/YmZuf7afbn7/9w9+//Vpg3i4LkiQN/6Ti93+5tdn0pzf26Hf6x6zj+7/d1nZHOrOr3zd3//nkyZO73+SrV7VT8/pprba1/+mrhwfzV9LwGz+SpKs/lX6x2SYN79Hxfemjd0LSidPPwfz1P+6q8jfNnz9hp+Zn7MRJ0mvscghJ0r83e74mSbEzn0l/rLh5YuGFx7wghU5XfPDOrWFJ+qRiYVg68a+KjyTpzgd3Pq/46s5CxcIJafhz3u+W9J/2y19VcSCKw4ffg80TpAsMFgO+iNukmMImXUiRKn9QUkRSiCEQbG1WkLsoIli4LFdBufeyxRZ7WfZMYkxx30A9lZlMRvjmzHfOKEY6Fj5FI94Zom2BUEFqRwhHkje7mud4+l7Hn139uF58m4UUI/KxpJWoArT7FKFciE/GnAn3/rM5YRXMUh33oNY+FINOOM2cAYZMNkiQM2hzuNe4wljA4N1rAYv3Q44xK/gVf9a83vVPPzlO/V37D3tBw5xcTmcr4DPRusmdEXlhhUh5j+DmydTIuZZGkibZCL6RxrDBzOA//YnB7GHQpv+Gs7OYICSbZYNpyTtFq0HVYm7jZhuHgl6ivB4V6XBWtN4fs6TWowpCyAfAnMUy5iqYGswVuxMd5jMC3TPJrmu6Jun24cTh008QqcGeY4s6OmPfMB9bBeeanF7uuxZNXbG4YQ55vQ1jDtXvB8BsgdV7gWswOwhePINZUMoiGCDIluVKmBmsUmTEpliuuYrZmP41NU0ggyrL5UTeuobX/o86+teWzlWVqYAejT2uh5/UmN7gZ2k4Frt5iej+MaeAexHwt1DpGMNIGMxYz6E2CsEZYWnxSN0ziFBgKnnivAe34hPAEpc5LkOMOjfT8d/Hx/v74VdLeSElbbFNxJn3Tc/R9t4+olKlptNwlH3/bnbNoR+QLqCW/KvAij2MuAAsBFzhuOLJrm+OiVZmyKY5ILjTKPO6m+4w0+vb4fDWXk9qpRAF9So2t3ddS87rWdpg1mr8ALdAGWc1R61JZ7oZ4vTKmhSzX+xu4tUMzSx5beGqr7eMr0XtKpVKP+j15BlPzE/Mz3hifmK+x/gP59AyW3AmPrcAAAAASUVORK5CYII=)

As a rule of thumb, when the endpoint ends with **/graphql** and it's a **POST** request, it's a 99.99% bulletproof indicator that the target site is using GraphQL. If you want to be 100% certain though, taking a look at the request payload will most definitely give it away.

![GraphQL payload](data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAfkAAACRCAMAAAD3uDFVAAAABGdBTUEAALGPC/xhBQAAAAFzUkdCAK7OHOkAAAHWaVRYdFhNTDpjb20uYWRvYmUueG1wAAAAAAA8eDp4bXBtZXRhIHhtbG5zOng9ImFkb2JlOm5zOm1ldGEvIiB4OnhtcHRrPSJYTVAgQ29yZSA2LjAuMCI+CiAgIDxyZGY6UkRGIHhtbG5zOnJkZj0iaHR0cDovL3d3dy53My5vcmcvMTk5OS8wMi8yMi1yZGYtc3ludGF4LW5zIyI+CiAgICAgIDxyZGY6RGVzY3JpcHRpb24gcmRmOmFib3V0PSIiCiAgICAgICAgICAgIHhtbG5zOmV4aWY9Imh0dHA6Ly9ucy5hZG9iZS5jb20vZXhpZi8xLjAvIj4KICAgICAgICAgPGV4aWY6UGl4ZWxZRGltZW5zaW9uPjE0NTwvZXhpZjpQaXhlbFlEaW1lbnNpb24+CiAgICAgICAgIDxleGlmOlBpeGVsWERpbWVuc2lvbj41MDU8L2V4aWY6UGl4ZWxYRGltZW5zaW9uPgogICAgICAgICA8ZXhpZjpVc2VyQ29tbWVudD5TY3JlZW5zaG90PC9leGlmOlVzZXJDb21tZW50PgogICAgICA8L3JkZjpEZXNjcmlwdGlvbj4KICAgPC9yZGY6UkRGPgo8L3g6eG1wbWV0YT4KiH/5+gAAAv1QTFRFHyEjRERE6OrtmZ+lAAAAmqCmzc3NKiotR0dHSUxQNdPFNNTH5+nsNzg6ISMm5ObpMDE1NczBjJKXIyUn6urqXK/VzMzMh4ySx8fHIScqNs7D5efqKywv293fLC0wHyIllZqgICUnSElMPj9DKmFdM76yuLq8MjM3OTo+e4CFTE5SbG1tJicqRUdKMZiQNTY6IzY3k5OUlpyih4iKLGxoNMO5hIWGNMe73d/hMrWrOz1A3+HkkJacgIWKgYaLNMC1TVBTmJ6kxcfKZWhtxMXFLoF7M7qvgoeMU1ZaNMu+K2VhUFJWSkpKeXl5o6Slam5ygIGDLoaAb3N3YmVpL4+I4ePneXp8ISwtJktKXF9jL4yFVlld1NXXwMHCLHZxYGNmLjAzWVtg1dfaNdDDQkVH19nc5OTlW1tbJT8/bXF2QUJGRklNMTEyJ1JQJkBAMrCmW6zTXFxeT5CwZmtwdnZ3J09NtLS1Y2NkJURELHFtXbDXfoOIk5mfMa6jKFhVNDY5MqmfX2FkZmluIjIzJDs6n5+gKVtYSH+bq6usKThBVVVWVqDDWKXJp6eoztDTyMrMTU1NNs7BjZSaV1dYJT4+d3uBsbGxra6vK2hky8vLvb7A5ujrp6mrMaugi5CWc3d7f3+AZmZnSk1RcnN0KSsuhYqPUFBQUlJTkJGSMJuSLXt2Pml/QEFCNcW4Ii4wMrKpoqKiMaKauLi4uru/SktPLXhyi4yOJzQ8yczOJDk6ODk9UVRYX19fM1BgWajNeX6Cio+UMpyVsLK0JS41b3ByamprVqPGMUpXtri6ycnJNMK2m5+gvb29fH1+MJWOMJSLdnl8IyovJDw7WqrQnJydSEhI5+fnmpqajpWbKmNfKl1aMZ+WCAgIL4mCJkZFEBAQlpaWOFtsQG+FNVZmmJiajo6OlJeZSoOfzM3RVJu+MDA1Kz1HSUlJFxcXLUNPTIqoLkVQMZ6VAwMDOVxvUpe4RnuWOmB0Q3SNO2R34eHhHR0dPGZ6YmJiRHaPROqlU5EPPgAAGj5JREFUeNrsm19IG9kawN2cOeeMYK7D7JDMQxKGBDRGJXQEKblJcapgILaRlBZCFJPFjUR9EcV9MCAVBBHsFYvKlkVr1bYPfSjeZR9UMFwqVLDQB2t9KOjDuixy+9K3Ln24Z/5Y1zYxaW23vev5Hubk+yZnznfO75zv+zJiUdG7Yi76MnLWxv36JkzJU/KUPCVPyVPylDwlT8lT8pQ8JU/JU/KU/Fc/4WIqZ1P+RB5/LmFOJ5jK55Ciww+UPCVPyVPylDwlT8lT8pQ8JU/JU/KU/F8hXpfti5EPNKpXGHz/zmygIPINgiBsL1UdA/vQ8j7sF0J3NvIBCKFS4X5vgJn6Qqaq9obWutOvWYX18BPsxGEfTi0e6kTJI1CT3mMeF+R+UIYQvPtFj3Loxqchz61y5OLh3vvyhVOTR1VVc8n+Y5QvF07+gtXsXpEm3xugzlXIVElvc2dAtn1C8gkOK3W48+3SKXn3ldk8EzCbD455XIj7IVh/wA29u/QG+QT3icjHe1Y57mYPzk3ex4OdLhxvBDDD4fYd6LcGDFsk3DjR7ofylezkRXIZec2sJ4XtJ/9IPmCY6WlCfrd1e7v/lW4l97eF/hzk1fFn0+VSUIppw3nScYz9oUBUH90h/Yyj0gGOzGYlr/YuhuU4LMuTC3gWQH7QJk3KypoDu3ggteEDaUgGKcxZIahYwKE0sA7mIB+ezUC/D0uuNJTr69ewawKCSawq5OnAasaZxMTT7CturVDjT/RAIm7EAkD2EKXOugZkQjUiy7P+rL38ai88lDA8NRpC3tFodUguw1ltTqeJ9o6e1Zs932ZZO/8VIjBokwa6JpV4WAl5lOsOedEZhgHDVgH4RMbaHoZcLvK+ZP8rYembpSRj/4lhttcJ+dbkD4+2xwzrL8K1jte5yXdKk14oR3zacOUgiLtgN19vjD4RwQMwhNPR7Gfe7a62+nFCia3IYScMdqWf2qAciikRmzTj7IUeDvrvpKA5LHcFQdAHe518Jgf5SdjrkSpImA2BlCuyiDO8Mwp9qhKFbXf8PJbAgPME8nw9B3d+HoCNxTMZosTggHNRwm0gEVJA1l4gobeGp0bjUWyLkpm4oTurz+lUeX5wbs6Rbe2AnwgMFkPfsluJ1dVhlxyuI6cIZwKGrYJEU6v/Z9sKlyvPC68f7N7f3R0WmP0kM7r96rLllTDPMOeShvX7KYYZz0EeAgDT7V7Yho3hGgdwpJEsnaGGyZLzkUGYdd9f0FJsHZ55urwcTlfDXq+r2gbJivbKHuglSAIc9GAbdPYqsXiXLzWxvHwH5iLPY5zKqAmWBHhCPvrAVgerVYVfw7gd+qSc6e8t+WoyCS++4tfIx4mCeRKrrmQnT/azHt91Tw8bEFDMap7Xna3T5nQa8iTU96xyuaN9VC9TXFYgg/B1Wcvzhq2CzLiTJ6HPlpV8y+oqCfHMXlJICswfgntqieT5KpXzTcGwJmtz53k+FCLFjBf6sDFcTMHpGFk6Q+2EZpCYSUzkyPNer4vnsax+FeCUAv3FNuhSc2hUnUOE5yCpHkExR2pBq3lAe2Qu8gNkw8wckV/ZgZJOXm5Tl8ijRuQ85N3YSSi3aeTJ+NUQy9cJ2uzkFe150SHDU6PxQAU6VfK6s9XanE5T4ZFQ78iG/pB8CHTH412DvNWL+fAKIOHhQsCwVahbniu/AqM587xKeY4ZFRgGPUzeJeRfCKMMc23bsNqHGab7hDyvki/HxnAOkABxsnSGiuWnGS+oSOGceT4B8ERvPL7cNeha6OJlG4kBOKqsQDKHikWDvM8cD8oDs9Z4nPMURt4BU3Gsk/eTwc2wU0p8BHm1b1t28pkZLds/NTw1Gg80W9M24oburFmb0ynId4+SHO8YNeckz4EhfAe4pQh2wjAHIgtdMGDYVPLpWcwp108gf014+UIk5Cu3k1pt39KyW/S6wbBe3u4mTT7yxnB4QG7U1tFQ12AKS7A9N3kndKQkc7d1LaG4STFmgzxX7ifOz+Jq0GaQX+PjC9aBO9C5EE7nJS8nHJFFM0kTQ+T0ESUluxzkB8RHkY/ID3wSyFHbRzhbCrYfeqo3pMJzgzBxQ3dWn9Npon2Vevn2hNo+qEB5CPdCIFnhIFGAP2DYVPJBBcAZ7gTyfyQFoUzoZ14JY4T8j0y3XRBaigzrLiJKXvLGcGRFYuo6HqkeshlwbvI+WHewCCHvdZACHkQJeQgnzHiFpIA1GweXtTOfhkDuxCkI0/nP/AWYItF+EYJM2q8q3Y0QSl1YiuUlv3xEPqiTP2iEIJf3CQVChWwn3VOjUX/VhaEbGs7qc/rM7/C8WlogV7WYMi8c2bQCdLk8zzu8qhfMyxck0Q8a+svdIyuz+7Kgd3jeE9WTJK5tS85tI/Wcq1z31fvnktarxTvbYOGPHOSww9jsjkH8kRIK2RaGcpaGbuM1keGp93gNrjurzun/4O1tjR196be3WoX31UgMTE6ClTPw3n5u7+UXf28f5b4i8rg6/Nf5Q/9iQ/9WR8lT8pQ8JU/JU/KUPCVPyf/tyFM5q/9dVUTlTMk3hlDyZ5c8S+UsCSVPyVPylDwVSp4KJU+FkqdCyVP5m5K/hFQRKwt5zP7Y6NsuGz81Hbt3Hz3M0kNETYU8+HELZfQFyFtGpkW0UchjWtGbwy5TG2jr2L1zpyE/30wZfQHyIsuuIvSMfb5h2exgb25axCmxb1/cZy+Jw+wT0SL2seNLFiQ+6icnveawSw3aZIc3kGWsT5wie0LcI+T7GiwbW8/YuVYLalhn50VL2UYe8vdKb7PsxccjF9l1e8k/n4wTdav0HvvvVvXmrRKTvUO9UXqZrbGzbOVVtuw/N86P3igpaX3G/nK15LsOivYDyd/b0f5PZ0fHWFk5ghpIuBZvEaSbSGxF6PZz9JytRSNzyFJmsdx8jjbHkLi1gZbmDfJ3+1F/M9og7OctlvFK1ErOfIcFLW2gc2wDmupHm+MWtCmifGfe3sqOl9SWtTSV/n7tYsnd82Xs76Ya9uowuXXbdP/J1damUvvtKVPtuVKW7TOx/yqx/3qjtGbYNDxnmr5t/56i/dAzf08j3/M2z6NKEpjn10XUh35sIugOyU+jvfUptDWCWmtrao+ivSZ9j55XvmlAzUuo+SG6T8gPk2M/R3bF3t7dfSTWojL2niUf+bFStqbkv2Utzaa5Nz+U/NZvZ0vtZR0m9TBfMk11VPbVqp+/u3FIvrSpyXSfZbf2H59/86bZRA/9B0f7CAEf0TFuDF8W0RJr0XD+im6x7LROvgaNbGrG6f+xd74hbWVZAD882Mfj3vDk7RIxcdyQxdgFh5BiYprthyylBuLYoqzThMzUD007yoLaitHoosg6aqszs4YRa4cRupTValHHAacj2xaks23ttpbalsEWpTttd2aZwra0uzOw/bDn/ckfk6jNzOwy6L0fkrz7zj3nnvO799zzHpa+wM3Lf5hwzu/e3bRHqL4mjykv5wd386NIfpDH1J2dLWyfl2vGh3zXS5zz1dxnu3cJBW92cXIbPMW90D3Me7hDrft03M7yrlz8VbAjSv49oZn7o3zzK2XAHsY2U/Kjf1JyvXZoF/InhHm+sXBP0V/kywL+5EME/RF/tIm/XXiqqLBwT+FHPF+9n2+ODRHkBXK8eZAvL8vGvC5XeO/yp4VRPqeXP7Gvmc8p4geFOxvueeFQU1Y5kv9ENzoycr1ayNp/qFr3hyvKoih8vC8v6xHXLwifvnkaF0ARkt8ljHLXBaHr/gcHRkZ6i/sZ24wrvFEl12sY7/A8cizAzdo7z7/7OZ7zRfz83nn+aDGfs30/f3KQ7yqc50c/4JsKE8h/zH/eeIIvFrDyK5LJn+LnH13jjzbz81g45LyPCWRww3Ne6MrNvYPke3XH75TrmoVd3BUhi5OrCeEhXjbl9mI1t0+39yTX2H9IIS/sPNB/Sldczp183PT6Y8Y289r+SEJtjxu+efSaUro3ZvPZOfzJfszvSFF4l8fyXTgiZ/suLAKxfouT34fJPoe/L1zns/uVp7r72PHxyJ1rcne28CibPzG/IflebjcmmTeF0zou97ggfMIVC1/lKnf6D3A63W3hk1yO21X2OI/TvaeSb8zidPh1heNeL2Zof5B3eCNKvVR2rkw5sF+MKp1l6gp5v1lOrGWjqzZZ/wu5swjLfK3j8Tkl/Va/rw49dyeTSR5Jke5tVtbNEUXpuZF4caB0l7Hy7ociH2sK+ZduOfw8Y7BJyFcXjmSg/NwLFuAfL3mI/d8WLBiMPGuMPGuMPGuMPGuMPGubgzz790ZbszHyjDxrjPyq9mrS9c9YzLYE+e26nUk9b+kesqhtAfI7jqd0Ne1gUdsC5LPKU7qK3mJR25Tkt/1U+VvM36lXuftSRpRnsahtzj1/UCG/Tf752l5uW8qIX3F//TWL26bM9r9B8H9Wfn3K7U8zZLduF4vbpiSP+V7L9fBb7o2U27/nfsnCtkkrvIM/ieX4rOvypw6P9oMHGt+7XVDAKrxNXdvHD/e8vfjxKocP9a9x23Uf5OUBHM1jUdsCT3V/1+HT+2fc6XjPTt12FrUtQB7gFcR/QIxf/5zFbIuQZ42RZ42RZ42RZ42RZ42RZ42RZ42RZ42RZ42RZ23TkfffhfbaDOQNdeJ6t2ft0R+GDRR1Sr71BXy138GddNNbXPoOisKL/y+qL2Ppf0F+egI89S8tbeoglKxHpFRSsToorVxf1Q20ukjDawus3MvYmej0ZpYTe1ucqZL1HdgfWWcRl9g3sOWxpXhEsb2dLGCs+r6WfhTkh0jYVLqyMfkFabKU1qynyUy68bNteE2BSkKD7Rk6E52etW1VHrClJy9GZtZMHfWUes3rLrJU93w2b6XNlCwgk/9ell6e/LeHDz9vgG96AN55Al/36fseAPz7eYP+lnjpGUDPlwAfZjUmk28xBl1tELACTISgzkIsmCFbXPVEEq/ikrQ+1aStbuVLE3ARYqkDmHvqoG1gsgapw1Qa9JB6AzgXACIlomSDGWS7HCRjkwDuGxItabEA2KQBCFF5G/ipnOq6Hcqg1acQWXpaWwMhJ5nygqbIN0y8JaApmsO52q/i4Eg8uajTm5UIkdDMYiRAMANYJBd2utqMkh1E1GCp0shDide0hvUSZ8BeKmo+wRghw2YtOGKkg7hXYCxCZX2wHMlPGOasiFmKCijkNUsBl5ckZ0LNkm2KkHvQjbOeLtEchKGJDMkf/uLrW33wZQOA/jI0NFx8phfhTGvP2fOmnn+AQY/roIArjknX+CGMKTdYsSK5YQwxkXaob+u0EzOMU+vsXdOCBQwk6p6fOis7ISpgN/jkVdNGqzq7oZvki3ZTKR1fROngOE7fakauFRYIk26bBWNilPJttZOoP4DAqtTk4JqWE4U6aFWrQbb4NVW/6KKgKRp31vlJSFOUTwzgQRhLNH4qaNPT9nwd9fhXBkCskrO9Y2Jx4SpOMpRPK6LkDUH7GtZnpHtD8qJQfArT7lAwXwuOmXbYLNbYlrZSezJ51dKqPa9ZGpOtG9JauuHwG+xKxnRVag6CtJwZ+SetZvgySt6gv9B3qfUBnNHLO+xY67H/3MLvbb94JWlQiJphLkreQLzuq7QUxomchRZpnUZJ2U83nKQ7KpA/7Y7gLmvzyHc6OpRsT0zgCQCZSSB/g7jdEq5fY0CRawOpSt4Fir5aYosNSgqJlwwviuQuDMXIW4xuN5mLKpLGB+gABnVgElZPL06+Nn7OO5bRFszh9pdkPNPysoLABKS3brYGg1ZR82lZ9nE6Rj4EAU8M7OSAIYW8YilG3jsUszSG45KrJM2SHBKIk1emUy9lRv48Qu7pg54GEPWXb7b+7dixYwY4c0m597yn75u0g7qJKKdKjJhI2ifp0MAAejTuULel1b2qYKof1gT8dDx8D1d3m1Xpnk6Yuxen7h0X0flKC4y5UXoW/VlSF5mcLqCdqNrk/RN1OKnmHXZ1GGg75FPQFDnGUFFnVFGFY8GTxpX64Th5UwL5GcwIMDelkV9RSkufvNfTWy9ZRhiqT2hZXgNqcLRFmO6cj5JXLMUE2jtjlrSUms6SsUX+uRTBgFRGHRRbMiMv6h/cfKcPvr1gftZ6GZ7/0wCXxSj5B/rWi7A622uDSKlvwg1VXrMdY+2aNkA7Zvuryr0lQjtjKyQEkxMBTQDzrc0bIz9kDEPIpIVxzlljp35cyT6HBY/sMEz60R81pTrkbYuh8CtXd4kvbexX7obvdUyBZ0ycxjCqiiocNqhZjCqaJETeTonZXpseHgtmhTwkk79L/N1Ktp9Ro4pHWVrr+eGSFiwpVJ9CpM4WsWvBiZIHoz0525tEb4UYI68JmAPtMUvpyGuW5jydUAuzdHFRrgRUB0OZVnj/atXjOX/xgv4dPOef3NIfPnwzSh4ON8ifV1LI4xMJwXPe5iROnFpIIkGjL0oeIvHav4oQgnNUBUweQtwx8uIUNRrFJTWMNielSKCCEAQG96iRBGLkx9VHubkOVaUjAEtpYu93EOqqQa5EolFFnRYakZO5pmjYKCaR16Yn++GIks83Emp0gaNK5iHWU0mSi0SnWqp2YvJNZ73FS8i0WfMJpih1G7TgmPG5rULONTilltXklae6Wc1SVMBHF2KWxjpSyWuWapw0iPmonga9MfJSMNOnuotnv+jDr7Pq1dmz8Ts39efXGmSzlcgreVa9ens2fmcSz9r4sVRjSBCwrSpXfIlvUd72OH04Vn1bYwgnCC6oQbdpb3KGgms804TnVKUrchg1RZ01CSbcC6nPijUbvUASRZL4DuGp67/tnF9IW3cUxw8/8Kk/c2XhPrg0SiNNikvAEpDQvEjwZXuwYZKO0nSOLPGhsC6VbowMJtp1NkuaMXywmLxYhQrp9hKJ6YM+JGqFIBjs0x4GtVArgha6toNS2O/+i7nJjal/ytSe78NN7i/3nHPv75PfXy6n2ir7Tskz3ZDWhQ/2s3yuFalN3AloK134/7z79bxIvlLJYGQHI5G8RvHvp/f0pI13NIv/PH2ibB5dbTHbJg9zf5/Q/PmnP75u2/1dffztp6rJQXO16H99d8AbJ/uK9M7kw9pNe3NmpxZx9Zcq27uNB1kBp+5c3aXFF9p7XA8+u7GH8N98+QPu26OQPArJo5A8CsmjkDwKyaOOP/kPO5miXncMyD/Z3sV9HBY/Xqy+qWXUqZl8Z8e6st/vtzfsdMX1a+C+ezBc3kcqmPl8nrnV5/NCprGzxNFw6Ml31zCefVz8Gv1X/Fjt42vYWMj53d6jhbi7ya/si6+l4reB34SjMwU0/+4OJSOVGm4ZhFwgS5Q46g+8kt0pMsDIp4xe8XHo0qEnX/e5ZUfj6cZy8vCwFvkpMrFX8joyVfHbBW4P5C9UpvS7xfkcLAKXvRc3v4dqpgPCMSGSB0/68JOv+0hq9hsrBWiOzkxHeb6P0V5bWwj1wYuVlRlGXyqLRvjgK4V8oY+fXWVW6zzPPvSct8TnMBFyKKeox3sTYu1wz8NAeGiGgU2lHCRj4q4ABNIqI5m8mSMc1w4dARpgvabbQbl5CBhY2RVN8udilI4CmOPUZZUj6TkXNQ4rRqp3B23624x8N6l3E4+mo8QigIPxM2Y5kodEL/VMQL+XGrqYLTcSj3erblmOpBipyTufHgHydXUS+tkkPOFP5WZ008EkQF/oTY5B1wm9vVwWjYZfBxXyyejGJv8ctv7R5TZZLZDtrMnWHpdwMkznrpMsOM7CNQLA3bSkvEL2vbP1c2AcgXlqURnpB/u/GpxQ2vyzzBl73Aq9CWi1Kc33mgnGykeCrONSE+NN5gYMy3IkK+lsCSwW27z6bRKBfHvcwpnjmo7MLoD4J+xmuest3X7SA4N6yHs67OQraCXeS8Mm9XNKkRQjNflR45zlyJB/G4G+LWhO9i3wLxn59e1xXi6LvoRCaEMmHwlGIvwaJPktcSJgul/02GIkQoKlLjamOhXyc8RodBILZHzCFZOczuVSGykSyTdRj9FIluAZtzig3XHL6hUzOd1mkTKXi+QHxNqXjdTvDgrk0/HOUTfVdFQkf0ts0saEH8D7PTu3M/JjUPacciRt8vMG5/jh7+398oge2uAfQjL4JLwwxMgPbZOXy6JvoTH0XCa/MBQOh3PsZC24Ve4zbRCaUUDoNsHxFM4TNqtqNZlMOsgsinUWt9Me7dsRybOun13dxCb8LpLeiTxDwjTCIrF/khTJSmzQFahidFvMAuY5+YjTdGTOgJ4K5MVMwJa8j/kKjLDzSUa+YqEmR1KMQHokmbw3dfh7+x+LX9dnZwGG1mEzVEZeLotGCq+E3r4g4F9dyEE4DJs6eBlV9/bCDI+N82PUP8V6e5dLP0pAb2DD+iDI5MEsptRXGykMhN45dnkC5q3AGlGADeMdpLSXpyXUUr4zbHFykXbcZ2ZSJIW8bKTq7fUJpx4m4iMtjk5NR5Meq50I5JcF8B1wkrZDl8MyRuoZecmDqrcXIylGUl5BWOb6xXH+8M/wSlZ1r0JvAcIrPB8tkl8PhvjgQ7ksOsuHVoXS10G+kIuEgvwMbPHBlcflEE+Si6z1BgjlsjBuoEZWa5M0TmNF8lPkUTXyXTRuhx6OGrgG6DVwTpMw7BODf7u+S4ymnMTAvMYIMTbJkUQevqJR6buDU4QpAWlKPH5NRy1O6qQK+XuUi8f0cMVDCWu/1cj7ikaw5KFudkWGssHHSpeP4E7OdHPVslyhpCwX1olHDRcuIiyb7lpiWVYJ56Q6ay1ZQ4/TWtMfm4jaVrnnokrZyCYVNuF4Rhx95Ug1ZbVVdeQv3fBplfIC2mrdq7+iJE2d+iNI/iCkkxb0AnkNXabZPXtWpWzcjw7MkdZ/y3oMdm/3pbFhzeLxbkAdb/IoJI9C8igkj0LyKCSPQvIoJI9C8qjdkkd9UMI2j20eySN5FJJHIXkUkkcheRSSRyF5FJJHIXkUkkchedT/rf8AlIKDWVYBzO8AAAAASUVORK5CYII=)

Every GraphQL payload will be a JSON object with a **query** property, and a **variables** property if any variables were provided. If you take a closer look at the full **query** property of this request, you'll notice that it's stringified GraphQL language content.

![Taking a closer look at the payload](/assets/images/stringified-syntax-d8dab2e70acddc32bdf220d05917e527.png)

## Advantages & disadvantages

We already discussed the advantages and disadvantages of API scraping in general in this course's introduction, but because GraphQL is such a different technology, scraping an API built with it comes with its own pros and cons.

### Advantages

1. GraphQL allows you as the developer to choose which fields you'd like to be returned back to you. Not only does this leave you with only the data you want and no extra unwanted fields, but it is also easier on the target.

2. Allows access to data that is not readily available natively through the website.

3. Queries are heavily customizable due to features like **fragments**.

### Disadvantages

1. Though it's a fantastic technology with lots of awesome features, it is also more complex to understand.

2. GraphQL https://docs.apify.com/academy/api-scraping/graphql-scraping/introspection.md is disabled on many sites, which makes it more difficult to reap the full benefits of GraphQL.

## Next up

This course section's https://docs.apify.com/academy/api-scraping/graphql-scraping/modifying-variables.md will discuss how to customize GraphQL queries without ever having to write any GraphQL language.


---

# Custom queries

**Learn how to write custom GraphQL queries, how to pass input values into GraphQL requests as variables, and how to retrieve and output the data from a scraper.**

***

Sometimes, the queries found in the **Network** tab aren't good enough for your use case. Or, perhaps they're even returning more data than what you're after (which can slow down the queries depending on how much data they're giving back). In these situations, it's a good idea to dig a bit deeper into the API and start writing your own custom use-case specific queries.

In this lesson, we're building a scraper which expects a single number (in **hours**) and a **query** string as its input. As output, it should provide data about the first 1000 Cheddar posts published within the last **n** hours which match the provided query. Each **post** object should contain the **title**, the **publishDate** and the **videoUrl** of the post.


```
[
    {
        "title": "FDA Authorizes 1st Breath Test for COVID-19 Infection",
        "publishDate": "2022-04-15T11:58:44-04:00",
        "videoUrl": "https://vod.chdrstatic.com/source%3Dbackend%2Cexpire%3D1651782479%2Cpath%3D%2Ftranscode%2Fb68f8133-3aa9-4c96-ac26-047452bbc9ce%2Ctoken%3D581fd52bb7f634834edca5c201619c014cd21eb20448cf89525bf101ca8a6f64/transcode/b68f8133-3aa9-4c96-ac26-047452bbc9ce/b68f8133-3aa9-4c96-ac26-047452bbc9ce.mp4"
    },
    {
        "...": "..."
    }
]
```


## Project setup

To make sure we're all on the same page, we're going to set up the project together by first creating a folder named **graphql-scraper**. Once navigated to the folder within your terminal, run the following command:


```
npm init -y && npm install graphql-tag puppeteer got-scraping
```


This command will first initialize the project with npm, then will install the `puppeteer`, `graphql-tag`, and `got-scraping` packages, which we will need in this lesson.

Finally, create a file called **index.js**. This is the file we will be working in for the rest of the lesson.

## Preparations

If we remember from the last lesson, we need to pass a valid "app token" within the **X-App-Token** header of every single request we make, or else we will be blocked. When testing queries, we copied this value straight from the **Network** tab; however, since this is a dynamic value, we should farm it.

Since we know requests with this header are sent right when the front page is loaded, it can be farmed by visiting the page and intercepting requests in Puppeteer like so:


```
// scrapeAppToken.js
import puppeteer from 'puppeteer';

const scrapeAppToken = async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    let appToken = null;

    page.on('response', async (res) => {
        // grab the token from the request headers
        const token = res.request().headers()?.['x-app-token'];

        // if there is a token, grab it and close the browser
        if (token) {
            appToken = token;
            await browser.close();
        }
    });

    await page.goto('https://www.cheddar.com/');

    await page.waitForNetworkIdle();

    // otherwise, close the browser after networkidle
    // has been fired
    await browser.close();

    // return the apptoken (or null)
    return appToken;
};

export default scrapeAppToken;
```


With this code, we're doing the same exact thing as we did in the previous lesson to grab this header value, except programmatically.

> To learn more about this method of scraping headers and tokens, refer to the https://docs.apify.com/academy/api-scraping/general-api-scraping/cookies-headers-tokens.md lesson of the **General API scraping** section.

Now, we can import this function into our **index.js** and use it to create a `token` variable which will be passed as our **X-App-Token** header when scraping:


```
// index.js

// import the function
import scrapeAppToken from './scrapeAppToken.mjs';

const token = await scrapeAppToken();
```


## Building the query

First, we'll write a skeleton query where we define which variables we're expecting (from the user of the scraper):


```
query SearchQuery($query: String!, $max_age: Int!) {
    # query will go here
}
```


Also in the previous lesson, we learned that the **media** type is dependent on the **organization** type. This means to get any **media**, it must be wrapped in the **organization** query:


```
query SearchQuery($query: String!, $max_age: Int!) {
  organization {
    media(query: $query, max_age: $max_age , first: 1000) {

    }
  }
}
```


Finally, since Cheddar is using https://relay.dev/graphql/connections.htm#relay-style-cursor-pagination for their API, we must access the data through the **edges** property, where each **node** is a result item:


```
query SearchQuery($query: String!) {
  organization {
    media(query: $query, max_age: $max_age , first: 1000) {
      edges {
        node {
            # here we will define the fields we want
        }
      }
    }
  }
}
```


The next step is to fill out the fields we'd like back, and we've got our final query!


```
query SearchQuery($query: String!) {
  organization {
    media(query: $query, max_age: $max_age , first: 1000) {
      edges {
        node {
          title # title
          public_at # this will be publishDate
          hero_video {
            video_urls {
              url # the first URL from these results will be videoUrl
            }
          }
        }
      }
    }
  }
}
```


## Making the request

Back in our code, we can import `gql` from `graphql-tag` and use it to store our query:


```
// index.js
import { gql } from 'graphql-tag';
import scrapeAppToken from './scrapeAppToken.mjs';

const token = await scrapeAppToken();

const GET_LATEST = gql`
    query SearchQuery($query: String!, $max_age: Int!) {
        organization {
            media(query: $query, max_age: $max_age, first: 1000) {
                edges {
                    node {
                        title
                        public_at
                        hero_video {
                            video_urls {
                                url
                            }
                        }
                        thumbnail_url
                    }
                }
            }
        }
    }
`;
```


Alternatively, if you don't want to write your GraphQL queries right within your JavaScript code, you can write them in files using the **.graphql** format, then read them from the filesystem or import them.

> In order to receive nice GraphQL syntax highlighting in these template literals, download the https://marketplace.visualstudio.com/items?itemName=GraphQL.vscode-graphql

Then, we'll take our input and use it to create a **variables** object which will be used for the request:


```
// find posts from the last 48 hours that include the keyword "stocks".
// since we don't have any real input, we'll simulate some input
const testInput = { hours: 48, query: 'stocks' };

// the API takes max_input in the format of minutes * 60
// to calculate this value, we do hours * 60^2
const variables = { query: testInput.query, max_age: Math.round(testInput.hours) * 60 ** 2 };
```


The final step is to take the query and variable and marry them within a `gotScraping()` call, which will return the API response:


```
const data = await gotScraping('https://api.cheddar.com/graphql', {
    // we are expecting a JSON response back
    responseType: 'json',
    // we must use a post request
    method: 'POST',
    // this is where we pass in our token
    headers: { 'X-App-Token': token, 'Content-Type': 'application/json' },
    // here is our query with our variables
    body: JSON.stringify({ query: GET_LATEST.loc.source.body, variables }),
});
```


The final step after making the query is to format the data to match the expected dataset schema.

## Final code

Here's what our final project looks like:


```
// index.js
import { gql } from 'graphql-tag';
import { gotScraping } from 'got-scraping';
import scrapeAppToken from './scrapeAppToken.mjs';

// Scrape the token
const token = await scrapeAppToken();

// Define our query
const GET_LATEST = gql`
    query SearchQuery($query: String!, $max_age: Int!) {
        organization {
            media(query: $query, max_age: $max_age, first: 1000) {
                edges {
                    node {
                        title
                        public_at
                        hero_video {
                            video_urls {
                                url
                            }
                        }
                        thumbnail_url
                    }
                }
            }
        }
    }
`;

// Grab our input
const testInput = { hours: 48, query: 'stocks' };

// Calculate and prepare our variables
const variables = { query: testInput.query, max_age: Math.round(testInput.hours) * 60 ** 2 };

// Make the request
const { body: { data: { organization } } } = await gotScraping('https://api.cheddar.com/graphql', {
    responseType: 'json',
    method: 'POST',
    headers: { 'X-App-Token': token, 'Content-Type': 'application/json' },
    body: JSON.stringify({ query: GET_LATEST.loc.source.body, variables }),
});

// Format the data
const result = organization.media.edges.map(({ node }) => ({
    title: node?.title,
    publishDate: node?.public_at,
    videoUrl: node?.hero_video ? node.hero_video.video_urls[0].url : null,
}));

// Log the result
console.log(result);
```



```
// scrapeAppToken.js
import puppeteer from 'puppeteer';

const scrapeAppToken = async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    let appToken = null;

    page.on('response', async (res) => {
        const token = res.request().headers()?.['x-app-token'];

        if (token) {
            appToken = token;
            await browser.close();
        }
    });

    await page.goto('https://www.cheddar.com/');

    await page.waitForNetworkIdle();

    await browser.close();

    return appToken;
};

export default scrapeAppToken;
```


## Wrap up

If you've made it this far, that means that you've conquered the king of API scraping - GraphQL, and that you're ready to take on writing scrapers for the majority of websites out there. Nice work!

Take a moment to review the skills you learned in this section:

1. Modifying the variables of copied GraphQL queries
2. Introspecting a GraphQL API
3. Visualizing and understanding a GraphQL API introspection
4. Writing custom queries
5. Dealing with cursor-based relay pagination
6. Writing a GraphQL scraper with custom queries


---

# Introspection

**Understand what introspection is, and how it can help you understand a GraphQL API to take advantage of the features it has to offer before writing any code.**

***

https://graphql.org/learn/introspection/ is when you make a query to the target GraphQL API requesting information about its schema. When done properly, this can provide a whole lot of information about the API and the different **queries** and **mutations** it supports.

Just like when working with regular RESTful APIs in the https://docs.apify.com/academy/api-scraping/general-api-scraping/locating-and-learning.md section, it's important to learn a bit about the different available features of the GraphQL API (or at least of the query/mutation) you are scraping before actually writing any code.

Not only does becoming comfortable with and understanding the ins and outs of using the API make the development process easier, but it can also sometimes expose features which will return data you'd otherwise be scraping from a different location.

## Making the query

warning

Cheddar website was changed and the below example no longer works there. Nonetheless, the general approach is still viable on some websites even though introspection is disabled on most.

In order to perform introspection on our https://www.cheddar.com, we need to make a request to their GraphQL API with this introspection query using https://docs.apify.com/academy/tools/insomnia.md or another HTTP client that supports GraphQL:

> To make a GraphQL query in Insomnia, make sure you've set the HTTP method to **POST** and the request body type to **GraphQL Query**.


```
query {
  __schema {
    queryType {
      name
    }
    mutationType {
      name
    }
    subscriptionType {
      name
    }
    types {
      ...FullType
    }
    directives {
      name
      description
      locations
      args {
        ...InputValue
      }
    }
  }
}
fragment FullType on __Type {
  kind
  name
  description
  fields(includeDeprecated: true) {
    name
    description
    args {
      ...InputValue
    }
    type {
      ...TypeRef
    }
    isDeprecated
    deprecationReason
  }
  inputFields {
    ...InputValue
  }
  interfaces {
    ...TypeRef
  }
  enumValues(includeDeprecated: true) {
    name
    description
    isDeprecated
    deprecationReason
  }
  possibleTypes {
    ...TypeRef
  }
}
fragment InputValue on __InputValue {
  name
  description
  type {
    ...TypeRef
  }
  defaultValue
}
fragment TypeRef on __Type {
  kind
  name
  ofType {
    kind
    name
    ofType {
      kind
      name
      ofType {
        kind
        name
        ofType {
          kind
          name
          ofType {
            kind
            name
            ofType {
              kind
              name
              ofType {
                kind
                name
              }
            }
          }
        }
      }
    }
  }
}
```


Here's what we got back from the request:

![GraphQL introspection request response](/assets/images/introspection-2f8159c4f926e20040ee65bfc4e18eb0.jpg)

The response body of our introspection query contains a whole lot of useful information about the API, such as the data types defined within it, as well the queries and mutations available for retrieving/changing the data.

## Understanding the response

An introspection query's response body size will vary depending on how big the target API is. In our case, what we got back is a 27 thousand line JSON response 🤯 If you thought to yourself, "Wow, that's a whole lot to sift through! I don't want to look through that!", you are absolutely right. Luckily for us, there is a fantastic online tool called https://graphql-kit.com/graphql-voyager/ (no install required) which can take this massive JSON response and turn it into a digestable visualization of the API.

Let's copy the response to our clipboard by clicking inside of the response body and pressing **CMD** + **A**, then subsequently **CMD** + **C**. Now, we'll head over to https://graphql-kit.com/graphql-voyager/ and click on **Change Schema**. In the modal, we'll click on the **Introspection** tab and paste our data into the text area.

![Pasting the introspection](/assets/images/pasting-introspection-78e8ac32a797fcfd7f17f7f1685bbceb.png)

Finally, we can click on **Display** and immediately be shown a visualization of the API:

![GraphQL Voyager API visualization](/assets/images/voyager-interface-b74eff607e4985d5228ec7d08563f909.jpg)

Now that we have this visualization to work off of, it will be much easier to build a query of our own.

## Building a query

In future lessons, we'll be building more complex queries using **dynamic variables** and advanced features such as **fragments**; however, for now let's get our feet wet by using the data we have from GraphQL Voyager to build a query.

Right now, our goal is to fetch the 1000 most recent articles on https://www.cheddar.com. From each article, we'd like to fetch the **title** and the **publish date**. After a bit of digging through the schema, we've come across the **media** field within the **organization** type, which has both **title** and **public\_at** fields - seems to check out!

![The media field pointing to datatype slugable](/assets/images/media-field-066b5bbc4dccdef44b38495648478deb.jpg)

Cool. Now we know we need to access **media** through the **organization** query. The **media** field also takes in some arguments, of which we will be using the **first** parameter set to **1000**. Let's start writing our query in Insomnia!

![Receiving a suggestion for a field titled edges](/assets/images/edges-suggested-65c22c50bf4e1682ec511f97e0790009.png)

While writing our query, we've hit a slight roadblock - the **media** type doesn't seem to be accepting a **title** field; however, we are being suggested an **edges** field. This signifies that Cheddar is using https://relay.dev/graphql/connections.htm#relay-style-cursor-pagination, and that what is returned from media is actually a **Connection** type with multiple properties. The **edges** property contains the list of results we're after, and each result lies within a **Node** type accessible within **edges** as **node**. With this knowledge, we can finish writing our query:


```
query {
    organization {
        media(first: 1000) {
            edges {
                node {
                    title
                    public_at
                }
            }
        }
    }
}
```


## Sending the query

Let's send it!

![Unauthorized](/assets/images/unauthorized-e5a911a6290b5515598de42cfb2f8b8a.png)

Oh, okay. That didn't work. But **why**?

Rest assured, nothing is wrong with our query. We are most likely missing an authorization token/parameter. Let's check back on the Cheddar website within our browser to see what types of headers are being sent with the requests there:

![Request headers back on the Cheddar website](/assets/images/cheddar-headers-37014534c6ca4250bc5c28b673373dda.jpg)

The **Authorization** and **X-App-Token** headers seem to be our culprits. Of course these values are dynamic, but for testing purposes we can copy them right from the **Network** tab and use them for our request in Insomnia.

![Successful request](/assets/images/successful-request-81d1fa87c1e58b7456a02376d395e38f.png)

Cool, it worked! Now we know that if we want to scrape this API, we'll likely have to scrape these authorization headers as well in order to not get blocked.

> For more information about cookies, headers, and tokens, refer back to https://docs.apify.com/academy/api-scraping/general-api-scraping/cookies-headers-tokens.md from the previous section of the **API scraping** course.

## Introspection disabled?

If the target website is smart, they will have introspection disabled. One of the most widely used GraphQL development tools is https://www.apollographql.com/docs/apollo-server/, which automatically disables introspection, so these cases are actually quite common.

![Introspection disabled](/assets/images/introspection-disabled-0b524331e3d8505a3e4c2cc6cdc3e39e.png)

In these cases, it is still possible to get some information about the API when using https://docs.apify.com/academy/tools/insomnia.md or https://docs.apify.com/academy/tools/postman.md, due to the autocomplete that they provide. If we remember from the  section of this lesson, we were able to receive autocomplete suggestions when we entered a non-existent field into the query. Though this is not as great as seeing an entire visualization of the API in GraphQL Voyager, it can still be quite helpful.

## Next up

https://docs.apify.com/academy/api-scraping/graphql-scraping/custom-queries.md's code-along project will walk you through how to construct a custom GraphQL query for scraping purposes, how to accept input into it, and how to retrieve and output the data.


---

# Modifying variables

**Learn how to modify the variables of a JSON format GraphQL query to use the API without needing to write any GraphQL language or create custom queries.**

***

In the introduction of this course, we searched for the term **test** on the https://www.cheddar.com/ website and discovered a request to their GraphQL API. The payload looked like this:


```
{
    "query": "query SearchQuery($query: String!, $count: Int!, $cursor: String) {\n    organization {\n        ...SearchList_organization\n        id\n    }\n    }\n    fragment SearchList_organization on Organization {\n    media(\n        first: $count\n        after: $cursor\n        query: $query\n        recency_weight: 0.6\n        recency_days: 30\n        include_private: false\n        include_unpublished: false\n    ) {\n        hitCount\n        edges {\n        node {\n            _score\n            id\n            ...StandardListCard_video\n            __typename\n        }\n        cursor\n        }\n        pageInfo {\n        endCursor\n        hasNextPage\n        }\n    }\n    }\n    fragment StandardListCard_video on Slugable {\n    ...Thumbnail_video\n    ...StandardTextCard_media\n    slug\n    id\n    __typename\n    }\n    fragment Thumbnail_video on Slugable {\n    original_thumbnails: thumbnails(aspect_ratio: ORIGINAL) {\n        small\n        medium\n        large\n    }\n    sd_thumbnails: thumbnails(aspect_ratio: SD) {\n        small\n        medium\n        large\n    }\n    hd_thumbnails: thumbnails(aspect_ratio: HD) {\n        small\n        medium\n        large\n    }\n    film_thumbnails: thumbnails(aspect_ratio: FILM) {\n        small\n        medium\n        large\n    }\n    square_thumbnails: thumbnails(aspect_ratio: SQUARE) {\n        small\n        medium\n        large\n    }\n    }\n    fragment StandardTextCard_media on Slugable {\n    public_at\n    updated_at\n    title\n    hero_video {\n        duration\n    }\n    description\n    }",
    "variables": { "query": "test","count": 10,"cursor": null },
    "operationName": "SearchQuery"
}
```


We also learned that every GraphQL request payload will have a **query** property, which contains a stringified version of the query, and a **variables** property, which contains any parameters for the query.

If we convert the query field to a `.graphql` format, we can get it nicely formatted with syntax highlighting (install GraphQL extension for editor)


```
query SearchQuery($query: String!, $count: Int!, $cursor: String) {
    organization {
        ...SearchList_organization
        id
    }
    }
    fragment SearchList_organization on Organization {
    media(
        first: $count
        after: $cursor
        query: $query
        recency_weight: 0.6
        recency_days: 30
        include_private: false
        include_unpublished: false
    ) {
        hitCount
        edges {
        node {
            _score
            id
            ...StandardListCard_video
            __typename
        }
        cursor
        }
        pageInfo {
        endCursor
        hasNextPage
        }
    }
    }
    fragment StandardListCard_video on Slugable {
    ...Thumbnail_video
    ...StandardTextCard_media
    slug
    id
    __typename
    }
    fragment Thumbnail_video on Slugable {
    original_thumbnails: thumbnails(aspect_ratio: ORIGINAL) {
        small
        medium
        large
    }
    sd_thumbnails: thumbnails(aspect_ratio: SD) {
        small
        medium
        large
    }
    hd_thumbnails: thumbnails(aspect_ratio: HD) {
        small
        medium
        large
    }
    film_thumbnails: thumbnails(aspect_ratio: FILM) {
        small
        medium
        large
    }
    square_thumbnails: thumbnails(aspect_ratio: SQUARE) {
        small
        medium
        large
    }
    }
    fragment StandardTextCard_media on Slugable {
    public_at
    updated_at
    title
    hero_video {
        duration
    }
    description
}
```


If the query provided in the payload you find in the **Network** tab is good enough for your scraper's needs, you don't actually have to go down the GraphQL rabbit hole. Rather, you can change the variables to receive the data you want. For example, right now, our example payload is set up to search for articles matching the keyword **test**. However, if we wanted to search for articles matching **cats** instead, we could do that by changing the **query** variable like so:


```
{
    "...": "...",
    "variables": { "query": "cats","count": 10,"cursor": null }
}
```


Depending on the API, doing just this can be sufficient. However, sometimes we want to utilize complex GraphQL features in order to optimize our scrapers or to receive more data than is being provided in the response of the request found in the **Network** tab. This is what we will be discussing in the next lessons.

## Next up

In the https://docs.apify.com/academy/api-scraping/graphql-scraping/introspection.md we will be walking you through how to learn about a GraphQL API before scraping it by using **introspection**.


---

# How to retry failed requests

**Learn how to re-scrape only failed requests in your run.**

***

Requests of a scraper can fail for many reasons. The most common causes are different page layouts or proxy blocking issues (https://docs.apify.com/academy/node-js/analyzing-pages-and-fixing-errors). Both https://apify.com and https://crawlee.dev/ allow you to restart your scraper run from the point where it ended, but there is no native functionality to re-scrape only failed requests. Usually, you also want to first analyze the problem, update the code, and build it before trying again.

If you attempt to restart an already finished run, it will likely immediately finish because all the requests in the https://crawlee.dev/docs/guides/request-storage are marked as handled. You need to update the failed requests in the queue to be marked as pending again.

The additional complication is that the https://crawlee.dev/api/core/class/Request object doesn't have anything like the `isFailed` property. We have to approximate it using other fields. Fortunately, we can use the `errorMessages` and `retryCount` properties to identify failed requests. Unless the user explicitly has overridden these properties, we can identify failed requests with a larger amount of `errorMessages` than `retryCount`. That happens because the last error that doesn't cause a retry anymore is added to `errorMessages`.

A simplified code example can look like this:


```
// The code is similar for both Crawlee-only but uses a different API
import { Actor } from 'apify';

const REQUEST_QUEUE_ID = 'pFCvCasdvsyvyZdfD'; // Replace with your valid request queue ID
const allRequests = [];
let exclusiveStartId = null;
// List all requests from the queue, we have to do it in a loop because the request queue list is paginated
for (; ;) {
    const { items: requests } = await Actor.apifyClient
        .requestQueue(REQUEST_QUEUE_ID)
        .listRequests({ exclusiveStartId, limit: 1000 });
    allRequests.push(...requests);
    // If we didn't get the full 1,000 requests, we have all and can finish the loop
    if (requests.length  (request.errorMessages?.length || 0) > (request.retryCount || 0));

// We need to update them 1 by 1 to the pristine state
for (const request of failedRequests) {
    request.retryCount = 0;
    request.errorMessages = [];
    // This tells the request queue to handle it again
    request.handledAt = null;
    await Actor.apifyClient.requestQueue(REQUEST_QUEUE_ID).updateRequest(request);
}

// And now we can resurrect our scraper again; it will only process the failed requests.
```


## Resurrect automatically with a free public Actor

Fortunately, you don't need to implement this code into your workflow. https://apify.com/store provides the https://apify.com/lukaskrivka/rebirth-failed-requests Actor (that is https://github.com/metalwarrior665/rebirth-failed-requests) that does this and more. The Actor can automatically scan multiple runs of your Actors based on filters like `date started`. It can also automatically resurrect the runs after renewing the failed requests. That means you will finish your scrape into the final successful state with a single click on the Run button.


---

# Run Actor and retrieve data via API

**Learn how to run an Actor/task via the Apify API, wait for the job to finish, and retrieve its output data. Your key to integrating Actors with your projects.**

***



The most popular way of https://help.apify.com/en/collections/1669769-integrations the Apify platform with an external project/application is by programmatically running an https://docs.apify.com/platform/actors.md or https://docs.apify.com/platform/actors/running/tasks.md, waiting for it to complete its run, then collecting its data and using it within the project. Follow this tutorial to have an idea on how to approach this, it isn't as complicated as it sounds!

> Remember to check out our https://docs.apify.com/api/v2.md with examples in different languages and a live API console. We also recommend testing the API with a desktop client like https://www.postman.com/ or https://insomnia.rest.

Apify API offers two ways of interacting with it:

*
*

If the Actor being run via API takes 5 minutes or less to complete a typical run, it should be called **synchronously**. Otherwise, (if a typical run takes longer than 5 minutes), it should be called **asynchronously**.

## Run an Actor or task

> If you are unsure about the differences between an Actor and a task, you can read about them in the https://docs.apify.com/platform/actors/running/tasks.md documentation. In brief, tasks are pre-configured inputs for Actors.

The API endpoints and usage (for both sync and async) for https://docs.apify.com/api/v2.md#tag/ActorsRun-collection/operation/act_runs_post and https://docs.apify.com/api/v2/actor-task-runs-post.md are essentially the same.

To run, or **call**, an Actor/task, you will need a few things:

* The name or ID of the Actor/task. The name looks like `username~actorName` or `username~taskName`. The ID can be retrieved on the **Settings** page of the Actor/task.

* Your https://docs.apify.com/platform/integrations.md, which you can find on the **Integrations** page in https://console.apify.com/account?tab=integrations (do not share it with anyone!).

* Possibly an input, which is passed in JSON format as the request's **body**.

* Some other optional settings if you'd like to change the default values (such as allocated memory or the build).

The URL of https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods/POST to run an Actor looks like this:


```
https://api.apify.com/v2/acts/ACTOR_NAME_OR_ID/runs?token=YOUR_TOKEN
```


For tasks, we can switch the path from **acts** to **actor-tasks** and keep the rest the same:


```
https://api.apify.com/v2/actor-tasks/TASK_NAME_OR_ID/runs?token=YOUR_TOKEN
```


If we send a correct POST request to one of these endpoints, the actor/actor-task will start just as if we had pressed the **Start** button on the Actor's page in the https://console.apify.com.

### Additional settings

We can also add settings for the Actor (which will override the default settings) as additional query parameters. For example, if we wanted to change how much memory the Actor's run should be allocated and which build to run, we could add the `memory` and `build` parameters separated by `&`.


```
https://api.apify.com/v2/acts/ACTOR_NAME_OR_ID/runs?token=YOUR_TOKEN&memory=8192&build=beta
```


This works in almost exactly the same way for both Actors and tasks; however, for tasks, there is no reason to specify a https://docs.apify.com/platform/actors/development/builds-and-runs/builds.md parameter, as a task already has only one specific Actor build which cannot be changed with query parameters.

### Input JSON

Most Actors would not be much use if input could not be passed into them to change their behavior. Additionally, even though tasks already have specified input configurations, it is handy to have the ability to overwrite task inputs through the **body** of the POST request.

> The input can technically be any https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/JSON, and will vary depending on the Actor being run. Ensure that you are familiar with the Actor's input schema while writing the body of the request.

Good Actors have reasonable defaults for most input fields, so if you want to run one of the major Actors from https://apify.com/store, you usually do not need to provide all possible fields.

Via API, let's quickly try to run https://apify.com/apify/web-scraper, which is the most popular Actor on the Apify Store at the moment. The full input with all possible fields is https://apify.com/apify/web-scraper?section=example-run, so we will not show it here. Because it has default values for most fields, we can provide a JSON input containing only the fields we'd like to customize. We will send a POST request to the endpoint below and add the JSON as the **body** of the request:


```
https://api.apify.com/v2/acts/apify~web-scraper/runs?token=YOUR_TOKEN
```


Here is how it looks in https://www.postman.com/:

![Run an Actor via API in Postman](/assets/images/run-actor-postman-b89097bdd92cf55096e73719086cb847.png)

If we press **Send**, it will immediately return some info about the run. The `status` will be either `READY` (which means that it is waiting to be allocated on a server) or `RUNNING` (99% of cases).

![Actor run info in Postman](/assets/images/run-info-postman-0d11537cf5eeccf8a474cdeab4e8550d.png)

We will later use this **run info** JSON to retrieve the run's output data. This info about the run can also be retrieved with another call to the https://docs.apify.com/api/v2/act-run-get.md endpoint.

## JavaScript and Python client

If you are using JavaScript or Python, we highly recommend using the Apify API client (https://docs.apify.com/api/client/js/, https://docs.apify.com/api/client/python/) instead of the raw HTTP API. The client implements smart polling and exponential backoff, which makes calling Actors and getting results efficient.

You can skip most of this tutorial by following this code example that calls Google Search Results Scraper and logs its results:

* Node.js
* Python


```
import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });

const input = { queries: 'Food in NYC' };

// Run the Actor and wait for it to finish
// .call method waits infinitely long using smart polling
// Get back the run API object
const run = await client.actor('apify/google-search-scraper').call(input);

// Fetch and print Actor results from the run's dataset (if any)
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});
```



```
from apify_client import ApifyClient
client = ApifyClient(token='YOUR_API_TOKEN')

run_input = {
    "queries": "Food in NYC",
}

# Run the Actor and wait for it to finish
# .call method waits infinitely long using smart polling
# Get back the run API object
run = client.actor("apify/google-search-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)
```


By using our client, you don't need to worry about choosing between synchronous or asynchronous flow. But if you don't want your code to wait during `.call` (potentially for hours), continue reading below about how to implement webhooks.

## Synchronous flow

If each of your runs will last shorter than 5 minutes, you can use a single https://usergrid.apache.org/docs/introduction/async-vs-sync.html#synchronous. When running **synchronously**, the connection will be held for *up to* 5 minutes.

If your synchronous run exceeds the 5-minute time limit, the response will be a run object containing information about the run and the status of `RUNNING`. If that happens, you need to restart the run  and .

### Synchronous runs with dataset output

Most Actor runs will store their data in the default https://docs.apify.com/platform/storage/dataset.md. The Apify API provides **run-sync-get-dataset-items** endpoints for https://docs.apify.com/api/v2/act-run-sync-get-dataset-items-post.md and https://docs.apify.com/api/v2/actor-task-run-sync-get-dataset-items-post.md, which allow you to run an Actor and receive the items from the default dataset once the run has finished.

Here is a Node.js example of calling a task via the API and logging the dataset items to the console:


```
// Use your favorite HTTP client
import got from 'got';

// Specify your API token
// (find it at https://console.apify.com/account#/integrations)
const myToken = '';

// Start apify/google-search-scraper Actor
// and pass some queries into the JSON body
const response = await got({
    url: `https://api.apify.com/v2/acts/apify~google-search-scraper/run-sync-get-dataset-items?token=${myToken}`,
    method: 'POST',
    json: {
        queries: 'web scraping\nweb crawling',
    },
    responseType: 'json',
});

const items = response.body;

// Log each non-promoted search result for both queries
items.forEach((item) => {
    const { nonPromotedSearchResults } = item;
    nonPromotedSearchResults.forEach((result) => {
        const { title, url, description } = result;
        console.log(`${title}: ${url} --- ${description}`);
    });
});
```


### Synchronous runs with key-value store output

https://docs.apify.com/platform/storage/key-value-store.md are useful for storing files like images, HTML snapshots, or JSON data. The Apify API provides **run-sync** endpoints for https://docs.apify.com/api/v2/act-run-sync-post.md and https://docs.apify.com/api/v2/actor-task-run-sync-post.md, which allow you to run a specific task and receive the output. By default, they return the `OUTPUT` record from the default key-value store.

## Asynchronous flow

For runs longer than 5 minutes, the process consists of three steps:

*
*
*

### Wait for the run to finish

There may be cases where we need to run the Actor and go away. But in any kind of integration, we are usually interested in its output. We have three basic options for how to wait for the actor/task to finish.

*
*
*

#### `waitForFinish` parameter

This solution is quite similar to the synchronous flow. To make the POST request wait, add the `waitForFinish` parameter. It can have a value from `0` to `60`, which is the maximum time in seconds to wait (the max value for `waitForFinish` is 1 minute). Knowing this, we can extend the example URL like this:


```
https://api.apify.com/v2/acts/apify~web-scraper/runs?token=YOUR_TOKEN&waitForFinish=60
```


You can also use the `waitForFinish` parameter with the https://docs.apify.com/api/v2/actor-run-get.md to implement a smarter  system.

Once again, the final response will be the **run info object**; however, now its status should be `SUCCEEDED` or `FAILED`. If the run exceeds the `waitForFinish` duration, the status will still be `RUNNING`.

#### Webhooks

If you have a server, https://docs.apify.com/platform/integrations/webhooks.md are the most elegant and flexible solution for integrations with Apify. You can set up a webhook for any Actor or task, and that webhook will send a POST request to your server after an https://docs.apify.com/platform/integrations/webhooks/events.md has occurred.

Usually, this event is a successfully finished run, but you can also set a different webhook for failed runs, etc.

![Webhook example](/assets/images/webhook-8b2fcb569631f00cd1bcc8a6db263572.png)

The webhook will send you a pretty complicated https://docs.apify.com/platform/integrations/webhooks/actions.md, but usually, you would only be interested in the `resource` object within the response, which is like the **run info** JSON from the previous sections. We can leave the payload template as is for our example since it is all we need.

Once your server receives this request from the webhook, you know that the event happened, and you can ask for the complete data.

> Don't forget to respond to the webhook with a **200** status code! Otherwise, it will ping you again.

#### Polling

What if you don't have a server, and the run you'd like to do is much too long to use a synchronous call? In cases like these, periodic **polling** of the run's status is the solution.

When we run the Actor with the  shown above, we will back a response with the **run info** object. From this JSON object, we can then extract the ID of the Actor run that we just started from the `id` field. Then, we can set an interval that will poll the Apify API (let's say every 5 seconds) by calling the https://docs.apify.com/api/v2/actor-run-get.md endpoint to retrieve the run's status.

Replace the `RUN_ID` in the following URL with the ID you extracted earlier:


```
https://api.apify.com/v2/acts/ACTOR_NAME_OR_ID/runs/RUN_ID
```


Once a status of `SUCCEEDED` or `FAILED` has been received, we know the run has finished and can cancel the interval and finally .

### Collecting the data

Unless you used the  mentioned above, you will have to make one additional request to the API to retrieve the data.

The **run info** JSON also contains the IDs of the default https://docs.apify.com/platform/storage/dataset.md and https://docs.apify.com/platform/storage/key-value-store.md that are allocated separately for each run, which is usually everything you need. The fields are called `defaultDatasetId` and `defaultKeyValueStoreId`.

#### Retrieving a dataset

> If you are scraping products, or any list of items with similar fields, the https://docs.apify.com/platform/storage/dataset.md should be your storage of choice. Don't forget though, that dataset items are immutable. This means that you can only add to the dataset, and not change the content that is already inside it.

To retrieve the data from a dataset, send a GET request to the https://docs.apify.com/api/v2/dataset-items-get.md endpoint and pass the `defaultDatasetId` into the URL. For a GET request to the default dataset, no token is needed.


```
https://api.apify.com/v2/datasets/DATASET_ID/items
```


By default, it will return the data in JSON format with some metadata. The actual data are in the `items` array.

You can use plenty of additional parameters, to learn more about them, visit our API reference https://docs.apify.com/api/v2/dataset-items-get.md. We will only mention that you can pass a `format` parameter that transforms the response into popular formats like CSV, XML, Excel, RSS, etc.

The items are paginated, which means you can ask only for a subset of the data. Specify this using the `limit` and `offset` parameters. This endpoint has a limit of 250,000 items that it can return per request. To retrieve more, you will need to send more requests incrementing the `offset` parameter.


```
https://api.apify.com/v2/datasets/DATASET_ID/items?format=csv&offset=250000
```


#### Retrieving a key-value store

> https://docs.apify.com/platform/storage/key-value-store.md are mainly useful if you have a single output or any kind of files that cannot be https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/JSON/stringify (such as images or PDFs).

When you want to retrieve something from a key-value store, the `defaultKeyValueStoreId` is *not* enough. You also need to know the name (or **key**) of the record you want to retrieve.

If you have a single output JSON, the convention is to return it as a record named `OUTPUT` to the default key-value store. To retrieve the record's content, call the https://docs.apify.com/api/v2/key-value-store-record-get.md endpoint.


```
https://api.apify.com/v2/key-value-stores/STORE_ID/records/RECORD_KEY
```


If you don't know the keys (names) of the records in advance, you can retrieve just the keys with the https://docs.apify.com/api/v2/key-value-store-keys-get.md endpoint.

Keep in mind that you can get a maximum of 1000 keys per request, so you will need to paginate over the keys using the `exclusiveStartKey` parameter if you have more than 1000 keys. To do this, after each call, take the last record key and provide it as the `exclusiveStartKey` parameter. You can do this until you get 0 keys back.


```
https://api.apify.com/v2/key-value-stores/STORE_ID/keys?exclusiveStartKey=myLastRecordKey
```


---

# Tutorials on Apify Actors

**Learn how to deploy your API project to the Apify platform.**

***

This tutorial shows you how to add your existing RapidAPI project to Apify, giving you access to managed hosting, data storage, and a broader user base through Apify Store while maintaining your RapidAPI presence.

* https://docs.apify.com/academy/apify-actors/adding-rapidapi-project.md


---

# Adding your RapidAPI project to Apify

If you've published an API project on https://rapidapi.com/, you can expand your project's visibility by listing it on Apify Store. This gives you access to Apify's developer community and ecosystem.

***

## Why add your API project to Apify

By publishing your API project on Apify, you'll reach thousands of active users in Apify Store. You'll also get access to the Apify platform's infrastructure: managed hosting, data storage, scheduling, advanced web scraping and crawling capabilities, and integrated proxy management. These tools help you reach more users and enhance your API's functionality.

## Step-by-step guide

The approach is demonstrated on an app built on top of https://expressjs.com/, but with a few adaptations to the code, any API framework will work.

You'll deploy your API as an https://apify.com/actors - a serverless cloud program that runs on the Apify platform. Actors can handle everything from simple automation to running web servers.

### Prerequisites

You’ll need an https://console.apify.com/sign-in - *it’s free and no credit card is required*. For simple migration and deployment, we recommend installing the Apify CLI:


```
curl -fsSL https://apify.com/install-cli.sh | bash
```


Other ways to install the CLI

Check the https://docs.apify.com/cli/docs/installation for more details and all the options.

### Step 1: Initialize the Actor structure

Once you have the Apify CLI, run the following command:


```
apify init
```


The command sets up an Actor project in your current directory by creating `actor.json` (Actor configuration) and storage files (Dataset and Key-value store).

### Step 2: Add Actor logic

The initialization of the Actor is the first important thing. The second is the correct mapping of the PORT. Check the following example for inspiration:


```
await Actor.init(); // Initializes the Actor

const app = express();
const PORT = Actor.config.get('containerPort'); // Specifies the PORT
const DATA_FILE = path.join(__dirname, 'data', 'items.json');

app.use(express.json());

// Rest of the logic
```


Readiness checks

The Apify platform performs readiness checks by sending GET requests to `/` with the `x-apify-container-server-readiness-probe` header. For better resource efficiency, consider checking for this header and returning a simple response early, rather than processing it as a full request. This optimization is particularly useful for resource-intensive Actors.


```
app.get('*', (req, res) => {
    if (req.headers['x-apify-container-server-readiness-probe']) {
        console.log('Readiness probe');
        res.send('Hello, readiness probe!\n');
    }
});
```


### Step 3: Test your Actor locally

Once you’ve added the Actor logic, test your Actor locally with the following command:


```
apify run
```


Now, check that your server is running. Check one of your endpoints, for example `/health`.

### Step 4: Deploy your Actor to Apify

Now push your Actor to https://console.apify.com/. You’ll be able to do this only if you’re logged in to your Apify account with the CLI. Run `apify info` to check, and if you’re not logged in yet, run `apify login`. This only needs to be done once. To push your project, run the following command:


```
apify push
```


### Step 5: Run your Actor

After pushing your Actor to the platform, in the terminal you’ll see an output similar to this:


```
2025-10-03T07:57:13.671Z ACTOR: Build finished.
Actor build detail https://console.apify.com/actors/a0c...
Actor detail https://console.apify.com/actors/aOc...
Success: Actor was deployed to Apify cloud and built there.
```


You can click the **Actor detail** link, or go to **Apify Console > My Actors**, and click on your Actor. Now, click on the Settings tab, and enable **Actor Standby**:

![Standby Actor](/assets/images/standby-46f0cc8b9b154e5a15f88cf43aa24005.png)

Two modes of Actors

Actors can run in two modes: as batch processing jobs that execute a single task and stop, or in **Standby mode** as a web server. For use cases like deploying an API that needs to respond to incoming requests in real-time, Standby mode is the best choice. It keeps your Actor running continuously and ready to handle HTTP requests like a standard web server.

Once you’ve saved the settings, go to the **Standby** tab, and click the **Test endpoint** button. It will start the Actor, and you can test it. Once the Actor is running, you're done with the migration!

## Next steps

Ready to monetize your Actor and start earning? Check out these guides:

* https://docs.apify.com/platform/actors/publishing/monetize
* https://docs.apify.com/platform/actors/publishing/publish

You can also extend your Actor with custom logic and leverage additional Apify platform features, such as storage or web scraping capabilities.


---

# Introduction to the Apify platform

**Learn all about the Apify platform, all of the tools it offers, and how it can improve your overall development experience.**

***

The https://apify.com was built to serve large-scale and high-performance web scraping and automation needs. It provides easy access to compute instances (https://docs.apify.com/academy/getting-started/actors.md), convenient request and result storages, proxies, scheduling, webhooks and more - all accessible through the **Console** web interface, https://docs.apify.com/api/v2.md, or our https://docs.apify.com/api/client/js and https://docs.apify.com/api/client/python API clients.

## Category outline

In this category, you'll learn how to become an Apify platform developer from the ground up. From creating your first account, to developing Actors, this is your one-stop-shop for understanding how the platform works, and how to work with it.

## First up

We'll start off this category light, by showing you how to create an Apify account and get everything ready for development with the platform. https://docs.apify.com/academy/getting-started.md


---

# Using ready-made Apify scrapers

**Discover Apify's ready-made web scraping and automation tools. Compare Web Scraper, Cheerio Scraper and Puppeteer Scraper to decide which is right for you.**

***

Scraping and crawling the web can be difficult and time-consuming without the right tools. That's why Apify provides ready-made solutions to crawl and scrape any website. They are based on our https://apify.com/actors, the https://docs.apify.com/sdk/js and https://crawlee.dev/.

Don't let the number of options confuse you. Unless you're really sure you need to use a specific tool, go ahead and use **Web Scraper** (https://docs.apify.com/academy/apify-scrapers/web-scraper.md). It is the easiest to pick up and can handle almost anything. Look at **Puppeteer Scraper** (https://docs.apify.com/academy/apify-scrapers/puppeteer-scraper.md) or **Cheerio Scraper** (https://docs.apify.com/academy/apify-scrapers/cheerio-scraper.md) only after you know your target websites well and need to optimize your scraper.

https://docs.apify.com/academy/apify-scrapers/getting-started.md

## Web Scraper

Web Scraper is a ready-made solution for scraping the web using the Chrome browser. It takes away all the work necessary to set up a browser for crawling, controls the browser automatically and produces machine-readable results in several common formats.

Underneath, it uses the Puppeteer library to control the browser, but you don't need to worry about that. Using a web UI and a little of basic JavaScript, you can tweak it to serve almost any scraping need.

https://docs.apify.com/academy/apify-scrapers/web-scraper.md

## Cheerio Scraper

Cheerio Scraper is a ready-made solution for crawling the web using plain HTTP requests to retrieve HTML pages and then parsing and inspecting the HTML using the https://www.npmjs.com/package/cheerio library. It's blazing fast.

Cheerio is a server-side version of the popular jQuery library that does not run in the browser but instead constructs a DOM out of an HTML string and then provides the user an API to work with that DOM.

Cheerio Scraper is ideal for scraping websites that do not rely on client-side JavaScript to serve their content. It can be as much as 20 times faster than using a full-browser solution like Puppeteer.

https://docs.apify.com/academy/apify-scrapers/cheerio-scraper.md

## Puppeteer Scraper

Puppeteer Scraper is the most powerful scraper tool in our arsenal (aside from developing your own Actors). It uses the Puppeteer library to programmatically control a headless Chrome browser, and it can make it do almost anything. If using Web Scraper does not cut it, Puppeteer Scraper is what you need.

Puppeteer is a Node.js library, so knowledge of Node.js and its paradigms is expected when working with Puppeteer Scraper.

https://docs.apify.com/academy/apify-scrapers/puppeteer-scraper.md


---

#

This scraping tutorial will go into the nitty gritty details of extracting data from **https://apify.com/store** using **Cheerio Scraper** (https://apify.com/apify/cheerio-scraper). If you arrived here from the https://docs.apify.com/academy/apify-scrapers/getting-started.md, tutorial, great! You are ready to continue where we left off. If you haven't seen the Getting started yet, check it out, it will help you learn about Apify and scraping in general and set you up for this tutorial, because this one builds on topics and code examples discussed there.

## Getting to know our tools

In the https://docs.apify.com/academy/apify-scrapers/getting-started.md tutorial, we've confirmed that the scraper works as expected, so now it's time to add more data to the results.

To do that, we'll be using the https://github.com/cheeriojs/cheerio library. This may not sound familiar, so let's try again. Does https://jquery.com/ ring a bell? If it does you're in luck, because Cheerio is like jQuery that doesn't need an actual browser to run. Everything else is the same. All the functions you already know are there and even the familiar `$` is used. If you still have no idea what either of those are, don't worry. We'll walk you through using them step by step.

> https://github.com/cheeriojs/cheerio to learn more about it.

Now that's out of the way, let's open one of the Actor detail pages in the Store, for example the **Web Scraper** (https://apify.com/apify/web-scraper) page, and use our DevTools-Fu to scrape some data.

> If you're wondering why we're using Web Scraper as an example instead of Cheerio Scraper, it's only because we didn't want to triple the number of screenshots we needed to make. Lazy developers!

## Building our Page function

Before we start, let's do a quick recap of the data we chose to scrape:

1. **URL** - The URL that goes directly to the Actor's detail page.
2. **Unique identifier** - Such as **apify/web-scraper**.
3. **Title** - The title visible in the Actor's detail page.
4. **Description** - The Actor's description.
5. **Last modification date** - When the Actor was last modified.
6. **Number of runs** - How many times the Actor was run.

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/scraping-practice.webp)

We've already scraped numbers 1 and 2 in the https://docs.apify.com/academy/apify-scrapers/getting-started.md tutorial, so let's get to the next one on the list: title.

### Title

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/title.webp)

By using the element selector tool, we find out that the title is there under an `` tag, as titles should be. Maybe surprisingly, we find that there are actually two `` tags on the detail page. This should get us thinking. Is there any parent element that includes our `` tag, but not the other ones? Yes, there is! A `` element that we can use to select only the heading we're interested in.

> Remember that you can press CTRL+F (CMD+F) in the Elements tab of DevTools to open the search bar where you can quickly search for elements using their selectors. And always make sure to use the DevTools to verify your scraping process and assumptions. It's faster than changing the crawler code all the time.

To get the title we need to find it using a `header h1` selector, which selects all `` elements that have a `` ancestor. And as we already know, there's only one.


```
// Using Cheerio.
async function pageFunction(context) {
    const { $ } = context;
    // ... rest of your code can come here
    return {
        title: $('header h1').text(),
    };
}
```


### Description

Getting the Actor's description is a little more involved, but still pretty straightforward. We cannot search for a `` tag, because there's a lot of them in the page. We need to narrow our search down a little. Using the DevTools we find that the Actor description is nested within the `` element too, same as the title. Moreover, the actual description is nested inside a `` tag with a class `actor-description`.

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/description.webp)


```
async function pageFunction(context) {
    const { $ } = context;
    // ... rest of your code can come here
    return {
        title: $('header h1').text(),
        description: $('header span.actor-description').text(),
    };
}
```


### Modified date

The DevTools tell us that the `modifiedDate` can be found in a `` element.

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/modified-date.webp)


```
async function pageFunction(context) {
    const { $ } = context;
    // ... rest of your code can come here
    return {
        title: $('header h1').text(),
        description: $('header span.actor-description').text(),
        modifiedDate: new Date(
            Number(
                $('ul.ActorHeader-stats time').attr('datetime'),
            ),
        ),
    };
}
```


It might look a little too complex at first glance, but let us walk you through it. We find all the `` elements. Then, we read its `datetime` attribute, because that's where a unix timestamp is stored as a `string`.

But we would much rather see a readable date in our results, not a unix timestamp, so we need to convert it. Unfortunately, the `new Date()` constructor will not accept a `string`, so we cast the `string` to a `number` using the `Number()` function before actually calling `new Date()`. Phew!

### Run count

And so we're finishing up with the `runCount`. There's no specific element like ``, so we need to create a complex selector and then do a transformation on the result.


```
async function pageFunction(context) {
    const { $ } = context;
    // ... rest of your code can come here
    return {
        title: $('header h1').text(),
        description: $('header span.actor-description').text(),
        modifiedDate: new Date(
            Number(
                $('ul.ActorHeader-stats time').attr('datetime'),
            ),
        ),
        runCount: Number(
            $('ul.ActorHeader-stats > li:nth-of-type(3)')
                .text()
                .match(/[\d,]+/)[0]
                .replace(/,/g, ''),
        ),
    };
}
```


The `ul.ActorHeader-stats > li:nth-of-type(3)` looks complicated, but it only reads that we're looking for a `` element and within that element we're looking for the third `` element. We grab its text, but we're only interested in the number of runs. We parse the number out using a regular expression, but its type is still a `string`, so we finally convert the result to a `number` by wrapping it with a `Number()` call.

> The numbers are formatted with commas as thousands separators (e.g. `'1,234,567'`), so to extract it, we first use regular expression `/[\d,]+/` - it will search for consecutive number or comma characters. Then we extract the match via `.match(/[\d,]+/)[0]` and finally remove all the commas by calling `.replace(/,/g, '')`. We need to use `/,/g` with the global modifier to support large numbers with multiple separators, without it we would replace only the very first occurrence.
>
> This will give us a string (e.g. `'1234567'`) that can be converted via `Number` function.

### Wrapping it up

And there we have it! All the data we needed in a single object. For the sake of completeness, let's add the properties we parsed from the URL earlier and we're good to go.


```
async function pageFunction(context) {
    const { $ } = context;
    const { url } = request;
    // ... rest of your code can come here

    const uniqueIdentifier = url
        .split('/')
        .slice(-2)
        .join('/');

    return {
        url,
        uniqueIdentifier,
        title: $('header h1').text(),
        description: $('header span.actor-description').text(),
        modifiedDate: new Date(
            Number(
                $('ul.ActorHeader-stats time').attr('datetime'),
            ),
        ),
        runCount: Number(
            $('ul.ActorHeader-stats > li:nth-of-type(3)')
                .text()
                .match(/[\d,]+/)[0]
                .replace(/,/g, ''),
        ),
    };
}
```


All we need to do now is add this to our `pageFunction`:


```
async function pageFunction(context) {
    // $ is Cheerio
    const { request, log, skipLinks, $ } = context;
    if (request.userData.label === 'START') {
        log.info('Store opened!');
        // Do some stuff later.
    }
    if (request.userData.label === 'DETAIL') {
        const { url } = request;
        log.info(`Scraping ${url}`);
        await skipLinks();

        // Do some scraping.
        const uniqueIdentifier = url
            .split('/')
            .slice(-2)
            .join('/');

        return {
            url,
            uniqueIdentifier,
            title: $('header h1').text(),
            description: $('header span.actor-description').text(),
            modifiedDate: new Date(
                Number(
                    $('ul.ActorHeader-stats time').attr('datetime'),
                ),
            ),
            runCount: Number(
                $('ul.ActorHeader-stats > li:nth-of-type(3)')
                    .text()
                    .match(/[\d,]+/)[0]
                    .replace(/,/g, ''),
            ),
        };
    }
}
```


### Test run

As always, try hitting that **Save & Run** button and visit the **Dataset** preview of clean items. You should see a nice table of all the attributes correctly scraped. You nailed it!

## Pagination

Pagination is a term that represents "going to the next page of results". You may have noticed that we did not actually scrape all the Actors, just the first page of results. That's because to load the rest of the Actors, one needs to click the **Show more** button at the very bottom of the list. This is pagination.

> This is a typical JavaScript pagination, sometimes called infinite scroll. Other pages may use links that take you to the next page. If you encounter those, make a Pseudo URL for those links and they will be automatically enqueued to the request queue. Use a label to let the scraper know what kind of URL it's processing.

If you paid close attention, you may now see a problem. How do we click a button in the page when we're working with Cheerio? We don't have a browser to do it and we only have the HTML of the page to work with. The simple answer is that we can't click a button. Does that mean that we cannot get the data at all? Usually not, but it requires some clever DevTools-Fu.

### Analyzing the page

While with Web Scraper and **Puppeteer Scraper** (https://apify.com/apify/puppeteer-scraper), we could get away with clicking a button, with Cheerio Scraper we need to dig a little deeper into the page's architecture. For this, we will use the Network tab of the Chrome DevTools.

> DevTools is a powerful tool with many features, so if you're not familiar with it, please https://developer.chrome.com/docs/devtools/, which explains everything much better than we ever could.

We want to know what happens when we click the **Show more** button, so we open the DevTools **Network** tab and clear it. Then we click the **Show more** button and wait for incoming requests to appear in the list.

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/inspect-network.webp)

Now, this is interesting. It seems that we've only received two images after clicking the button and no additional data. This means that the data about Actors must already be available in the page and the **Show more** button only displays it. This is good news.

### Finding the Actors

Now that we know the information we seek is already in the page, we just need to find it. The first Actor in the store is Web Scraper, so let's try using the search tool in the **Elements** tab to find some reference to it. The first few hits do not provide any interesting information, but in the end, we find our goldmine. A `` tag, with the ID `__NEXT_DATA__` that seems to hold a lot of information about Web Scraper. In DevTools, you can right click an element and click **Store as global variable** to make this element available in the **Console**.

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/find-data.webp)

A `temp1` variable is now added to your console. We're mostly interested in its contents and we can get that using the `temp1.textContent` property. You can see that it's a rather large JSON string. How do we know? The `type` attribute of the `` element says `application/json`. But working with a string would be very cumbersome, so we need to parse it.


```
const data = JSON.parse(temp1.textContent);
```


After entering the above command into the console, we can inspect the `data` variable and see that all the information we need is there, in the `data.props.pageProps.items` array. Great!

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/inspect-data.webp)

> It's obvious that all the information we set to scrape is available in this one data object, so you might already be wondering, can I make one request to the store to get this JSON and then parse it out and be done with it in a single request? Yes you can! And that's the power of clever page analysis.

### Using the data to enqueue all Actor details

We don't really need to go to all the Actor details now, but for the sake of practice, let's imagine we only found Actor names such as `cheerio-scraper` and their owners, such as `apify` in the data. We will use this information to construct URLs that will take us to the Actor detail pages and enqueue those URLs into the request queue.


```
// We're not in DevTools anymore,
// so we use Cheerio to get the data.
const dataJson = $('#__NEXT_DATA__').html();
// We requested HTML, but the data are actually JSON.
const data = JSON.parse(dataJson);

for (const item of data.props.pageProps.items) {
    const { name, username } = item;
    const actorDetailUrl = `https://apify.com/${username}/${name}`;
    await context.enqueueRequest({
        url: actorDetailUrl,
        userData: {
            // Don't forget the label.
            label: 'DETAIL',
        },
    });
}
```


We iterate through the items we found, build Actor detail URLs from the available properties and then enqueue those URLs into the request queue. We need to specify the label too, otherwise our page function wouldn't know how to route those requests.

> If you're wondering how we know the structure of the URL, see the https://docs.apify.com/academy/apify-scrapers/getting-started.md tutorial again.

### Plugging it into the Page function

We've got the general algorithm ready, so all that's left is to integrate it into our earlier `pageFunction`. Remember the `// Do some stuff later` comment? Let's replace it.


```
async function pageFunction(context) {
    const { request, log, skipLinks, $ } = context;
    if (request.userData.label === 'START') {
        log.info('Store opened!');

        const dataJson = $('#__NEXT_DATA__').html();
        // We requested HTML, but the data are actually JSON.
        const data = JSON.parse(dataJson);

        for (const item of data.props.pageProps.items) {
            const { name, username } = item;
            const actorDetailUrl = `https://apify.com/${username}/${name}`;
            await context.enqueueRequest({
                url: actorDetailUrl,
                userData: {
                    label: 'DETAIL',
                },
            });
        }
    }
    if (request.userData.label === 'DETAIL') {
        const { url } = request;
        log.info(`Scraping ${url}`);
        await skipLinks();

        // Do some scraping.
        const uniqueIdentifier = url
            .split('/')
            .slice(-2)
            .join('/');

        return {
            url,
            uniqueIdentifier,
            title: $('header h1').text(),
            description: $('header span.actor-description').text(),
            modifiedDate: new Date(
                Number(
                    $('ul.ActorHeader-stats time').attr('datetime'),
                ),
            ),
            runCount: Number(
                $('ul.ActorHeader-stats > li:nth-of-type(3)')
                    .text()
                    .match(/[\d,]+/)[0]
                    .replace(/,/g, ''),
            ),
        };
    }
}
```


That's it! You can now remove the **Max pages per run** limit, **Save & Run** your task and watch the scraper scrape all of the Actors' data. After it succeeds, open the **Dataset** tab again click on **Preview**. You should have a table of all the Actor's details in front of you. If you do, great job! You've successfully scraped Apify Store. And if not, no worries, go through the code examples again, it's probably just a typo.

> There's an important caveat. The way we implemented pagination here is in no way a generic system that you can use with other websites. Cheerio is fast (and that means it's cheap), but it's not easy. Sometimes there's just no way to get all results with Cheerio only and other times it takes hours of research. Keep this in mind when choosing the right scraper for your job. But don't get discouraged. Often times, the only thing you will ever need is to define a correct Pseudo URL. Do your research first before giving up on Cheerio Scraper.

## Downloading the scraped data

You already know the **Dataset** tab of the run console since this is where we've always previewed our data. Notice the row of data formats such as JSON, CSV, and Excel. Below it are options for viewing and downloading the data. Go ahead and try it.

> If you prefer working with an API, you can find the example endpoint under the API tab: **Get dataset items**.

### Clean items

You can view and download your data without modifications, or you can choose to only get **clean** items. Data that aren't cleaned include a record for each `pageFunction` invocation, even if you did not return any results. The record also includes hidden fields such as `#debug`, where you can find a variety of information that can help you with debugging your scrapers.

Clean items, on the other hand, include only the data you returned from the `pageFunction`. If you're only interested in the data you scraped, this format is what you will be using most of the time.

To control this, open the **Advanced options** view on the **Dataset** tab.

## Bonus: Making your code neater

You may have noticed that the `pageFunction` gets quite bulky. To make better sense of your code and have an easier time maintaining or extending your task, feel free to define other functions inside the `pageFunction` that encapsulate all the different logic. You can, for example, define a function for each of the different pages:


```
async function pageFunction(context) {
    switch (context.request.userData.label) {
        case 'START': return handleStart(context);
        case 'DETAIL': return handleDetail(context);
        default: throw new Error('Unknown request label.');
    }

    async function handleStart({ log, waitFor, $ }) {
        log.info('Store opened!');

        const dataJson = $('#__NEXT_DATA__').html();
        // We requested HTML, but the data are actually JSON.
        const data = JSON.parse(dataJson);

        for (const item of data.props.pageProps.items) {
            const { name, username } = item;
            const actorDetailUrl = `https://apify.com/${username}/${name}`;
            await context.enqueueRequest({
                url: actorDetailUrl,
                userData: {
                    label: 'DETAIL',
                },
            });
        }
    }

    async function handleDetail({ request, log, skipLinks, $ }) {
        const { url } = request;
        log.info(`Scraping ${url}`);
        await skipLinks();

        // Do some scraping.
        const uniqueIdentifier = url
            .split('/')
            .slice(-2)
            .join('/');

        return {
            url,
            uniqueIdentifier,
            title: $('header h1').text(),
            description: $('header span.actor-description').text(),
            modifiedDate: new Date(
                Number(
                    $('ul.ActorHeader-stats time').attr('datetime'),
                ),
            ),
            runCount: Number(
                $('ul.ActorHeader-stats > li:nth-of-type(3)')
                    .text()
                    .match(/[\d,]+/)[0]
                    .replace(/,/g, ''),
            ),
        };
    }
}
```


> If you're confused by the functions being declared below their executions, it's called hoisting and it's a feature of JavaScript. It helps you put what matters on top, if you so desire.

## Final word

Thank you for reading this whole tutorial! Really! It's important to us that our users have the best information available to them so that they can use Apify and effectively. We're glad that you made it all the way here and congratulations on creating your first scraping task. We hope that you liked the tutorial and if there's anything you'd like to ask, https://discord.gg/jyEM2PRvMU!

## What's next

* Check out the https://docs.apify.com/sdk and its https://docs.apify.com/sdk/js/docs/guides/apify-platform tutorial if you'd like to try building your own Actors. It's a bit more complex and involved than writing a `pageFunction`, but it allows you to fine-tune all the details of your scraper to your liking.
* https://docs.apify.com/platform/actors.md, from how they work to https://docs.apify.com/platform/actors/publishing.md them in Apify Store, and even https://blog.apify.com/make-regular-passive-income-developing-web-automation-actors-b0392278d085/ on Actors.
* Found out you're not into the coding part but would still to use Apify Actors? Check out our https://apify.com/store or https://apify.com/contact-sales from an Apify-certified developer.

**Learn how to scrape a website using Apify's Cheerio Scraper. Build an Actor's page function, extract information from a web page and download your data.**

***


---

#

Welcome to the getting started tutorial! It will walk you through creating your first scraping task step by step. You will learn how to set up all the different configuration options, code a **Page function** (`pageFunction`), and finally download the scraped data either as an Excel sheet or in another format, such as JSON or CSV. But first, let's give you a brief introduction to web scraping with Apify.

## What is an Apify scraper

It doesn't matter whether you arrived here from **Web Scraper** (https://apify.com/apify/web-scraper), **Puppeteer Scraper** (https://apify.com/apify/puppeteer-scraper) or **Cheerio Scraper** (https://apify.com/apify/cheerio-scraper). All of them are **Actors** and for now, let's think of an **Actor** as an application that you can use with your own configuration. **apify/web-scraper** is therefore an application called **web-scraper**, built by **apify**, that you can configure to scrape any webpage. We call these configurations **tasks**.

> If you need help choosing the right scraper, see this https://help.apify.com/en/articles/3024655-choosing-the-right-solution. If you want to learn more about Actors in general, you can read our https://apify.com/actors or https://docs.apify.com/platform/actors.md.

You can create 10 different **tasks** for 10 different websites, with very different options, but there will always be just one **Actor**, the `apify/*-scraper` you chose. This is the essence of tasks. They are nothing but **saved configurations** of the Actor that you can run repeatedly.

## Trying it out

Depending on how you arrived at this tutorial, you may already have your first task created for the scraper of your choice. If not, the easiest way is to go to https://console.apify.com/actors#/store/ and select the Actor you want to base your task on. Then, click the **Create a new task** button in the top-right corner.

> This tutorial covers the use of **Web**, **Cheerio**, and **Puppeteer** scrapers, but a lot of the information here can be used with all Actors. For this tutorial, we will select **Web Scraper**.

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/actor-selection.webp)

### Running a task

This takes you to the **Input and options** tab of the task configuration. Before we delve into the details, let's see how the example works. You can see that there are already some pre-configured input values. It says that the task should visit **https://apify.com** and all its subpages, such as **https://apify.com/contact** and scrape some data using the provided `pageFunction`, specifically the `` of the page and its URL.

Scroll down to the **Performance and limits** section and set the **Max pages per run** option to **10**. This tells your task to finish after 10 pages have been visited. We don't need to crawl the whole domain to see that the Actor works.

> This also helps with keeping your https://docs.apify.com/platform/actors/running/usage-and-resources.md (CU) consumption low. To get an idea, our free plan includes 10 CUs and this run will consume about 0.04 CU, so you can run it 250 times a month for free. If you accidentally go over the limit, no worries, we won't charge you for it. You just won't be able to run more tasks that month.

Now click **Save & Run**! *(in the bottom-left part of your screen)*

### The run detail

After clicking **Save & Run**, the window will change to the run detail. Here, you will see the run's log. If it seems that nothing is happening, don't worry, it takes a few seconds for the run to fully boot up. In under a minute, you should have the 10 pages scraped. You will know that the run successfully completed when the `RUNNING` card in top-left corner changes to `SUCCEEDED`.

> Feel free to browse through the various new tabs: **Log**, **Info**, **Input** and other, but for the sake of brevity, we will not explain all their features in this tutorial.

Now that the run has `SUCCEEDED`, click on the glowing **Results** card to see the scrape's results. This takes you to the **Dataset** tab, where you can display or download the results in various formats. For now, click the **Preview** button. Voila, the scraped data!

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/the-run-detail.webp)

Good job! We've run our first task and got some results. Let's learn how to change the default configuration to scrape something more interesting than the page's ``.

## Creating your own task

Before we jump into the scraping itself, let's have a quick look at the user interface that's available to us. Click on the task's name in the top-left corner to visit the task's configuration.

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/task-name.webp)

### Input and options

The **Input** tab is where we started and it's the place where you create your scraping configuration. The Actor's creator prepares the **Input** form so that you can tell the Actor what to do. Feel free to check the tooltips of the various options to get a better idea of what they do. To display the tooltip, click the question mark next to each input field's name.

> We will not go through all the available input options in this tutorial. See the Actor's README for detailed information.

Below the input fields are the Build, Timeout and Memory options. Let's keep them at default settings for now. Remember that if you see a yellow `TIMED-OUT` status after running your task, you might want to come back here and increase the timeout.

> Timeouts are there to prevent tasks from running forever. Always set a reasonable timeout to prevent a rogue task from eating up all your compute units.

### Settings

In the settings tab, you can set options that are common to all tasks and not directly related to the Actor's purpose. Unless you've already changed the task's name, it's set to **my-task**, so why not try changing it to **my-first-scraper** and clicking **Save**.

### Runs

You can find all the task runs and their detail pages here. Every time you start a task, it will appear here in the list. Apify securely stores your ten most recent runs indefinitely, ensuring your records are always accessible. All of your task's runs and their outcomes, beyond the latest ten, will be stored here for the data retention period, https://apify.com/pricing.

### Webhooks

Webhooks are a feature that help keep you aware of what's happening with your tasks. You can set them up to inform you when a task starts, finishes, fails etc., or you can even use them to run more tasks, depending on the outcome of the original one. https://docs.apify.com/platform/integrations/webhooks.md.

### Information

Since tasks are configurations for Actors, this tab shows you all the information about the underlying Actor, the Apify scraper of your choice. You can see the available versions and their READMEs - it's always a good idea to read an Actor's README first before creating a task for it.

### API

The API tab gives you a quick overview of all the available API calls in case you would like to use your task programmatically. It also includes links to detailed API documentation. You can even try it out immediately using the **Test endpoint** button.

> Never share a URL containing the authentication token (`?token=...` parameter in the URLs), as this will compromise your account's security.

## Scraping theory

Since this is a tutorial, we'll be scraping our own website. https://apify.com/store is a great candidate for some scraping practice. It's a page built on popular technologies, which displays a lot of different items in various categories, just like an online store, a typical scraping target, would.

### The goal

We want to create a scraper that scrapes all the Actors in the store and collects the following attributes for each Actor:

1. **URL** - The URL that goes directly to the Actor's detail page.
2. **Unique identifier** - Such as **apify/web-scraper**.
3. **Title** - The title visible in the Actor's detail page.
4. **Description** - The Actor's description.
5. **Last modification date** - When the Actor was last modified.
6. **Number of runs** - How many times the Actor was run.

Some of this information may be scraped directly from the listing pages, but for the rest, we will need to visit the detail pages of all the Actors.

### The start URL

In the **Input** tab of the task we have, we'll change the **Start URL** from **https://apify.com**. This will tell the scraper to start by opening a different URL. You can add more **Start URL**s or even , but in this case, we'll be good with just one.

How do we choose the new **Start URL**? The goal is to scrape all Actors in the store, which is available at https://apify.com/store, so we choose this URL as our **Start URL**.


```
https://apify.com/store
```


We also need to somehow distinguish the **Start URL** from all the other URLs that the scraper will add later. To do this, click the **Details** button in the **Start URL** form and see the **User data** input. Here you can add any information you'll need during the scrape in a JSON format. For now, add a label to the **Start URL**.


```
{
  "label": "START"
}
```


### Filtering with a Link selector

The **Link selector**, together with **Pseudo URL**s, are your URL matching arsenal. The Link selector is a CSS selector and its purpose is to select the HTML elements where the scraper should look for URLs. And by looking for URLs, we mean finding the elements' `href` attributes. For example, to enqueue URLs from `` tags, we would enter `'div.my-class'`.

What's the connection to **Pseudo URL**s? Well, first, all the URLs found in the elements that match the Link selector are collected. Then, **Pseudo URL**s are used to filter through those URLs and enqueue only the ones that match the **Pseudo URL** structure.

To scrape all the Actors in Apify Store, we should use the Link selector to tell the scraper where to find the URLs we need. For now, let us tell you that the Link selector you're looking for is:


```
div.item > a
```


Save it as your **Link selector**. If you're wondering how we figured this out, follow along with the tutorial. By the time we finish, you'll know why we used this selector, too.

### Crawling the website with pseudo URLs

What is a **Pseudo URL**? Let us explain. Before we can start scraping the Actor details, we need to find all the links to the details. If the links follow a set structure, we can use a certain pattern to describe this structure. And that's what a **Pseudo URL** is. A pattern that describes a URL structure. By setting a **Pseudo URL**, all links that follow the given structure will automatically be added to the crawling queue.

Let's see an example. To find the pattern, open some of the Actor details in the store. You'll find that the URLs are always structured the same:


```
https://apify.com/{OWNER}/{NAME}
```


In the structures, only the `OWNER` and `NAME` change. We can leverage this in a **Pseudo URL**.

#### Making a pseudo URL

**Pseudo URL**s are URLs with some variable parts in them. Those variable parts are represented by https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions enclosed in brackets `[]`.

Working with our Actor details example, we could produce a **Pseudo URL** like this:


```
https://apify.com/[.+]/[.+]
```


This **Pseudo URL** will match all Actor detail pages, such as:


```
https://apify.com/apify/web-scraper
```


But it will not match pages we're not interested in, such as:


```
https://apify.com/contact
```


In addition, together with the filter we set up using the **Link selector**, the scraper will now avoid URLs such as:


```
https://apify.com/industries/manufacturing
```


This is because even though it matches our **Pseudo URL**'s format, the HTML element that contains it does not match the `div.item > a` element we specified in the **Link selector**.

Let's use the above **Pseudo URL** in our task. We should also add a label as we did with our **Start URL**. This label will be added to all pages that were enqueued into the request queue using the given **Pseudo URL**.


```
{
  "label": "DETAIL"
}
```


### Test run

Now that we've added some configuration, it's time to test it. Run the task, keeping the **Max pages per run** set to `10` and the `pageFunction` as it is. You should see in the log that the scraper first visits the **Start URL** and then several of the Actor details matching the **Pseudo URL**.

## The page function

The `pageFunction` is a JavaScript function that gets executed for each page the scraper visits. To figure out how to create it, you must first inspect the page's structure to get an idea of its inner workings. The best tools for that are a browser's inbuilt developer tools - DevTools.

### Using DevTools

Open https://apify.com/store in the Chrome browser (or use any other browser, just note that the DevTools may differ slightly) and open the DevTools, either by right-clicking on the page and selecting **Inspect** or by pressing **F12**.

The DevTools window will pop up and display a lot of, perhaps unfamiliar, information. Don't worry about that too much - open the Elements tab (the one with the page's HTML). The Elements tab allows you to browse the page's structure and search within it using the search tool. You can open the search tool by pressing **CTRL+F** or **CMD+F**. Try typing **title** into the search bar.

You'll see that the Element tab jumps to the first `` element of the current page and that the title is **Store · Apify**. It's always good practice to do your research using the DevTools before writing the `pageFunction` and running your task.

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/using-devtools.webp)

> For the sake of brevity, we won't go into the details of using the DevTools in this tutorial. If you're just starting out with DevTools, this https://developer.chrome.com/docs/devtools/ is a good place to begin.

### Understanding `context`

The `pageFunction` has access to global variables such as `window` and `document`, which are provided by the browser, as well as to `context`, which is the `pageFunction`'s single argument. `context` carries a lot of useful information and helpful functions, which are described in the Actor's README.

### New page function boilerplate

We know that we'll visit two kinds of pages, the list page (**Start URL**) and the detail pages (enqueued using the **Pseudo URL**). We want to enqueue links on the list page and scrape data on the detail page.

Since we're not covering jQuery in this tutorial for the sake of brevity, replace the default boilerplate with the code below.


```
async function pageFunction(context) {
    const { request, log, skipLinks } = context;
    if (request.userData.label === 'START') {
        log.info('Store opened!');
        // Do some stuff later.
    }
    if (request.userData.label === 'DETAIL') {
        log.info(`Scraping ${request.url}`);
        await skipLinks();
        // Do some scraping.
        return {
            // Scraped data.
        };
    }
}
```


This may seem like a lot of new information, but it's all connected to our earlier configuration.

### `context.request`

The `request` is an instance of the https://sdk.apify.com/docs/api/request class and holds information about the currently processed page, such as its `url`. Each `request` also has the `request.userData` property of type `Object`. While configuring the **Start URL** and the **Pseudo URL**, we gave them a `label`. We're now using them in the `pageFunction` to distinguish between the store page and the detail pages.

### `context.skipLinks()`

When a **Pseudo URL** is set, the scraper attempts to enqueue matching links on each page it visits. `skipLinks()` is used to tell the scraper that we don't want this to happen on the current page.

### `context.log`

`log` is used for printing messages to the console. You may be tempted to use `console.log()`, but this will not work unless you turn on the **Browser log** option. `log.info()` should be used for general messages, but you can also use `log.debug()` for messages that will only be shown when you turn on the **Debug log** option. https://sdk.apify.com/docs/api/log.

### The page function's return value

The `pageFunction` may only return nothing, `null`, `Object` or `Object[]`. If an `Object` is returned, it will be saved as a single result. Returning an `Array` of `Objects` will save each item in the array as a result.

The scraping results are saved in a https://docs.apify.com/platform/storage/dataset.md (one of the tabs in the run console, as you may remember). It behaves like a table. Each item is a row in the table and its properties are its columns. Returning the following `Object`:


```
async function pageFunction(context) {
    // ... rest of your code
    return {
        url: 'https://apify.com',
        title: 'Web Scraping, Data Extraction and Automation - Apify',
    };
}
```


will produce the following table:

| title                                                | url               |
| ---------------------------------------------------- | ----------------- |
| Web Scraping, Data Extraction and Automation - Apify | https://apify.com |

## Scraper lifecycle

Now that we're familiar with all the pieces in the puzzle, we'll quickly take a look at the scraper lifecycle, or in other words, what the scraper actually does when it scrapes. It's quite straightforward.

The scraper:

1. Visits the first **Start URL** and waits for the page to load.
2. Executes the `pageFunction`.
3. Finds all the elements matching the **Link selector** and extracts their `href` attributes (URLs).
4. Uses the **pseudo URLs** to filter the extracted URLs and throws away those that don't match.
5. Enqueues the matching URLs to the end of the crawling queue.
6. Closes the page and selects a new URL to visit, either from the **Start URL**s if there are any left, or from the beginning of the crawling queue.

> When you're not using the request queue, the scraper repeats steps 1 and 2. You would not use the request queue when you already know all the URLs you want to visit. For example, when you have a pre-existing list of a thousand URLs that you uploaded as a text file. Or when scraping a single URL.

## Scraping practice

We've covered all the concepts that we need to understand to successfully scrape the data in our goal, so let's get to it. We will only output data that are already available to us in the page's URL. Remember from  that we also want to include the **URL** and a **Unique identifier** in our results. To get those, we need the `request.url`, because it is the URL and includes the Unique identifier.


```
const { url } = request;
const uniqueIdentifier = url.split('/').slice(-2).join('/');
```


### Test run 2

We'll add our first data to the `pageFunction` and carry out a test run to see that everything works as expected.


```
async function pageFunction(context) {
    const { request, log, skipLinks } = context;
    if (request.userData.label === 'START') {
        log.info('Store opened!');
        // Do some stuff later.
    }
    if (request.userData.label === 'DETAIL') {
        const { url } = request;
        log.info(`Scraping ${url}`);
        await skipLinks();

        // Do some scraping.
        const uniqueIdentifier = url
            .split('/')
            .slice(-2)
            .join('/');

        return {
            url,
            uniqueIdentifier,
        };
    }
}
```


Now **Save & Run** the task and once it finishes, check the dataset by clicking on the **Results** card. Click **Preview** and you should see the URLs and unique identifiers scraped. Great job!

## Choosing sides

Up until now, everything has been the same for all the Apify scrapers. Whether you're using Web Scraper, Puppeteer Scraper or Cheerio Scraper, what you've learned now will always be the same. This is great if you ever need to switch scrapers, because there's no need to learn everything from scratch.

Differences can be found in the code we use in the `pageFunction`. Often subtle, sometimes large. In the next part of the tutorial, we'll focus on the individual scrapers' specific implementation details. It's time to choose sides. But don't worry, at Apify, no side is the dark side.

* https://docs.apify.com/academy/apify-scrapers/web-scraper.md
* https://docs.apify.com/academy/apify-scrapers/cheerio-scraper.md
* https://docs.apify.com/academy/apify-scrapers/puppeteer-scraper.md

**Step-by-step tutorial that will help you get started with all Apify Scrapers. Learn the foundations of scraping the web with Apify and creating your own Actors.**


---

#

This scraping tutorial will go into the nitty gritty details of extracting data from **https://apify.com/store** using **Puppeteer Scraper** (https://apify.com/apify/puppeteer-scraper). If you arrived here from the https://docs.apify.com/academy/apify-scrapers/getting-started.md, tutorial, great! You are ready to continue where we left off. If you haven't seen the Getting started yet, check it out, it will help you learn about Apify and scraping in general and set you up for this tutorial, because this one builds on topics and code examples discussed there.

## Getting to know our tools

In the https://docs.apify.com/academy/apify-scrapers/getting-started tutorial, we've confirmed that the scraper works as expected, so now it's time to add more data to the results.

To do that, we'll be using the https://github.com/puppeteer/puppeteer. Puppeteer is a browser automation library that allows you to control a browser using JavaScript. That is, simulate a real human sitting in front of a computer, using a mouse and a keyboard. It gives you almost unlimited possibilities, but you need to learn quite a lot before you'll be able to use all of its features. We'll walk you through some of the basics of Puppeteer, so that you can start using it for some of the most typical scraping tasks, but if you really want to master it, you'll need to visit its https://pptr.dev/ and really dive deep into its intricacies.

> The purpose of Puppeteer Scraper is to remove some of the difficulty faced when using Puppeteer by wrapping it in a nice, manageable UI. It provides almost all of its features in a format that is much easier to grasp when first trying to scrape using Puppeteer.

### Web Scraper differences

At first glance, it may seem like **Web Scraper** (https://apify.com/apify/web-scraper) and Puppeteer Scraper are almost the same. Well, they are. In fact, Web Scraper uses Puppeteer underneath. The difference is the amount of control they give you. Where Web Scraper only gives you access to in-browser JavaScript and the `pageFunction` is executed in the browser context, Puppeteer Scraper's `pageFunction` is executed in Node.js context, giving you much more freedom to bend the browser to your will. You're the puppeteer and the browser is your puppet. It's also much easier to work with external APIs, databases or the https://sdk.apify.com in the Node.js context. The tradeoff is simplicity vs power. Web Scraper is simple, Puppeteer Scraper is powerful (and the https://sdk.apify.com is super-powerful).

> In other words, Web Scraper's `pageFunction` is like a single https://pptr.dev/#?product=Puppeteer&show=api-pageevaluatepagefunction-args call.

Now that's out of the way, let's open one of the Actor detail pages in the Store, for example the Web Scraper page and use our DevTools-Fu to scrape some data.

> If you're wondering why we're using Web Scraper as an example instead of Puppeteer Scraper, it's only because we didn't want to triple the number of screenshots we needed to make. Lazy developers!

## Building our Page function

Before we start, let's do a quick recap of the data we chose to scrape:

1. **URL** - The URL that goes directly to the Actor's detail page.
2. **Unique identifier** - Such as **apify/web-scraper**.
3. **Title** - The title visible in the Actor's detail page.
4. **Description** - The Actor's description.
5. **Last modification date** - When the Actor was last modified.
6. **Number of runs** - How many times the Actor was run.

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/scraping-practice.webp)

We've already scraped numbers 1 and 2 in the https://docs.apify.com/academy/apify-scrapers/getting-started.md tutorial, so let's get to the next one on the list: title.

### Title

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/title.webp)

By using the element selector tool, we find out that the title is there under an `` tag, as titles should be. Maybe surprisingly, we find that there are actually two `` tags on the detail page. This should get us thinking. Is there any parent element that includes our `` tag, but not the other ones? Yes, there is! A `` element that we can use to select only the heading we're interested in.

> Remember that you can press CTRL+F (CMD+F) in the Elements tab of DevTools to open the search bar where you can quickly search for elements using their selectors. And always make sure to use the DevTools to verify your scraping process and assumptions. It's faster than changing the crawler code all the time.

To get the title we need to find it using a `header h1` selector, which selects all `` elements that have a `` ancestor. And as we already know, there's only one.


```
// Using Puppeteer
async function pageFunction(context) {
    const { page } = context;
    const title = await page.$eval(
        'header h1',
        ((el) => el.textContent),
    );

    return {
        title,
    };
}
```


The https://pptr.dev/#?product=Puppeteer&show=api-elementhandleevalselector-pagefunction-args-1 function allows you to run a function in the browser, with the selected element as the first argument. Here we use it to extract the text content of a `h1` element that's in the page. The return value of the function is automatically passed back to the Node.js context, so we receive an actual `string` with the element's text.

### Description

Getting the Actor's description is a little more involved, but still pretty straightforward. We cannot search for a `` tag, because there's a lot of them in the page. We need to narrow our search down a little. Using the DevTools we find that the Actor description is nested within the `` element too, same as the title. Moreover, the actual description is nested inside a `` tag with a class `actor-description`.

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/description.webp)


```
async function pageFunction(context) {
    const { page } = context;
    const title = await page.$eval(
        'header h1',
        ((el) => el.textContent),
    );
    const description = await page.$eval(
        'header span.actor-description',
        ((el) => el.textContent),
    );

    return {
        title,
        description,
    };
}
```


### Modified date

The DevTools tell us that the `modifiedDate` can be found in a `` element.

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/modified-date.webp)


```
async function pageFunction(context) {
    const { page } = context;
    const title = await page.$eval(
        'header h1',
        ((el) => el.textContent),
    );
    const description = await page.$eval(
        'header span.actor-description',
        ((el) => el.textContent),
    );

    const modifiedTimestamp = await page.$eval(
        'ul.ActorHeader-stats time',
        (el) => el.getAttribute('datetime'),
    );
    const modifiedDate = new Date(Number(modifiedTimestamp));

    return {
        title,
        description,
        modifiedDate,
    };
}
```


Similarly to `page.$eval`, the https://pptr.dev/#?product=Puppeteer&show=api-elementhandleevalselector-pagefunction-args function runs a function in the browser, only this time, it does not provide you with a single `Element` as the function's argument, but rather with an `Array` of `Elements`. Once again, the return value of the function will be passed back to the Node.js context.

It might look a little too complex at first glance, but let us walk you through it. We find all the `` elements. Then, we read its `datetime` attribute, because that's where a unix timestamp is stored as a `string`.

But we would much rather see a readable date in our results, not a unix timestamp, so we need to convert it. Unfortunately, the `new Date()` constructor will not accept a `string`, so we cast the `string` to a `number` using the `Number()` function before actually calling `new Date()`. Phew!

### Run count

And so we're finishing up with the `runCount`. There's no specific element like ``, so we need to create a complex selector and then do a transformation on the result.


```
async function pageFunction(context) {
    const { page } = context;
    const title = await page.$eval(
        'header h1',
        ((el) => el.textContent),
    );
    const description = await page.$eval(
        'header span.actor-description',
        ((el) => el.textContent),
    );

    const modifiedTimestamp = await page.$eval(
        'ul.ActorHeader-stats time',
        (el) => el.getAttribute('datetime'),
    );
    const modifiedDate = new Date(Number(modifiedTimestamp));

    const runCountText = await page.$eval(
        'ul.ActorHeader-stats > li:nth-of-type(3)',
        ((el) => el.textContent),
    );
    const runCount = Number(runCountText.match(/[\d,]+/)[0].replace(',', ''));

    return {
        title,
        description,
        modifiedDate,
        runCount,
    };
}
```


The `ul.ActorHeader-stats > li:nth-of-type(3)` looks complicated, but it only reads that we're looking for a `` element and within that element we're looking for the third `` element. We grab its text, but we're only interested in the number of runs. We parse the number out using a regular expression, but its type is still a `string`, so we finally convert the result to a `number` by wrapping it with a `Number()` call.

> The numbers are formatted with commas as thousands separators (e.g. `'1,234,567'`), so to extract it, we first use regular expression `/[\d,]+/` - it will search for consecutive number or comma characters. Then we extract the match via `.match(/[\d,]+/)[0]` and finally remove all the commas by calling `.replace(/,/g, '')`. We need to use `/,/g` with the global modifier to support large numbers with multiple separators, without it we would replace only the very first occurrence.
>
> This will give us a string (e.g. `'1234567'`) that can be converted via `Number` function.

### Wrapping it up

And there we have it! All the data we needed in a single object. For the sake of completeness, let's add the properties we parsed from the URL earlier and we're good to go.


```
async function pageFunction(context) {
    const { page, request } = context;
    const { url } = request;

    // ...

    const uniqueIdentifier = url
        .split('/')
        .slice(-2)
        .join('/');

    const title = await page.$eval(
        'header h1',
        ((el) => el.textContent),
    );
    const description = await page.$eval(
        'header span.actor-description',
        ((el) => el.textContent),
    );

    const modifiedTimestamp = await page.$eval(
        'ul.ActorHeader-stats time',
        (el) => el.getAttribute('datetime'),
    );
    const modifiedDate = new Date(Number(modifiedTimestamp));

    const runCountText = await page.$eval(
        'ul.ActorHeader-stats > li:nth-of-type(3)',
        ((el) => el.textContent),
    );
    const runCount = Number(runCountText.match(/[\d,]+/)[0].replace(',', ''));

    return {
        url,
        uniqueIdentifier,
        title,
        description,
        modifiedDate,
        runCount,
    };
}
```


All we need to do now is add this to our `pageFunction`:


```
async function pageFunction(context) {
    // page is Puppeteer's page
    const { request, log, skipLinks, page } = context;

    if (request.userData.label === 'START') {
        log.info('Store opened!');
        // Do some stuff later.
    }
    if (request.userData.label === 'DETAIL') {
        const { url } = request;
        log.info(`Scraping ${url}`);
        await skipLinks();

        // Do some scraping.
        const uniqueIdentifier = url
            .split('/')
            .slice(-2)
            .join('/');

        // Get attributes in parallel to speed up the process.
        const titleP = page.$eval(
            'header h1',
            (el) => el.textContent,
        );
        const descriptionP = page.$eval(
            'header span.actor-description',
            (el) => el.textContent,
        );
        const modifiedTimestampP = page.$eval(
            'ul.ActorHeader-stats time',
            (el) => el.getAttribute('datetime'),
        );
        const runCountTextP = page.$eval(
            'ul.ActorHeader-stats > li:nth-of-type(3)',
            (el) => el.textContent,
        );

        const [
            title,
            description,
            modifiedTimestamp,
            runCountText,
        ] = await Promise.all([
            titleP,
            descriptionP,
            modifiedTimestampP,
            runCountTextP,
        ]);

        const modifiedDate = new Date(Number(modifiedTimestamp));
        const runCount = Number(runCountText.match(/[\d,]+/)[0].replace(',', ''));

        return {
            url,
            uniqueIdentifier,
            title,
            description,
            modifiedDate,
            runCount,
        };
    }
}
```


> You have definitely noticed that we changed up the code a little bit. This is because the back and forth communication between Node.js and browser takes some time and it slows down the scraper. To limit the effect of this, we changed all the functions to start at the same time and only wait for all of them to finish at the end. This is called concurrency or parallelism. Unless the functions need to be executed in a specific order, it's often a good idea to run them concurrently to speed things up.

### Test run

As always, try hitting that **Save & Run** button and visit the **Dataset** preview of clean items. You should see a nice table of all the attributes correctly scraped. You nailed it!

## Pagination

Pagination is a term that represents "going to the next page of results". You may have noticed that we did not actually scrape all the Actors, just the first page of results. That's because to load the rest of the Actors, one needs to click the **Show more** button at the very bottom of the list. This is pagination.

> This is a typical form of JavaScript pagination, sometimes called infinite scroll. Other pages may use links that take you to the next page. If you encounter those, make a **Pseudo URL** for those links and they will be automatically enqueued to the request queue. Use a label to let the scraper know what kind of URL it's processing.

### Waiting for dynamic content

Before we talk about paginating, we need to have a quick look at dynamic content. Since Apify Store is a JavaScript application (a popular approach), the button might not exist in the page when the scraper runs the `pageFunction`.

How is this possible? Because the scraper only waits with executing the `pageFunction` for the page to load its HTML. If there's additional JavaScript that modifies the DOM afterwards, the `pageFunction` may execute before this JavaScript had the time to run.

At first, you may think that the scraper is broken, but it just cannot wait for all the JavaScript in the page to finish executing. For a lot of pages, there's always some JavaScript executing or some network requests being made. It would never stop waiting. It is therefore up to you, the programmer, to wait for the elements you need.

#### The `context.page.waitFor()` function

`waitFor()` is a function that's available on the Puppeteer `page` object that's in turn available on the `context` argument of the `pageFunction` (as you already know from previous chapters). It helps you with, well, waiting for stuff. It accepts either a number of milliseconds to wait, a selector to await in the page, or a function to execute. It will stop waiting once the time elapses, the selector appears or the provided function returns `true`.

> See https://pptr.dev/#?product=Puppeteer&show=api-pagewaitforselectororfunctionortimeout-options-args in the Puppeteer documentation.


```
// Waits for 2 seconds.
await page.waitFor(2000);
// Waits until an element with id "my-id" appears in the page.
await page.waitFor('#my-id');
// Waits until a "myObject" variable appears
// on the window object.
await page.waitFor(() => !!window.myObject);
```


The selector may never be found and the function might never return `true`, so the `page.waitFor()` function also has a timeout. The default is `30` seconds. You can override it by providing an options object as the second parameter, with a `timeout` property.


```
await page.waitFor('.bad-class', { timeout: 5000 });
```


With those tools, you should be able to handle any dynamic content the website throws at you.

### How to paginate

After going through the theory, let's design the algorithm:

1. Wait for the **Show more** button.

2. Click it.

3. Is there another **Show more** button?

   

   * Yes? Repeat from 1. (loop)
   * No? We're done. We have all the Actors.

#### Waiting for the button

Before we can wait for the button, we need to know its unique selector. A quick look in the DevTools tells us that the button's class is some weird randomly generated string, but fortunately, there's an enclosing `` with a class of `show-more`. Great! Our unique selector:


```
div.show-more > button
```


> Don't forget to confirm our assumption in the DevTools finder tool (CTRL/CMD + F).

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/waiting-for-the-button.webp)

Now that we know what to wait for, we plug it into the `waitFor()` function.


```
await page.waitFor('div.show-more > button');
```


#### Clicking the button

We have a unique selector for the button and we know that it's already rendered in the page. Clicking it is a piece of cake. We'll use the Puppeteer `page` again to issue the click. Puppeteer will actually simulate dragging the mouse and making a left mouse click in the element.


```
await page.click('div.show-more > button');
```


This will show the next page of Actors.

#### Repeating the process

We've shown two function calls, but how do we make this work together in the `pageFunction`?


```
async function pageFunction(context) {

    // ...

    let timeout; // undefined
    const buttonSelector = 'div.show-more > button';
    for (;;) {
        log.info('Waiting for the "Show more" button.');
        try {
        // Default timeout first time.
            await page.waitFor(buttonSelector, { timeout });
            // 2 sec timeout after the first.
            timeout = 2000;
        } catch (err) {
        // Ignore the timeout error.
            log.info('Could not find the "Show more button", '
            + 'we\'ve reached the end.');
            break;
        }
        log.info('Clicking the "Show more" button.');
        await page.click(buttonSelector);
    }

    // ...

}
```


We want to run this until the `waitFor()` function throws, so that's why we use a `while(true)` loop. We're also not interested in the error, because we're expecting it, so we ignore it and print a log message instead.

You might be wondering what's up with the `timeout`. Well, for the first page load, we want to wait longer, so that all the page's JavaScript has had a chance to execute, but for the other iterations, the JavaScript is already loaded and we're waiting for the page to re-render so waiting for `2` seconds is enough to confirm that the button is not there. We don't want to stall the scraper for `30` seconds just to make sure that there's no button.

### Plugging it into the Page function

We've got the general algorithm ready, so all that's left is to integrate it into our earlier `pageFunction`. Remember the `// Do some stuff later` comment? Let's replace it.


```
async function pageFunction(context) {
    const { request, log, skipLinks, page } = context;
    if (request.userData.label === 'START') {
        log.info('Store opened!');
        let timeout; // undefined
        const buttonSelector = 'div.show-more > button';
        for (;;) {
            log.info('Waiting for the "Show more" button.');
            try {
                // Default timeout first time.
                await page.waitFor(buttonSelector, { timeout });
                // 2 sec timeout after the first.
                timeout = 2000;
            } catch (err) {
                // Ignore the timeout error.
                log.info('Could not find the "Show more button", '
                    + 'we\'ve reached the end.');
                break;
            }
            log.info('Clicking the "Show more" button.');
            await page.click(buttonSelector);
        }
    }

    if (request.userData.label === 'DETAIL') {
        const { url } = request;
        log.info(`Scraping ${url}`);
        await skipLinks();

        // Do some scraping.
        const uniqueIdentifier = url
            .split('/')
            .slice(-2)
            .join('/');

        // Get attributes in parallel to speed up the process.
        const titleP = page.$eval(
            'header h1',
            (el) => el.textContent,
        );
        const descriptionP = page.$eval(
            'header span.actor-description',
            (el) => el.textContent,
        );
        const modifiedTimestampP = page.$eval(
            'ul.ActorHeader-stats time',
            (el) => el.getAttribute('datetime'),
        );
        const runCountTextP = page.$eval(
            'ul.ActorHeader-stats > li:nth-of-type(3)',
            (el) => el.textContent,
        );

        const [
            title,
            description,
            modifiedTimestamp,
            runCountText,
        ] = await Promise.all([
            titleP,
            descriptionP,
            modifiedTimestampP,
            runCountTextP,
        ]);

        const modifiedDate = new Date(Number(modifiedTimestamp));
        const runCount = Number(runCountText.match(/[\d,]+/)[0].replace(',', ''));

        return {
            url,
            uniqueIdentifier,
            title,
            description,
            modifiedDate,
            runCount,
        };
    }
}
```


That's it! You can now remove the **Max pages per run** limit, **Save & Run** your task and watch the scraper paginate through all the Actors and then scrape all of their data. After it succeeds, open the **Dataset** tab again and click on **Preview**\*\*. You should have a table of all the Actor's details in front of you. If you do, great job! You've successfully scraped Apify Store. And if not, no worries, go through the code examples again, it's probably just a typo.

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/plugging-it-into-the-pagefunction.webp)

## Downloading the scraped data

You already know the **Dataset** tab of the run console since this is where we've always previewed our data. Notice the row of data formats such as JSON, CSV, and Excel. Below it are options for viewing and downloading the data. Go ahead and try it.

> If you prefer working with an API, you can find the example endpoint under the API tab: **Get dataset items**.

### Clean items

You can view and download your data without modifications, or you can choose to only get **clean** items. Data that aren't cleaned include a record for each `pageFunction` invocation, even if you did not return any results. The record also includes hidden fields such as `#debug`, where you can find a variety of information that can help you with debugging your scrapers.

Clean items, on the other hand, include only the data you returned from the `pageFunction`. If you're only interested in the data you scraped, this format is what you will be using most of the time.

To control this, open the **Advanced options** view on the **Dataset** tab.

## Bonus: Making your code neater

You may have noticed that the `pageFunction` gets quite bulky. To make better sense of your code and have an easier time maintaining or extending your task, feel free to define other functions inside the `pageFunction` that encapsulate all the different logic. You can, for example, define a function for each of the different pages:


```
async function pageFunction(context) {
    switch (context.request.userData.label) {
        case 'START': return handleStart(context);
        case 'DETAIL': return handleDetail(context);
        default: throw new Error('Unknown request label.');
    }

    async function handleStart({ log, page }) {
        log.info('Store opened!');
        let timeout; // undefined
        const buttonSelector = 'div.show-more > button';
        for (;;) {
            log.info('Waiting for the "Show more" button.');
            try {
                // Default timeout first time.
                await page.waitFor(buttonSelector, { timeout });
                // 2 sec timeout after the first.
                timeout = 2000;
            } catch (err) {
                // Ignore the timeout error.
                log.info('Could not find the "Show more button", '
                    + 'we\'ve reached the end.');
                break;
            }
            log.info('Clicking the "Show more" button.');
            await page.click(buttonSelector);
        }
    }

    async function handleDetail({
        request,
        log,
        skipLinks,
        page,
    }) {
        const { url } = request;
        log.info(`Scraping ${url}`);
        await skipLinks();

        // Do some scraping.
        const uniqueIdentifier = url
            .split('/')
            .slice(-2)
            .join('/');

        // Get attributes in parallel to speed up the process.
        const titleP = page.$eval(
            'header h1',
            (el) => el.textContent,
        );
        const descriptionP = page.$eval(
            'header span.actor-description',
            (el) => el.textContent,
        );
        const modifiedTimestampP = page.$eval(
            'ul.ActorHeader-stats time',
            (el) => el.getAttribute('datetime'),
        );
        const runCountTextP = page.$eval(
            'ul.ActorHeader-stats > li:nth-of-type(3)',
            (el) => el.textContent,
        );

        const [
            title,
            description,
            modifiedTimestamp,
            runCountText,
        ] = await Promise.all([
            titleP,
            descriptionP,
            modifiedTimestampP,
            runCountTextP,
        ]);

        const modifiedDate = new Date(Number(modifiedTimestamp));
        const runCount = Number(runCountText.match(/[\d,]+/)[0].replace(',', ''));

        return {
            url,
            uniqueIdentifier,
            title,
            description,
            modifiedDate,
            runCount,
        };
    }
}
```


> If you're confused by the functions being declared below their executions, it's called hoisting and it's a feature of JavaScript. It helps you put what matters on top, if you so desire.

## Bonus 2: Using jQuery with Puppeteer Scraper

If you're familiar with the https://jquery.com/, you may have looked at the scraping code and thought that it's unnecessarily complicated. That's probably up to everyone to decide on their own, but the good news is, you can use jQuery with Puppeteer Scraper too.

### Injecting jQuery

To be able to use jQuery, we first need to introduce it to the browser. The https://sdk.apify.com/docs/api/puppeteer#puppeteerinjectjquerypage function will help us with the task.

> Friendly warning: Injecting jQuery into a page may break the page itself, if it expects a specific version of jQuery to be available and you override it with an incompatible one. Be careful.

You can either call this function directly in your `pageFunction`, or you can set up jQuery injection in the **Pre goto function** in the **Input and options** section.


```
async function pageFunction(context) {
    const { Apify, page } = context;
    await Apify.utils.puppeteer.injectJQuery(page);

    // your code ...
}
```



```
async function preGotoFunction({ page, Apify }) {
    await Apify.utils.puppeteer.injectJQuery(page);
}
```


The implementations are almost equal in effect. That means that in some cases, you may see performance differences, or one might work while the other does not. Depending on the target website.

Let's try refactoring the Bonus 1 version of the `pageFunction` to use jQuery.


```
async function pageFunction(context) {
    switch (context.request.userData.label) {
        case 'START': return handleStart(context);
        case 'DETAIL': return handleDetail(context);
        default: throw new Error(`Unknown label: ${context.request.userData.label}`);
    }

    async function handleStart({ log, page }) {
        log.info('Store opened!');
        let timeout; // undefined
        const buttonSelector = 'div.show-more > button';
        for (;;) {
            log.info('Waiting for the "Show more" button.');
            try {
                await page.waitFor(buttonSelector, { timeout });
                timeout = 2000;
            } catch (err) {
                log.info('Could not find the "Show more button", '
                    + 'we\'ve reached the end.');
                break;
            }
            log.info('Clicking the "Show more" button.');
            await page.click(buttonSelector);
        }
    }

    async function handleDetail(contextInner) {
        const {
            request,
            log,
            skipLinks,
            page,
            Apify,
        } = contextInner;

        // Inject jQuery
        await Apify.utils.puppeteer.injectJQuery(page);

        const { url } = request;
        log.info(`Scraping ${url}`);
        await skipLinks();

        // Do some scraping.
        const uniqueIdentifier = url
            .split('/')
            .slice(-2)
            .join('/');

        // Use jQuery only inside page.evaluate (inside browser)
        const results = await page.evaluate(() => {
            return {
                title: $('header h1').text(),
                description: $('header span.actor-description').text(),
                modifiedDate: new Date(
                    Number(
                        $('ul.ActorHeader-stats time').attr('datetime'),
                    ),
                ).toISOString(),
                runCount: Number(
                    $('ul.ActorHeader-stats > li:nth-of-type(3)')
                        .text()
                        .match(/[\d,]+/)[0]
                        .replace(/,/g, ''),
                ),
            };
        });

        return {
            url,
            uniqueIdentifier,
            // Add results from browser to output
            ...results,
        };
    }
}
```


> There's an important takeaway from the example code. You can only use jQuery in the browser scope, even though you're injecting it outside of the browser. We're using the https://pptr.dev/#?product=Puppeteer&show=api-pageevaluatepagefunction-args function to run the script in the context of the browser and the return value is passed back to Node.js. Keep this in mind.

## Final word

Thank you for reading this whole tutorial! Really! It's important to us that our users have the best information available to them so that they can use Apify effectively. We're glad that you made it all the way here and congratulations on creating your first scraping task. We hope that you liked the tutorial and if there's anything you'd like to ask, https://discord.gg/jyEM2PRvMU!

## What's next

* Check out the https://docs.apify.com/sdk and its https://docs.apify.com/sdk/js/docs/guides/apify-platform tutorial if you'd like to try building your own Actors. It's a bit more complex and involved than writing a `pageFunction`, but it allows you to fine-tune all the details of your scraper to your liking.
* https://docs.apify.com/platform/actors.md, from how they work to https://docs.apify.com/platform/actors/publishing.md them in Apify Store, and even https://blog.apify.com/make-regular-passive-income-developing-web-automation-actors-b0392278d085/ on Actors.
* Found out you're not into the coding part but would still to use Apify Actors? Check out our https://apify.com/store or https://apify.com/contact-sales from an Apify-certified developer.

**Learn how to scrape a website using Apify's Puppeteer Scraper. Build an Actor's page function, extract information from a web page and download your data.**

***


---

#

This scraping tutorial will go into the nitty gritty details of extracting data from **https://apify.com/store** using **Web Scraper** (https://apify.com/apify/web-scraper). If you arrived here from the https://docs.apify.com/academy/apify-scrapers/getting-started.md, tutorial, great! You are ready to continue where we left off. If you haven't seen the Getting started yet, check it out, it will help you learn about Apify and scraping in general and set you up for this tutorial, because this one builds on topics and code examples discussed there.

## Getting to know our tools

In the https://docs.apify.com/academy/apify-scrapers/getting-started tutorial, we've confirmed that the scraper works as expected, so now it's time to add more data to the results.

To do that, we'll be using the https://jquery.com/, because it provides some nice tools and a lot of people familiar with JavaScript already know how to use it.

> https://api.jquery.com/ if you're not familiar with it. And if you don't want to use it, that's okay. Everything can be done using pure JavaScript, too.

To add jQuery, all we need to do is turn on **Inject jQuery** under the **Input and options** tab. This will add a `context.jQuery` function that you can use.

Now that's out of the way, let's open one of the Actor detail pages in the Store, for example the https://apify.com/apify/web-scraper page and use our DevTools-Fu to scrape some data.

## Building our Page function

Before we start, let's do a quick recap of the data we chose to scrape:

1. **URL** - The URL that goes directly to the Actor's detail page.
2. **Unique identifier** - Such as **apify/web-scraper**.
3. **Title** - The title visible in the Actor's detail page.
4. **Description** - The Actor's description.
5. **Last modification date** - When the Actor was last modified.
6. **Number of runs** - How many times the Actor was run.

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/scraping-practice.webp)

We've already scraped numbers 1 and 2 in the https://docs.apify.com/academy/apify-scrapers/getting-started.md tutorial, so let's get to the next one on the list: title.

### Title

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/title.webp)

By using the element selector tool, we find out that the title is there under an `` tag, as titles should be. Maybe surprisingly, we find that there are actually two `` tags on the detail page. This should get us thinking. Is there any parent element that includes our `` tag, but not the other ones? Yes, there is! A `` element that we can use to select only the heading we're interested in.

> Remember that you can press CTRL+F (CMD+F) in the Elements tab of DevTools to open the search bar where you can quickly search for elements using their selectors. And always make sure to use the DevTools to verify your scraping process and assumptions. It's faster than changing the crawler code all the time.

To get the title we need to find it using a `header h1` selector, which selects all `` elements that have a `` ancestor. And as we already know, there's only one.


```
// Using jQuery.
async function pageFunction(context) {
    const { jQuery: $ } = context;

    // ... rest of the code
    return {
        title: $('header h1').text(),
    };
}
```


### Description

Getting the Actor's description is a little more involved, but still pretty straightforward. We cannot search for a `` tag, because there's a lot of them in the page. We need to narrow our search down a little. Using the DevTools we find that the Actor description is nested within the `` element too, same as the title. Moreover, the actual description is nested inside a `` tag with a class `actor-description`.

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/description.webp)


```
async function pageFunction(context) {
    const { jQuery: $ } = context;

    // ... rest of the code
    return {
        title: $('header h1').text(),
        description: $('header span.actor-description').text(),
    };
}
```


### Modified date

The DevTools tell us that the `modifiedDate` can be found in a `` element.

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/modified-date.webp)


```
async function pageFunction(context) {
    const { jQuery: $ } = context;

    // ... rest of the code
    return {
        title: $('header h1').text(),
        description: $('header span.actor-description').text(),
        modifiedDate: new Date(
            Number(
                $('ul.ActorHeader-stats time').attr('datetime'),
            ),
        ),
    };
}
```


It might look a little too complex at first glance, but let us walk you through it. We find all the `` elements. Then, we read its `datetime` attribute, because that's where a unix timestamp is stored as a `string`.

But we would much rather see a readable date in our results, not a unix timestamp, so we need to convert it. Unfortunately, the `new Date()` constructor will not accept a `string`, so we cast the `string` to a `number` using the `Number()` function before actually calling `new Date()`. Phew!

### Run count

And so we're finishing up with the `runCount`. There's no specific element like ``, so we need to create a complex selector and then do a transformation on the result.


```
async function pageFunction(context) {
    const { jQuery: $ } = context;

    // ... rest of the code
    return {
        title: $('header h1').text(),
        description: $('header span.actor-description').text(),
        modifiedDate: new Date(
            Number(
                $('ul.ActorHeader-stats time').attr('datetime'),
            ),
        ),
        runCount: Number(
            $('ul.ActorHeader-stats > li:nth-of-type(3)')
                .text()
                .match(/[\d,]+/)[0]
                .replace(/,/g, ''),
        ),
    };
}
```


The `ul.ActorHeader-stats > li:nth-of-type(3)` looks complicated, but it only reads that we're looking for a `` element and within that element we're looking for the third `` element. We grab its text, but we're only interested in the number of runs. We parse the number out using a regular expression, but its type is still a `string`, so we finally convert the result to a `number` by wrapping it with a `Number()` call.

> The numbers are formatted with commas as thousands separators (e.g. `'1,234,567'`), so to extract it, we first use regular expression `/[\d,]+/` - it will search for consecutive number or comma characters. Then we extract the match via `.match(/[\d,]+/)[0]` and finally remove all the commas by calling `.replace(/,/g, '')`. We need to use `/,/g` with the global modifier to support large numbers with multiple separators, without it we would replace only the very first occurrence.
>
> This will give us a string (e.g. `'1234567'`) that can be converted via `Number` function.

### Wrapping it up

And there we have it! All the data we needed in a single object. For the sake of completeness, let's add the properties we parsed from the URL earlier and we're good to go.


```
async function pageFunction(context) {
    const { request, jQuery: $ } = context;
    const { url } = request;

    // ... rest of the code

    const uniqueIdentifier = url.split('/').slice(-2).join('/');

    return {
        url,
        uniqueIdentifier,
        title: $('header h1').text(),
        description: $('header span.actor-description').text(),
        modifiedDate: new Date(
            Number(
                $('ul.ActorHeader-stats time').attr('datetime'),
            ),
        ),
        runCount: Number(
            $('ul.ActorHeader-stats > li:nth-of-type(3)')
                .text()
                .match(/[\d,]+/)[0]
                .replace(/,/g, ''),
        ),
    };
}
```


All we need to do now is add this to our `pageFunction`:


```
async function pageFunction(context) {
    // use jQuery as $
    const { request, log, skipLinks, jQuery: $ } = context;

    if (request.userData.label === 'START') {
        log.info('Store opened!');
        // Do some stuff later.
    }
    if (request.userData.label === 'DETAIL') {
        const { url } = request;
        log.info(`Scraping ${url}`);
        await skipLinks();

        // Do some scraping.
        const uniqueIdentifier = url
            .split('/')
            .slice(-2)
            .join('/');

        return {
            url,
            uniqueIdentifier,
            title: $('header h1').text(),
            description: $('header span.actor-description').text(),
            modifiedDate: new Date(
                Number(
                    $('ul.ActorHeader-stats time').attr('datetime'),
                ),
            ),
            runCount: Number(
                $('ul.ActorHeader-stats > li:nth-of-type(3)')
                    .text()
                    .match(/[\d,]+/)[0]
                    .replace(/,/g, ''),
            ),
        };
    }
}
```


### Test run

As always, try hitting that **Save & Run** button and visit the **Dataset** preview of clean items. You should see a nice table of all the attributes correctly scraped. You nailed it!

## Pagination

Pagination is a term that represents "going to the next page of results". You may have noticed that we did not actually scrape all the Actors, just the first page of results. That's because to load the rest of the Actors, one needs to click the **Show more** button at the very bottom of the list. This is pagination.

> This is a typical form of JavaScript pagination, sometimes called infinite scroll. Other pages may use links that take you to the next page. If you encounter those, make a **Pseudo URL** for those links and they will be automatically enqueued to the request queue. Use a label to let the scraper know what kind of URL it's processing.

### Waiting for dynamic content

Before we talk about paginating, we need to have a quick look at dynamic content. Since Apify Store is a JavaScript application (a popular approach), the button might not exist in the page when the scraper runs the `pageFunction`.

How is this possible? Because the scraper only waits with executing the `pageFunction` for the page to load its HTML. If there's additional JavaScript that modifies the DOM afterwards, the `pageFunction` may execute before this JavaScript had the time to run.

At first, you may think that the scraper is broken, but it just cannot wait for all the JavaScript in the page to finish executing. For a lot of pages, there's always some JavaScript executing or some network requests being made. It would never stop waiting. It is therefore up to you, the programmer, to wait for the elements you need.

#### The `context.waitFor()` function

`waitFor()` is a function that's available on the `context` object passed to the `pageFunction` and helps you with, well, waiting for stuff. It accepts either a number of milliseconds to wait, a selector to await in the page, or a function to execute. It will stop waiting once the time elapses, the selector appears or the provided function returns `true`.


```
// Waits for 2 seconds.
await waitFor(2000);
// Waits until an element with id "my-id" appears
// in the page.
await waitFor('#my-id');
// Waits until a "myObject" variable appears
// on the window object.
await waitFor(() => !!window.myObject);
```


The selector may never be found and the function might never return `true`, so the `waitFor()` function also has a timeout. The default is `20` seconds. You can override it by providing an options object as the second parameter, with a `timeoutMillis` property.


```
await waitFor('.bad-class', { timeoutMillis: 5000 });
```


With those tools, you should be able to handle any dynamic content the website throws at you.

### How to paginate

After going through the theory, let's design the algorithm:

1. Wait for the **Show more** button.

2. Click it.

3. Is there another **Show more** button?

   

   * Yes? Repeat from 1. (loop)
   * No? We're done. We have all the Actors.

#### Waiting for the button

Before we can wait for the button, we need to know its unique selector. A quick look in the DevTools tells us that the button's class is some weird randomly generated string, but fortunately, there's an enclosing `` with a class of `show-more`. Great! Our unique selector:


```
div.show-more > button
```


> Don't forget to confirm our assumption in the DevTools finder tool (CTRL/CMD + F).

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/waiting-for-the-button.webp)

Now that we know what to wait for, we plug it into the `waitFor()` function.


```
await waitFor('div.show-more > button');
```


#### Clicking the button

We have a unique selector for the button and we know that it's already rendered in the page. Clicking it is a piece of cake. We'll use jQuery again, but feel free to use plain JavaScript, it works the same.


```
$('div.show-more > button').click();
```


This will show the next page of Actors.

#### Repeating the process

We've shown two function calls, but how do we make this work together in the `pageFunction`?


```
async function pageFunction(context) {

    // ...

    let timeoutMillis; // undefined
    const buttonSelector = 'div.show-more > button';
    for (;;) {
        log.info('Waiting for the "Show more" button.');
        try {
        // Default timeout first time.
            await waitFor(buttonSelector, { timeoutMillis });
            // 2 sec timeout after the first.
            timeoutMillis = 2000;
        } catch (err) {
        // Ignore the timeout error.
            log.info('Could not find the "Show more button", '
            + 'we\'ve reached the end.');
            break;
        }
        log.info('Clicking the "Show more" button.');
        $(buttonSelector).click();
    }

    // ...

}
```


We want to run this until the `waitFor()` function throws, so that's why we use a `while(true)` loop. We're also not interested in the error, because we're expecting it, so we ignore it and print a log message instead.

You might be wondering what's up with the `timeoutMillis`. Well, for the first page load, we want to wait longer, so that all the page's JavaScript has had a chance to execute, but for the other iterations, the JavaScript is already loaded and we're waiting for the page to re-render so waiting for `2` seconds is enough to confirm that the button is not there. We don't want to stall the scraper for `20` seconds just to make sure that there's no button.

### Plugging it into the pageFunction

We've got the general algorithm ready, so all that's left is to integrate it into our earlier `pageFunction`. Remember the `// Do some stuff later` comment? Let's replace it. And don't forget to destructure the `waitFor()` function on the first line.


```
async function pageFunction(context) {
    const { request,
        log,
        skipLinks,
        jQuery: $,
        waitFor,
    } = context;

    if (request.userData.label === 'START') {
        log.info('Store opened!');
        let timeoutMillis; // undefined
        const buttonSelector = 'div.show-more > button';
        for (;;) {
            log.info('Waiting for the "Show more" button.');
            try {
                // Default timeout first time.
                await waitFor(buttonSelector, { timeoutMillis });
                // 2 sec timeout after the first.
                timeoutMillis = 2000;
            } catch (err) {
                // Ignore the timeout error.
                log.info('Could not find the "Show more button", '
                    + 'we\'ve reached the end.');
                break;
            }
            log.info('Clicking the "Show more" button.');
            $(buttonSelector).click();
        }
    }
    if (request.userData.label === 'DETAIL') {
        const { url } = request;
        log.info(`Scraping ${url}`);
        await skipLinks();

        // Do some scraping.
        const uniqueIdentifier = url
            .split('/')
            .slice(-2)
            .join('/');

        return {
            url,
            uniqueIdentifier,
            title: $('header h1').text(),
            description: $('header span.actor-description').text(),
            modifiedDate: new Date(
                Number(
                    $('ul.ActorHeader-stats time').attr('datetime'),
                ),
            ),
            runCount: Number(
                $('ul.ActorHeader-stats > li:nth-of-type(3)')
                    .text()
                    .match(/[\d,]+/)[0]
                    .replace(/,/g, ''),
            ),
        };
    }
}
```


That's it! You can now remove the **Max pages per run** limit, **Save & Run** your task and watch the scraper paginate through all the Actors and then scrape all of their data. After it succeeds, open the **Dataset** tab again click on **Preview**. You should have a table of all the Actor's details in front of you. If you do, great job! You've successfully scraped Apify Store. And if not, no worries, go through the code examples again, it's probably just a typo.

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/plugging-it-into-the-pagefunction.webp)

## Downloading the scraped data

You already know the **Dataset** tab of the run console since this is where we've always previewed our data. Notice the row of data formats such as JSON, CSV, and Excel. Below it are options for viewing and downloading the data. Go ahead and try it.

> If you prefer working with an API, you can find the example endpoint under the API tab: **Get dataset items**.

### Clean items

You can view and download your data without modifications, or you can choose to only get **clean** items. Data that aren't cleaned include a record for each `pageFunction` invocation, even if you did not return any results. The record also includes hidden fields such as `#debug`, where you can find a variety of information that can help you with debugging your scrapers.

Clean items, on the other hand, include only the data you returned from the `pageFunction`. If you're only interested in the data you scraped, this format is what you will be using most of the time.

To control this, open the **Advanced options** view on the **Dataset** tab.

## Bonus: Making your code neater

You may have noticed that the `pageFunction` gets quite bulky. To make better sense of your code and have an easier time maintaining or extending your task, feel free to define other functions inside the `pageFunction` that encapsulate all the different logic. You can, for example, define a function for each of the different pages:


```
async function pageFunction(context) {
    switch (context.request.userData.label) {
        case 'START': return handleStart(context);
        case 'DETAIL': return handleDetail(context);
        default: throw new Error('Unknown request label.');
    }

    async function handleStart({ log, waitFor }) {
        log.info('Store opened!');
        let timeoutMillis; // undefined
        const buttonSelector = 'div.show-more > button';
        for (;;) {
            log.info('Waiting for the "Show more" button.');
            try {
                // Default timeout first time.
                await waitFor(buttonSelector, { timeoutMillis });
                // 2 sec timeout after the first.
                timeoutMillis = 2000;
            } catch (err) {
                // Ignore the timeout error.
                log.info('Could not find the "Show more button", '
                    + 'we\'ve reached the end.');
                break;
            }
            log.info('Clicking the "Show more" button.');
            $(buttonSelector).click();
        }
    }

    async function handleDetail({
        request,
        log,
        skipLinks,
        jQuery: $,
    }) {
        const { url } = request;
        log.info(`Scraping ${url}`);
        await skipLinks();

        // Do some scraping.
        const uniqueIdentifier = url
            .split('/')
            .slice(-2)
            .join('/');

        return {
            url,
            uniqueIdentifier,
            title: $('header h1').text(),
            description: $('header span.actor-description').text(),
            modifiedDate: new Date(
                Number(
                    $('ul.ActorHeader-stats time').attr('datetime'),
                ),
            ),
            runCount: Number(
                $('ul.ActorHeader-stats > li:nth-of-type(3)')
                    .text()
                    .match(/[\d,]+/)[0]
                    .replace(/,/g, ''),
            ),
        };
    }
}
```


> If you're confused by the functions being declared below their executions, it's called hoisting and it's a feature of JavaScript. It helps you put what matters on top, if you so desire.

## Final word

Thank you for reading this whole tutorial! Really! It's important to us that our users have the best information available to them so that they can use Apify effectively. We're glad that you made it all the way here and congratulations on creating your first scraping task. We hope that you liked the tutorial and if there's anything you'd like to ask, https://discord.gg/jyEM2PRvMU!

## What's next

* Check out the https://docs.apify.com/sdk and its https://docs.apify.com/sdk/js/docs/guides/apify-platform tutorial if you'd like to try building your own Actors. It's a bit more complex and involved than writing a `pageFunction`, but it allows you to fine-tune all the details of your scraper to your liking.
* https://docs.apify.com/platform/actors.md, from how they work to https://docs.apify.com/platform/actors/publishing.md them in Apify Store, and even https://blog.apify.com/make-regular-passive-income-developing-web-automation-actors-b0392278d085/ on Actors.
* Found out you're not into the coding part but would still to use Apify Actors? Check out our https://apify.com/store or https://apify.com/contact-sales from an Apify-certified developer.

**Learn how to scrape a website using Apify's Web Scraper. Build an Actor's page function, extract information from a web page and download your data.**

***


---

# Validate your Actor idea

Before investing time into building an Actor, validate that people actually need it. This guide shows you how to assess market demand using free tools and research techniques.

## Assess your motivation

Ask yourself: *Do you want to build this?*

You'll work on this Actor for a long time. The best Actors come from developers who genuinely care about the problem they're solving. You don't need to be obsessed, but you should feel excited. That enthusiasm carries you through challenges and shows in your work.

## Estimate demand with SEO data

Check if people are searching for solutions like yours. If your idea aligns with popular search queries, you have a built-in user base.

### Keyword demand

Search for terms related to your Actor's function. If you're building a Reddit sentiment analysis scraper, check volume for phrases like *Reddit data extractor* or *analyze Reddit comments tool*.

Use free tools:

* https://business.google.com/en-all/ad-tools/keyword-planner/
* https://chromewebstore.google.com/detail/whatsmyserp/chbmoagfhnkggnhbjpoonnmhnpjdjdod Chrome extension
* https://keywordseverywhere.com/ (paid)

High search volume or multiple related terms indicate solid demand. Low or zero searches mean a very niche market, which isn't bad, but you'll rely more on direct marketing.

### Google autocomplete and related searches

Type your core keywords into Google and note the suggestions. Typing *scrape Amazon* might show *scrape Amazon reviews* or *Amazon price tracker*, confirming what people actually want.

### SEO difficulty and content gaps

Examine current search results. Few quality results for a query like *download data from \[obscure site]* indicates a content gap your Actor can fill.

Many results or ads for *Instagram scraper* means the market is proven but competitive. You'll need to differentiate.

Check keyword difficulty and domain authority. If difficulty is 70+ and top pages have 80+ domain authority with thousands of backlinks—and Apify already has an official Actor with 100,000+ users—you can't compete directly. Find an adjacent angle or specialization.

## Analyze Google Trends

Google Trends shows if interest in your idea is rising or falling. Declining trends are red flags. If searches dropped 90% over 12+ months (like *Clubhouse scraper* since 2021), that market has moved on.

Growth velocity matters more than current volume. A keyword growing from 10 to 100 monthly searches over 12 months shows exploding demand. Jump in early, before competition heats up.

Watch for spikes. Sudden jumps from media coverage or viral moments usually don't mean sustainable demand.

## Research community discussions

Beyond SEO data, go where your potential users are. Browse Reddit, Hacker News, Stack Overflow, X (Twitter), Discord, and Facebook groups. What problems are people discussing? What tools do they wish existed?

Document your findings. Note quotes and recurring themes like *Multiple marketers on Reddit want easy competitor pricing tracking—no existing solution mentioned*. These insights complement your SEO data and help you speak your users' language.

Zero discussion across multiple platforms over 4+ weeks means either no one cares about the problem or they've already solved it.

### Reddit

Search relevant subreddits (r/webscraping, r/datascience, r/SEO, r/marketing, or industry-specific ones) for questions like *How can I extract \[data] from \[site]?* or *I wish there was a tool to do X*. Multiple people independently asking for the same solution is strong validation.

Use the `site:` parameter in Google to search for relevant threads:


```
site:reddit.com extracting data from LinkedIn
```


You can also use tools like https://f5bot.com/ or https://gummysearch.com/.

### Q\&A forums and Stack Overflow

Look for questions about doing the task manually. If thinking about a LinkedIn scraper, check Stack Overflow for questions like *How can I scrape LinkedIn profiles?* Frequent questions or upvotes indicate many people trying to solve it without a good tool—an opportunity for your Actor.

Use the `site:` parameter:


```
site:stackoverflow.com extracting data from LinkedIn
```


### X and social media

Search keywords on X, LinkedIn, or other social media for professionals asking for recommendations like *Does anyone know a tool to monitor news about \[topic]?*

Run quick polls or ask your followers if they'd use a tool that does XYZ. A few positive responses validate your idea. Silence means rethink your value proposition. Engaging this way is early marketing.

Use the `site:` parameter:


```
site:x.com extracting data from LinkedIn
```


### Hacker News and niche forums

Platforms like https://news.ycombinator.com/ often have discussions on tech pain points and new tool launches. Search for keywords like *scrape Airbnb data* to see if people have shown interest or if someone launched a similar tool and what the reaction was.

Use the `site:` parameter:


```
site:news.ycombinator.com extracting data from LinkedIn
```


Look for spending signals

Current spending patterns are the strongest signal. When users mention "currently paying $X/month for \[existing tool] but..." or "upgraded from free to paid plan because..." or specific competitor pricing, they are proven buyers.

You can also engage in communities. Answer related questions, share knowledge, build reputation. Mention your Actor idea casually where relevant: "I'm building a tool to solve exactly this, would you use it?" Track responses. Positive responses with questions about pricing or features mean genuine interest.

## Analyze GitHub repositories

Star counts signal market demand. https://github.com/scrapy/scrapy has 58,000+, https://github.com/apify/crawlee has 20,000+, web scraping is validated. Use the https://www.star-history.com/ to check if stars are rising (growing momentum) or flat.

Issue analysis reveals pain points your Actor could solve. High issue counts with active responses indicate healthy, used projects. Open issues with themes like *JavaScript rendering problems* or *CAPTCHA bypass needed* show gaps you can fill. Issues with 10+ upvotes mean multiple users face the same problem.

Fork and commit activity shows developers actively work with the technology. High fork-to-star ratios mean people are building extensions (evidence of real usage). Recent commits (within 30 days) indicate active maintenance and a healthy project. No commits for 6+ months suggests declining interest.

## Review Product Hunt launches

Study successful automation tool launches from the past 12-24 months on Product Hunt. Filter by *Browser Automation* and *Automation tools*, then sort by upvotes. Note which taglines, value propositions, and features resonated. Products with 500+ upvotes validated something—figure out what worked.

## Research Apify Store

Apify Store shows transparent competitive intelligence most marketplaces hide. Every Actor displays monthly users, ratings, pricing, and last updates, a data goldmine for what works and what doesn't.

Search your use case or segment thoroughly. List relevant Actors with their metrics: monthly users, ratings, pricing, last update, and creator. Create a feature comparison matrix. Analyze top performers' READMEs, documentation quality, and issues.

Review competitor issues tabs closely. High-quality READMEs with examples and clear value propositions perform better in Store search. Issues reveal unresolved pain points from actual users. If competitors have 20+ open issues with repeated themes, that's your differentiation roadmap.

### Assess market saturation

* 10-30 Actors: healthy competition (market validated, you need differentiation)
* 50+ Actors: saturated (need obvious gaps)
* 1-5 Actors: blue ocean or unproven demand (validate carefully)

If the market has 50+ Actors with strong leaders (Apify-maintained with 50,000+ users) and you can't articulate clear differentiation, pivot. If you spot feature gaps or underserved niches, continue.

## Scan the broader market

Do a general Google search for tools or services that solve your problem. Your competition might not be another Actor—it could be a SaaS tool or API. If your idea is *monitor website uptime and screenshot changes*, established services probably exist.

Note direct competitors: How do they price it? What audience do they target? Are users satisfied or complaining? This validates that people pay for the service and reveals gaps you can fill.

Understanding the competition helps you refine your unique value—whether that's lower cost, better features, or targeting an underserved niche.

No existing solutions? Ask why. You might have found an untapped need, or it's a red flag (too difficult to implement, or the target website aggressively blocks scraping). Use your judgment.

## Get feedback from potential users

Reach out to people who match your target user profile. Building a real estate data Actor? Contact real estate analysts or agents (LinkedIn works well) and ask if a tool that does X would help them. Keep it informal—describe the problem you're solving and ask if they'd use or pay for it.

Direct feedback helps you:

* Validate your assumptions
* Understand pricing expectations
* Identify must-have features
* Refine your value proposition

Track responses carefully. Enthusiasm with specific questions about features or pricing indicates genuine interest. Generic "sounds interesting" responses mean keep validating.


---

# Find ideas for new Actors

Learn what kind of software tools are suitable to be packaged and published as Actors on Apify, and where you can find inspiration what to build.

***

## What can you build as an Actor

https://docs.apify.com/platform/actors are a new concept for building serverless micro-apps, which are easy to develop, share, integrate, and build upon.

They are useful for backend automation jobs, which users set up, integrate into their workflow, and let run in the background, rather than consumer-facing applications that users need to interact with.

Actors can run in two modes:

* In *batch mode*, they take a well-defined input, perform a job, and produce a well-defined output. This is useful for longer-running operations, such as web crawling or data processing.
* In *standby mode*, they run as a web server at a specific public URL. This is useful for request-response style applications, such as APIs or MCP servers.

### Web scrapers and crawlers

This is the most common type of Actors on https://apify.com/store. These Actors navigate websites, collect information from web pages, and store structured data in datasets for further processing.

Examples:

* **Website-specific scrapers** (https://apify.com/junglee/amazon-crawler, https://apify.com/curious_coder/linkedin-profile-scraper)
* **Search engines** (https://apify.com/apify/google-search-scraper, https://apify.com/curious_coder/bing-search-scraper)
* **Social media** (https://apify.com/apidojo/twitter-scraper-lite, https://apify.com/apify/instagram-scraper)
* **E-commerce data** (https://apify.com/autofacts/shopify, https://apify.com/dtrungtin/ebay-items-scraper)
* **General-purpose crawlers** (https://apify.com/apify/web-scraper, https://apify.com/apify/website-content-crawler)

### SaaS API wrappers

These Actors wrap existing SaaS services as Actors to make them accessible through the Apify platform and its many integrations.

Examples:

* https://apify.com/apify/openrouter
* https://apify.com/parsera-labs/parsera
* https://apify.com/apify/super-scraper-api

### Open-source libraries

Many open-source automation or data processing tools do not have a presence in the cloud, and need to be installed locally in "just five easy steps". Wrap those tools as Actors and make it easy for users to try and integrate those tools.

Examples:

* https://apify.com/misceres/sherlock
* https://apify.com/vancura/docling
* https://apify.com/snshn/monolith
* https://apify.com/janbuchar/crawl4ai

For inspiration, check out the https://apify.com/store/categories/open-source in Apify Store, or the following list:

GitHub projects potentially suitable for turning into Actors

* https://github.com/bytedance/Dolphin
* https://github.com/google/langextract
* https://github.com/virattt/ai-hedge-fund
* https://github.com/jamesturk/scrapeghost/
* https://github.com/idosal/git-mcp
* https://github.com/browser-use/browser-use
* https://github.com/browserbase/stagehand
* https://github.com/BuilderIO/gpt-crawler
* https://github.com/errata-ai/vale
* https://github.com/scrapybara/scrapybara-demos
* https://github.com/David-patrick-chuks/Riona-AI-Agent
* https://github.com/projectdiscovery/katana
* https://github.com/exa-labs/company-researcher
* https://github.com/Janix-ai/mcp-validator
* https://github.com/JoshuaC215/agent-service-toolkit
* https://github.com/dequelabs/axe-core
* https://github.com/janreges/siteone-crawler
* https://github.com/eugeneyan/news-agents
* https://github.com/askui/askui
* https://github.com/Shubhamsaboo/awesome-llm-apps
* https://github.com/TheAgenticAI/TheAgenticBrowser
* https://github.com/zcaceres/markdownify-mcp

Open Source Fair Share

Developers of open-source Actors can earn passive affiliate income through Apify's https://apify.com/partners/open-source-fair-share program to help them support their projects.

### MCP servers and tools for AI

https://modelcontextprotocol.io/docs/getting-started/intro lets AI agents interact with external tools and data sources. Many MCP servers are still stand-alone packages that need to be installed locally, which is both inefficient and insecure, or require an external service account. Publishing these packages as Actors makes the MCP servers remote and accessible through the Apify platform and ecosystem, including the new agentic payments protocols.

Examples:

* https://apify.com/jiri.spilka/playwright-mcp-server
* https://apify.com/mcp-servers/browserbase-mcp-server
* https://apify.com/agentify/firecrawl-mcp-server
* https://apify.com/agentify/brave-search-mcp-server

For more inspiration, check out the https://apify.com/store/categories/mcp-servers in Apify Store.

### AI agents

Build Actors that use LLMs to perform complex tasks autonomously. These Actors can navigate websites, make decisions, and complete multistep workflows.

Secure execution

Actors are cloud-based sandboxes that can securely run any AI-generated code.

For inspiration, check out the https://apify.com/store/categories/agents in Apify Store.

### Other

Any repetitive job matching the following criteria might be suitable for turning into an Actor:

* The job is better to be run in the background in the cloud and forgotten.
* The task is isolated and can be described and delegated to another person.
* There are at least a few hundred people in the world dealing with this problem.

If you look closely, you'll start seeing opportunities for new Actors everywhere. Be creative!

## Use the Actor ideas page

The https://apify.com/ideas page is where you can find inspiration for new Actors sourced from the Apify community.

### Browse and claim ideas

1. *Visit* https://apify.com/ideas to find ideas that interest you. Look for ideas that align with your skills.

2. *Select an Actor idea*: Review the details and requirements. Check the status—if it's marked **Open to develop**, you can start building.

3. *Build your Actor*: Develop your Actor based on the idea. You don't need to notify Apify during development.

4. *Prepare for launch*: Ensure your Actor meets quality standards and has a comprehensive README with installation instructions, usage details, and examples.

5. *Publish your Actor*: Deploy your Actor on Apify Store and make it live.

6. *Claim the idea*: After publishing, email mailto:ideas@apify.com with your Actor URL and the original idea. Apify will tag the idea as **Completed** and link it to your Actor.

   1. To claim an idea, ensure your Actor is functional, README contains relevant information, and your Actor closely aligns with the original idea.

7. *Monitor and optimize*: Track your Actor's performance and user feedback. Make improvements to keep your Actor current.

#### Multiple developers for one idea

Apify Store can host multiple Actors with similar functions. However, the "first come, first served" rule applies—the first developer to claim an idea receives the **Completed** tag and a link from the Actor ideas page.

Competition motivates developers to improve the code. You can still build the Actor, but differentiate with a unique set of features.

### Submit your own ideas

The Ideas page is also where you contribute concepts to drive innovation in the community.

Here's how you can contribute too:

* *Submit ideas*: Share Actor concepts through the https://apify.typeform.com/to/BNON8poB#source=ideas. Provide clear details about what the tool should do and how it should work.

* *Engage with the community*: Upvote ideas you find intriguing. More support increases the likelihood a developer will build it.

## Find ideas from other sources

Beyond the https://apify.com/ideas page, you can find new Actor ideas through:

* SEO tools: Discover relevant search terms people use to find solutions
* Your experience: Draw from problems you've encountered in your work
* Community discussions: Browse Reddit, Twitter, Stack Overflow, and forums for user pain points
* Competitor analysis: Research existing tools and identify gaps

Once you get one, learn how to https://docs.apify.com/academy/build-and-publish/actor-ideas/actor-validation.md.


---

# Why publish Actors on Apify

Publishing Actors on Apify Store transforms your web scraping and automation code into revenue-generating products without the overhead of traditional SaaS development.

***

## What you get when you publish on Apify

When you publish your Actor on Apify Store, you eliminate the complexity of building and maintaining a traditional SaaS product. The platform handles infrastructure, billing, and distribution, so you can focus on your code.

### Skip the SaaS overhead

Your Actor gets its own dedicated landing page with built-in documentation hosting through README integration, giving you instant distribution with direct exposure to organic user traffic through Apify Store's marketplace. You won't pay hosting costs since the built-in cloud infrastructure with automatic scaling handles all compute needs. Payment infrastructure is completely handled for you with multiple payment options, automated billing, and transactions.

### No infrastructure headaches

Publishing on Apify Store means you don't need to purchase and manage domains or websites, build payment processing systems, set up hosting infrastructure, or handle customer billing manually. You also won't need to invest heavily in marketing since the marketplace presence drives discovery.

## Choose your pricing options

Apify Store offers flexible pricing models that let you match your Actor's value proposition:

* Pay-per-event (PPE): Charge for any custom events your Actor triggers (maximum flexibility, AI/MCP compatible, priority store placement)
* Pay-per-result (PPR): Set pricing based on dataset items generated (predictable costs for users, unlimited revenue potential)
* Rental: Charge a flat monthly fee for continuous access (users cover their own platform usage costs)

All models give you 80% of revenue, with platform usage costs deducted for PPR and PPE models.

Learn more in https://docs.apify.com/academy/actor-marketing-playbook/store-basics/how-actor-monetization-works.md.

## Why developers publish Actors

### Generate passive income

Developers successfully monetize their Actors through the Apify platform. Once published and promoted, Actors can generate recurring revenue with minimal maintenance.

Check out their success stories:

* https://blog.apify.com/web-scraping-freelance-financial-freedom/ - Achieved financial freedom through Actor development.
* https://apify.com/success-stories/paid-actor-journey-apify-freelancer-tugkan - Built a successful freelance career with paid Actors.

### Build your portfolio

Publishing Actors demonstrates your skills publicly. Your Actors become visible examples of your work, showcasing your technical expertise to potential clients while building your reputation in the developer community. This visibility can open freelance opportunities and establish you as a subject matter expert.

### Join a marketplace

Apify Store is a growing library of thousands of Actors, most created by community developers. When you publish, you reach users actively searching for automation solutions while benefiting from platform features like monitoring, scheduling, API access, and integrations. You get visibility through Store categories and search, plus access to analytics to understand user behavior and optimize pricing.

## What it takes to succeed

### Maintain quality

Public Actors require higher standards than private ones. Since users depend on your Actor, you'll need to commit to regular maintenance—reserve approximately 2 hours per week for bug fixes, updates, and user support. Thorough documentation is essential; write clear README files using simple language since users may not be developers. Set up automated testing or use manual testing to prevent user issues, and respond promptly to issues through the Issues tab, where your response time is publicly visible. Learn more about metrics determining quality in https://docs.apify.com/platform/actors/publishing/quality-score.md.

### When you need to change things

If you need to make breaking changes to your Actor, contact mailto:community@apify.com beforehand. Major pricing changes require 14-day notice and are limited to once per month. The platform helps communicate changes to your users.

## Getting started

Ready to publish? The process involves four main stages:

1. Development: Build your Actor using https://docs.apify.com/sdk, https://crawlee.dev/, or https://apify.com/templates
2. Publication: Set up display information, description, README, and monetization
3. Testing: Ensure your Actor works reliably with automated or manual tests
4. Promotion: Optimize for SEO, share on social media, and create tutorials

Learn more:

* https://docs.apify.com/academy/actor-marketing-playbook/store-basics/how-to-build-actors.md
* https://docs.apify.com/academy/actor-marketing-playbook/store-basics/how-store-works.md
* https://docs.apify.com/platform/actors/publishing/publish.md


---

# Concepts 🤔

**Learn about some common yet tricky concepts and terms that are used frequently within the academy, as well as in the world of scraper development.**

***

You'll see some terms and concepts frequently repeated throughout various courses in the academy. Many of these concepts are common, and even fundamental in the scraping world, which makes it necessary to explain them to our course-takers; however it would be inconvenient for our readers to explain these terms each time they appear in a lesson.

Because of this slight dilemma, and because there are no outside resources which compile all of these concepts into an educational and digestible form, we've decided to do just that. Welcome to the **Concepts** section of the Apify Academy's **Glossary**!

> It's important to note that there is no specific order to these concepts. All of them range in their relevance and importance to your every day scraping endeavors.


---

# CSS selectors

CSS selectors are patterns used to select https://docs.apify.com/academy/concepts/html-elements.md on a web page. They are used in combination with CSS styles to change the appearance of web pages, and also in JavaScript to access and manipulate the elements on a web page.

> Querying of CSS selectors with JavaScript is done using https://docs.apify.com/academy/concepts/querying-css-selectors.md.

## Common types of CSS selectors

Some of the most common types of CSS selectors are:

### Element selector

This is used to select elements by their tag name. For example, to select all `` elements, you would use the `p` selector.


```
const paragraphs = document.querySelectorAll('p');
```


### Class selector

This is used to select elements by their class attribute. For example, to select all elements with the class of `highlight`, you would use the `.highlight` selector.


```
const highlightedElements = document.querySelectorAll('.highlight');
```


### ID selector

This is used to select an element by its `id` attribute. For example, to select an element with the id of `header`, you would use the `#header` selector.


```
const header = document.querySelector(`#header`);
```


### Attribute selector

This is used to select elements based on the value of an attribute. For example, to select all elements with the attribute `data-custom` whose value is `yes`, you would use the `[data-custom="yes"]` selector.


```
const customElements = document.querySelectorAll('[data-custom="yes"]');
```


### Chaining selectors

You can also chain multiple selectors together to select elements more precisely. For example, to select an element with the class `highlight` that is inside a `` element, you would use the `p.highlight` selector.


```
const highlightedParagraph = document.querySelectorAll('p.highlight');
```


## CSS selectors in web scraping

CSS selectors are important for web scraping because they allow you to target specific elements on a web page and extract their data. When scraping a web page, you typically want to extract specific pieces of information from the page, such as text, images, or links. CSS selectors allow you to locate these elements on the page, so you can extract the data that you need.

For example, if you wanted to scrape a list of all the titles of blog posts on a website, you could use a CSS selector to select all the elements that contain the title text. Once you have selected these elements, you can extract the text from them and use it for your scraping project.

Additionally, when web scraping it is important to understand the structure of the website and CSS selectors can help you to navigate it. With them, you can select specific elements and their children, siblings, or parent elements. This allows you to extract data that is nested within other elements, or to navigate through the page structure to find the data you need.

## Resources

* Find all the available CSS selectors and their syntax on the https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors.


---

# Dynamic pages and single-page applications (SPAs)

**Understand what makes a page dynamic, and how a page being dynamic might change your approach when writing a scraper for it.**

***

Oftentimes, web pages load additional information dynamically, long after their main body is loaded in the browser. A subset of dynamic pages takes this approach further and loads all of its content dynamically. Such style of constructing websites is called Single-page applications (SPAs), and it's widespread thanks to some popular JavaScript libraries, such as https://react.dev/ or https://vuejs.org/.

As you progress in your scraping journey, you'll quickly realize that different websites load their content and populate their pages with data in different ways. Some pages are rendered entirely on the server, some retrieve the data dynamically, and some use a combination of both those methods.

## How page loading works

The process of loading a page involves three main events, each with a designated corresponding name:

1. `DOMContentLoaded` - The initial HTML document is loaded, which contains the HTML as it was rendered on the website's server. It also includes all of the JavaScript which will be run in the next step.
2. `load` - The page's JavaScript is executed.
3. `networkidle` - Network https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest are sent and loaded, and data from these requests is populated onto the page. Many websites load essential data this way. These requests might be sent upon certain page events as well (not just the first load), such as scrolling or clicking.

Now that we have a solid understanding of the different stages of page-loading, and the order they happen in, we can fully understand what a dynamic page is.

## What is dynamic content

Dynamic content is any content that is rendered **after** the `DOMContentLoaded` event, which means any content loaded by JavaScript during the `load` event, or after any network XHR/Fetch requests have been made.

Sometimes, it can be quite obvious when content is dynamically being rendered. For example, take a look at this gif:

![Image](https://blog.apify.com/content/images/2022/02/dynamicLoading-1--1--2.gif)

Here, it's very clear that new content is being generated. As we scroll down the Twitter feed, we can see the scroll bar jumping back up, signifying that more elements have been created using JavaScript.

Other times, it's less obvious though. Content can appear to be static (non-dynamic) when it is not, or even sometimes the other way around.


---

# HTML elements

An HTML element is a building block of an HTML document. It is used to represent a piece of content on a web page, such as text, images, or videos. Each element is defined by a tag, which is a set of characters enclosed in angle brackets, such as ``, ``, or ``. For example, this is a paragraph element:


```
This is a paragraph of text.
```


You can also add **attributes** to an element to provide additional information or to control how the element behaves. For example, the `src` attribute is used to specify the source of an image, like this:


```

```


In JavaScript, you can use the **DOM** (Document Object Model) to interact with elements on a web page. For example, you can use the https://docs.apify.com/academy/concepts/querying-css-selectors.md to select an element by its https://docs.apify.com/academy/concepts/css-selectors.md, like this:


```
const myElement = document.querySelector('#myId');
```


You can also use `getElementById()` method to select an element by its `id`, like this:


```
const myElement = document.getElementById('myId');
```


You can also use `getElementsByTagName()` method to select all elements of a certain type, like this:


```
const myElements = document.getElementsByTagName('p');
```


Once you have selected an element, you can use JavaScript to change its content, style, or behavior.

In summary, an HTML element is a building block of a web page. It is defined by a **tag** with **attributes**, which provide additional information or control how the element behaves. You can use the **DOM** (Document Object Model) to interact with elements on a web page.


---

# HTTP cookies

**Learn a bit about what cookies are, and how they are utilized in scrapers to appear logged-in, view specific data, or even avoid blocking.**

***

HTTP cookies are small pieces of data sent by the server to the user's web browser, which are typically stored by the browser and used to send later requests to the same server. Cookies are usually represented as a string (if used together with a plain HTTP request) and sent with the request under the **Cookie** https://docs.apify.com/academy/concepts/http-headers.md.

## Most common uses of cookies in crawlers

1. To make the website show data to you as if you were a logged-in user.
2. To make the website show location-specific data (works for websites where you could set a zip code or country directly on the page, but unfortunately doesn't work for some location-based ads).
3. To make the website less suspicious of the crawler and let the crawler's traffic blend in with regular user traffic.

For local testing, we recommend using the https://chrome.google.com/webstore/detail/fngmhnnpilhplaeedifhccceomclgfbg Chrome extension.


---

# HTTP headers

**Understand what HTTP headers are, what they're used for, and three of the biggest differences between HTTP/1.1 and HTTP/2 headers.**

***

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers let the client and the server pass additional information with an HTTP request or response. Headers are represented by an object where the keys are header names. Headers can also contain certain authentication tokens.

In general, there are 4 different paths you'll find yourself on when scraping a website and dealing with headers:

## No headers

For some websites, you won't need to worry about modifying headers at all, as there are no checks or verifications in place.

## Some default headers required

Some websites will require certain default browser headers to work properly, such as **User-Agent** (though, this header is becoming more obsolete, as there are more sophisticated ways to detect and block a suspicious user).

Another example of such a "default" header is **Referer**. Some e-commerce websites might share the same platform, and data is loaded through XMLHttpRequests to that platform, which would not know which data to return without knowing which exact website is requesting it.

## Custom headers required

A custom header is a non-standard HTTP header used for a specific website. For example, an imaginary website of **cool-stuff.com** might have a header with the name **X\_Cool\_Stuff\_Token** which is required for every single request to a product page.

Dealing with cases like these usually isn't difficult, but can sometimes be tedious.

## Very specific headers required

The most challenging websites to scrape are the ones that require a full set of site-specific headers to be included with the request. For example, not only would they potentially require proper **User-Agent** and **Referer** headers mentioned above, but also **Accept**, **Accept-Language**, **Accept-Encoding**, etc. with specific values.

Another big one to mention is the **Cookie** header. We cover this in more detail within the https://docs.apify.com/academy/concepts/http-cookies.md lesson.

You could use Chrome DevTools to inspect request headers, and https://docs.apify.com/academy/tools/insomnia.md or https://docs.apify.com/academy/tools/postman.md to test how the website behaves with or without specific headers.

## HTTP/1.1 vs HTTP/2 headers

HTTP/1.1 and HTTP/2 headers have several differences. Here are the three key differences that you should be aware of:

1. HTTP/2 headers do not include status messages. They only contain status codes.
2. Certain headers are no longer used in HTTP/2 (such as **Connection** along with a few others related to it like **Keep-Alive**). In HTTP/2, connection-specific headers are prohibited. While some browsers will ignore them, Safari and other Webkit-based browsers will outright reject any response that contains them. Easy to do by accident, and a big problem.
3. While HTTP/1.1 headers are case-insensitive and could be sent by the browsers with capitalized letters (e.g. **Accept-Encoding**, **Cache-Control**, **User-Agent**), HTTP/2 headers must be lower-cased (e.g. **accept-encoding**, **cache-control**, **user-agent**).

> To learn more about the difference between HTTP/1.1 and HTTP/2 headers, check out https://httptoolkit.com/blog/translating-http-2-into-http-1/ article


---

# Querying elements

`document.querySelector()` and `document.querySelectorAll()` are JavaScript functions that allow you to select elements on a web page using https://docs.apify.com/academy/concepts/css-selectors.md.

`document.querySelector()` is used to select the first element that matches the provided https://docs.apify.com/academy/concepts/css-selectors.md. It returns the first matching element or null if no matching element is found.

Here's an example of how you can use it:


```
const firstButton = document.querySelector('button');
```


This will select the first button element on the page and store it in the variable **firstButton**.

`document.querySelectorAll()` is used to select all elements that match the provided CSS selector. It returns a `NodeList` (a collection of elements) that can be accessed and manipulated like an array.

Here's an example of how you can use it:


```
const buttons = document.querySelectorAll('button');
```


This will select all button elements on the page and store them in the variable "buttons".

Both functions can be used to access and manipulate the elements in the web page. Here's an example on how you can use it to extract the text of all buttons.


```
const buttons = document.querySelectorAll('button');
const buttonTexts = buttons.forEach((button) => button.textContent);
```


It's important to note that when using `querySelectorAll()` in a browser environment, it returns a live `NodeList`, which means that if the DOM changes, the NodeList will also change.


---

# What is robotic process automation (RPA)?

**Learn the basics of robotic process automation. Make your processes on the web and other software more efficient by automating repetitive tasks.**

***

RPA allows you to create software (also known as **bots**), which can imitate your digital actions. You can program bots to perform repetitive tasks faster, more reliably and more accurately than humans. Plus, they can do these tasks all day, every day.

## What can I use RPA for?

You can https://apify.com/use-cases/rpa RPA to automate any repetitive task you perform using software. The tasks can range from https://apify.com/jakubbalada/content-checker to monitoring web pages for changes (such as changes in your competitors' pricing).

Other use cases for RPA include filling forms or https://apify.com/lukaskrivka/google-sheets while you get on with more important tasks. And it's not just simple tasks you can automate. How about https://apify.com/katerinahronik/toggl-invoice-download or posting content across several marketing channels at once?

## How does RPA work?

In a traditional automation workflow, you

1. Break a repetitive process down into https://kissflow.com/workflow/workflow-automation/an-8-step-checklist-to-get-your-workflow-ready-for-automation/, e.g. open website => log into website => click button "X" => download section "Y", etc.
2. Program a bot that does each of those chunks.
3. Execute the chunks of code in the right order (or in parallel).

With the advance of https://en.wikipedia.org/wiki/Machine_learning, it is becoming possible to https://www.nice.com/info/rpa-guide/process-recorder-function-in-rpa/ your workflows and analyze which can be automated. However, this technology is still not perfected and at times can even be less practical than the manual process.

## Is RPA the same as web scraping?

While https://docs.apify.com/academy/web-scraping-for-beginners.md is a kind of RPA, it focuses on extracting structured data. RPA focuses on the other tasks in browsers - everything except for extracting information.

## Additional resources

An easy-to-follow https://www.youtube.com/watch?v=9URSbTOE4YI on what RPA is.

To learn about RPA in plain English, check out https://enterprisersproject.com/article/2019/5/rpa-robotic-process-automation-how-explain article.

https://www.cio.com/article/227908/what-is-rpa-robotic-process-automation-explained.html article explains what RPA is and discusses both its advantages and disadvantages.

You might also like to check out this article on https://quandarycg.com/automating-workflows/.


---

# Deploying your code to Apify

**In this course learn how to take an existing project of yours and deploy it to the Apify platform as an Actor.**

***

This section will discuss how to use your newfound knowledge of the Apify platform and Actors from the https://docs.apify.com/academy/getting-started.md section to deploy your existing project's code to the Apify platform as an Actor. Any program running in a Docker container can become an Apify Actor.

![The deployment workflow](/assets/images/deployment-workflow-72f8b289e512701951e27c687a932dfa.png)

Apify provides detailed guidance on how to deploy Node.js and Python programs as Actors, but apart from that you're not limited in what programming language you choose for your scraper.

![Supported languages](/assets/images/supported-languages-2b3aced02908c1def900dbace072201a.jpg)

Here are a few examples of Actors in other languages:

* https://apify.com/lukaskrivka/rust-actor-example
* https://apify.com/jirimoravcik/go-actor-example
* https://apify.com/jirimoravcik/julia-actor-example

## The "actorification" workflow

Follow these four main steps to turn a piece of code into an Actor:

1. Handle https://docs.apify.com/academy/deploying-your-code/inputs-outputs.md.
2. Create an https://docs.apify.com/academy/deploying-your-code/input-schema.md **(optional)**.
3. Add a https://docs.apify.com/academy/deploying-your-code/docker-file.md.
4. https://docs.apify.com/academy/deploying-your-code/deploying.md to the Apify platform!

## Our example project

For this section, we'll be turning this example project into an Actor:

* JavaScript
* Python


```
// index.js
const addAllNumbers = (...nums) => nums.reduce((total, curr) => total + curr, 0);

console.log(addAllNumbers(1, 2, 3, 4)); // -> 10
```



```
# index.py
def add_all_numbers (nums):
    total = 0

    for num in nums:
        total += num

    return total

print(add_all_numbers([1, 2, 3, 4])) # -> 10
```


> For all lessons in this section, we'll have examples for both Node.js and Python so that you can follow along in either language.

## Next up

https://docs.apify.com/academy/deploying-your-code/inputs-outputs.md, we'll be learning how to accept input into our Actor as well as deliver output.


---

# Creating dataset schema

**Learn how to generate an appealing Overview table interface to preview your Actor results in real time on the Apify platform.**

***

The dataset schema generates an interface that enables users to instantly preview their Actor results in real time.

![Dataset Schema](/assets/images/output-schema-example-42bf91c1c1f39834fad5bbedf209acaa.png)

In this quick tutorial, you will learn how to set up an output tab for your own Actor.

## Implementation

Firstly, create a `.actor` folder in the root of your Actor's source code. Then, create a `actor.json` file in this folder, after which you'll have .actor/actor.json.

![.actor/actor.json](/assets/images/actor-json-example-7f3c312c187b9f6f86879594a769f35f.webp)

Next, copy-paste the following template code into your `actor.json` file.


```
{
    "actorSpecification": 1,
    "name": "___ENTER_ACTOR_NAME____",
    "title": "___ENTER_ACTOR_TITLE____",
    "version": "1.0.0",
    "storages": {
        "dataset": {
            "actorSpecification": 1,
            "views": {
                "overview": {
                    "title": "Overview",
                    "transformation": {
                        "fields": [
                            "___EXAMPLE_NUMERIC_FIELD___",
                            "___EXAMPLE_PICTURE_URL_FIELD___",
                            "___EXAMPLE_LINK_URL_FIELD___",
                            "___EXAMPLE_TEXT_FIELD___",
                            "___EXAMPLE_BOOLEAN_FIELD___"
                        ]
                    },
                    "display": {
                        "component": "table",
                        "properties": {
                            "___EXAMPLE_NUMERIC_FIELD___": {
                                "label": "ID",
                                "format": "number"
                            },
                            "___EXAMPLE_PICTURE_URL_FIELD___": {
                                "format": "image"
                            },
                            "___EXAMPLE_LINK_URL_FIELD___": {
                                "label": "Clickable link",
                                "format": "link"
                            }
                        }
                    }
                }
            }
        }
    }
}
```


To configure the dataset schema, replace the fields in the template with the relevant fields to your Actor.

For reference, you can use the https://github.com/PerVillalva/zappos-scraper-actor/blob/main/.actor/actor.json as an example of how the final implementation of the output tab should look in a live Actor.


```
{
    "actorSpecification": 1,
    "name": "zappos-scraper",
    "title": "Zappos Scraper",
    "description": "",
    "version": "1.0.0",
    "storages": {
        "dataset": {
            "actorSpecification": 1,
            "title": "Zappos.com Dataset",
            "description": "",
            "views": {
                "products": {
                    "title": "Overview",
                    "description": "It can take about one minute until the first results are available.",
                    "transformation": {
                        "fields": [
                            "imgUrl",
                            "brand",
                            "name",
                            "SKU",
                            "inStock",
                            "onSale",
                            "price",
                            "url"
                        ]
                    },
                    "display": {
                        "component": "table",
                        "properties": {
                            "imgUrl": {
                                "label": "Product image",
                                "format": "image"
                            },
                            "url": {
                                "label": "Link",
                                "format": "link"
                            },
                            "brand": {
                                "format": "text"
                            },
                            "name": {
                                "format": "text"
                            },
                            "SKU": {
                                "format": "text"
                            },
                            "inStock": {
                                "format": "boolean"
                            },
                            "onSale": {
                                "format": "boolean"
                            },
                            "price": {
                                "format": "text"
                            }
                        }
                    }
                }
            }
        }
    }
}
```


Note that the fields specified in the dataset schema should match the object keys of your resulting dataset.

Also, if your desired label has the same name as the defined object key, then you don't need to specify a label name. The schema will, by default, show a capitalized version of the key and even split camel case into separate words and capitalize all of them.

The matching object for the Zappos Scraper shown in the example above will look something like this:


```
const results = {
    url: request.loadedUrl,
    imgUrl: $('#stage button[data-media="image"] img[itemprop="image"]').attr('src'),
    brand: $('span[itemprop="brand"]').text().trim(),
    name: $('meta[itemprop="name"]').attr('content'),
    SKU: $('*[itemprop~="sku"]').text().trim(),
    inStock: !request.url.includes('oosRedirected=true'),
    onSale: !$('div[itemprop="offers"]').text().includes('OFF'),
    price: $('span[itemprop="price"]').text(),
};
```


## Final result

Great! Now that everything is set up, it's time to run the Actor and admire your Actor's brand new output tab.

> Need some extra guidance? Visit the https://docs.apify.com/platform/actors/development/actor-definition/dataset-schema.md for more detailed information about how to implement this feature.

A few seconds after running the Actor, you should see its results displayed in the `Overview` table.

![Output table overview](/assets/images/output-schema-final-example-0beffd41c710a5438a8fe1c4a72f0f07.webp)

## Next up

In the https://docs.apify.com/academy/deploying-your-code/docker-file.md, we'll learn about a very important file that is required for our project to run on the Apify platform - the Dockerfile.


---

# Publishing your Actor

**Push local code to the platform, or create a new Actor on the console and integrate it with a Git repository to optionally automatically rebuild any new changes.**

***

Once you've **actorified** your code, there are two ways to deploy it to the Apify platform. You can either push the code directly from your local machine onto the platform, or you can create a blank Actor in the web interface, and then integrate its source code with a GitHub repository.

## With a Git repository

Before we deploy our project onto the Apify platform, let's ensure that we've pushed the changes we made in the last 3 lessons into our remote GitHub repository.

> The benefit of using this method is that any time you push to the Git repository, the code on the platform is also updated and the Actor is automatically rebuilt. Also, you don't have to use a GitHub repository - you can use GitLab or any other service you'd like.

### Creating the Actor

Before anything can be integrated, we've gotta create a new Actor. Let's head over to our https://console.apify.com?asrc=developers_portal, navigate to the **Development** subsection and click on the **Develop new** button, then select the **Empty** template.

![Create new button](/assets/images/develop-new-actor-a499c8a2618fec73c828ddb4dcbb75b4.png)

### Changing source code location

In the **Source** tab on the new Actor's page, we'll click the dropdown menu under **Source code** and select **Git repository**. By default, this is set to **Web IDE**.

![Select source code location](/assets/images/select-source-location-8b84116417145746c275463c49e24baa.png)

Now we'll paste the link to our GitHub repository into the **Git URL** text field and click **Save**.

### Adding the webhook to the repository

The final step is to click on **API** in the top right corner of our Actor's page:

![API button](/assets/images/api-button-4384acadb7883bbad6c7f363c0c1a37c.jpg)

And scroll through all of the links until we find the **Build Actor** API endpoint. Now we'll copy this endpoint's URL, head back over to our GitHub repository and navigate to **Settings > Webhooks > Add webhook**. The final thing to do is to paste the URL and save the webhook.

![Adding a webhook to your GitHub repository](/assets/images/ci-github-integration-2ee82ac772eb3280155b7027a4259528.png)

That's it! The Actor should now pull its source code from the repository and automatically build.

## Without a GitHub repository (using the Apify CLI)

> If you don't yet have the Apify CLI, learn how to install it and log in by following along with https://docs.apify.com/academy/tools/apify-cli.md about it.

If you're logged in to the Apify CLI, the `apify push` command can be used to push the code straight onto the Apify platform from your local machine (no GitHub repository required), where it will automatically be built for you. Prior to running this command, make sure that you have an **.actor/actor.json** file at the root of the project. If you don't already have one, you can use `apify init .` to automatically generate one for you.

One important thing to note is that you can use a `.gitignore` file to exclude files from being pushed. When you use `apify push` without a `.gitignore`, the full folder contents will be pushed, meaning that even the **storage** and **node\_modules** will be pushed. These files are unnecessary to push, as they are both generated on the platform.

> The `apify push` command should only really be used for quickly pushing and testing Actors on the platform during development. If you are ready to make your Actor public, use a Git repository instead, as you will reap the benefits of using Git and others will be able to contribute to the project.

## Deployed!

Great! Once you've pushed your Actor to the platform, you will find it listed under the **Actors** tab. When using the `apify push` command, you will have access to the multifile editor. For details about using the multifile editor, refer to https://docs.apify.com/academy/getting-started/creating-actors.md#web-ide.

![Deployed Actor on the Apify platform](/assets/images/actor-page-e3c2002c5e585e896614af6e3e38838e.jpg)

The next step is to test your Actor and experiment with the vast amount of features the platform has to offer.

## Wrap up

That's it! In this short section, you've learned how to take your code written in any programming language and turn it into a usable Actor that can run on the Apify platform! The next step is to start looking into the https://docs.apify.com/platform/actors/publishing.md program, which allows you to monetize your work.


---

# Creating Actor Dockerfile

**Understand how to write a Dockerfile (Docker image blueprint) for your project so that it can be run within a Docker container on the Apify platform.**

***

The **Dockerfile** is a file which gives the Apify platform (or Docker, more specifically) instructions on how to create an environment for your code to run in. Every Actor must have a Dockerfile, as Actors run in Docker containers.

> Actors on the platform are always run in Docker containers; however, they can also be run in local Docker containers. This is not common practice though, as it requires more setup and a deeper understanding of Docker. For testing, it's best to run the Actor on the local OS (this requires you to have the underlying runtime installed, such as Node.js, Python, Rust, GO, etc).

## Base images

If your project doesn’t already contain a Dockerfile, don’t worry! Apify offers https://docs.apify.com/sdk/js/docs/guides/docker-images that are optimized for building and running Actors on the platform, which can be found on https://hub.docker.com/u/apify. When using a language for which Apify doesn't provide a base image, https://hub.docker.com/ provides a ton of free Docker images for most use-cases, upon which you can create your own images.

> Tip: You can see all of Apify's Docker images https://hub.docker.com/u/apify.

At the base level, each Docker image contains a base operating system and usually also a programming language runtime (such as Node.js or Python). You can also find images with preinstalled libraries or install them yourself during the build step.

Once you find the base image you need, you can add it as the initial `FROM` statement:


```
FROM apify/actor-node:16
```


> For syntax highlighting in your Dockerfiles, download the https://code.visualstudio.com/docs/containers/overview#_installation.

## Writing the file

The rest of the Dockerfile is about copying the source code from the local filesystem into the container's filesystem, installing libraries, and setting the `RUN` command (which falls back to the parent image).

> If you are not using a base image from Apify, then you should specify how to launch the source code of your Actor with the `CMD` instruction.

Here's the Dockerfile for our Node.js example project's Actor:

* Node.js Dockerfile
* Python Dockerfile


```
FROM apify/actor-node:16

# Second, copy just package.json and package-lock.json since they are the only files
# that affect npm install in the next step
COPY package*.json ./

# Install npm packages, skip optional and development dependencies to keep the
# image small. Avoid logging too much and print the dependency tree for debugging
RUN npm --quiet set progress=false \
 && npm install --only=prod --no-optional \
 && echo "Installed npm packages:" \
 && (npm list --all || true) \
 && echo "Node.js version:" \
 && node --version \
 && echo "npm version:" \
 && npm --version

# Next, copy the remaining files and directories with the source code.
# Since we do this after npm install, quick build will be really fast
# for simple source file changes.
COPY . ./
```



```
# First, specify the base Docker image.
# You can also use any other image from Docker Hub.
FROM apify/actor-python:3.9

# Second, copy just requirements.txt into the Actor image,
# since it should be the only file that affects "pip install" in the next step,
# in order to speed up the build
COPY requirements.txt ./

# Install the packages specified in requirements.txt,
# Print the installed Python version, pip version
# and all installed packages with their versions for debugging
RUN echo "Python version:" \
 && python --version \
 && echo "Pip version:" \
 && pip --version \
 && echo "Installing dependencies from requirements.txt:" \
 && pip install -r requirements.txt \
 && echo "All installed Python packages:" \
 && pip freeze

# Next, copy the remaining files and directories with the source code.
# Since we do this after installing the dependencies, quick build will be really fast
# for most source file changes.
COPY . ./

# Specify how to launch the source code of your Actor.
# By default, the main.py file is run
CMD python3 main.py
```


## Examples

The examples above show how to deploy Actors written in Node.js or Python, but you can use any language. As an inspiration, here are a few examples for other languages: Go, Rust, Julia.

* GO Actor Dockerfile
* Rust Actor Dockerfile
* Julia Actor Dockerfile


```
FROM golang:1.17.1-alpine

WORKDIR /app
COPY . .

RUN go mod download

RUN go build -o /example-actor
CMD ["/example-actor"]
```



```
# Image with prebuilt Rust. We use the newest 1.* version
# https://hub.docker.com/_/rust
FROM rust:1

# We copy only package setup so we cache building all dependencies
COPY Cargo* ./

# We need to have dummy main.rs file to be able to build
RUN mkdir src && echo "fn main() {}" > src/main.rs

# Build dependencies only
# Since we do this before copying  the rest of the files,
# the dependencies will be cached by Docker, allowing fast
# build times for new code changes
RUN cargo build --release

# Delete dummy main.rs
RUN rm -rf src

# Copy rest of the files
COPY . ./

# Build the source files
RUN cargo build --release

CMD ["./target/release/actor-example"]
```



```
FROM julia:1.7.1-alpine

WORKDIR /app
COPY . .

RUN julia install.jl

CMD ["julia", "main.jl"]
```


## Next up

In the https://docs.apify.com/academy/deploying-your-code/deploying.md, we'll push our code directly to the Apify platform, or create and integrate a new Actor on the Apify platform with our project's GitHub repository.


---

# How to write Actor input schema

**Learn how to generate a user interface on the platform for your Actor's input with a single file - the INPUT\_SCHEMA.json file.**

***

Though writing an https://docs.apify.com/platform/actors/development/actor-definition/input-schema.md for an Actor is not a required step, it is most definitely an ideal one. The Apify platform will read the **INPUT\_SCHEMA.json** file within the root of your project and generate a user interface for entering input into your Actor, which makes it significantly easier for non-developers (and even developers) to configure and understand the inputs your Actor can receive. Because of this, we'll be writing an input schema for our example Actor.

> Without an input schema, the users of our Actor will have to provide the input in JSON format, which can be problematic for those who are not familiar with JSON.

## Schema title & description

In the root of our project, we'll create a file named **INPUT\_SCHEMA.json** and start writing the first part of the schema.


```
{
    "title": "Adding Actor input",
    "description": "Add all values in list of numbers with an arbitrary length.",
    "type": "object",
    "schemaVersion": 1
}
```


The **title** and **description** describe what the input schema is for, and a bit about what the Actor itself does.

## Properties

In order to define all of the properties our Actor is expecting, we must include them within an object with a key of **properties**.


```
{
    "title": "Adding Actor input",
    "description": "Add all values in list of numbers with an arbitrary length.",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "numbers": {
            "title": "Number list",
            "description": "The list of numbers to add up."
        }
    }
}
```


Each property's key corresponds to the name we're expecting within our code, while the **title** and **description** are what the user will see when configuring input on the platform.

## Property types & editor types

Within our new **numbers** property, there are two more fields we must specify. Firstly, we must let the platform know that we're expecting an array of numbers with the **type** field. Then, we should also instruct Apify on which UI component to render for this input property. In our case, we have an array of numbers, which means we should use the **json** editor type that we discovered in the https://docs.apify.com/platform/actors/development/actor-definition/input-schema/specification/v1.md#array of the input schema documentation. We could also use **stringList**, but then we'd have to parse out the numbers from the strings.


```
{
    "title": "Adding Actor input",
    "description": "Add all values in list of numbers with an arbitrary length.",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "numbers": {
            "title": "Number list",
            "description": "The list of numbers to add up.",
            "type": "array",
            "editor": "json"
        }
    }
}
```


## Required fields

The great thing about building an input schema is that it will automatically validate your inputs based on their type, maximum value, minimum value, etc. Sometimes, you want to ensure that the user will always provide input for certain fields, as they are crucial to the Actor's run. This can be done by using the **required** field and passing in the names of the fields you'd like to require.


```
{
    "title": "Adding Actor input",
    "description": "Add all values in list of numbers with an arbitrary length.",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "numbers": {
            "title": "Number list",
            "description": "The list of numbers to add up.",
            "type": "array",
            "editor": "json"
        }
    },
    "required": ["numbers"]
}
```


For our case, we've made the **numbers** field required, as it is crucial to our Actor's run.

## Final thoughts

Here is what the input schema we wrote will render on the platform:

![Rendered UI from input schema](/assets/images/rendered-ui-74b1f9f74dce9ba83249f733716a0745.png)

Later on, we'll be building more complex input schemas, as well as discussing how to write quality input schemas that allow the user to understand the Actor and not become overwhelmed.

It's not expected to memorize all of the fields that properties can take or the different editor types available, which is why it's always good to reference the https://docs.apify.com/platform/actors/development/actor-definition/input-schema.md when writing a schema.

## Next up

In the https://docs.apify.com/platform/actors/development/actor-definition/dataset-schema.md, we'll learn how to generate an appealing Overview table to display our Actor's results in real time, so users can get immediate feedback about the data being extracted.


---

# Managing Actor inputs and outputs

**Learn to accept input into your Actor, do something with it, and then return output. Actors can be written in any language, so this concept is language agnostic.**

***

Most of the time when you're creating a project, you are expecting some sort of input from which your software will run off. Oftentimes as well, you want to provide some sort of output once your software has completed running. Apify provides a convenient way to handle inputs and deliver outputs.

An important thing to understand regarding inputs and outputs is that they are read/written differently depending on where the Actor is running:

* If your Actor is running locally, the inputs/outputs are usually provided in the filesystem, and environment variables are injected either by you, the developer, or by the Apify CLI by running the project with the `apify run` command.

* While running in a Docker container on the platform, environment variables are automatically injected, and inputs & outputs are provided and modified using Apify's REST API.

## A bit about storage

You can read/write your inputs/outputs: to the https://docs.apify.com/platform/storage/key-value-store.md, or to the https://docs.apify.com/platform/storage/dataset.md. The key-value store can be used to store any sort of unorganized/unrelated data in any format, while the data pushed to a dataset typically resembles a table with columns (fields) and rows (items). Each Actor's run is allocated both a default dataset and a default key-value store.

When running locally, these storages are accessible through the **storage** folder within your project's root directory, while on the platform they are accessible via Apify's API.

## Accepting input

You can utilize multiple ways to accept input into your project. The option you go with depends on the language you have written your project in. If you are using Node.js for your repo's code, you can use the https://www.npmjs.com/package/apify package. Otherwise, you can use the useful environment variables automatically set up for you by Apify to write utility functions which read the Actor's input and return it.

### Accepting input with the Apify SDK

Since we're using Node.js, let's install the `apify` package by running the following command:


```
npm install apify
```


Now, let's import `Actor` from `apify` and use the `Actor.getInput()` function to grab our input.


```
// index.js
import { Actor } from 'apify';

// We must initialize and exit the Actor. The rest of our code
// goes in between these two.
await Actor.init();

const input = await Actor.getInput();
console.log(input);

await Actor.exit();
```


If we run this right now, we'll see **null** in our terminal - this is because we never provided any sort of test input, which should be provided in the default key-value store. The `Actor.getInput()` function has detected that there is no **storage** folder and generated one for us.

![Default key-value store filepath](/assets/images/filepath-6c643f3e6fc1e05a2c8e477557a9dd4e.jpg)

We'll now add an **INPUT.json** file within **storage/key\_value\_stores/default** to match what we're expecting in our code.


```
{
    "numbers": [5, 5, 5, 5]
}
```


Then we can add our example project code from earlier. It will grab the input and use it to generate a solution which is logged into the console.


```
// index.js
import { Actor } from 'apify';

await Actor.init();

const { numbers } = await Actor.getInput();

const addAllNumbers = (...nums) => nums.reduce((total, curr) => total + curr, 0);

const solution = addAllNumbers(...numbers);

console.log(solution);

await Actor.exit();
```


Cool! When we run `node index.js`, we see **20**.

### Accepting input without the Apify SDK

Alternatively, when writing in a language other than JavaScript, we can create our own `get_input()` function which utilizes the Apify API when the Actor is running on the platform. For this example, we are using the https://docs.apify.com/academy/getting-started/apify-client.md for Python to access the API.


```
# index.py
from apify_client import ApifyClient
from os import environ
import json

client = ApifyClient(token='YOUR_TOKEN')

# If being run on the platform, the "APIFY_IS_AT_HOME" environment variable
# will be "1". Otherwise, it will be undefined/None
def is_on_apify ():
    return 'APIFY_IS_AT_HOME' in environ

# Get the input
def get_input ():
    if not is_on_apify():
        with open('./apify_storage/key_value_stores/default/INPUT.json') as actor_input:
            return json.load(actor_input)

    kv_store = client.key_value_store(environ.get('APIFY_DEFAULT_KEY_VALUE_STORE_ID'))
    return kv_store.get_record('INPUT')['value']

def add_all_numbers (nums):
    total = 0

    for num in nums:
        total += num

    return total

actor_input = get_input()['numbers']

solution = add_all_numbers(actor_input)

print(solution)
```


> For a better understanding of the API endpoints for reading and modifying key-value stores, check the https://docs.apify.com/api/v2/storage-key-value-stores.md.

## Writing output

Similarly to reading input, you can write the Actor's output either by using the Apify SDK in Node.js or by manually writing a utility function to do so.

### Writing output with the Apify SDK

In the SDK, we can write to the dataset with the `Actor.pushData()` function. Let's go ahead and write the solution of the `addAllNumbers()` function to the dataset store using this function:


```
// index.js

// This is our example project code from earlier.
// We will use the Apify input as its input.
import { Actor } from 'apify';

await Actor.init();

const { numbers } = await Actor.getInput();

const addAllNumbers = (...nums) => nums.reduce((total, curr) => total + curr, 0);

const solution = addAllNumbers(...numbers);

// And save its output to the default dataset
await Actor.pushData({ solution });

await Actor.exit();
```


### Writing output without the Apify SDK

Just as with the custom `get_input()` utility function, you can write a custom `set_output()` function as well if you cannot use the Apify SDK.

> You can read and write your output anywhere; however, it is standard practice to use a folder named **storage**.


```
# index.py
from apify_client import ApifyClient
from os import environ
import json

client = ApifyClient(token='YOUR_TOKEN')

def is_on_apify ():
    return 'APIFY_IS_AT_HOME' in environ

def get_input ():
    if not is_on_apify():
        with open('./apify_storage/key_value_stores/default/INPUT.json') as actor_input:
            return json.load(actor_input)

    kv_store = client.key_value_store(environ.get('APIFY_DEFAULT_KEY_VALUE_STORE_ID'))
    return kv_store.get_record('INPUT')['value']

# Push the solution to the dataset
def set_output (data):
    if not is_on_apify():
        with open('./apify_storage/datasets/default/solution.json', 'w') as output:
            return output.write(json.dumps(data, indent=2))

    dataset = client.dataset(environ.get('APIFY_DEFAULT_DATASET_ID'))
    dataset.push_items('OUTPUT', value=[json.dumps(data, indent=4)])

def add_all_numbers (nums):
    total = 0

    for num in nums:
        total += num

    return total

actor_input = get_input()['numbers']

solution = add_all_numbers(actor_input)

set_output({ 'solution': solution })
```


## Testing locally

Since we've changed our code a lot from the way it originally was by wrapping it in the Apify SDK to accept inputs and return outputs, we most definitely should test it locally before worrying about pushing it to the Apify platform.

After running our script, there should be a single item in the default dataset that looks like this:


```
{
    "solution": 20
}
```


## Next up

That's it! We've now added all of the files and code necessary to convert our software into an Actor. In the https://docs.apify.com/academy/deploying-your-code/input-schema.md, we'll be learning how to generate a user interface for our Actor's input so that users don't have to provide the input in raw JSON format.


---

# Expert scraping with Apify

**After learning the basics of Actors and Apify, learn to develop pro-level scrapers on the Apify platform with this advanced course.**

***

This course will teach you the nitty gritty of what it takes to build pro-level scrapers with Apify. We recommend that you've at least looked through all of the other courses in the academy prior to taking this one.

## Preparations

Before developing a pro-level Apify scraper, there are some important things you should have at least a bit of knowledge about (knowing the basics of each is enough to continue through this section), as well as some things that you should have installed on your system.

> If you've already gone through the https://docs.apify.com/academy/web-scraping-for-beginners.md and the first courses of the https://docs.apify.com/academy/apify-platform.md, you will be more than well equipped to continue on with the lessons in this course.

### Crawlee, Apify SDK, and the Apify CLI

If you're feeling ambitious, you don't need to have any prior experience with Crawlee to get started with this course; however, at least 5–10 minutes of exposure is recommended. If you haven't yet tried out Crawlee, you can refer to https://docs.apify.com/academy/web-scraping-for-beginners/crawling/pro-scraping.md in the **Web scraping basics for JavaScript devs** course (and ideally follow along). To familiarize yourself with the Apify SDK, you can refer to the https://docs.apify.com/academy/apify-platform.md category.

The Apify CLI will play a core role in the running and testing of the Actor you will build, so if you haven't gotten it installed already, please refer to https://docs.apify.com/academy/tools/apify-cli.md.

### Git

In one of the later lessons, we'll be learning how to integrate our Actor on the Apify platform with a GitHub repository. For this, you'll need to understand at least the basics of https://git-scm.com/docs. Here's a https://product.hubspot.com/blog/git-and-github-tutorial-for-beginners to help you get started with Git.

### Docker

Docker is a massive topic on its own, but don't be worried! We only expect you to know and understand the very basics of it, which can be learned about in https://docs.docker.com/guides/docker-overview/ (10 minute read).

### The basics of Actors

Part of this course will be learning more in-depth about Actors; however, some basic knowledge is already assumed. If you haven't yet gone through the https://docs.apify.com/academy/getting-started/actors.md lesson of the **Apify platform** course, it's highly recommended to at least give it a glance before moving forward.

## First up

https://docs.apify.com/academy/expert-scraping-with-apify/actors-webhooks.md, we'll be learning in-depth about integrating Actors with each other using webhooks.

> Each lesson will have a short *(and optional)* quiz that you can take at home to test your skills and knowledge related to the lesson's content. Some questions have straight factual answers, but some others can have varying opinionated answers.


---

# Webhooks & advanced Actor overview

**Learn more advanced details about Actors, how they work, and the default configurations they can take. Also, learn how to integrate your Actor with webhooks.**

***

Thus far, you've run Actors on the platform and written an Actor of your own, which you published to the platform yourself using the Apify CLI; therefore, it's fair to say that you are becoming more familiar and comfortable with the concept of **Actors**. Within this lesson, we'll take a more in-depth look at Actors and what they can do.

## Advanced Actor overview

In this course, we'll be working out of the Amazon scraper project from the **Web scraping basics for JavaScript devs** course. If you haven't already built that project, you can do it in three short lessons https://docs.apify.com/academy/web-scraping-for-beginners/challenge.md. We've made a few small modifications to the project with the Apify SDK, but 99% of the code is still the same.

Take another look at the files within your Amazon scraper project. You'll notice that there is a **Dockerfile**. Every single Actor has a Dockerfile (the Actor's **Image**) which tells Docker how to spin up a container on the Apify platform which can successfully run the Actor's code. "Apify Actors" is a serverless platform that runs multiple Docker containers. For a deeper understanding of Actor Dockerfiles, refer to the https://docs.apify.com/sdk/js/docs/guides/docker-images#example-dockerfile.

## Webhooks

Webhooks are a powerful tool that can be used for just about anything. You can set up actions to be taken when an Actor reaches a certain state (started, failed, succeeded, etc). These actions usually take the form of an API call (generally a POST request).

## Learning 🧠

Prior to moving forward, please read over these resources:

* Read about https://docs.apify.com/platform/actors/running.md.
* Learn about https://docs.apify.com/platform/integrations/webhooks.md, which we will implement in the next lesson.
* Learn https://docs.apify.com/academy/api/run-actor-and-retrieve-data-via-api.md using Apify's REST API.

## Knowledge check 📝

1. How do you allocate more CPU for an Actor's run?
2. Within itself, can you get the exact time that an Actor was started?
3. What are the types of default storages connected to an Actor's run?
4. Can you change the allocated memory of an Actor while it's running?
5. How can you run an Actor with Puppeteer on the Apify platform with headless mode set to `false`?

## Our task

In this task, we'll be building on top of what we already created in the https://docs.apify.com/academy/web-scraping-for-beginners/challenge.md course's final challenge, so keep those files safe!

Once our Amazon Actor has completed its run, we will, rather than sending an email to ourselves, call an Actor through a webhook. The Actor called will be a new Actor that we will create together, which will take the dataset ID as input, then subsequently filter through all of the results and return only the cheapest one for each product. All of the results of the Actor will be pushed to its default dataset.

https://docs.apify.com/academy/expert-scraping-with-apify/solutions/integrating-webhooks.md

## Next up

This course's https://docs.apify.com/academy/expert-scraping-with-apify/managing-source-code.md is brief, but discusses a very important topic: managing your code and storing it in a safe place.


---

# Apify API & client

**Gain an in-depth understanding of the two main ways of programmatically interacting with the Apify platform - through the API, and through a client.**

***

You can use one of the two main ways to programmatically interact with the Apify platform: by directly using https://docs.apify.com/api/v2.md, or by using the https://docs.apify.com/api/client/js and https://docs.apify.com/api/client/python API clients. In the next two lessons, we'll be focusing on the first two.

> Apify's API and JavaScript API client allow us to do anything a regular user can do when interacting with the platform's web interface, only programmatically.

## Learning 🧠

* Scroll through the https://docs.apify.com/api/v2.md (there's a whole lot there, so you're not expected to memorize everything).
* Read about the Apify client in https://docs.apify.com/api/client/js. It can also be seen on https://github.com/apify/apify-client-js and https://www.npmjs.com/package/apify-client.
* Learn about the https://docs.apify.com/sdk/js/reference/class/Actor#newClient function in the Apify SDK.
* Skim through https://help.apify.com/en/articles/2868670-how-to-pass-data-from-web-scraper-to-another-actor about API integration (this article is old; however, still relevant).

## Knowledge check 📝

1. What is the relationship between the Apify API and the Apify client? Are there any significant differences?
2. How do you pass input when running an Actor or task via API?
3. Do you need to install the `apify-client` npm package when already using the `apify` package?

## Our task

We'll be creating another new Actor, which will have two jobs:

1. Programmatically call the task for the Amazon Actor.
2. Export its results into CSV format under a new key called **OUTPUT.csv** in the default key-value store.

Though it's a bit unintuitive, this is a perfect activity for learning how to use both the Apify API and the Apify JavaScript client.

The new Actor should take the following input values, which be mapped to parameters in the API calls:


```
{
    // How much memory to allocate to the Amazon Actor
    // Must be a power of 2
    "memory": 4096,

    // Whether to use the JavaScript client to make the
    // call, or to use the API
    "useClient": false,

    // The fields in each item to return back. All other
    // fields should be ommitted
    "fields": ["title", "itemUrl", "offer"],

    // The maximum number of items to return back
    "maxItems": 10
}
```


https://docs.apify.com/academy/expert-scraping-with-apify/solutions/using-api-and-client.md

## Next up

https://docs.apify.com/academy/expert-scraping-with-apify/migrations-maintaining-state.md will teach us everything we need to know about migrations and how to handle them properly to avoid losing any state; therefore, increasing the reliability of our `demo-actor` Amazon scraper.


---

# Bypassing anti-scraping methods

**Learn about bypassing anti-scraping methods using proxies and proxy/session rotation together with Crawlee and the Apify SDK.**

***

Effectively bypassing anti-scraping software is one of the most crucial, but also one of the most difficult skills to master. The different types of https://docs.apify.com/academy/anti-scraping.md can vary a lot on the web. Some websites aren't even protected at all, some require only moderate IP rotation, and some cannot even be scraped without using advanced techniques and workarounds. Additionally, because the web is evolving, anti-scraping techniques are also evolving and becoming more advanced.

It is generally quite difficult to recognize the anti-scraping protections a page may have when first inspecting it, so it is important to thoroughly investigate a site prior to writing any lines of code, as anti-scraping measures can significantly change your approach as well as complicate the development process of an Actor. As your skills expand, you will be able to spot anti-scraping measures quicker, and better evaluate the complexity of a new project.

You might have already noticed that we've been using the **RESIDENTIAL** proxy group in the `proxyConfiguration` within our Amazon scraping Actor. But what does that mean? This is a proxy group from https://apify.com/proxy which has been preventing us from being blocked by Amazon this entire time. We'll be learning more about proxies and Apify Proxy in this lesson.

## Learning 🧠

* Skim https://apify.com/proxy for a general idea of Apify Proxy.
* Give the https://docs.apify.com/platform/proxy.md a solid readover (feel free to skip most of the examples).
* Check out the https://docs.apify.com/academy/anti-scraping.md.
* Gain a solid understanding of the https://crawlee.dev/api/core/class/SessionPool.
* Look at a few Actors on the https://apify.com/store. How are they utilizing proxies?

## Knowledge check 📝

1. What are the different types of proxies that Apify proxy offers? What are the main differences between them?
2. Which proxy groups do users get on the free plan? Can they access the proxy from their computer?
3. How can you prevent an error from occurring if one of the proxy groups that a user has is removed? What are the best practices for these scenarios?
4. Does it make sense to rotate proxies when you are logged into a website?
5. Construct a proxy URL that will select proxies **only from the US**.
6. What do you need to do to rotate a proxy (one proxy usually has one IP)? How does this differ for CheerioCrawler and PuppeteerCrawler?
7. Name a few different ways how a website can prevent you from scraping it.

## Our task

This time, we're going to build a trivial proxy-session manager for our Amazon scraping Actor. A session should be used a maximum of 5 times before being rotated; however, if a request fails, the IP should be rotated immediately.

Additionally, the proxies used by our scraper should now only be from the US.

https://docs.apify.com/academy/expert-scraping-with-apify/solutions/rotating-proxies.md

## Next up

Up https://docs.apify.com/academy/expert-scraping-with-apify/saving-useful-stats.md, we'll be learning about how to save useful stats about our run, which becomes more and more useful as a project scales.


---

# Managing source code

**Learn how to manage your Actor's source code more efficiently by integrating it with a GitHub repository. This is standard on the Apify platform.**

***

In this brief lesson, we'll discuss how to better manage an Actor's source code. Up 'til now, you've been developing your scripts locally, and then pushing the code directly to the Actor on the Apify platform; however, there is a much more optimal (and standard) way.

## Learning 🧠

Thus far, every time we've updated our code on the Apify platform, we've used the `apify push` CLI command; however, this can be problematic for a few reasons - mainly because, if someone else wants to make a change to/maintain your code, they don't have access to it, as it is on your local machine.

If you're not yet familiar with Git, please get familiar with it through the https://git-scm.com/docs, then take a quick moment to read about https://docs.apify.com/platform/integrations/github.md in the Apify docs.

Also, try to explore the **Multifile editor** in one of the Actors you developed in the previous lessons before moving forward.

## Knowledge check 📝

1. Do you have to rebuild an Actor each time the source code is changed?
2. In Git, what is the difference between **pushing** changes and making a **pull request**?
3. Based on your knowledge and experience, is the `apify push` command worth using (in your opinion)?

https://docs.apify.com/academy/expert-scraping-with-apify/solutions/managing-source.md

## Our task

First, we must initialize a GitHub repository (you can use Gitlab if you like, but this lesson's examples will be using GitHub). Then, after pushing our main Amazon Actor's code to the repo, we must switch its source code to use the content of the GitHub repository instead.

## Integrating GitHub source code

First, let's create a repository. This can be done https://kbroman.org/github_tutorial/pages/init.html, but in this lesson, we'll do it by creating the remote repository on GitHub's website:

![Create a new GitHub repo](/assets/images/github-new-repo-1e45ed3d75fdb3672b6253b016e1186d.png)

Then, we'll run the commands it tells us in our terminal (while within the **demo-actor** directory) to initialize the repository locally, and then push all of the files to the remote one.

After you've created your repo, navigate on the Apify platform to the Actor we called **demo-actor**. In the **Source** tab, click the dropdown menu under **Source code** and select **Git repository**. By default, this is set to **Web IDE**, which is what we've been using so far.

![Select source code location](/assets/images/select-source-location-8b84116417145746c275463c49e24baa.png)

Then, go ahead and paste the link to your repository into the **Git URL** text field and click **Save**.

The final step is to click on **API** in the top right corner of your Actor's page:

![API button](/assets/images/api-button-4384acadb7883bbad6c7f363c0c1a37c.jpg)

And scroll through all of the links until you find the **Build Actor** API endpoint. Copy this endpoint's URL, then head back over to your GitHub repository and navigate to **Settings > Webhooks > Add webhook**. The final thing to do is to paste the URL and save the webhook.

![Adding a webhook to your GitHub repo](/assets/images/ci-github-integration-2ee82ac772eb3280155b7027a4259528.png)

And you're done! 🎉

## Quick chat about code management

This was a bit of overhead, but the good news is that you don't ever have to configure this stuff again for this Actor. Now, every time the content of your **main**/**master** branch changes, the Actor on the Apify platform will rebuild based on the newest code.

Think of it as combining two steps into one! Normally, you'd have to do a `git push` from your terminal in order to get the newest code onto GitHub, then run `apify push` to push it to the platform.

It's also important to know that GitHub/Gitlab repository integration is standard practice. As projects grow and the number of contributors and maintainers increases, it only makes sense to have a GitHub repository integrated with the project's Actor. For the remainder of this course, all Actors created will be integrated with a GitHub repository.

## Next up

https://docs.apify.com/academy/expert-scraping-with-apify/tasks-and-storage.md, you'll learn about the different ways to store scraped data, as well as how to utilize a cool feature to run pre-configured Actors.


---

# Migrations & maintaining state

**Learn about what Actor migrations are and how to handle them properly so that the state is not lost and runs can safely be resurrected.**

***

We already know that Actors are Docker containers that can be run on any server. This means that they can be allocated anywhere there is space available, making them very efficient. Unfortunately, there is one big caveat: Actors move - a lot. When an Actor moves, it is called a **migration**.

On migration, the process inside of an Actor is completely restarted and everything in its memory is lost, meaning that any values stored within variables or classes are lost.

When a migration happens, you want to do a so-called "state transition", which means saving any data you care about so the Actor can continue right where it left off before the migration.

## Learning 🧠

Read this https://docs.apify.com/platform/actors/development/builds-and-runs/state-persistence.md on migrations and dealing with state transitions.

Before moving forward, read about Actor https://docs.apify.com/sdk/js/docs/upgrading/upgrading-to-v3#events and how to listen for them.

## Knowledge check 📝

1. Actors have an option in the **Settings** tab to **Restart on error**. Would you use this feature for regular Actors? When would you use this feature?
2. Migrations happen randomly, but by https://docs.apify.com/platform/actors/running/runs-and-builds.md#aborting-runs, you can simulate a similar situation. Try this out on the platform and observe what happens. What changes occur, and what remains the same for the restarted Actor's run?
3. Why don't you (usually) need to add any special migration handling code for a standard crawling/scraping Actor? Are there any features in the Crawlee/Apify SDK that handle this under the hood?
4. How can you intercept the migration event? How much time do you have after this event happens and before the Actor migrates?
5. When would you persist data to the default key-value store instead of to a named key-value store?

## Our task

Once again returning to our Amazon **demo-actor**, let's say that we need to store an object in memory (as a variable) containing all of the scraped ASINs as keys and the number of offers scraped from each ASIN as values. The object should follow this format:


```
{
    "B079ZJ1BPR": 3,
    "B07D4R4258": 21
}
```


Every 10 seconds, we should log the most up-to-date version of this object to the console. Additionally, the object should be able to solve Actor migrations, which means that even if the Actor were to migrate, its data would not be lost upon resurrection.

https://docs.apify.com/academy/expert-scraping-with-apify/solutions/handling-migrations.md

## Next up

You might have already noticed that we've been using the **RESIDENTIAL** proxy group in the `proxyConfiguration` within our Amazon scraping Actor. But what does that mean? Learn why we've used this group, about proxies, and about avoiding anti-scraping measures in the https://docs.apify.com/academy/expert-scraping-with-apify/bypassing-anti-scraping.md.


---

# Saving useful run statistics

**Understand how to save statistics about an Actor's run, what types of statistics you can save, and why you might want to save them for a large-scale scraper.**

***

Using Crawlee and the Apify SDK, we are now able to collect and format data coming directly from websites and save it into a Key-Value store or Dataset. This is great, but sometimes, we want to store some extra data about the run itself, or about each request. We might want to store some extra general run information separately from our results or potentially include statistics about each request within its corresponding dataset item.

The types of values that are saved are totally up to you, but the most common are error scores, number of total saved items, number of request retries, number of captchas hit, etc. Storing these values is not always necessary, but can be valuable when debugging and maintaining an Actor. As your projects scale, this will become more and more useful and important.

## Learning 🧠

Before moving on, give these valuable resources a quick lookover:

* Refamiliarize with the various available data on the https://crawlee.dev/api/core/class/Request.
* Learn about the https://crawlee.dev/api/browser-crawler/interface/BrowserCrawlerOptions#failedRequestHandler.
* Understand how to use the https://crawlee.dev/api/browser-crawler/interface/BrowserCrawlerOptions#errorHandler function to handle request failures.
* Ensure you are comfortable using https://docs.apify.com/sdk/js/docs/guides/result-storage#key-value-store and https://docs.apify.com/sdk/js/docs/guides/result-storage#dataset, and understand the differences between the two storage types.

## Knowledge check 📝

1. Why might you want to store statistics about an Actor's run (or a specific request)?
2. In our Amazon scraper, we are trying to store the number of retries of a request once its data is pushed to the dataset. Where would you get this information? Where would you store it?
3. What is the difference between the `failedRequestHandler` and `errorHandler`?

## Our task

In our Amazon Actor, each dataset result must now have the following extra keys:


```
{
    "dateHandled": "date-here", // the date + time at which the request was handled
    "numberOfRetries": 4, // the number of retries of the request before running successfully
    "currentPendingRequests": 24 // the current number of requests left pending in the request queue
}
```


Also, an object including these values should be persisted during the run in th Key-Value store and logged to the console every 10 seconds:


```
{
    "errors": { // all of the errors for every request path
        "some-site.com/products/123": [
            "error1",
            "error2"
        ]
    },
    "totalSaved": 43 // total number of saved items throughout the entire run
}
```


https://docs.apify.com/academy/expert-scraping-with-apify/solutions/saving-stats.md

## Wrap up

Wow, you've learned a whole lot in this course, so give yourself the pat on the back that you deserve! If you were able to follow along with this course, that means that you're officially an **Apify pro**, and that you're equipped with all of the knowledge and tools you need to build awesome scalable web-scrapers either for your own personal projects or for the Apify platform.

Congratulations! 🎉


---

# Solutions

**View all of the solutions for all of the activities and tasks of this course. Please try to complete each task on your own before reading the solution!**

***

The final section of each lesson in this course will be a task which you as the course-taker are expected to complete before moving on to the next lesson. Each task's completion and understanding plays an important role in the ability to continue through the course.

If you ever get stuck, or if you feel like your solution could be more optimal, you can always refer to the **Solutions** section of the course. Each solution will have all of the code and explanations needed to understand it.

**Please** try to do each task **on your own** prior to checking out the solution!


---

# Handling migrations

**Get real-world experience of maintaining a stateful object stored in memory, which will be persisted through migrations and even graceful aborts.**

***

Let's first head into our **demo-actor** and create a new file named **asinTracker.js** in the **src** folder. Within this file, we are going to build a utility class which will allow us to store, modify, persist, and log our tracked ASIN data.

Here's the skeleton of our class:


```
// asinTracker.js
class ASINTracker {
    constructor() {
        this.state = {};

        // Log the state to the console every ten
        // seconds
        setInterval(() => console.log(this.state), 10000);
    }

    // Add an offer to the ASIN's offer count
    // If ASIN doesn't exist yet, set it to 0
    incrementASIN(asin) {
        if (this.state[asin] === undefined) {
            this.state[asin] = 0;
            return;
        }

        this.state[asin] += 1;
    }
}

// It is only a utility class, so we will immediately
// create an instance of it and export that. We only
// need one instance for our use case.
module.exports = new ASINTracker();
```


Multiple techniques exist for storing data in memory; however, this is the most modular way, as all state-persistence and modification logic will be held in this file.

Here is our updated **routes.js** file which is now utilizing this utility class to track the number of offers for each product ASIN:


```
// routes.js
import { createCheerioRouter } from '@crawlee/cheerio';
import { BASE_URL, OFFERS_URL, labels } from './constants';
import tracker from './asinTracker';
import { dataset } from './main.js';

export const router = createCheerioRouter();

router.addHandler(labels.START, async ({ $, crawler, request }) => {
    const { keyword } = request.userData;

    const products = $('div > div[data-asin]:not([data-asin=""])');

    for (const product of products) {
        const element = $(product);
        const titleElement = $(element.find('.a-text-normal[href]'));

        const url = `${BASE_URL}${titleElement.attr('href')}`;

        // For each product, add it to the ASIN tracker
        // and initialize its collected offers count to 0
        tracker.incrementASIN(element.attr('data-asin'));

        await crawler.addRequest([{
            url,
            label: labels.PRODUCT,
            userData: {
                data: {
                    title: titleElement.first().text().trim(),
                    asin: element.attr('data-asin'),
                    itemUrl: url,
                    keyword,
                },
            },
        }]);
    }
});

router.addHandler(labels.PRODUCT, async ({ $, crawler, request }) => {
    const { data } = request.userData;

    const element = $('div#productDescription');

    await crawler.addRequests([{
        url: OFFERS_URL(data.asin),
        label: labels.OFFERS,
        userData: {
            data: {
                ...data,
                description: element.text().trim(),
            },
        },
    }]);
});

router.addHandler(labels.OFFERS, async ({ $, request }) => {
    const { data } = request.userData;

    const { asin } = data;

    for (const offer of $('#aod-offer')) {
        // For each offer, add 1 to the ASIN's
        // offer count
        tracker.incrementASIN(asin);

        const element = $(offer);

        await dataset.pushData({
            ...data,
            sellerName: element.find('div[id*="soldBy"] a[aria-label]').text().trim(),
            offer: element.find('.a-price .a-offscreen').text().trim(),
        });
    }
});
```


## Persisting state

The **persistState** event is automatically fired (by default) every 60 seconds by the Apify SDK while the Actor is running and is also fired when the **migrating** event occurs.

In order to persist our ASIN tracker object, let's use the `Actor.on` function to listen for the **persistState** event and store it in the key-value store each time it is emitted.


```
// asinTracker.js
import { Actor } from 'apify';
// We've updated our constants.js file to include the name
// of this new key in the key-value store
const { ASIN_TRACKER } = require('./constants');

class ASINTracker {
    constructor() {
        this.state = {};

        Actor.on('persistState', async () => {
            await Actor.setValue(ASIN_TRACKER, this.state);
        });

        setInterval(() => console.log(this.state), 10000);
    }

    incrementASIN(asin) {
        if (this.state[asin] === undefined) {
            this.state[asin] = 0;
            return;
        }

        this.state[asin] += 1;
    }
}

module.exports = new ASINTracker();
```


## Handling resurrections

Great! Now our state will be persisted every 60 seconds in the key-value store. However, we're not done. Let's say that the Actor migrates and is resurrected. We never actually update the `state` variable of our `ASINTracker` class with the state stored in the key-value store, so as our code currently stands, we still don't support state-persistence on migrations.

In order to fix this, let's create a method called `initialize` which will be called at the very beginning of the Actor's run, and will check the key-value store for a previous state under the key **ASIN-TRACKER**. If a previous state does live there, then it will update the class' `state` variable with the value read from the key-value store:


```
// asinTracker.js
import { Actor } from 'apify';
import { ASIN_TRACKER } from './constants';

class ASINTracker {
    constructor() {
        this.state = {};

        Actor.on('persistState', async () => {
            await Actor.setValue(ASIN_TRACKER, this.state);
        });

        setInterval(() => console.log(this.state), 10000);
    }

    async initialize() {
        // Read the data from the key-value store. If it
        // doesn't exist, it will be undefined
        const data = await Actor.getValue(ASIN_TRACKER);

        // If the data does exist, replace the current state
        // (initialized as an empty object) with the data
        if (data) this.state = data;
    }

    incrementASIN(asin) {
        if (this.state[asin] === undefined) {
            this.state[asin] = 0;
            return;
        }

        this.state[asin] += 1;
    }
}

module.exports = new ASINTracker();
```


We'll now call this function at the top level of the **main.js** file to ensure it is the first thing that gets called when the Actor starts up:


```
// main.js

// ...
import tracker from './asinTracker';

// The Actor.init() function should be executed before
// the tracker's initialization
await Actor.init();

await tracker.initialize();
// ...
```


That's everything! Now, even if the Actor migrates (or is gracefully aborted and then resurrected), this `state` object will always be persisted.

## Quiz answers 📝

**Q: Actors have an option in the Settings tab to Restart on error. Would you use this feature for regular Actors? When would you use this feature?**

**A:** It's not best to use this option by default. If it fails, there must be a reason, which would need to be thought through first - meaning that the edge case of failing should be handled when resurrecting the Actor. The state should be persisted beforehand.

**Q: Migrations happen randomly, but by https://docs.apify.com/platform/actors/running/runs-and-builds.md#aborting-runs, you can simulate a similar situation. Try this out on the platform and observe what happens. What changes occur, and what remains the same for the restarted Actor's run?**

**A:** After aborting or throwing an error mid-process, it manages to start back from where it was upon resurrection.

**Q: Why don't you (usually) need to add any special migration handling code for a standard crawling/scraping Actor? Are there any features in Crawlee or Apify SDK that handle this under the hood?**

**A:** Because Apify SDK handles all of the migration handling code for us. If you want to add custom migration-handling code, you can use `Actor.events` to listen for the `migrating` or `persistState` events to save the current state in key-value store (or elsewhere).

**Q: How can you intercept the migration event? How much time do you have after this event happens and before the Actor migrates?**

**A:** By using the `Actor.on` function. You have a maximum of a few seconds before shutdown after the `migrating` event has been fired.

**Q: When would you persist data to the default key-value store instead of to a named key-value store?**

**A:** Persisting data to the default key-value store would help when handling an Actor's run state or with storing metadata about the run (such as results, miscellaneous files, or logs). Using a named key-value store allows you to persist data at the account level to handle data across multiple Actor runs.

## Wrap up

In this activity, we learned how to persist custom values on an interval as well as after Actor migrations by using the `persistState` event and the key-value store. With this knowledge, you can safely increase your Actor's performance by storing data in variables and then pushing them to the dataset periodically/at the end of the Actor's run as opposed to pushing data immediately after it's been collected.

One important thing to note is that this workflow can be used to replace the usage of `userData` to pass data between requests, as it allows for the creation of a "global store" which all requests have access to at any time.


---

# Integrating webhooks

**Learn how to integrate webhooks into your Actors. Webhooks are a super powerful tool, and can be used to do almost anything!**

***

In this lesson we'll be writing a new Actor and integrating it with our beloved Amazon scraping Actor. First, we'll navigate to the same directory where our **demo-actor** folder lives, and run `apify create filter-actor` *(once again, you can name the Actor whatever you want, but for this lesson, we'll be calling the new Actor **filter-actor**)*. When prompted about the programming language, select **JavaScript**:


```
$ apify create filter-actor
? Choose the programming language of your new Actor:
❯ JavaScript
  TypeScript
  Python
```


Then use the arrow down key to select **Empty JavaScript Project**:


```
$ apify create filter-actor
✔ Choose the programming language of your new Actor: JavaScript
? Choose a template for your new Actor. Detailed information about the template will be shown in the next step.
  Crawlee + Playwright + Chrome
  Crawlee + Playwright + Camoufox
  Bootstrap CheerioCrawler
  Cypress
❯ Empty JavaScript Project
  Standby JavaScript Project
  ...
```


As a last step, confirm the choices by **Install template** and wait until our new Actor is ready.

## Building the new Actor

First of all, we should clear out any of the boilerplate code within **main.js** to get a clean slate:


```
// main.js
import { Actor } from 'apify';

await Actor.init();

// ...

await Actor.exit();
```


We'll be passing the ID of the Amazon Actor's default dataset along to the new Actor, so we can expect that as an input:


```
const { datasetId } = await Actor.getInput();
const dataset = await Actor.openDataset(datasetId);
// ...
```


Accessing Cloud Datasets Locally

You will need to use `forceCloud` option - `Actor.openDataset(, { forceCloud: true });` - to open dataset from platform storage while running Actor locally.

Next, we'll grab hold of the dataset's items with the `dataset.getData()` function:


```
const { items } = await dataset.getData();
```


While several methods can achieve the goal output of this Actor, using the https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/reduce is the most concise approach


```
const filtered = items.reduce((acc, curr) => {
    // Grab the price of the item matching our current
    // item's ASIN in the map. If it doesn't exist, set
    // "prevPrice" to null
    const prevPrice = acc?.[curr.asin] ? +acc[curr.asin].offer.slice(1) : null;

    // Grab the price of our current offer
    const price = +curr.offer.slice(1);

    // If the item doesn't yet exist in the map, add it.
    // Or, if the current offer's price is less than the
    // saved one, replace the saved one
    if (!acc[curr.asin] || prevPrice > price) acc[curr.asin] = curr;

    // Return the map
    return acc;
}, {});
```


The results should be an array, so we can take the map we just created and push an array of its values to the Actor's default dataset:


```
await Actor.pushData(Object.values(filtered));
```


Our final code looks like this:


```
import { Actor } from 'apify';

await Actor.init();

const { datasetId } = await Actor.getInput();
const dataset = await Actor.openDataset(datasetId);

const { items } = await dataset.getData();

const filtered = items.reduce((acc, curr) => {
    const prevPrice = acc?.[curr.asin] ? +acc[curr.asin].offer.slice(1) : null;
    const price = +curr.offer.slice(1);

    if (!acc[curr.asin] || prevPrice > price) acc[curr.asin] = curr;

    return acc;
}, {});

await Actor.pushData(Object.values(filtered));

await Actor.exit();
```


Cool! But **wait**, don't forget to configure the **INPUT\_SCHEMA.json** file as well! It's not necessary to do this step, as we'll be calling the Actor through Apify's API within a webhook, but it's still good to get into the habit of writing quality input schemas that describe the input values your Actors are expecting.


```
{
    "title": "Amazon Filter Actor",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "datasetId": {
            "title": "Dataset ID",
            "type": "string",
            "description": "Enter the ID of the dataset.",
            "editor": "textfield"
        }
    },
    "required": ["datasetId"]
}
```


Now we're done, and we can push it up to the Apify platform with the `apify push` command:


```
$ apify push
Info: Created Actor with name filter-actor on Apify.
Info: Deploying Actor 'filter-actor' to Apify.
Run: Updated version 0.0 for Actor filter-actor.
Run: Building Actor filter-actor
(timestamp) ACTOR: Extracting Actor documentation from README.md
(timestamp) ACTOR: Building Docker image.
...
(timestamp) ACTOR: Pushing Docker image to repository.
(timestamp) ACTOR: Build finished.
Actor build detail https://console.apify.com/actors/Yk1bieximsduYDydP#/builds/0.0.1
Actor detail https://console.apify.com/actors/Yk1bieximsduYDydP
Success: Actor was deployed to Apify cloud and built there.
```


## Setting up the webhook

We'll use the https://docs.apify.com/academy/api/run-actor-and-retrieve-data-via-api.md to set up the webhook. To compose the HTTP request, we'll need either the ID of our Actor or its technical name. Let's take a second look at the end of the output of the `apify push` command:


```
...
Actor build detail https://console.apify.com/actors/Yk1bieximsduYDydP#/builds/0.0.1
Actor detail https://console.apify.com/actors/Yk1bieximsduYDydP
Success: Actor was deployed to Apify cloud and built there.
```


The URLs tell us that our Actor's ID is `Yk1bieximsduYDydP`. With this `actorId`, and our `token`, which is retrievable through **Settings > Integrations** on the Apify Console, we can construct a link which will call the Actor:


```
https://api.apify.com/v2/acts/Yk1bieximsduYDydP/runs?token=YOUR_TOKEN_HERE
```


We can also use our username and the name of the Actor like this:


```
https://api.apify.com/v2/acts/USERNAME~filter-actor/runs?token=YOUR_TOKEN_HERE
```


Whichever one you choose is totally up to your preference.

Next, within the Amazon scraping Actor, we will click the **Integrations** tab and choose **Webhook**, then fill out the details to look like this:

![Configuring a webhook](/assets/images/adding-webhook-c76d2f73bb0cadcf48620b59db1a1a9c.jpg)

We have chosen to run the webhook once the Actor has succeeded, which means that its default dataset will surely be populated. Since the filtering Actor is expecting the default dataset ID of the Amazon Actor, we use the `resource` variable to grab hold of the `defaultDatasetId`.

Click **Save**, then run the Amazon **demo-actor** again.

## Making sure it worked

If everything worked, then at the end of the **demo-actor**'s run, we should see this within the **Integrations** tab:

![Webhook succeeded](/assets/images/webhook-succeeded-f95ddb172f63747d28dc72e5cdbb9c21.png)

Additionally, we should be able to see that our **filter-actor** was run, and have access to its dataset:

![Dataset preview](/assets/images/dataset-preview-711de106446452a93cc8c15675d77a4d.png)

## Quiz answers 📝

**Q: How do you allocate more CPU for an Actor's run?**

**A:** On the platform, more memory can be allocated in the Actor's input configuration, and the default allocated CPU can be changed in the Actor's **Settings** tab. When running locally, you can use the **APIFY\_MEMORY\_MBYTES** environment variable to set the allocated CPU. 4GB is equal to 1 CPU core on the Apify platform.

**Q: Within itself, can you get the exact time that an Actor was started?**

**A:** Yes. The time the Actor was started can be retrieved through the `startedAt` property from the `Actor.getEnv()` function, or directly from `process.env.APIFY_STARTED_AT`

**Q: What are the types of default storages connected to an Actor's run?**

Every Actor's run is given a default key-value store and a default dataset. The default key-value store by default has the `INPUT` and `OUTPUT` keys. The Actor's request queue is also stored.

**Q: Can you change the allocated memory of an Actor while it's running?**

**A:** Not while it's running. You'd need to stop it and run a new one. However, there is an option to soft abort an Actor, then resurrect then run with a different memory configuration.

**Q: How can you run an Actor with Puppeteer on the Apify platform with headless mode set to `false`?**

**A:** This can be done by using the `actor-node-puppeteer-chrome` Docker image and making sure that `launchContext.launchOptions.headless` in `PuppeteerCrawlerOptions` is set to `false`.

## Wrap up

See that?! Integrating webhooks is a piece of cake on the Apify platform! You'll soon discover that the platform factors away a lot of complex things and allows you to focus on what's most important - developing and releasing Actors.


---

# Managing source

**View in-depth answers for all three of the quiz questions that were provided in the corresponding lesson about managing source code.**

***

In the lesson corresponding to this solution, we discussed an extremely important topic: source code management. Though we solved the task right in the lesson, we've still included the quiz answers here.

## Quiz answers

**Q: Do you have to rebuild an Actor each time the source code is changed?**

**A:** Yes. It needs to be built into an image, saved in a registry, and later on run in a container.

**Q: In Git, what is the difference between pushing changes and making a pull request?**

**A:** Pushing changes to the remote branch based on the content on the local branch. The pushing of code changes is usually made to a branch parallel to the one you want to eventually push it to.

When creating a pull request, the code is meant to be reviewed, or at least pass all the test suites before being merged into the target branch.

**Q: Based on your knowledge and experience, is the `apify push` command worth using (in your opinion)?**

**A:** The `apify push` command can sometimes be useful when testing ideas; however, it is much more ideal to use GitHub integration rather than directly pushing to the platform.


---

# Rotating proxies/sessions

**Learn firsthand how to rotate proxies and sessions in order to avoid the majority of the most common anti-scraping protections.**

***

If you take a look at our current code for the Amazon scraping Actor, you might notice this snippet:


```
const proxyConfiguration = await Actor.createProxyConfiguration({
    groups: ['RESIDENTIAL'],
});
```


We didn't provide much explanation for this initially, as it was not directly relevant to the lesson at hand. When you https://docs.apify.com/academy/anti-scraping/mitigation/using-proxies.md and pass it to a crawler, Crawlee will make the crawler automatically rotate through the proxies. This entire time, we've been using the **RESIDENTIAL** proxy group to avoid being blocked by Amazon.

> Go ahead and try commenting out the proxy configuration code then running the scraper. What happens?

In order to rotate sessions, we must utilize the https://crawlee.dev/api/core/class/AutoscaledPool, which we've actually also already been using by setting the **useSessionPool** option in our crawler's configuration to **true**. The SessionPool advances the concept of proxy rotation by tying proxies to user-like sessions and rotating those instead. In addition to a proxy, each user-like session has cookies attached to it (and potentially a browser fingerprint as well).

## Configuring SessionPool

Let's go ahead and add a **sessionPoolOptions** key to our crawler's configuration so that we can modify the default settings:


```
const crawler = new CheerioCrawler({
    requestList,
    requestQueue,
    proxyConfiguration,
    useSessionPool: true,
    // This is where our session pool
    // configuration lives
    sessionPoolOptions: {
        // We can add options for each
        // session created by the session
        // pool here
        sessionOptions: {

        },
    },
    maxConcurrency: 50,
    // ...
});
```


Now, we'll use the **maxUsageCount** key to force each session to be thrown away after 5 uses and **maxErrorScore** to trash a session once it receives an error.


```
const crawler = new CheerioCrawler({
    requestList,
    requestQueue,
    proxyConfiguration,
    useSessionPool: true,
    sessionPoolOptions: {
        sessionOptions: {
            maxUsageCount: 5,
            maxErrorScore: 1,
        },
    },
    maxConcurrency: 50,
    // ...
});
```


And that's it! We've successfully configured the session pool to match the task's requirements.

## Limiting proxy location

The final requirement was to use proxies only from the US. Back in our **ProxyConfiguration**, we need to add the **countryCode** key and set it to **US**:


```
const proxyConfiguration = await Actor.createProxyConfiguration({
    groups: ['RESIDENTIAL'],
    countryCode: 'US',
});
```


## Quiz answers

**Q: What are the different types of proxies that Apify proxy offers? What are the main differences between them?**

**A:** Datacenter, residential, and Google SERP proxies with sub-groups. Datacenter proxies are fast and cheap but have a higher chance of being blocked on certain sites in comparison to residential proxies, which are IP addresses located in homes and offices around the world. Google SERP proxies are specifically for Google.

**Q: Which proxy groups do users get on the free plan? Can they access the proxy from their computer?**

**A:** All users have access to the **BUYPROXIES94952**, **GOOGLE\_SERP** and **RESIDENTIAL** groups. Free users cannot access the proxy from outside the Apify platform (paying users can).

**Q: How can you prevent an error from occurring if one of the proxy groups that a user has is removed? What are the best practices for these scenarios?**

**A:** By making the proxy for the scraper to use be configurable by the user through the Actor's input. That way, they can switch proxies if the Actor stops working due to proxy-related issues. It can also be done by using the **AUTO** proxy instead of specific groups.

**Q: Does it make sense to rotate proxies when you are logged into a website?**

**A:** No, because most websites tie an IP address to a session. If you start making requests with cookies used with a different IP address, the website might see it as unusual activity and either block the scraper or automatically log out.

**Q: Construct a proxy URL that will select proxies only from the US.**

**A:** `http://country-US:@proxy.apify.com:8000`

**Q: What do you need to do to rotate a proxy (one proxy usually has one IP)? How does this differ for CheerioCrawler and PuppeteerCrawler?**

**A:** Making a new request with the proxy endpoint above will automatically rotate it. Sessions can also be used to automatically do this. While proxy rotation is fairly straightforward for Cheerio, it's more complex in Puppeteer, as you have to retire the browser each time a new proxy is rotated in. The SessionPool will automatically retire a browser when a session is retired. Sessions can be manually retired with `session.retire()`.

**Q: Name a few different ways how a website can prevent you from scraping it.**

**A:** IP detection and rate-limiting, browser/fingerprint detection, user behavior tracking, etc.

## Wrap up

In this solution, you learned one of the most important concepts in web scraping - proxy/session rotation. With your newfound knowledge of the SessionPool, you'll be (practically) unstoppable!


---

# Saving run stats

**Implement the saving of general statistics about an Actor's run, as well as adding request-specific statistics to dataset items.**

***

The code in this solution will be similar to what we already did in the **Handling migrations** solution; however, we'll be storing and logging different data. First, let's create a new file called **Stats.js** and write a utility class for storing our run stats:


```
import Actor from 'apify';

class Stats {
    constructor() {
        this.state = {
            errors: {},
            totalSaved: 0,
        };
    }

    async initialize() {
        const data = await Actor.getValue('STATS');

        if (data) this.state = data;

        Actor.on('persistState', async () => {
            await Actor.setValue('STATS', this.state);
        });

        setInterval(() => console.log(this.state), 10000);
    }

    addError(url, errorMessage) {
        if (!this.state.errors?.[url]) this.state.errors[url] = [];
        this.state.errors[url].push(errorMessage);
    }

    success() {
        this.state.totalSaved += 1;
    }
}

module.exports = new Stats();
```


Cool, very similar to the **AsinTracker** class we wrote earlier. We'll now import **Stats** into our **main.js** file and initialize it along with the ASIN tracker:


```
// ...
import Stats from './Stats.js';

await Actor.init();
await asinTracker.initialize();
await Stats.initialize();
// ...
```


## Tracking errors

In order to keep track of errors, we must write a new function within the crawler's configuration called **errorHandler**. Passed into this function is an object containing an **Error** object for the error which occurred and the **Request** object, as well as information about the session and proxy which were used for the request.


```
const crawler = new CheerioCrawler({
    proxyConfiguration,
    useSessionPool: true,
    sessionPoolOptions: {
        persistStateKey: 'AMAZON-SESSIONS',
        sessionOptions: {
            maxUsageCount: 5,
            maxErrorScore: 1,
        },
    },
    maxConcurrency: 50,
    requestHandler: router,
    // Handle all failed requests
    errorHandler: async ({ error, request }) => {
        // Add an error for this url to our error tracker
        Stats.addError(request.url, error?.message);
    },
});
```


## Tracking total saved

Now, we'll increment our **totalSaved** count for every offer added to the dataset.


```
router.addHandler(labels.OFFERS, async ({ $, request }) => {
    const { data } = request.userData;

    const { asin } = data;

    for (const offer of $('#aod-offer')) {
        tracker.incrementASIN(asin);
        // Add 1 to totalSaved for every offer
        Stats.success();

        const element = $(offer);

        await dataset.pushData({
            ...data,
            sellerName: element.find('div[id*="soldBy"] a[aria-label]').text().trim(),
            offer: element.find('.a-price .a-offscreen').text().trim(),
        });
    }
});
```


## Saving stats with dataset items

Still, in the **OFFERS** handler, we need to add a few extra keys to the items which are pushed to the dataset. Luckily, all of the data required by the task is accessible in the context object.


```
router.addHandler(labels.OFFERS, async ({ $, request }) => {
    const { data } = request.userData;

    const { asin } = data;

    for (const offer of $('#aod-offer')) {
        tracker.incrementASIN(asin);
        // Add 1 to totalSaved for every offer
        Stats.success();

        const element = $(offer);

        await dataset.pushData({
            ...data,
            sellerName: element.find('div[id*="soldBy"] a[aria-label]').text().trim(),
            offer: element.find('.a-price .a-offscreen').text().trim(),
            // Store the handledAt date or current date if that is undefined
            dateHandled: request.handledAt || new Date().toISOString(),
            // Access the number of retries on the request object
            numberOfRetries: request.retryCount,
            // Grab the number of pending requests from the requestQueue
            currentPendingRequests: (await requestQueue.getInfo()).pendingRequestCount,
        });
    }
});
```


## Quiz answers

**Q: Why might you want to store statistics about an Actor's run (or a specific request)?**

**A:** If certain types of requests are error-prone, you might want to save stats about the run to look at them later to either eliminate or better handle the errors. Things like **dateHandled** can be generally useful information.

**Q: In our Amazon scraper, we are trying to store the number of retries of a request once its data is pushed to the dataset. Where would you get this information? Where would you store it?**

**A:** This information is available directly on the request object under the property **retryCount**.

**Q: What is the difference between the `failedRequestHandler` and `errorHandler`?**

**A:** `failedRequestHandler` runs after a request has failed and reached its `maxRetries` count. `errorHandler` runs on every failure and retry.


---

# Using the Apify API & JavaScript client

**Learn how to interact with the Apify API directly through the well-documented RESTful routes, or by using the proprietary Apify JavaScript client.**

***

Since we need to create another Actor, we'll once again use the `apify create` command and start from an empty template. This time, let's call our project **actor-caller**:


```
$ apify create filter-caller
? Choose the programming language of your new Actor:
❯ JavaScript
  TypeScript
  Python
```


Again, use the arrow down key to select **Empty JavaScript Project**:


```
$ apify create filter-actor
✔ Choose the programming language of your new Actor: JavaScript
? Choose a template for your new Actor. Detailed information about the template will be shown in the next step.
  Crawlee + Playwright + Chrome
  Crawlee + Playwright + Camoufox
  Bootstrap CheerioCrawler
  Cypress
❯ Empty JavaScript Project
  Standby JavaScript Project
  ...
```


Confirm the choices by **Install template** and wait until our new Actor is ready. Now let's also set up some boilerplate, grabbing our inputs and creating a constant variable for the task:


```
import { Actor } from 'apify';
import axios from 'axios';

await Actor.init();

const { useClient, memory, fields, maxItems } = await Actor.getInput();

const TASK = 'YOUR_USERNAME~demo-actor-task';

// our future code will go here

await Actor.exit();
```


## Calling a task via JavaScript client

When using the `apify-client` package, you can create a new client instance by using `new ApifyClient()`. Within the Apify SDK however, it is not necessary to even install the `apify-client` package, as the `Actor.newClient()` function is available for use.

We'll start by creating a function called `withClient()` and creating a new client, then calling the task:


```
const withClient = async () => {
    const client = Actor.newClient();
    const task = client.task(TASK);

    const { id } = await task.call({ memory });
};
```


After the task has run, we'll grab hold of its dataset, then attempt to download the items, plugging in our `maxItems` and `fields` inputs. Then, once the data has been downloaded, we'll push it to the default key-value store under a key named **OUTPUT.csv**.


```
const withClient = async () => {
    const client = Actor.newClient();
    const task = client.task(TASK);

    const { id } = await task.call({ memory });

    const dataset = client.run(id).dataset();

    const items = await dataset.downloadItems('csv', {
        limit: maxItems,
        fields,
    });

    // If the content type is anything other than JSON, it must
    // be specified within the third options parameter
    return Actor.setValue('OUTPUT', items, { contentType: 'text/csv' });
};
```


## Calling a task via API

First, we'll create a function (right under the `withClient()`) function named `withAPI` and instantiate a new variable which represents the API endpoint to run our task:


```
const withAPI = async () => {
    const uri = `https://api.apify.com/v2/actor-tasks/${TASK}/run-sync-get-dataset-items?`;
};
```


To add the query parameters to the URL, we could create a super long string literal, plugging in all of our input values; however, there is a much better way: https://nodejs.org/api/url.html#new-urlsearchparams. By using `URLSearchParams`, we can add the query parameters in an object:


```
const withAPI = async () => {
    const uri = `https://api.apify.com/v2/actor-tasks/${TASK}/run-sync-get-dataset-items?`;
    const url = new URL(uri);

    url.search = new URLSearchParams({
        memory,
        format: 'csv',
        limit: maxItems,
        fields: fields.join(','),
        token: process.env.APIFY_TOKEN,
    });
};
```


Finally, let's make a `POST` request to our endpoint. You can use any library you want, but in this example, we'll use https://www.npmjs.com/package/axios. Don't forget to run `npm install axios` if you're going to use this package too!


```
const withAPI = async () => {
    const uri = `https://api.apify.com/v2/actor-tasks/${TASK}/run-sync-get-dataset-items?`;
    const url = new URL(uri);

    url.search = new URLSearchParams({
        memory,
        format: 'csv',
        limit: maxItems,
        fields: fields.join(','),
        token: process.env.APIFY_TOKEN,
    });

    const { data } = await axios.post(url.toString());

    return Actor.setValue('OUTPUT', data, { contentType: 'text/csv' });
};
```


## Finalizing the Actor

Now, since we've written both of these functions, all we have to do is write a conditional statement based on the boolean value from `useClient`:


```
if (useClient) await withClient();
else await withAPI();
```


And before we push to the platform, let's not forget to write an input schema in the **INPUT\_SCHEMA.JSON** file:


```
{
  "title": "Actor Caller",
  "type": "object",
  "schemaVersion": 1,
  "properties": {
    "memory": {
      "title": "Memory",
      "type": "integer",
      "description": "Select memory in megabytes.",
      "default": 4096,
      "maximum": 32768,
      "unit": "MB"
    },
    "useClient": {
      "title": "Use client?",
      "type": "boolean",
      "description": "Specifies whether the Apify JS client, or the pure Apify API should be used.",
      "default": true
    },
    "fields": {
      "title": "Fields",
      "type": "array",
      "description": "Enter the dataset fields to export to CSV",
      "prefill": ["title", "url", "price"],
      "editor": "stringList"
    },
    "maxItems": {
      "title": "Max items",
      "type": "integer",
      "description": "Fill the maximum number of items to export.",
      "default": 10
    }
  },
  "required": ["useClient", "memory", "fields", "maxItems"]
}
```


## Final code

To ensure we're on the same page, here is what the final code looks like:


```
import { Actor } from 'apify';
import axios from 'axios';

await Actor.init();

const { useClient, memory, fields, maxItems } = await Actor.getInput();

const TASK = 'YOUR_USERNAME~demo-actor-task';

const withClient = async () => {
    const client = Actor.newClient();
    const task = client.task(TASK);

    const { id } = await task.call({ memory });

    const dataset = client.run(id).dataset();

    const items = await dataset.downloadItems('csv', {
        limit: maxItems,
        fields,
    });

    return Actor.setValue('OUTPUT', items, { contentType: 'text/csv' });
};

const withAPI = async () => {
    const uri = `https://api.apify.com/v2/actor-tasks/${TASK}/run-sync-get-dataset-items?`;
    const url = new URL(uri);

    url.search = new URLSearchParams({
        memory,
        format: 'csv',
        limit: maxItems,
        fields: fields.join(','),
        token: process.env.APIFY_TOKEN,
    });

    const { data } = await axios.post(url.toString());

    return Actor.setValue('OUTPUT', data, { contentType: 'text/csv' });
};

if (useClient) {
    await withClient();
} else {
    await withAPI();
}

await Actor.exit();
```


## Quiz answers 📝

**Q: What is the relationship between the Apify API and Apify client? Are there any significant differences?**

**A:** The Apify client mimics the Apify API, so there aren't any super significant differences. It's super handy as it helps with managing the API calls (parsing, error handling, retries, etc) and even adds convenience functions.

The one main difference is that the Apify client automatically uses https://docs.apify.com/api/client/js/docs#retries-with-exponential-backoff to deal with errors.

**Q: How do you pass input when running an Actor or task via API?**

**A:** The input should be passed into the **body** of the request when running an actor/task via API.

**Q: Do you need to install the `apify-client` npm package when already using the `apify` package?**

**A:** No. The Apify client is available right in the SDK with the `Actor.newClient()` function.

## Wrap up

That's it! Now, if you want to go above and beyond, you should create a GitHub repository for this Actor, integrate it with a new one on the Apify platform, and test if it works there as well (with multiple input configurations).


---

# Using storage & creating tasks

## Quiz answers 📝

**Q: What is the relationship between Actors and tasks?**

**A:** Tasks are pre-configured runs of Actors. The configurations of an Actor can be saved as a task so that it doesn't have to be manually configured every single time.

**Q: What are the differences between default (unnamed) and named storage? Which one would you use for everyday usage?**

**A:** Unnamed storage is persisted for only 7 days, while named storage is persisted indefinitely. For everyday usage, it is best to use default unnamed storages unless the data should explicitly be persisted for more than 7 days.

> With named storages, it's easier to verify that you're using the correct store, as they can be referred to by name rather than by an ID.

**Q: What is data retention, and how does it work for all types of storages (default and named)?**

**A:** Default/unnamed storages expire after 7 days unless otherwise specified. Named storages are retained indefinitely.

## Wrap up

You've learned how to use the different storage options available on Apify, the two different types of storage, as well as how to create tasks for Actors.


---

# Tasks & storage

**Understand how to save the configurations for Actors with Actor tasks. Also, learn about storage and the different types Apify offers.**

***

Both of these are very different things; however, they are also tied together in many ways. **Tasks** run Actors, Actors return data, and data is stored in different types of **Storages**.

## Tasks

Tasks are a very useful feature which allow us to save pre-configured inputs for Actors. This means that rather than configuring the Actor every time, or rather than having to save screenshots of various different Actor configurations, you can store the configurations right in your Apify account instead, and run the Actor at will with them.

## Storage

Storage allows us to save persistent data for further processing. As you'll learn, there are two main storage options on the Apify platform, as well as two main storage types (**named** and **unnamed**) with one big difference between them.

## Learning 🧠

* Check out https://docs.apify.com/platform/actors/running/tasks.md.
* Read about the https://docs.apify.com/platform/storage/dataset.md on the Apify platform.
* Understand the https://docs.apify.com/platform/storage/usage.md#named-and-unnamed-storages.
* Learn about the https://docs.apify.com/sdk/js/reference/class/Dataset and https://docs.apify.com/sdk/js/reference/class/KeyValueStore objects in the Apify SDK.

## Knowledge check 📝

1. What is the relationship between Actors and tasks?
2. What are the differences between default (unnamed) and named storage? Which one would you use for everyday usage?
3. What is data retention, and how does it work for all types of storages (default and named)?

https://docs.apify.com/academy/expert-scraping-with-apify/solutions/using-storage-creating-tasks.md

## Next up

The https://docs.apify.com/academy/expert-scraping-with-apify/apify-api-and-client.md is very exciting, as it will unlock the ability to seamlessly integrate your Apify Actors into your own external projects and applications with the Apify API.


---

# Monetizing your Actor

**Learn how you can monetize your web scraping and automation projects by publishing Actors to users in Apify Store.**

***

When you publish your Actor on the Apify platform, you have the option to make it a *Paid Actor* and earn revenue from users who benefit from your tool. You can choose between two pricing models:

* Rental
* Pay-per-result

## Rental pricing model

With the rental model, you can specify a free trial period and a monthly rental price. After the trial, users with an https://apify.com/pricing can continue using your Actor by paying the monthly fee. You can receive 80% of the total rental fees collected each month.

Example - rental pricing model

You make your Actor rental with 7 days free trial and then $30/month. During the first calendar month, three users start to use your Actor:

1. First user, on Apify paid plan, starts the free trial on 15th
2. Second user, on Apify paid plan, starts the free trial on 25th
3. Third user, on Apify free plan, start the free trial on 20th

The first user pays their first rent 7 days after the free trial, i.e., on 22nd. The second user only starts paying the rent next month. The third user is on Apify free plan, so after the free trial ends on 27th, they are not charged and cannot use the Actor further until they get a paid plan. Your profit is computed only from the first user. They were charged $30, so 80% of this goes to you, i.e., *0.8 \* 30 = $24*.

## Pay-per-result pricing model

In this model, you set a price per 1000 results. Users are charged based on the number of results your Actor produces. Your profit is calculated as 80% of the revenue minus platform usage costs. The formula is:

`(0.8 * revenue) - costs = profit`

### Pay-per-result unit pricing for cost computation

| Service                         | Unit price                 |
| ------------------------------- | -------------------------- |
| Compute unit                    | **$0.3** / CU              |
| Residential proxies             | **$13** / GB               |
| SERPs proxy                     | **$3** / 1,000 SERPs       |
| Data transfer - external        | **$0.20** / GB             |
| Data transfer - internal        | **$0.05** / GB             |
| Dataset - timed storage         | **$1.00** / 1,000 GB-hours |
| Dataset - reads                 | **$0.0004** / 1,000 reads  |
| Dataset - writes                | **$0.005** / 1,000 writes  |
| Key-value store - timed storage | **$1.00** / 1,000 GB-hours |
| Key-value store - reads         | **$0.005** / 1,000 reads   |
| Key-value store - writes        | **$0.05** / 1,000 writes   |
| Key-value store - lists         | **$0.05** / 1,000 lists    |
| Request queue - timed storage   | **$4.00** / 1,000 GB-hours |
| Request queue - reads           | **$0.004** / 1,000 reads   |
| Request queue - writes          | **$0.02** / 1,000 writes   |

Only revenue & cost for Apify customers on paid plans are taken into consideration when computing your profit. Users on free plans are not reflected there, although you can see statistics about the potential revenue of users that are currently on free plans in Actor Insights in the Apify Console.

What are Gigabyte-hours?

Gigabyte-hours (GB-hours) are a unit of measurement used to quantify data storage and processing capacity over time. To calculate GB-hours, multiply the amount of data in gigabytes by the number of hours it's stored or processed.

For example, if you host 50GB of data for 30 days:

* Convert days to hours: *30 \* 24 = 720*
* Multiply data size by hours: *50 \* 720 = 36,000*

This means that storing 50 GB of data for 30 days results in 36,000 GB-hours.

Read more about Actors in the Store and different pricing models from the perspective of your users in the https://docs.apify.com/platform/actors/running/actors-in-store.

Example - pay-per-result pricing model

You make your Actor pay-per-result and set price to be $1/1,000 results. During the first month, two users on Apify paid plans use your Actor to get 50,000 and 20,000 results, costing them $50 and $20 respectively. Let's say the underlying platform usage for the first user is $5 and for the second $2. Third user, this time on Apify free plan, uses the Actor to get 5,000 results, with underlying platform usage $0.5.

Your profit is computed only from the first two users, since they are on Apify paid plans. The revenue for the first user is $50 and for the second $20, i.e., total revenue is $70. The total underlying cost is *$5 + $2 = $7*. Since your profit is 80% of the revenue minus the cost, it would be *0.8 \* 70 - 7 = $49*.

### Best practices for Pay-per-results Actors

To ensure profitable operation:

* Set memory limits in your https://docs.apify.com/platform/actors/development/actor-definition/actor-json file to control platform usage costs
* Implement the `ACTOR_MAX_PAID_DATASET_ITEMS` check to prevent excess result generation
* Test your Actor with various result volumes to determine optimal pricing

## Setting up monetization

Navigate to your https://console.apify.com/actors?tab=my in the Apify Console choose Actor that you want to monetize, and select the Publication tab. ![Monetization section](/assets/images/monetization-section-5ea234343a91208580100eb37c1b9e7f.png) Open the Monetization section and complete your billing and payment details. ![Set up monetization](/assets/images/monetize_actor_set_up_monetization-7612e44589223f7e92b8adcd006bc1bb.png) Follow the monetization wizard to configure. Follow the monetization wizard to configure your pricing model. ![Monetization wizard](/assets/images/monetization_wizard-39bd82ef5ffa7a6f5a9143d2892178a4.png)

### Changing monetization

You can change the monetization setting of your Actor by using the same wizard as for the setup in the **Monetization** section of your Actor's **Publication** tab.

Most changes take effect **immediately**. However, **major changes** require a 14-day notice period and are limited to once per month to protect users.

**Major changes** that require 14-day notice include:

* Changing the pricing model (e.g., from rental to pay-per-result)
* Increasing prices
* Adding new pay-per-event charges

All other changes (such as decreasing prices, adjusting descriptions, or removing pay-per-event charges) take effect immediately.

Frequency of major monetization adjustments

You can make major monetization changes to each Actor only **once per month**. After making a major change, you must wait until it takes effect (14 days) plus an additional period before making another major change. For further information & guidelines, please refer to our https://apify.com/store-terms-and-conditions

## Payouts & analytics

Payout invoices are generated automatically on the 14th of each month. Review your invoice in the Settings > Payout section within one week. If not approved by the 20th, the system will auto-approve on the 21st.

Track your Actor's performance through:

* The payout section for financial records

* Actor Analytics for usage statistics

  ![Actor analytics](/assets/images/actor_analytics-72d29767ca18eb8c642d199bb488627f.png)

* Individual Actor Insights for detailed performance metrics

  ![Actor insights](/assets/images/actor-insights-5178afe3392983f919cf0f8755be182a.png)

## Promoting your Actor

Create SEO-optimized descriptions and README files to improve search engine visibility. Share your Actor on multiple channels:

* Post on Reddit, Quora, and social media platform
* Create tutorial videos demonstrating key features
* Publish articles about your Actor on relevant websites
* Consider creating a product showcase on platforms like Product hunt

Remember to tag Apify in your social media posts for additional exposure. Effective promotion can significantly impact your Actor's success, differentiating between those with many paid users and those with few to none.

Learn more about promoting your Actor from https://apify.notion.site/3fdc9fd4c8164649a2024c9ca7a2d0da?v=6d262c0b026d49bfa45771cd71f8c9ab.


---

# Getting started

**Get started with the Apify platform by creating an account and learning about the Apify Console, which is where all Apify Actors are born!**

***

Your gateway to the Apify platform is your Apify account. The great thing about creating an account is that we support integration with both Google and GitHub, which takes only about 30 seconds!

1. Create your account on the https://console.apify.com/sign-up?asrc=developers_portal page.
2. Check your email, you should have a verification email with a link. Click it!
3. Done! 👍

## Getting to know the platform

Now that you have an account, you have access to the https://console.apify.com?asrc=developers_portal, which is a wonderful place where you utilize all of the features the platform has to offer, as well as manage and test your own projects.

## Next up

In our next lesson, we'll learn about something super exciting - **Actors**. Actors are the living and breathing core of the Apify platform and are an extremely powerful concept. What are you waiting for? Let's jump https://docs.apify.com/academy/getting-started/actors.md!


---

# Actors

**What is an Actor? How do we create them? Learn the basics of what Actors are, how they work, and try out an Actor yourself right on the Apify platform!**

***

After you've followed the **Getting started** lesson, you're almost ready to start creating some Actors! But before we get into that, let's discuss what an Actor is, and a bit about how they work.

## What's an Actor?

When you deploy your script to the Apify platform, it is then called an **Actor**, which is a https://www.datadoghq.com/knowledge-center/serverless-architecture/serverless-microservices/#:~:text=Serverless%20microservices%20are%20cloud-based,suited%20for%20microservice-based%20architectures. that accepts an input and produces an output. Actors can run for a few seconds, hours or even infinitely. An Actor can perform anything from a basic action such as filling out a web form or sending an email, to complex operations such as crawling an entire website and removing duplicates from a large dataset.

Once an Actor has been pushed to the Apify platform, they can be shared to the world through the https://apify.com/store, and even monetized after going public.

> Though the majority of Actors that are currently on the Apify platform are scrapers, crawlers, or automation software, Actors are not limited to scraping. They can be any program running in a Docker container.

## Actors on the Apify platform

For a super quick and dirty understanding of what a published Actor looks like, and how it works, let's run an SEO audit of **apify.com** using the https://apify.com/misceres/seo-audit-tool.

On the front page of the Actor, click the green **Try for free** button. If you're logged into your Apify account which you created during the https://docs.apify.com/academy/getting-started.md lesson, you'll be taken to the Apify Console and greeted with a page that looks like this:

![Actor configuration](/assets/images/seo-actor-config-6cde16dcb2bc752723bf7c6ed8364075.png)

This is where we can provide input to the Actor. The defaults here are just fine, so we'll leave it as is and click the green **Start** button to run it. While the Actor is running, you'll see it log some information about itself.

![Actor logs](/assets/images/actor-logs-a100ea07b38cdbe0ff6bc9cf3d808472.jpg)

After the Actor has completed its run (you'll know this when you see **SEO audit for apify.com finished.** in the logs), the results of the run can be viewed by clicking the **Results** tab, then subsequently the **View in another tab** option under **Export**.

## The "Actors" tab

While still on the platform, click on the tab with the **\** icon which says **Actors**. This tab is your one-stop-shop for seeing which Actors you've used recently, and which ones you've developed yourself. You will be frequently using this tab when developing and testing on the Apify platform.

![The \&quot;Actors\&quot; tab on the Apify platform](/assets/images/actors-tab-6244fff86563e1f10b96f275583162a2.jpg)

Now that you know the basics of what Actors are and how to use them, it's time to develop **an Actor of your own**!

## Next up

Get ready, because in the https://docs.apify.com/academy/getting-started/creating-actors.md, you'll be writing your very own Actor!


---

# The Apify API

**Learn how to use the Apify API to programmatically call your Actors, retrieve data stored on the platform, view Actor logs, and more!**

***

https://docs.apify.com/api/v2.md is your ticket to the Apify platform without even needing to access the https://console.apify.com?asrc=developers_portal web-interface. The API is organized around RESTful HTTP endpoints.

In this lesson, we'll be learning how to use the Apify API to call an Actor and view its results. We'll be using the Actor we created in the previous lesson, so if you haven't already gotten that one set up, go ahead do that before moving forward if you'd like to follow along.

## Finding your endpoint

Within one of your Actors on the https://console.apify.com?asrc=developers_portal (we'll use the **adding-actor** from the previous lesson), click on the **API** button in the top right-hand corner:

![The \&quot;API\&quot; button on an Actor\&#39;s page on the Apify Console](/assets/images/api-tab-1fb75598685ed64e58605cd51734d19c.jpg)

You should see a long list of API endpoints that you can copy and paste elsewhere, or even test right within the **API** modal. Go ahead and copy the endpoint labeled **Run Actor synchronously and get dataset items**. It should look something like this:


```
https://api.apify.com/v2/acts/YOUR_USERNAME~adding-actor/run-sync?token=YOUR_TOKEN
```


> In this lesson, we'll only be focusing on this one endpoint, as it is the most popularly used one; however, don't let this limit your curiosity! Take a look at the other endpoints in the **API** window to learn about everything you can do to your Actor programmatically.

Now, let's move over to our favorite HTTP client (in this lesson we'll use https://docs.apify.com/academy/tools/insomnia.md in order to prepare and send the request).

## Providing input

Our **adding-actor** takes in two input values (`num1` and `num2`). When using the Actor on the platform, provide these fields either through the UI generated by the **INPUT\_SCHEMA.json**, or directly in JSON format. When providing input when making an API call to run an Actor, the input must be provided in the **body** of the POST request as a JSON object.

![Providing input](/assets/images/provide-input-16fe316e976462f5e2d9ede9158b6b8b.jpg)

## Parameters

Let's say we want to run our **adding-actor** via API and view its results in CSV format at the end. We'll achieve this by passing the **format** parameter with a value of **csv** to change the output format:


```
https://api.apify.com/v2/acts/YOUR_USERNAME~adding-actor/run-sync-get-dataset-items?token=YOUR_TOKEN_HERE&format=csv
```


Additional parameters can be passed to this endpoint. You can learn about them in our https://docs.apify.com/api/v2/act-run-sync-get-dataset-items-post.md

> Network components can record visited URLs, so it's more secure to send the token as a HTTP header, not as a parameter. The header should look like `Authorization: Bearer YOUR_TOKEN`. Popular HTTP clients, such as https://docs.apify.com/academy/tools/postman.md or https://docs.apify.com/academy/tools/insomnia.md, provide a convenient way to configure the Authorization header for all your API requests.

## Sending the request

If you're not using an HTTP client, you can send the request through your terminal with this command:


```
curl -d '{"num1":1, "num2":8}' -H "Content-Type: application/json" -X POST "https://api.apify.com/v2/acts/YOUR_USERNAME~adding-actor/run-sync-get-dataset-items?token=YOUR_TOKEN_HERE&format=csv"
```


Here's the response we got:

![API response](/assets/images/api-csv-response-486ba68d3939c6f5c9328f8fefa5c7a2.png)

And there it is! The Actor was run with our inputs of **num1** and **num2**, then the dataset results were returned back to us in CSV format.

## Apify API's many features

What we've done in this lesson only scratches the surface of what the Apify API can do. Right from Insomnia, or from any HTTP client, you can https://docs.apify.com/api/v2/storage-datasets.md and https://docs.apify.com/api/v2/storage-key-value-stores.md, \[add to request queues]/api/v2/storage-request-queues), https://docs.apify.com/api/v2/storage-request-queues-requests.md, and much more! Basically, whatever you can do on the platform's web interface, you also do through the API.

## Next up

https://docs.apify.com/academy/getting-started/apify-client.md, we'll be learning about how to use Apify's JavaScript and Python clients to interact with the API right within our code.


---

# Apify client

**Interact with the Apify API in your code by using the apify-client package, which is available for both JavaScript and Python.**

***

Now that you've gotten your toes wet with interacting with the Apify API through raw HTTP requests, you're ready to become familiar with the **Apify client**, which is a package available for both JavaScript and Python that allows you to interact with the API in your code without explicitly needing to make any GET or POST requests.

This lesson will provide code examples for both Node.js and Python, so regardless of the language you are using, you can follow along!

## Examples

You can access `apify-client` examples in the Console Actor detail page. Click the **API** button and then the **API Client** dropdown button.

![API button](/assets/images/api-button-16287c6b358ebf6ad02c35f2ece5c333.png)

## Installing and importing

If you are going to use the client in Node.js, use this command within one of your projects to install the package through npm:


```
npm install apify-client
```


In Python, you can install it from PyPI with this command:


```
pip install apify-client
```


After installing the package, let's make a file named **client** and import the Apify client like so:

* Node.js
* Python


```
// client.js
import { ApifyClient } from 'apify-client';
```



```
# client.py
from apify_client import ApifyClient
```


## Running an Actor

In the last lesson, we ran the **adding-actor** and retrieved its dataset items. That's exactly what we're going to do now; however, by using the Apify client instead.

Before we can use the client though, we must create a new instance of the `ApifyClient` class and pass it our API token from the https://console.apify.com/account?tab=integrations&asrc=developers_portal on the Apify Console:

* Node.js
* Python


```
const client = new ApifyClient({
    token: 'YOUR_TOKEN',
});
```



```
client = ApifyClient(token='YOUR_TOKEN')
```


> If you are planning on publishing your code to a public GitHub/Gitlab repository or anywhere else online, be sure to set your API token as en environment variable, and never hardcode it directly into your script.

Now that we've got our instance, we can point to an Actor using the https://docs.apify.com/api/client/js/reference/class/ApifyClient#actor function, then call the Actor with some input with the https://docs.apify.com/api/client/js/reference/class/ApifyClient#actor function - the first parameter of which is the input for the Actor.

* Node.js
* Python


```
const run = await client.actor('YOUR_USERNAME/adding-actor').call({
    num1: 4,
    num2: 2,
});
```



```
run = client.actor('YOUR_USERNAME/adding-actor').call(run_input={
    'num1': 4,
    'num2': 2
})
```


> Learn more about the `.call()` function in our https://docs.apify.com/api/client/js/reference/class/ApifyClient#actor.

## Downloading dataset items

Once an Actor's run has completed, it will return a **run info** object that looks something like this:

![Run info object](/assets/images/run-info-5744283cdcb67851aa05d10ef782d69d.jpg)

The `run` variable we created in the last section points to the **run info** object of the run we created with the `.call()` function, which means that through this variable, we can access the run's `defaultDatasetId`. This ID can then be passed into the `client.dataset()` function.

* Node.js
* Python


```
const dataset = client.dataset(run.defaultDatasetId);
```



```
dataset = client.dataset(run['defaultDatasetId'])
```


Finally, we can download the items in the dataset by using the **list items** function, then log them to the console.

* Node.js
* Python


```
const { items } = await dataset.listItems();

console.log(items);
```



```
items = dataset.list_items().items

print(items)
```


The final code for running the Actor and fetching its dataset items looks like this:

* Node.js
* Python


```
// client.js
import { ApifyClient } from 'apify-client';

const client = new ApifyClient({
    token: 'YOUR_TOKEN',
});

const run = await client.actor('YOUR_USERNAME/adding-actor').call({
    num1: 4,
    num2: 2,
});

const dataset = client.dataset(run.defaultDatasetId);

const { items } = await dataset.listItems();

console.log(items);
```



```
# client.py
from apify_client import ApifyClient

client = ApifyClient(token='YOUR_TOKEN')

actor = client.actor('YOUR_USERNAME/adding-actor').call(run_input={
    'num1': 4,
    'num2': 2
})

dataset = client.dataset(run['defaultDatasetId'])

items = dataset.list_items().items

print(items)
```


## Updating an Actor

If you check the **Settings** tab within your **adding-actor**, you'll notice that the default timeout being set to the Actor is **360 seconds**. This is a bit overkill considering the fact that the Actor is only adding two numbers together - the run should never take more than 20 seconds (even this is a generous number). The default memory being allocated to the Actor is **256 MB**, which is reasonable for our purposes.

Let's change these two Actor settings via the Apify client using the https://docs.apify.com/api/client/js/reference/class/ActorClient#update function. This function will call the **update Actor** endpoint, which can take `defaultRunOptions` as an input property. You can find the shape of the `defaultRunOptions` in the https://docs.apify.com/api/v2/act-put.md. Perfect!

First, we'll create a pointer to our Actor, similar to before (except this time, we won't be using `.call()` at the end):

* Node.js
* Python


```
const actor = client.actor('YOUR_USERNAME/adding-actor');
```



```
actor = client.actor('YOUR_USERNAME/adding-actor')
```


Then, we'll call the `.update()` method on the `actor` variable we created and pass in our new **default run options**:

* Node.js
* Python


```
await actor.update({
    defaultRunOptions: {
        build: 'latest',
        memoryMbytes: 256,
        timeoutSecs: 20,
    },
});
```



```
actor.update(default_run_build='latest', default_run_memory_mbytes=256, default_run_timeout_secs=20)
```


After running the code, go back to the **Settings** page of **adding-actor**. If your default options now look like this, then it worked!:

![New run defaults](/assets/images/new-defaults-ba42f0ce8c11e3b3a26e55d07f2d77b5.jpg)

## Overview

You can do so much more with the Apify client than running Actors, updating Actors, and downloading dataset items. The purpose of this lesson was to get you comfortable using the client in your own projects, as it's the absolute best developer tool for integrating the Apify platform with an external system.

For a more in-depth understanding of the Apify API client, give these a quick lookover:

* https://docs.apify.com/api/client/js
* https://docs.apify.com/api/client/python

## Next up

Now that you're familiar and a bit more comfortable with the Apify platform, you're ready to start deploying your code to Apify! In the https://docs.apify.com/academy/deploying-your-code.md, you'll learn how to take any project written in any programming language and turn it into an Actor.


---

# Creating Actors

**This lesson offers hands-on experience in building and running Actors in Apify Console using a template. By the end of it, you will be able to build and run your first Actor using an Actor template.**

***

You can create an Actor in several ways. You can create one from your own source code hosted in a Git repository or in your local machine, for example. But in this tutorial, we'll focus on the easiest method: selecting an Actor code template. We don't need to install any special software, and everything can be done directly in Apify Console using an Apify account.

## Choose the source

Once you're in Apify Console, go to https://console.apify.com/actors, and click on the **Develop new** button in the top right-hand corner.

![Develop an Actor button](/assets/images/develop-new-actor-a499c8a2618fec73c828ddb4dcbb75b4.png)

You'll be presented with a page featuring two ways to get started with a new Actor.

1. Creating an Actor from existing source code (using Git providers or pushing the code from your local machine using Apify CLI)
2. Creating an Actor from a code template

| Existing source code                                                                                                    | Code templates                                                                                                          |
| ----------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- |
| ![Create and Actor from source code](/assets/images/create-actor-from-source-code-3b8f6761162e4c51daea94589b9e2407.png) | ![Create an Actor from code templates](/assets/images/create-actor-from-templates-80f2545ea6bf5071f073ab66af3d9973.png) |

## Creating Actor from existing source code

If you already have your code hosted by a Git provider, you can use it to create an Actor by linking the repository. If you use GitHub, you can use our https://docs.apify.com/platform/integrations/github.md to create an Actor from your public or private repository. You can also use GitLab, Bitbucket or other Git providers or external repositories.

![Create an Actor from Git repository](/assets/images/create-actor-git-0f6cdca6e156997d67fc7078944c97c9.png)

You can also push your existing code from your local machine using https://docs.apify.com/cli. This is useful when you develop your code locally and then you want to push it to the Apify Console to run the code as an Actor in the cloud. For this option, you'll need the https://docs.apify.com/cli/docs/installation on your machine. By clicking on the **Push your code using the Apify command-line interface (CLI)** button, you will be presented with instructions on how to push your code to the Apify Console.

![Push your code using the Apify CLI](/assets/images/create-actor-cli-4a172ba02eb3aeda5fc286317274f201.png)

## Creating Actor from code template

Python, JavaScript, and TypeScript have several template options that you can use.

> You can select one from the list on this page or you can browse all the templates in the template library by clicking on the **View all templates** button in the right corner.

For example, let's choose the **Start with JavaScript** template and click on the template card.

![JavaScript template card](/assets/images/create-actor-template-javascript-card-c532263658eb98fa3d68a1b522c4af94.png)

You will end up on a template detail page where you can see all the important information about the template - description, included features, used technologies, and what is the use-case of this template. More importantly, there is a code preview and also instructions for how the code works.

![JavaScript template detail page](/assets/images/create-actor-template-detail-page-8ff37bb2c50a5756663f61ffca76a010.png)

### Using the template in the Web IDE

By clicking **Use this template** button you will create the Actor in Apify Console and you will be moved to the **Code** tab with the https://docs.apify.com/platform/actors/development/quick-start/web-ide.md where you can see the code of the template and start editing it.

> The Web IDE is a great tool for developing your Actor directly in Apify Console without the need to install or use any other software.

![Web IDE](/assets/images/create-actor-web-ide-53857177e9d96389456c6d0e5feff72a.png)

### Using the template locally

If you want to use the template locally, you can again use our https://docs.apify.com/cli to download the template to your local machine.

> Creating an Actor from a template locally is a great option if you want to develop your code using your local environment and IDE and then push the final solution back to the Apify Console.

When you click on the **Use locally** button, you'll be presented with instructions on how to create an Actor from this template in your local environment.

With the Apify CLI installed, you can run the following commands in your terminal:


```
apify create my-actor -t getting_started_node
```



```
cd my-actor
apify run
```


![Use the template locally](/assets/images/create-actor-template-locally-b4d9caaebe286c60cbc29017f02ab3d4.png)

## Start with scraping single page

This template is a great starting point for web scraping as it extracts data from a single website. It uses https://axios-http.com/docs/intro for downloading the page content and https://cheerio.js.org/ for parsing the HTML from the content.

Let's see what's inside the **Start with JavaScript** template. The main logic of the template lives in the `src/main.js` file.


```
// Axios - Promise based HTTP client for the browser and node.js (Read more at https://axios-http.com/docs/intro).
import { Actor } from 'apify';
import axios from 'axios';
// Cheerio - The fast, flexible & elegant library for parsing and manipulating HTML and XML (Read more at https://cheerio.js.org/).
import * as cheerio from 'cheerio';
// Apify SDK - toolkit for building Apify Actors (Read more at https://docs.apify.com/sdk/js/).

// The init() call configures the Actor for its environment. It's recommended to start every Actor with an init().
await Actor.init();

// Structure of input is defined in input_schema.json
const input = await Actor.getInput();
const { url } = input;

// Fetch the HTML content of the page.
const response = await axios.get(url);

// Parse the downloaded HTML with Cheerio to enable data extraction.
const $ = cheerio.load(response.data);

// Extract all headings from the page (tag name and text).
const headings = [];
$('h1, h2, h3, h4, h5, h6').each((i, element) => {
    const headingObject = {
        level: $(element).prop('tagName').toLowerCase(),
        text: $(element).text(),
    };
    console.log('Extracted heading', headingObject);
    headings.push(headingObject);
});

// Save headings to Dataset - a table-like storage.
await Actor.pushData(headings);

// Gracefully exit the Actor process. It's recommended to quit all Actors with an exit().
await Actor.exit();
```


The Actor takes the `url` from the input and then:

1. Sends a request to the URL.
2. Downloads the page's HTML content.
3. Extracts headings (H1 - H6) from the page.
4. Stores the extracted data.

The extracted data is stored in the https://docs.apify.com/platform/storage/dataset.md where you can preview it and download it. We'll show how to do that later in  section.

> Feel free to play around with the code and add some more features to it. For example, you can extract all the links from the page or extract all the images or completely change the logic of this template. Keep in mind that this template uses https://docs.apify.com/academy/deploying-your-code/input-schema.md defined in the `.actor/input_schema.json` file and linked to the `.actor/actor.json`. If you want to change the input schema, you need to change it in those files as well. Learn more about the Actor input and output https://docs.apify.com/academy/getting-started/inputs-outputs.md.

## Build the Actor 🧱

In order to run the Actor, you need to https://docs.apify.com/platform/actors/development/builds-and-runs/builds.md it first. Click on the **Build** button at the bottom of the page or **Build now** button right under the code editor.

![Build the Actor](/assets/images/build-actor-5aaefc12ec3684c08bd92818b88e3576.png)

After you've clicked the **Build** button, it'll take around 5–10 seconds to complete the build. You'll know it's finished when you see a green **Start** button.

![Start button](data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAARgAAABsCAMAAACGlF3dAAAABGdBTUEAALGPC/xhBQAAAAFzUkdCAK7OHOkAAAHWaVRYdFhNTDpjb20uYWRvYmUueG1wAAAAAAA8eDp4bXBtZXRhIHhtbG5zOng9ImFkb2JlOm5zOm1ldGEvIiB4OnhtcHRrPSJYTVAgQ29yZSA2LjAuMCI+CiAgIDxyZGY6UkRGIHhtbG5zOnJkZj0iaHR0cDovL3d3dy53My5vcmcvMTk5OS8wMi8yMi1yZGYtc3ludGF4LW5zIyI+CiAgICAgIDxyZGY6RGVzY3JpcHRpb24gcmRmOmFib3V0PSIiCiAgICAgICAgICAgIHhtbG5zOmV4aWY9Imh0dHA6Ly9ucy5hZG9iZS5jb20vZXhpZi8xLjAvIj4KICAgICAgICAgPGV4aWY6UGl4ZWxZRGltZW5zaW9uPjEwODwvZXhpZjpQaXhlbFlEaW1lbnNpb24+CiAgICAgICAgIDxleGlmOlBpeGVsWERpbWVuc2lvbj4yODA8L2V4aWY6UGl4ZWxYRGltZW5zaW9uPgogICAgICAgICA8ZXhpZjpVc2VyQ29tbWVudD5TY3JlZW5zaG90PC9leGlmOlVzZXJDb21tZW50PgogICAgICA8L3JkZjpEZXNjcmlwdGlvbj4KICAgPC9yZGY6UkRGPgo8L3g6eG1wbWV0YT4KG+KORAAAAAlwSFlzAAAWJQAAFiUBSVIk8AAAAeBQTFRFAIon8/j0hr2JCIsqWqhhyuHKhb2I////0NXrlMWXEYws/f79YKtn0ufUOZlGZ69us9a1v93B/P383e3eYKtm/v7+8ffyjcGRSJ9S+/371enXRp9QJZI3MJY/K5Q8i8CP3O3dzuTOE40vcLN1tte46/TricCOQ55Oebh+rtOvIZE16fTpP5xL3Ozd0ebTmcicOJlFsta0+fv59fn10ObRYaxor9Sxc7V4S6BUq9KuW6ljf7qEX6plX6pm2OrZLZQ8ir+OzeTOzuXPx+DJ7/bw+vz5cbR2y+TNm8idKZM7fbmBoMuj5PHk1Nnta7Bx5+n05uj07/D43uDwNphE9/j7+vz6PptJd7V8xN/FZK1pudi75fLmFY4v2eva5PHljsGRvNq+8Pfxos2l1urYM5dAXqtm6/Tsir+Nl8eaWadhs9a0kMOUmMebkcSVx+DILJU9wN3CLZQ9zuTPjcGQjMGQcrR34/Djmcid9vr3T6NZbbJzqdCssdWzIZE2LpU9p8+pG48xocykvdu/fLiBTaJWhr6KHJAyqNCr7PXtM5dCa7Fx6vPqiL+Nncqhaa9vxuDHSqFV4O/i8vjxgbuFO5pHqtGt3+7gmMicVKZenMmgrtOwu9m95/Lny+LLe7d/pM2m1hvhngAABDpJREFUeNrt3Od301YYBvAnKO4TIQF2HCdx9t6TEDLYe4V0pGxK997QQQu0ZbTs1ULLavuv9oMkK461aGwn5r7307XOPbrHv3N1l3RfUJJnghAIjMAIjMAIjMAIjMAIjMAIjCSBERiBKS7M4CdbNzZh2SdysXdo2rg1Phgd5huURlo8DACgJiLMtnqoBYPa1igwnwGqwQDD4TDboCIMWkNh6tWEqQ2DOQM1YbJ74JdzYFqgKgxaXIXXK3Jg4urCxF2FV3NhutSFaXYVXsmF2aQuTJWr8FouTJO6MF+7CgdzYaAuzDyLCoERGIERGIERGIERGIERGIERmP8Do1MvFxife+gCk5PKrcK6wHjDRJAZ+PVSx+VG5WDCaC5WWqXWTigHEyTTNuQW6wUA9BiGEd58+g3jbgnDxMpjITTvkmRn66rqTKkVJFeH3r2MXFXKMIAeKNNNctc4AHSlyH6VYIJpPiT5uZXtIHkYA2YvyX9Msw8A0DB7oLfHfrAumOY0MPDXw0dalbmWnDHNeyUN48jEPKZ775CssrI/TU1NzeENp5INAPbVWfmxKwDwEsnGapLr651C1YuH8fjspWgwAY3mGMk6bd6F+TDdaeeHcceGuc98w+gL/51eTJhM9Qtr1VIkP/i0L3NhReJPkn8nEpNAjGT/08TvJE/YMGTliVNdDYmEQY4lErfz8Cjpz+OSdxjfRtNrXd65o8dpN5nOV0uR6wAgSaY0G2Zv/jtf/TlcCgDj1L+wo1nv3DW943A2DNpOngQATJCctmCOF2JU0qO7FALGntLk9MAff2FPfblr0mu47hs4QnKNBWMWZLjWI7sUACZgpqd9O/pdiiTTv2XDPBh95qDZMCsLM4/RIy948w1T7j9i20uDvQbJI1kwPZVunYWFsWR0FBsmdGEAAO0k35oPM0KSp44+efxv4WGgR9wgKdYE76vNmz9y8vYglIEZJjkBAONFgIGuo8gwgc2lkzR+cVvMqA1zAwDqyEMAgOPeMP0vxCLSp3fpINn5AwDsTpMcB7Cb5B4N1vdu7QBuefUxnSS7X+Bth4Y0SZYNJsdIcgYA5kjSuGZan3XdXFft2fkOkWTZnhLfqPIfjNC90y021AZ7omutlewlJOs8YKbztVZaSpjAlygrjzpbmwfsK/vOnyW5H0DHIZKVV0dIHgP+IOmuN9uTJN8uZZhYaMmR/ed+/rEv6wm73mBlVs9Oar6kje9ppQsTW/J3bssTRl64eU4mY/KKdhkngREYgREYgREYgREYgREYgRGY4sAofFj0y8DDonK82Od4sRxI9zmQLiEMfEIYtKgLkwwMesEaVWHeDA6TIoF1fCMOtaoJsz08eNewijDfRwn3drpWNZja7REDBNaoBRM1QCDJlnhz1RYVYLa83xxPShBSic4qMAIjMAIjMAIjMAIjMJIERmAERmAEZsnTf45EnbI+9eB+AAAAAElFTkSuQmCC)

## Fill the input

And now we are ready to run the Actor. But before we do that, let's give the Actor some input by going to the `Input` tab.

The input tab is where you can provide the Actor with some meaningful input. In this case, we'll be providing the Actor with a URL to scrape. For now, we'll use the prefilled value of https://apify.com/ (`https://apify.com/`).

You can change the website you want to extract the data from by changing the URL in the input field.

![Input tab](/assets/images/actor-input-tab-93256e980a452661e0a608910bddecb1.png)

## Run the Actor

Once you have provided the Actor with some URL you want to extract the data from, click **Start** button and wait a few seconds. You should see the Actor run logs in the **Last run** tab.

![Actor run logs](/assets/images/actor-run-1c928e9040dac9112be91f2bfbfde02f.png)

After the Actor finishes, you can preview or download the extracted data by clicking on the **Export X results** button.

![Export results](/assets/images/actor-run-dataset-a27223a2b496df661e18f8e311c9bfc4.png)

And that's it! You've just created your first Actor and extracted data from a website 🎉.

## Next up

We've created an Actor, but how can we give it more complex inputs and make it do stuff based on these inputs? This is exactly what we'll be discussing in the https://docs.apify.com/academy/getting-started/inputs-outputs.md's activity.


---

# Inputs & outputs

**Create an Actor from scratch which takes an input, processes that input, and then outputs a result that can be used elsewhere.**

***

Actors, as any other programs, take inputs and generate outputs. The Apify platform has a way how to specify what inputs the Actor expects, and a way to temporarily or permanently store its results.

In this lesson, we'll be demonstrating inputs and outputs by building an Actor which takes two numbers as input, adds them up, and then outputs the result.

## Accept input into an Actor

Let's first create another new Actor using the same template as before. Feel free to refer to the https://docs.apify.com/academy/getting-started/creating-actors.md for a refresher on how to do this.

Replace all of the code in **main.js** with this code snippet:


```
import { Actor } from 'apify';

await Actor.init();

// Grab our numbers which were inputted
const { num1, num2 } = await Actor.getInput();

// Calculate the solution
const solution = num1 + num2;

// Push the solution to the dataset
await Actor.pushData({ solution });

await Actor.exit();
```


Then, replace everything in **INPUT\_SCHEMA.json** with this:

> This step isn't necessary, as the Actor will still be able to take input in JSON format without it; however, we are providing the content for this Actor's input schema in this lesson, as it will give the Apify platform a blueprint off of which it can generate a nice UI for your inputs, as well as validate their values.


```
{
    "title": "Number adder",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "num1": {
            "title": "1st Number",
            "type": "integer",
            "description": "First number.",
            "editor": "number"
        },
        "num2": {
            "title": "2nd Number",
            "type": "integer",
            "description": "Second number.",
            "editor": "number"
        }
    },
    "required": ["num1", "num2"]
}
```


> If you're interested in learning more about how the code works, and what the **INPUT\_SCHEMA.json** means, read about https://docs.apify.com/sdk/js/docs/examples/accept-user-input and https://docs.apify.com/sdk/js/docs/examples/add-data-to-dataset in the Apify SDK documentation, and refer to the https://docs.apify.com/platform/actors/development/actor-definition/input-schema/specification/v1.md#integer.

Finally, **Save** and **Build** the Actor just as you did in the previous lesson.

## Configuring an Actor with inputs

By default, after running a build, the **Last build** tab will be selected, where you can see all of the logs related to building the Actor. Inputs can be configured within the **Input** tab.

![Configuring inputs](/assets/images/configure-inputs-0efc6f6ade028079e5da7b87e966bdcf.jpg)

Enter any two numbers you'd like, then press **Start**. The Actor's run should be completed almost immediately.

## View Actor results

Since we've pushed the result into the default dataset, it, and some info about it, can be viewed in two places inside the Last Run tab:

1. **Export** button
2. **Storage** → **Dataset** (scroll below the main view)

On the results tab, there are a whole lot of options for which format to view/download the data in. Keep the default of **JSON** selected, and click on **Preview**.

![Dataset preview](/assets/images/dataset-preview-da23f5956de7eccb38a691f09fd3dd1c.png)

There's our solution! Did it work for you as well? Now, we can download the data right from the Dataset tab to be used elsewhere, or even programmatically retrieve it by using https://docs.apify.com/api/v2.md (we'll be discussing how to do this in the next lesson).

It's important to note that the default dataset of the Actor, which we pushed our solution to, will be retained for 7 days. If we wanted the data to be retained for an indefinite period of time, we'd have to use a named dataset. For more information about named storages vs unnamed storages, read a bit about https://docs.apify.com/platform/storage/usage.md#data-retention.

## Next up

In https://docs.apify.com/academy/getting-started/apify-api.md's fun activity, you'll learn how to call the Actor we created in this lesson programmatically using one of Apify's most powerful tools - the Apify API.


---

# Why a glossary?

**Browse important web scraping concepts, tools and topics in succinct articles explaining common web development terms in a web scraping and automation context.**

***

Web scraping comes with a lot of terms that are specific to the area. Some of them are tools and libraries, like https://docs.apify.com/academy/puppeteer-playwright.md or Insomnia. Others are general topics that have a special place in web scraping, like headless browsers or browser fingerprints. And some topics are related to all web development, but play a special role in web scraping, such as HTTP headers and cookies.

When writing the academy, we very early on realized that we needed a place to reference these terms, but quickly found out that the usual tutorials and guides available all over the web weren't the most ideal. The explanations were too broad and generic and did not fit the web scraping context. With the **Apify Academy** glossary, we aim to provide you with short articles and lessons that provide the necessary web scraping context for specific terms, then link to other parts of the web for further in-depth reading.


---

# Scraping with Node.js

**A collection of various Node.js tutorials on scraping sitemaps, optimizing your scrapers, using popular Node.js web scraping libraries, and more.**

***

This section contains various web-scraping or web-scraping related tutorials for Node.js. Whether you're trying to scrape from a website with sitemaps, struggling with a dynamic page, want to optimize your slow Puppeteer scraper, or need some general tips for scraping in Node.js, this section is right for you.


---

# How to add external libraries to Web Scraper

Sometimes you need to use some extra JavaScript in your https://apify.com/apify/web-scraper page functions. Whether it is to work with dates and times using https://momentjs.com/, or to manipulate the DOM using https://jquery.com/, libraries save precious time and make your code more concise and readable. Web Scraper already provides a way to add jQuery to your page functions. All you need to do is to check the Inject jQuery input option. There's also the option to Inject Underscore, a popular helper function library.

In this tutorial, we'll learn how to inject any JavaScript library into your page functions, with the only limitation being that the library needs to be available somewhere on the internet as a downloadable file (typically a CDN).

## Injecting Moment.js

Moment.js is a very popular library for working with date and time. It helps you with the parsing, manipulation, and formatting of datetime values in multiple locales and has become the de-facto standard for this kind of work in JavaScript.

To inject Moment.js into our page function (or any other library using the same method), we first need to find a link to download it from. We can find it in https://momentjs.com/docs/#/use-it/browser/ under the CDN links.

> https://cdnjs.cloudflare.com/ajax/libs/moment.js/2.24.0/moment.min.js

Now we have two options. Inject the library using plain JavaScript, or if you prefer working with jQuery, use a jQuery helper.

## Injecting a library with plain JavaScript


```
async function pageFunction(context) {
    const libraryUrl = 'https://cdnjs.cloudflare.com/ajax/libs/moment.js/2.24.0/moment.min.js';

    // Inject Moment.js\
    await new Promise((resolve) => {
        const script = document.createElement('script');
        script.src = libraryUrl;
        script.addEventListener('load', resolve);
        document.body.append(script);
    });

    // Confirm that it works.\
    const now = moment().format('ddd, hA');
    context.log.info(`NOW: ${now}`);
}
```


We're creating a script element in the page's DOM and waiting for the script to load. Afterwards, we confirm that the library has been successfully loaded by using one of its functions.

## Injecting a library using jQuery

After you select the Inject jQuery input option, jQuery will become available in your page function as `context.jQuery` .


```
async function pageFunction(context) {
    const libraryUrl = 'https://cdnjs.cloudflare.com/ajax/libs/moment.js/2.24.0/moment.min.js';

    const $ = context.jQuery;

    // Inject Moment.js\
    await $.getScript(libraryUrl);

    // Confirm that it works.\
    const now = moment().format('ddd, hA');
    context.log.info(`NOW: ${now}`);
}
```


With jQuery, we're using the `$.getScript()` helper to fetch the script for us and wait for it to load.

## Dealing with errors

Some websites employ security measures that disallow loading external scripts within their pages. Luckily, those measures can be overridden with Web Scraper. If you are encountering errors saying that your library cannot be loaded due to a security policy, select the Ignore CORS and CSP input option at the very bottom of Web Scraper input and the errors should go away.

Happy scraping!


---

# How to analyze and fix errors when scraping a website

**Learn how to deal with random crashes in your web-scraping and automation jobs. Find out the essentials of debugging and fixing problems in your crawlers.**

***

Debugging is absolutely essential in programming. Even if you don't call yourself a programmer, having basic debugging skills will make building crawlers easier. It will also help you save money by allowing you to avoid hiring an expensive developer to solve your issue for you.

This quick lesson covers the absolute basics by discussing some of the most common problems and the simplest tools for analyzing and fixing them.

## Possible causes

It is often tricky to see the full scope of what can go wrong. We assume that once the code is set up correctly, it will keep working. Unfortunately, that is rarely true in the realm of web scraping and automation.

Websites change, they introduce new https://docs.apify.com/academy/anti-scraping.md, programming tools change and, in addition, people make mistakes.

Here are the most common reasons your working solution may break.

* The website changes its layout or https://www.datafeedwatch.com/academy/data-feed.
* A site's layout changes depending on location or uses https://www.youtube.com/watch?v=XDoKXaGrUxE&feature=youtu.be.
* A page starts to block you (recognizes you as a bot).
* The website https://docs.apify.com/academy/node-js/dealing-with-dynamic-pages.md, so the code works only sometimes, if you are slow or lucky enough.
* You made a mistake when updating your code.
* Your https://docs.apify.com/academy/anti-scraping/mitigation/proxies.md aren't working.
* You have upgraded your https://www.quora.com/What-is-a-dependency-in-coding (other software that your software relies upon), and the new versions no longer work (this is harder to debug).

## Diagnosing/analyzing the issue

Web scraping and automation are very specific types of programming. It is not possible to rely on specialized debugging tools, since the code does not output the same results every time. However, there are still many ways to diagnose issues in a crawler.

> Many issues are edge cases, which occur in one of a thousand pages or are time-dependent. Because of this, you cannot rely only on https://en.wikipedia.org/wiki/Deterministic_algorithm.

### Logging

Logging is an essential tool for any programmer. When used correctly, it helps you capture a surprising amount of information. Here are some general rules for logging:

* Usually, **many logs** is better than **no logs** at all.
* Putting more information into one line, rather than logging multiple short lines, helps reduce the overall log size.
* Focus on numbers. Log how many items you extract from a page, etc.
* Structure your logs and use the same structure in all your logs.
* Append the current page's URL to each log. This lets you immediately open that page and review it.

Here's an example of what a structured log message might look like:


```
[CATEGORY]: Products: 20, Unique products: 4, Next page: true --- https://apify.com/store
```


The log begins with the **page type**. Usually, we use labels such as **\[CATEGORY]** and **\[DETAIL]**. Then, we log important numbers and other information. Finally, we add the page's URL, so we can check if the log is correct.

#### Logging errors

Errors require a different approach because, if your code crashes, your usual logs will not be called. Instead, exception handlers will print the error, but these are usually ugly messages with a https://en.wikipedia.org/wiki/Stack_trace that only the experts will understand.

You can overcome this by adding https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/try...catch into your code. In the catch block, explain what happened and re-throw the error (so the request is automatically retried).


```
try {
    // Sensitive code block
    // ...
} catch (error) {
    // You know where the code crashed so you can explain here
    throw new Error('Request failed during login with an error', { cause: error });
}
```


Read more information about logging and error handling in our developer https://docs.apify.com/academy/web-scraping-for-beginners/best-practices.md section.

### Saving snapshots

By snapshots, we mean **screenshots** if you use a https://docs.apify.com/academy/puppeteer-playwright.md and HTML saved into a https://crawlee.dev/api/core/class/KeyValueStore that you can display in your own browser. Snapshots are useful throughout your code but especially important in error handling.

Note that an error can happen only in a few pages out of a thousand and look completely random. You cannot do much else than to save and analyze a snapshot.

Snapshots can tell you if:

* A website has changed its layout. This can also mean A/B testing or different content for different locations.
* You have been blocked—you open a https://en.wikipedia.org/wiki/CAPTCHA or an **Access Denied** page.
* Data load later dynamically—the page is empty.
* The page was redirected—the content is different.

You can learn how to take snapshots in Puppeteer or Playwright in https://docs.apify.com/academy/puppeteer-playwright/page/page-methods.md

#### When to save snapshots

The most common approach is to save on error. We can enhance our previous try/catch block like this:


```
import { puppeteerUtils } from 'crawlee';

// ...
// storeId is ID of current key value store, where we save snapshots
const storeId = Actor.getEnv().defaultKeyValueStoreId;
try {
    // Sensitive code block
    // ...
} catch (error) {
    // Change the way you save it depending on what tool you use
    const randomNumber = Math.random();
    const key = `ERROR-LOGIN-${randomNumber}`;
    await puppeteerUtils.saveSnapshot(page, { key });
    const screenshotLink = `https://api.apify.com/v2/key-value-stores/${storeId}/records/${key}.jpg`;

    // You know where the code crashed so you can explain here
    throw new Error('Request failed during login with an error', { cause: error });
}
// ...
```


To make the error snapshot descriptive, we name it **ERROR-LOGIN**. We add a random number so the next **ERROR-LOGIN**s would not overwrite this one and we can see all the snapshots. If you can use an ID of some sort, it is even better.

**Beware:**

* The snapshot's **name** (key) can only contain letter, number, dot and dash characters. Other characters will cause an error, which makes the random number a safe pick.
* Do not overdo the snapshots. Once you get out of the testing phase, limit them to critical places. Saving snapshots uses resources.

### Error reporting

Logging and snapshotting are great tools but once you reach a certain run size, it may be hard to read through them all. For a large project, it is handy to create a more sophisticated reporting system.

## With the Apify SDK

This example extends our snapshot solution above by creating a https://docs.apify.com/platform/storage/usage.md#named-and-unnamed-storages (named datasets have infinite retention), where we will accumulate error reports. Those reports will explain what happened and will link to a saved snapshot, so we can do a quick visual check.


```
import { Actor } from 'apify';
import { puppeteerUtils } from 'crawlee';

await Actor.init();
// ...
// Let's create reporting dataset
// If you already have one, this will continue adding to it
const reportingDataset = await Actor.openDataset('REPORTING');

try {
    // Sensitive code block
    // ...
} catch (error) {
    // Change the way you save it depending on what tool you use
    const randomNumber = Math.random();
    const key = `ERROR-LOGIN-${randomNumber}`;
    // The store gets removed with the run after data retention period so the links will stop working eventually
    // You can store the snapshots infinitely in a named KV store by adding `keyValueStoreName` option
    await puppeteerUtils.saveSnapshot(page, { key });

    // To create the reporting URLs, we need to know the Key-Value store and run IDs
    const { actorRunId, defaultKeyValueStoreId } = Actor.getEnv();

    // We create a report object
    const report = {
        errorType: 'login',
        errorMessage: error.toString(),
        // .html and .jpg file extensions are added automatically by the saveSnapshot function
        htmlSnapshotUrl: `https://api.apify.com/v2/key-value-stores/${defaultKeyValueStoreId}/records/${key}.html`,
        screenshotUrl: `https://api.apify.com/v2/key-value-stores/${defaultKeyValueStoreId}/records/${key}.jpg`,
        runUrl: `https://console.apify.com/actors/runs/${actorRunId}`,
    };

    // And we push the report to our reporting dataset
    await reportingDataset.pushData(report);

    // You know where the code crashed so you can explain here
    throw new Error('Request failed during login with an error', { cause: error });
}
// ...
await Actor.exit();
```


---

# Apify's free Google SERP API

You need to regularly grab SERP data about your target keywords? Apify provides a free SERP API that includes organic search, ads, people also ask, etc. Free Apify accounts come with unlimited proxy access and $5 of credit. To get started, head over to the https://apify.com/apify/google-search-scraper page and click the `Try me` button. You'll be taken to a page where you can enter the search query, region, language and other settings.

![Apify Google SERP API](/assets/images/gserp-api-2621c8ee29f74544ef0ec986a4a8989a.png)

Hit `Save & Run` and you'll have the downloaded data as soon as the query finishes. To have it run at a regular frequency, you can set up the task to run on an https://docs.apify.com/platform/schedules.md#setting-up-a-new-schedule.

To run from the API, send a https://docs.apify.com/api/v2/actor-task-run-sync-get-dataset-items-post.md to an endpoint such as `https://api.apify.com/v2/acts/TASK_NAME_OR_ID/runs?token=YOUR_TOKEN`. Include any required input in a JSON object in the request's body.

Keep in mind that, as Google search uses a non-deterministic algorithm, output results may vary even if the input settings are exactly the same.


---

# Avoid EACCES error in Actor builds with a custom Dockerfile

Sometimes when building an Actor using a custom Dockerfile, you might receive errors like:


```
Missing write access to ...
```


or


```
EACCES: permission denied
```


This problem is usually caused by the fact that by default, the `COPY` Dockerfile instruction copies files as the root user (with UID and GID of 0), while your Dockerfile probably uses another user to copy files and run commands.

To fix this problem, make sure the `COPY`  instruction in Dockerfile uses the `--chown` flag. For example, instead of


```
COPY . ./
```


use


```
COPY --chown=myuser:myuser . ./
```


where `myuser` is the user and group defined by the `USER`  instruction in the base Docker image. To learn more, see https://docs.docker.com/reference/dockerfile/#copy.

Hope this helps!


---

# Block requests in Puppeteer

Improve Performance: Use `blockRequests`

Unfortunately, in the recent version of Puppeteer, request interception disables the native cache and slows down the Actor significantly. Therefore, it's not recommended to follow the examples shown in this article. Instead, use https://crawlee.dev/api/puppeteer-crawler/namespace/puppeteerUtils#BlockRequestsOptions *utility function from* https://crawlee.dev. It works through different paths and doesn't slow down your process.

When using Puppeteer, often a webpage will load many resources that are not actually necessary for your use case. For example page could be loading many tracking libraries, that are completely unnecessary for most crawlers, but will cause the page to use more traffic and load slower.

For example for this web page: https://edition.cnn.com/ If we run an Actor that measures extracted downloaded data from each response until the page is fully loaded, we get these results:

![Actor loading](/assets/images/actor-load-e6fc832092a1c94156fd96b3522c2c3b.png)

Now if we want to optimize this to keep the webpage looking the same, but ignore unnecessary requests, then after


```
const page = await browser.newPage();
```


we could can use this piece of code


```
await page.setRequestInterception(true);
page.on('request', (request) => {
    if (someCondition) request.abort();
    else request.continue();
});
```


Where `someCondition` is a custom condition (not actually implemented in the code above) that checks whether a request should be aborted.

For our example we will only disable some tracking scripts and then check if everything looks the same.

Here is the code used:


```
await page.setRequestInterception(true);
page.on('request', (request) => {
    const url = request.url();
    const filters = [
        'livefyre',
        'moatad',
        'analytics',
        'controltag',
        'chartbeat',
    ];
    const shouldAbort = filters.some((urlPart) => url.includes(urlPart));
    if (shouldAbort) request.abort();
    else request.continue();
});
```


With this code set up this is the output:

![Improved Actor loading](/assets/images/improved-actor-loading-a1e7b6b855bb90ba1780f19f3653a34c.png)

And except for different ads, the page should look the same.

From this we can see that just by blocking a few analytics and tracking scripts the page was loaded nearly 25 seconds faster and downloaded 35% less data (approximately since the data is measured after it's decompressed).

Hopefully this helps you make your solutions faster and use fewer resources.


---

# How to optimize Puppeteer by caching responses

**Learn why it is important for performance to cache responses in memory when intercepting requests in Puppeteer and how to implement it in your code.**

***

> In the latest version of Puppeteer, the request-interception function inconveniently disables the native cache and significantly slows down the crawler. Therefore, it's not recommended to follow the examples shown in this article unless you have a very specific use-case where the default browser cache is not enough (e.g. cashing over multiple scraper runs)

When running crawlers that go through a single website, each open page has to load all resources again. The problem is that each resource needs to be downloaded through the network, which can be slow and/or unstable (especially when proxies are used).

For this reason, in this article, we will take a look at how to use memory to cache responses in Puppeteer (only those that contain header **cache-control** with **max-age** above **0**).

In this example, we will use a scraper which goes through top stories on the CNN website and takes a screenshot of each opened page. The scraper is very slow right now because it waits till all network requests are finished and because the posts contain videos. If the scraper runs with disabled caching, these statistics will show at the end of the run:

![Bad run stats](/assets/images/bad-scraper-stats-b38622928fa3b188cae38d285750451e.png)

As you can see, we used 177MB of traffic for 10 posts (that is how many posts are in the top-stories column) and 1 main page.

From the screenshot above, it's clear that most of the traffic is coming from script files (124MB) and documents (22.8MB). For this kind of situation, it's always good to check if the content of the page is cache-able. You can do that using Chromes Developer tools.

## Understanding and reproducing the issue

If we go to the CNN website, open up the tools and go to the **Network** tab, we will find an option to disable caching.

![Disabling cache in the Network tab](/assets/images/cnn-network-tab-0ca18e39872e758ab7f60f2cd601e0f1.png)

Once caching is disabled, we can take a look at how much data is transferred when we open the page. This is visible at the bottom of the developer tools.

![5.3MB of data transferred](/assets/images/slow-no-cache-0681379c53774a230ff67f2ec4704f7c.png)

If we uncheck the disable-cache checkbox and refresh the page, we will see how much data we can save by caching responses.

![642KB of data transferred](/assets/images/fast-with-cache-1a683d4e3a74468186b8d004c5fba276.png)

By comparison, the data transfer appears to be reduced by 88%!

## Solving the problem by creating an in-memory cache

We can now emulate this and cache responses in Puppeteer. All we have to do is to check, when the response is received, whether it contains the **cache-control** header, and whether it's set with a **max-age** higher than **0**. If so, then we'll save the headers, URL, and body of the response to memory, and on the next request check if the requested URL is already stored in the cache.

The code will look like this:


```
// On top of your code
const cache = {};

// The code below should go between newPage function and goto function

await page.setRequestInterception(true);

page.on('request', async (request) => {
    const url = request.url();
    if (cache[url] && cache[url].expires > Date.now()) {
        await request.respond(cache[url]);
        return;
    }
    request.continue();
});

page.on('response', async (response) => {
    const url = response.url();
    const headers = response.headers();
    const cacheControl = headers['cache-control'] || '';
    const maxAgeMatch = cacheControl.match(/max-age=(\d+)/);
    const maxAge = maxAgeMatch && maxAgeMatch.length > 1 ? parseInt(maxAgeMatch[1], 10) : 0;
    if (maxAge) {
        if (cache[url] && cache[url].expires > Date.now()) return;

        let buffer;
        try {
            buffer = await response.buffer();
        } catch (error) {
            // some responses do not contain buffer and do not need to be catched
            return;
        }

        cache[url] = {
            status: response.status(),
            headers: response.headers(),
            body: buffer,
            expires: Date.now() + (maxAge * 1000),
        };
    }
});
```


> If the code above looks completely foreign to you, we recommending going through our free https://docs.apify.com/academy/puppeteer-playwright.md.

After implementing this code, we can run the scraper again.

![Good run results](/assets/images/good-run-results-38dc359a0a3b4cdf6b7611255218d234.png)

Looking at the statistics, caching responses in Puppeteer brought the traffic down from 177MB to 13.4MB, which is a reduction of data transfer by 92%. The related screenshots can be found https://my.apify.com/storage/key-value/iWQ3mQE2XsLA2eErL.

It did not speed up the crawler, but that is only because the crawler is set to wait until the network is nearly idle, and CNN has a lot of tracking and analytics scripts that keep the network busy.

## Implementation in Crawlee

Since most of you are likely using https://crawlee.dev, here is what response caching would look like using `PuppeteerCrawler`:

https://console.apify.com/actors/7tWSD8hrYzuc9Lte7?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IERhdGFzZXQsIFB1cHBldGVlckNyYXdsZXIgfSBmcm9tICdjcmF3bGVlJztcXG5cXG5jb25zdCBjYWNoZSA9IHt9O1xcblxcbmNvbnN0IGNyYXdsZXIgPSBuZXcgUHVwcGV0ZWVyQ3Jhd2xlcih7XFxuICAgIHByZU5hdmlnYXRpb25Ib29rczogW2FzeW5jICh7IHBhZ2UgfSkgPT4ge1xcbiAgICAgICAgYXdhaXQgcGFnZS5zZXRSZXF1ZXN0SW50ZXJjZXB0aW9uKHRydWUpO1xcblxcbiAgICAgICAgcGFnZS5vbigncmVxdWVzdCcsIGFzeW5jIChyZXF1ZXN0KSA9PiB7XFxuICAgICAgICAgICAgY29uc3QgdXJsID0gcmVxdWVzdC51cmwoKTtcXG4gICAgICAgICAgICBpZiAoY2FjaGVbdXJsXSAmJiBjYWNoZVt1cmxdLmV4cGlyZXMgPiBEYXRlLm5vdygpKSB7XFxuICAgICAgICAgICAgICAgIGF3YWl0IHJlcXVlc3QucmVzcG9uZChjYWNoZVt1cmxdKTtcXG4gICAgICAgICAgICAgICAgcmV0dXJuO1xcbiAgICAgICAgICAgIH1cXG4gICAgICAgICAgICByZXF1ZXN0LmNvbnRpbnVlKCk7XFxuICAgICAgICB9KTtcXG5cXG4gICAgICAgIHBhZ2Uub24oJ3Jlc3BvbnNlJywgYXN5bmMgKHJlc3BvbnNlKSA9PiB7XFxuICAgICAgICAgICAgY29uc3QgdXJsID0gcmVzcG9uc2UudXJsKCk7XFxuICAgICAgICAgICAgY29uc3QgaGVhZGVycyA9IHJlc3BvbnNlLmhlYWRlcnMoKTtcXG4gICAgICAgICAgICBjb25zdCBjYWNoZUNvbnRyb2wgPSBoZWFkZXJzWydjYWNoZS1jb250cm9sJ10gfHwgJyc7XFxuICAgICAgICAgICAgY29uc3QgbWF4QWdlTWF0Y2ggPSBjYWNoZUNvbnRyb2wubWF0Y2goL21heC1hZ2U9KFxcXFxkKykvKTtcXG4gICAgICAgICAgICBjb25zdCBtYXhBZ2UgPSBtYXhBZ2VNYXRjaCAmJiBtYXhBZ2VNYXRjaC5sZW5ndGggPiAxID8gcGFyc2VJbnQobWF4QWdlTWF0Y2hbMV0sIDEwKSA6IDA7XFxuXFxuICAgICAgICAgICAgaWYgKG1heEFnZSkge1xcbiAgICAgICAgICAgICAgICBpZiAoIWNhY2hlW3VybF0gfHwgY2FjaGVbdXJsXS5leHBpcmVzID4gRGF0ZS5ub3coKSkgcmV0dXJuO1xcblxcbiAgICAgICAgICAgICAgICBsZXQgYnVmZmVyO1xcbiAgICAgICAgICAgICAgICB0cnkge1xcbiAgICAgICAgICAgICAgICAgICAgYnVmZmVyID0gYXdhaXQgcmVzcG9uc2UuYnVmZmVyKCk7XFxuICAgICAgICAgICAgICAgIH0gY2F0Y2gge1xcbiAgICAgICAgICAgICAgICAgICAgLy8gc29tZSByZXNwb25zZXMgZG8gbm90IGNvbnRhaW4gYnVmZmVyIGFuZCBkbyBub3QgbmVlZCB0byBiZSBjYWNoZWRcXG4gICAgICAgICAgICAgICAgICAgIHJldHVybjtcXG4gICAgICAgICAgICAgICAgfVxcblxcbiAgICAgICAgICAgICAgICBjYWNoZVt1cmxdID0ge1xcbiAgICAgICAgICAgICAgICAgICAgc3RhdHVzOiByZXNwb25zZS5zdGF0dXMoKSxcXG4gICAgICAgICAgICAgICAgICAgIGhlYWRlcnM6IHJlc3BvbnNlLmhlYWRlcnMoKSxcXG4gICAgICAgICAgICAgICAgICAgIGJvZHk6IGJ1ZmZlcixcXG4gICAgICAgICAgICAgICAgICAgIGV4cGlyZXM6IERhdGUubm93KCkgKyBtYXhBZ2UgKiAxMDAwLFxcbiAgICAgICAgICAgICAgICB9O1xcbiAgICAgICAgICAgIH1cXG4gICAgICAgIH0pO1xcbiAgICB9XSxcXG4gICAgcmVxdWVzdEhhbmRsZXI6IGFzeW5jICh7IHBhZ2UsIHJlcXVlc3QgfSkgPT4ge1xcbiAgICAgICAgYXdhaXQgRGF0YXNldC5wdXNoRGF0YSh7XFxuICAgICAgICAgICAgdGl0bGU6IGF3YWl0IHBhZ2UudGl0bGUoKSxcXG4gICAgICAgICAgICB1cmw6IHJlcXVlc3QudXJsLFxcbiAgICAgICAgICAgIHN1Y2NlZWRlZDogdHJ1ZSxcXG4gICAgICAgIH0pO1xcbiAgICB9LFxcbn0pO1xcblxcbmF3YWl0IGNyYXdsZXIucnVuKFsnaHR0cHM6Ly9hcGlmeS5jb20vc3RvcmUnLCAnaHR0cHM6Ly9hcGlmeS5jb20nXSk7XFxuXCJ9Iiwib3B0aW9ucyI6eyJidWlsZCI6ImxhdGVzdCIsImNvbnRlbnRUeXBlIjoiYXBwbGljYXRpb24vanNvbjsgY2hhcnNldD11dGYtOCIsIm1lbW9yeSI6NDA5NiwidGltZW91dCI6MTgwfX0.JN2lYfrYhuU1Kj6T5Ya9YEuVQboRB4s5BbGj-WHjpVw&asrc=run_on_apify


```
import { Dataset, PuppeteerCrawler } from 'crawlee';

const cache = {};

const crawler = new PuppeteerCrawler({
    preNavigationHooks: [async ({ page }) => {
        await page.setRequestInterception(true);

        page.on('request', async (request) => {
            const url = request.url();
            if (cache[url] && cache[url].expires > Date.now()) {
                await request.respond(cache[url]);
                return;
            }
            request.continue();
        });

        page.on('response', async (response) => {
            const url = response.url();
            const headers = response.headers();
            const cacheControl = headers['cache-control'] || '';
            const maxAgeMatch = cacheControl.match(/max-age=(\d+)/);
            const maxAge = maxAgeMatch && maxAgeMatch.length > 1 ? parseInt(maxAgeMatch[1], 10) : 0;

            if (maxAge) {
                if (!cache[url] || cache[url].expires > Date.now()) return;

                let buffer;
                try {
                    buffer = await response.buffer();
                } catch {
                    // some responses do not contain buffer and do not need to be cached
                    return;
                }

                cache[url] = {
                    status: response.status(),
                    headers: response.headers(),
                    body: buffer,
                    expires: Date.now() + maxAge * 1000,
                };
            }
        });
    }],
    requestHandler: async ({ page, request }) => {
        await Dataset.pushData({
            title: await page.title(),
            url: request.url,
            succeeded: true,
        });
    },
});

await crawler.run(['https://apify.com/store', 'https://apify.com']);
```


---

# How to choose the right scraper for the job

**Learn basic web scraping concepts to help you analyze a website and choose the best scraper for your particular use case.**

***

You can use one of the two main ways to proceed with building your crawler:

1. Using plain HTTP requests.
2. Using an automated browser.

We will briefly go through the pros and cons of both, and also will cover the basic steps on how to determine which one should you go with.

## Performance

First, let's discuss performance. Plain HTTP request-based scraping will **always** be faster than browser-based scraping. When using plain requests, the page's HTML is not rendered, no JavaScript is executed, no images are loaded, etc. Also, there's no memory used by the browser, and there are no CPU-hungry operations.

If it were only a question of performance, you'd of course use request-based scraping every time; however, it's unfortunately not that simple.

## Dynamic pages & blocking

Some websites do not load any data without a browser, as they need to execute some scripts to show it (these are known as https://docs.apify.com/academy/node-js/dealing-with-dynamic-pages.md). Another problem is blocking. If the website collects a https://docs.apify.com/academy/anti-scraping/techniques/fingerprinting.md, it can distinguish between a real user and a bot (crawler) and block access.

## Making the choice

When choosing which scraper to use, we would suggest first checking whether the website works without JavaScript or not. Probably the easiest way to do so is to use the https://docs.apify.com/academy/tools/quick-javascript-switcher.md extension for Chrome. If JavaScript is not needed, or you've spotted some XHR requests in the **Network** tab with the data you need, you probably won't need to use an automated browser browser. You can then check what data is received in response using https://docs.apify.com/academy/tools/postman.md or https://docs.apify.com/academy/tools/insomnia.md or try to send a few requests programmatically. If the data is there and you're not blocked straight away, a request-based scraper is probably the way to go.

It also depends of course on whether you need to fill in some data (like a username and password) or select a location (such as entering a zip code manually). Tasks where interacting with the page is absolutely necessary cannot be done using plain HTTP scraping, and require headless browsers. In some cases, you might also decide to use a browser-based solution in order to better blend in with the rest of the "regular" traffic coming from real users.


---

# How to scrape from dynamic pages

**Learn about dynamic pages and dynamic content. How can we find out if a page is dynamic? How do we programmatically scrape dynamic content?**

***

## A quick experiment

From our adored and beloved https://demo-webstore.apify.org/, we have been tasked to scrape each product's title, price, and image from the https://demo-webstore.apify.org/search/new-arrivals page.

![New arrival products in Fakestore](/assets/images/new-arrivals-a6b6da0fc639633520351f429b66bf4f.jpg)

First, create a file called **dynamic.js** and copy-paste the following boiler plate code into it:


```
import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    requestHandler: async ({ $, request }) => {
        // We'll put our logic here in a minute
    },
});

await crawler.addRequests([{ url: 'https://demo-webstore.apify.org/search/new-arrivals' }]);

await crawler.run();
```


If you're in a brand new project, don't forget to initialize your project, then install the necessary dependencies:


```
# this command will initialize your project
# and install the "crawlee" and "cheerio" packages
npm init -y && npm i crawlee
```


Now, let's write some data extraction code to extract each product's data. This should look familiar if you went through the https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction.md lessons:


```
import { CheerioCrawler } from 'crawlee';

const BASE_URL = 'https://demo-webstore.apify.org';

const crawler = new CheerioCrawler({
    requestHandler: async ({ $, request }) => {
        const products = $('a[href*="/product/"]');

        const results = [...products].map((product) => {
            const elem = $(product);

            const title = elem.find('h3').text();
            const price = elem.find('div[class*="price"]').text();
            const image = elem.find('img[src]').attr('src');

            return {
                title,
                price,
                image: new URL(image, BASE_URL).href,
            };
        });

        console.log(results);
    },
});

await crawler.run([{ url: 'https://demo-webstore.apify.org/search/new-arrivals' }]);
```


> Here, we are using the https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/map function to loop through all of the product elements and save them into an array we call `results` all at the same time.

After running it, you might say, "Great! It works!" **But wait...** What are those results being logged to console?

![Bad results in console](/assets/images/bad-results-f0ad878dbe1965962328c43da45fb920.png)

Every single image seems to have the same exact "URL," but they are most definitely not the image URLs we are looking for. This is strange, because in the browser, we were getting URLs that looked like this:


```
https://demo-webstore.apify.org/_next/image?url=https%3A%2F%2Fm.media-amazon.com%2Fimages%2FI%2F81ywGFOb0eL._AC_UL1500_.jpg&w=3840&q=85
```


The reason this is happening is because CheerioCrawler makes static HTTP requests, so it only manages to capture the content from the `DOMContentLoaded` event. Any elements or attributes generated dynamically thereafter using JavaScript (and usually XHR/Fetch requests) are not part of the downloaded HTML, and therefore are not accessible through the `$` object.

What's the solution? We need to use something that is able to allow the page to follow through with the entire load process - a headless browser.

## Scraping dynamic content

Let's change a few lines of our code to switch the crawler type from CheerioCrawler to PuppeteerCrawler, which will run a headless browser, allowing the `load` and `networkidle` events to fire:

> Also, don't forget to run `npm i puppeteer` in order to install the `puppeteer` package!


```
import { PuppeteerCrawler } from 'crawlee';

const BASE_URL = 'https://demo-webstore.apify.org';

// Switch CheerioCrawler to PuppeteerCrawler
const crawler = new PuppeteerCrawler({
    // Replace "$" with "page"
    requestHandler: async ({ parseWithCheerio, request }) => {
        // Create the $ Cheerio object based on the page's content
        const $ = await parseWithCheerio();

        const products = $('a[href*="/product/"]');

        const results = [...products].map((product) => {
            const elem = $(product);

            const title = elem.find('h3').text();
            const price = elem.find('div[class*="price"]').text();
            const image = elem.find('img[src]').attr('src');

            return {
                title,
                price,
                image: new URL(image, BASE_URL).href,
            };
        });

        console.log(results);
    },
});

await crawler.run([{ url: 'https://demo-webstore.apify.org/search/new-arrivals' }]);
```


After running this one, we can see that our results look different from before. We're getting the image links!

![Not perfect results](/assets/images/almost-there-689821c3a9b7953bbffa2ef30e67beab.png)

Well... Not quite. It seems that the only images which we got the full links to were the ones that were being displayed within the view of the browser. This means that the images are lazy-loaded. **Lazy-loading** is a common technique used across the web to improve performance. Lazy-loaded items allow the user to load content incrementally, as they perform some action. In most cases, including our current one, this action is scrolling.

We've gotta scroll down the page to load these images. Luckily, because we're using Crawlee, we don't have to write the logic that will achieve that, because a utility function specifically for Puppeteer called https://crawlee.dev/api/puppeteer-crawler/namespace/puppeteerUtils#infiniteScroll already exists right in the library, and can be accessed through `utils.puppeteer`. Let's add it to our code now:

https://console.apify.com/actors/7tWSD8hrYzuc9Lte7?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IERhdGFzZXQsIFB1cHBldGVlckNyYXdsZXIgfSBmcm9tICdjcmF3bGVlJztcXG5cXG5jb25zdCBCQVNFX1VSTCA9ICdodHRwczovL2RlbW8td2Vic3RvcmUuYXBpZnkub3JnJztcXG5cXG5jb25zdCBjcmF3bGVyID0gbmV3IFB1cHBldGVlckNyYXdsZXIoe1xcbiAgICByZXF1ZXN0SGFuZGxlcjogYXN5bmMgKHsgcGFyc2VXaXRoQ2hlZXJpbywgaW5maW5pdGVTY3JvbGwgfSkgPT4ge1xcbiAgICAgICAgLy8gQWRkIHRoZSB1dGlsaXR5IGZ1bmN0aW9uXFxuICAgICAgICBhd2FpdCBpbmZpbml0ZVNjcm9sbCgpO1xcblxcbiAgICAgICAgY29uc3QgJCA9IGF3YWl0IHBhcnNlV2l0aENoZWVyaW8oKTtcXG5cXG4gICAgICAgIGNvbnN0IHByb2R1Y3RzID0gJCgnYVtocmVmKj1cXFwiL3Byb2R1Y3QvXFxcIl0nKTtcXG5cXG4gICAgICAgIGNvbnN0IHJlc3VsdHMgPSBbLi4ucHJvZHVjdHNdLm1hcCgocHJvZHVjdCkgPT4ge1xcbiAgICAgICAgICAgIGNvbnN0IGVsZW0gPSAkKHByb2R1Y3QpO1xcblxcbiAgICAgICAgICAgIGNvbnN0IHRpdGxlID0gZWxlbS5maW5kKCdoMycpLnRleHQoKTtcXG4gICAgICAgICAgICBjb25zdCBwcmljZSA9IGVsZW0uZmluZCgnZGl2W2NsYXNzKj1cXFwicHJpY2VcXFwiXScpLnRleHQoKTtcXG4gICAgICAgICAgICBjb25zdCBpbWFnZSA9IGVsZW0uZmluZCgnaW1nW3NyY10nKS5hdHRyKCdzcmMnKTtcXG5cXG4gICAgICAgICAgICByZXR1cm4ge1xcbiAgICAgICAgICAgICAgICB0aXRsZSxcXG4gICAgICAgICAgICAgICAgcHJpY2UsXFxuICAgICAgICAgICAgICAgIGltYWdlOiBuZXcgVVJMKGltYWdlLCBCQVNFX1VSTCkuaHJlZixcXG4gICAgICAgICAgICB9O1xcbiAgICAgICAgfSk7XFxuXFxuICAgICAgICAvLyBQdXNoIG91ciByZXN1bHRzIHRvIHRoZSBkYXRhc2V0XFxuICAgICAgICBhd2FpdCBEYXRhc2V0LnB1c2hEYXRhKHJlc3VsdHMpO1xcbiAgICB9LFxcbn0pO1xcblxcbmF3YWl0IGNyYXdsZXIucnVuKFt7IHVybDogJ2h0dHBzOi8vZGVtby13ZWJzdG9yZS5hcGlmeS5vcmcvc2VhcmNoL25ldy1hcnJpdmFscycgfV0pO1xcblwifSIsIm9wdGlvbnMiOnsiYnVpbGQiOiJsYXRlc3QiLCJjb250ZW50VHlwZSI6ImFwcGxpY2F0aW9uL2pzb247IGNoYXJzZXQ9dXRmLTgiLCJtZW1vcnkiOjQwOTYsInRpbWVvdXQiOjE4MH19.PGZUSPbQL3ooxDjeGftoPaHw-O18NzHba1zVXzq0E6k&asrc=run_on_apify


```
import { Dataset, PuppeteerCrawler } from 'crawlee';

const BASE_URL = 'https://demo-webstore.apify.org';

const crawler = new PuppeteerCrawler({
    requestHandler: async ({ parseWithCheerio, infiniteScroll }) => {
        // Add the utility function
        await infiniteScroll();

        const $ = await parseWithCheerio();

        const products = $('a[href*="/product/"]');

        const results = [...products].map((product) => {
            const elem = $(product);

            const title = elem.find('h3').text();
            const price = elem.find('div[class*="price"]').text();
            const image = elem.find('img[src]').attr('src');

            return {
                title,
                price,
                image: new URL(image, BASE_URL).href,
            };
        });

        // Push our results to the dataset
        await Dataset.pushData(results);
    },
});

await crawler.run([{ url: 'https://demo-webstore.apify.org/search/new-arrivals' }]);
```


Let's run this and check our dataset results...


```
{
  "title": "women's shoes",
  "price": "$40.00 USD",
  "image": "https://demo-webstore.apify.org/_next/image?url=https%3A%2F%2Fdummyjson.com%2Fimage%2Fi%2Fproducts%2F46%2F1.jpg&w=3840&q=85"
}
```


Each product looks like this, and each image is a valid link that can be visited. These are the results we were after.

## Small Recap

Making static HTTP requests only downloads the HTML content from the `DOMContentLoaded` event. We must use a browser to allow dynamic code to load, or find different means altogether of scraping the data (see https://docs.apify.com/academy/api-scraping.md)


---

A lot of beginners struggle through trial and error while scraping a simple site. They write some code that might work, press the run button, see that error happened and they continue writing more code that might work but probably won't. This is extremely inefficient and gets tedious really fast.

What beginners are missing are basic tools and tricks to get things done quickly. One of these wow tricks is the option to run the JavaScript code directly in your browser.

Pressing F12 while browsing with Chrome, Firefox, or other popular browsers opens up the browser console, the magic toolbox of any web developer. The console allows you to run a code in the context of the website you are in. Don't worry, you cannot mess the site up (well, unless you start doing really nasty tricks) as the page content is downloaded on your computer and any change is only local to your PC.

# Running code in a browser console

> Test your Page Function's code directly in your browser's console.

First, you need to inject jQuery. You can try to paste and run this snippet.


```
const jq = document.createElement('script');
jq.src = 'https://ajax.googleapis.com/ajax/libs/jquery/2.2.2/jquery.min.js';
document.getElementsByTagName('head')[0].appendChild(jq);
```


If that doesn't work because of a CORS violation, you can install https://chrome.google.com/webstore/detail/ekkjohcjbjcjjifokpingdbdlfekjcgi that injects jQuery on a button click.

You can test a `pageFunction` code in two ways in your console:

## Pasting and running a small code snippet

Usually, you don't need to paste in the whole pageFunction as you can isolate the critical part of the code you are trying to debug. You will need to remove any references to the `context` object and its properties like `request` and the final return statement but otherwise, the code should work 1:1.

I will also usually remove `const` declarations on the top level variables. This helps you to run the same code many times over without needing to restart the console (you cannot declare constants more than once). My declaration will change from:


```
const results = [];
// Scraping something to fill the results
```


into


```
results = [];
```


You can get all the information you need by running a snippet of your `pageFunction` like this:


```
results = [];
$('.my-list-item').each((i, el) => {
    results.push({
        title: $(el).find('.title').text().trim(),
        // other fields
    });
});
```


Now the `results` variable stays on the page and you can do whatever you wish with it. Log it to analyze if your scraping code is correct. Writing a single expression will also log it in a browser console.


```
results;
// Will log a nicely formatted [{ title: 'my-article-1'}, { title: 'my-article-2'}] etc.
```


## Pasting and running a full pageFunction

If you don't want to deal with copy/pasting a proper snippet, you can always paste the whole pageFunction. You will have to mock the context object when calling it. If you use some advanced tricks, this might not work but in most cases copy pasting this code should do it. This code is only for debugging your Page Function for a particular page. It does not crawl the website and the output is not saved anywhere.


```
async function pageFunction(context) {
    // this is your pageFunction
}
// Now you will call it with mocked context
pageFunction({
    request: {
        url: window.location.href,
        userData: { label: 'paste-a-label-if-you-use-one' },
    },
    async waitFor(ms) {
        console.log('(waitFor)');
        await new Promise((res) => setTimeout(res, ms));
    },
    enqueueRequest() { console.log('(enqueuePage)', arguments); },
    skipLinks() { console.log('(skipLinks)', arguments); },
    jQuery: $,
});
```


Happy debugging!


---

# Filter out blocked proxies using sessions

*This article explains how the problem was solved before the https://docs.apify.com/sdk/js/docs/api/session-pool class was added into https://docs.apify.com/sdk/js. We are keeping the article here as it might be interesting for people who want to see how to work with sessions on a lower level. For any practical usage of sessions, follow the documentation and examples of SessionPool.*

### Overview of the problem

You want to crawl a website with a proxy pool, but most of your proxies are blocked. It's a very common situation. Proxies can be blocked for many reasons:

1. You overused them in your current Actor run and they got banned.

2. You overused them in some of your previous runs and they are still banned (and may never be unbanned).

3. Some other user with whom you share part of your proxy pool overused them when crawling the same website before you even touched it.

4. The proxies were actually banned before anyone used them to crawl the website because they share a subnetwork in some datacenter and all proxies of that subnet got banned.

5. The proxies actually got banned before anyone used them to crawl the website because they use anti-bot protection that bans proxies across websites (e.g. Cloudflare).

Nobody can make sure that a proxy will work infinitely. The only real solution to this problem is to use https://docs.apify.com/platform/proxy/residential-proxy.md, but they can sometimes be too costly.

However, usually, at least some of our proxies work. To crawl successfully, it is therefore imperative to handle blocked requests properly. You first need to discover that you are blocked, which usually means that either your request returned status greater or equal to 400 (it didn't return the proper response) or that the page displayed a captcha. To ensure that this bad request is retried, you usually throw an error and it gets automatically retried later (our https://docs.apify.com/sdk/js handles this for you). Check out https://docs.apify.com/academy/node-js/handle-blocked-requests-puppeteer as inspiration for how to handle this situation with `PuppeteerCrawler` class.

### Solution

Now we are able to retry bad requests and eventually unless all of our proxies get banned, we should be able to successfully crawl what we want. The problem is that it takes too long and our log is full of errors. Fortunately, we can overcome this with https://docs.apify.com/platform/proxy/datacenter-proxy.md#username-parameters (look at the proxy and SDK documentation for how to use them in your Actors.)

First we define `sessions`  object at the top of our code (in global scope) to hold the state of our working sessions.

`let sessions;`

Then we need to define an interval that will ensure our sessions are periodically saved to the key-value store, so if the Actor restarts, we can load them.


```
setInterval(async () => {
    await Apify.setValue('SESSIONS', sessions);
}, 30 * 1000);
```


And inside our main function, we load the sessions the same way we load an input. If they were not saved yet (the Actor was not restarted), we instantiate them as an empty object.


```
Apify.main(async () => {
    sessions = (await Apify.getValue('SESSIONS')) || {};
    // ...the rest of your code
});
```


### Algorithm

You don't necessarily need to understand the solution below - it should be fine to copy/paste it to your Actor.

`sessions`  will be an object whose keys will be the names of the sessions and values will be objects with the name of the session (we choose a random number as a name here) and user agent (you can add any other useful properties that you want to match with each session.) This will be created automatically, for example:


```
{
    "0.7870849452667994": {
        "name": "0.7870849452667994",
        "userAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36"
        },
    "0.4787584713044999": {
        "name": "0.4787584713044999",
        "userAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 Edge/16.16299"
    }
    // ...
}
```


Now let's get to the algorithm that will define which sessions to pick for a request. It can be done in many ways and this is by no means the ideal way, so I encourage you to find a more intelligent algorithm and paste it into the comments of this article.

This function takes `sessions`  as an argument and returns a `session`  object which will either be a random object from `sessions`  or a new one with random user agent.


```
const pickSession = (sessions, maxSessions = 100) => {

    // sessions is our sessions object, at the beginning instantiated as {}
    // maxSessions is a constant which should be the number of working proxies we aspire to have.
    // The lower the number, the faster you will use the working proxies
    // but the faster the new one will not be picked
    // 100 is reasonable default
    // Since sessions is an object, we prepare an array of the session names
    const sessionsKeys = Object.keys(sessions);

    console.log(`Currently we have ${sessionsKeys.length} working sessions`);

    // We define a random floating number from 0 to 1 that will serve
    // both as a chance to pick the session and its possible name
    const randomNumber = Math.random();

    // The chance to pick a session will be higher when we have more working sessions
    const chanceToPickSession = sessionsKeys.length / maxSessions;

    console.log(`Chance to pick a working session is ${Math.round(chanceToPickSession * 100)}%`);

    // If the chance is higher than the random number, we pick one from the working sessions
    const willPickSession = chanceToPickSession > randomNumber;

    if (willPickSession) {
        // We randomly pick one of the working sessions and return it
        const indexToPick = Math.floor(sessionsKeys.length * Math.random());

        const nameToPick = sessionsKeys[indexToPick];

        console.log(`We picked a working session: ${nameToPick} on index ${indexToPick}`);

        return sessions[nameToPick];
    }
    // We create a new session object, assign a random userAgent to it and return it

    console.log(`Creating new session: ${randomNumber}`);

    return {
        name: randomNumber.toString(),
        userAgent: Apify.utils.getRandomUserAgent(),
    };

};
```


### Puppeteer example

We then use this function whenever we want to get the session for our request. Here is an example of how we would use it for bare bones Puppeteer (for example as a part of `BasicCrawler` class).


```
const session = pickSession(sessions);
const browser = await Apify.launchPuppeteer({
    useApifyProxy: true,
    apifyProxySession: session.name,
    userAgent: session.userAgent,
});
```


Then we only need to add the session if the request was successful or remove it if it was not. It doesn't matter if we add the same session twice or delete a non-existent session (because of how JavaScript objects work).

After success: `sessions[session.name] = session;`

After failure (captcha, blocked request, etc.): `delete sessions[session.name]`

### PuppeteerCrawler example

Now you might start to wonder, "I have already prepared an Actor using PuppeteerCrawler, can I make it work there?". The problem is that with PuppeteerCrawler we don't have everything nicely inside one function scope like when using pure Puppeteer or BasicCrawler. Fortunately, there is a little hack that enables passing the session name to where we need it.

First we define `lauchPuppeteerFunction` which tells the crawler how to create new browser instances and we pass the picked session there.


```
const crawler = new Apify.PuppeteerCrawler({
    launchPuppeteerFunction: async () => {
        const session = pickSession(sessions);
        return Apify.launchPuppeteer({
            useApifyProxy: true,
            userAgent: `${session.userAgent} s=${session.name}`,
            apifyProxySession: session.name,
        });
    },
    // handlePageFunction etc.
});
```


We picked the session and added it to the browser as `apifyProxySession` but for userAgent, we didn't pass the User-Agent as it is but added the session name into it. That is the hack because we can retrieve the user agent from the Puppeteer browser itself.

Now we need to retrieve the session name back in the `gotoFunction`, pass it into userData and fix the hacked userAgent back to normal so it is not suspicious for the website.


```
const gotoFunction = async ({ request, page }) => {
    const userAgentWithSession = await page.browser().userAgent();
    const match = userAgentWithSession.match(/(.+) s=(.+)/);
    const session = {
        name: match[2],
        userAgent: match[1],
    };
    request.userData.session = session;
    await page.setUserAgent(session.userAgent);
    return page.goto(request.url, { timeout: 60000 });
};
```


Now we have access to the session in the `handlePageFunction` and the rest of the logic is the same as in the first example. We extract the session from the userData, try/catch the whole code and on success we add the session and on error we delete it. Also it is useful to retire the browser completely (check https://docs.apify.com/academy/node-js/handle-blocked-requests-puppeteer for reference) since the other requests will probably have similar problem.


```
const handlePageFunction = async ({ request, page, puppeteerPool }) => {
    const { session } = request.userData;
    console.log(`URL: ${request.url}, session: ${session.name}, userAgent: ${session.userAgent}`);

    try {
        // your main logic that is executed on each page
        sessions[session.name] = session;
    } catch (e) {
        delete sessions[session.name];
        await puppeteerPool.retire(page.browser());
        throw e;
    }
};
```


Things to consider

1. Since the good and bad proxies are getting filtered over time, this solution only makes sense for crawlers with at least hundreds of requests.

2. This solution will not help you if you don't have enough proxies for your job. It can even get your proxies banned faster (since the good ones will be used more often), so you should be cautious about the speed of your crawl.

3. If you are more concerned about the speed of your crawler and less about banning proxies, set the `maxSessions` parameter of `pickSession` function to a number relatively lower than your total number of proxies. If on the other hand, keeping your proxies alive is more important, set `maxSessions`  relatively higher so you will always pick new proxies.

4. Since sessions only last 24 hours, if you have bigger intervals between your crawler runs, they will start fresh each time.


---

One of the main defense mechanisms websites use to ensure they are not scraped by bots is allowing only a limited number of requests from a specific IP address. That's why Apify provides a https://docs.apify.com/platform/proxy component with intelligent rotation. With a large enough pool of proxies, you can multiply the number of allowed requests per day to cover your crawling needs. Let's look at how we can rotate proxies when using our https://github.com/apify/apify-sdk-js.

# BasicCrawler

> Getting around website defense mechanisms when crawling.

You can use `handleRequestFunction` to set up proxy rotation for a https://crawlee.dev/api/basic-crawler/class/BasicCrawler. The following example shows how to use a fresh proxy on each request if you make requests through the popular https://www.npmjs.com/package/request-promise npm package:


```
const Apify = require('apify');
const requestPromise = require('request-promise');

const PROXY_PASSWORD = process.env.APIFY_PROXY_PASSWORD;
const proxyUrl = `http://auto:${PROXY_PASSWORD}@proxy.apify.com`;

const crawler = new Apify.BasicCrawler({
    requestList: someInitializedRequestList,
    handleRequestFunction: async ({ request }) => {
        const response = await requestPromise({
            url: request.url,
            proxy: proxyUrl,
        });
    },
});
```


Each time `handleRequestFunction` is executed in this example, requestPromise will send a request through the least used proxy for that target domain. This way you will not burn through your proxies.

# Puppeteer Crawler

With https://docs.apify.com/sdk/js/docs/api/puppeteer-crawler the situation is a little more complicated. That's because you have to restart the browser to change the proxy the browser is using. By default, PuppeteerCrawler restarts the browser every 100 requests, which can lead to a number of requests being wasted because the IP address the browser is using is already blocked by the website.

The straightforward solution would be to set the 'retireInstanceAfterRequestCount' option to 1. PuppeteerCrawler would then rotate the proxies in the same way as BasicCrawler. While this approach could sometimes be useful for the toughest websites, the price you pay is in performance. Restarting the browser is an expensive operation.

That's why PuppeteerCrawler offers a utility retire() function through a PuppeteerPool class. You can access PuppeteerPool by passing it into the object parameter of gotoFunction or handlePageFunction.


```
const crawler = new PuppeteerCrawler({
    requestList: someInitializedRequestList,
    launchPuppeteerOptions: {
        useApifyProxy: true,
    },
    handlePageFunction: async ({ request, page, puppeteerPool }) => {
        // you are on the page now
    },

});
```


It is really up to a developer to spot if something is wrong with his request. A website can interfere with your crawling in https://docs.apify.com/academy/anti-scraping. Page loading can be cancelled right away, it can timeout, the page can display a captcha, some error or warning message, or the data may be missing or corrupted. The developer can then choose if he will try to handle these problems in the code or focus on receiving the proper data. Either way, if the request went wrong, you should throw a proper error.

Now that we know when the request is blocked, we can use the retire() function and continue crawling with a new proxy. Google is one of the most popular websites for scrapers, so let's code a Google search crawler. The two main blocking mechanisms used by Google is either to display their (in)famous 'sorry' captcha or to not load the page at all so we will focus on covering these.

For example, let's assume we have already initialized a requestList of Google search pages. Let's show how you can use the retire() function in both gotoFunction and handlePageFunction.


```
const crawler = new Apify.PuppeteerCrawler({
    requestList: someInitializedRequestList,
    launchPuppeteerOptions: {
        useApifyProxy: true,
    },
    gotoFunction: async ({ request, page, puppeteerPool }) => {
        const response = page.goto(request.url).catch(() => null);
        if (!response) {
            await puppeteerPool.retire(page.browser());
            throw new Error(`Page didn't load for ${request.url}`);
        }
        return response;
    },
    handlePageFunction: async ({ request, page, puppeteerPool }) => {
        if (page.url().includes('sorry')) {
            await puppeteerPool.retire(page.browser());
            throw new Error(`We got captcha for ${request.url}`);
        }
    },
    retireInstanceAfterRequestCount: 50,
});

Apify.main(async () => {
    await crawler.run();
});
```


Now we have a crawler that catches the most common blocking issues on Google. In `gotoFunction` we will catch if the page doesn't load and in the handlePageFunction we check if we were redirected to the 'sorry page'. In both cases we throw an error afterwards so the request is added back to the crawling queue (otherwise the crawler would think everything was okay and would treat that request as handled).


---

# How to fix 'Target closed' error in Puppeteer and Playwright

**Learn about common causes for the 'Target closed' error in browser automation and what you can do to fix it.**

***

The `Target closed` error happens when you try to access the `page` object (or some of its parent objects like the `browser`), but the underlying browser tab has already been closed. The exact error message can appear in several variants, such as `Target page, context or browser has been closed`, but none of them are very helpful for debugging. To debug it, attach logs in multiple places or use the headful mode.

## Out of memory

![Chrome crashed tab](/assets/images/chrome-crashed-tab-b7f5310d7661df3872ca9c294b3b28a5.png)

Browsers create a separate process for each tab. That means each tab lives with a separate memory space. If you have a lot of tabs open, you might run out of memory. The browser cannot close your old tabs to free extra memory so it will usually kill your current memory hungry tab.

### Memory solution

If you use https://crawlee.dev/, your concurrency automatically scales up and down to fit in the allocated memory. You can change the allocated memory using the environment variable or the https://crawlee.dev/docs/guides/configuration class. But very hungry pages can still occasionally cause sudden memory spikes, and you might have to limit the https://crawlee.dev/docs/guides/scaling-crawlers#minconcurrency-and-maxconcurrency of the crawler. This problem is very rare, though.

Without Crawlee, you will need to predict the maximum concurrency the particular use case can handle or increase the allocated memory.

## Page closed prematurely

If you close the page before executing all code that tries to access the page, you will get the 'Target closed' error. The most common cause is that your crawler doesn't properly wait for all actions and instead closes the page earlier than it should. Usually, this is caused by forgotten `await` keyword (floating promise), using event handlers like `page.on` or having wrongly ordered crawling loop.

### Page closed solution

https://docs.apify.com/academy/node-js/analyzing-pages-and-fixing-errors to see exactly at which point the crash occurs. See if you can spot one of the above mentioned problems. Adding missing `await` is simple but if your code runs in an event handler, you will need to wrap it in try/catch block and ensure that you give it enough time to execute before you close the main crawling handler.

If you use Crawlee and utilize https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions#preNavigationHooks to execute event handlers like `page.on` asynchronously be aware that this can cause the above mentioned problem that the https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions#requestHandler already finishes before we access the `page` in the event handler. You can solve this issue by making sure the `requestHandler` waits for all promises from the `preNavigationHooks`. This can be achieved by passing the promises to the `context` which is accessible to both functions and awaiting them before the scraping code starts.


```
const crawler = new PlaywrightCrawler({
    // ...other options
    preNavigationHooks: [
        async ({ page, context }) => {
            // Some action that takes time, we don't await here
            // Try/catch all non awaited code because it can cause unhandled rejection which crashes the whole process
            const responsePromise = page.waitForResponse('https://example.com/resource').catch((e) => e);
            // Attach the promise to the context which is accessible to requestHandler
            context.responsePromise = responsePromise;
        },
    ],
    requestHandler: async ({ request, page, context }) => {
        // We first wait for the response before doing anything else
        const response = await context.responsePromise;
        // Check if it errored out, otherwise proceed with parsing it
        if (typeof response === 'string' || response instanceof Error) {
            throw new Error(`Failed to load resource from response`, { cause: response });
        }
        // Now process the response and continue with the code synchronously
    },
});
```


If you are still unsure what causes your particular error, check with the community and Apify team on https://discord.com/invite/jyEM2PRvMU.


---

# How to save screenshots from puppeteer

A good way to debug your puppeteer crawler in Apify Actors is to save a screenshot of a browser window to the Apify key-value store. You can do that using this function:


```
/**
* Store screen from puppeteer page to Apify key-value store
* @param page - Instance of puppeteer Page class https://pptr.dev/api/puppeteer.page
* @param [key] - Function stores your screen in Apify key-value store under this key
* @return {Promise}
*/
const saveScreen = async (page, key = 'debug-screen') => {
    const screenshotBuffer = await page.screenshot({ fullPage: true });
    await Apify.setValue(key, screenshotBuffer, { contentType: 'image/png' });
};
```


This function takes the parameters page (an instance of a puppeteer page) and key (your screen is stored under this key function in the Apify key-value store).

Because this is so common use-case Apify SDK has a utility function called https://docs.apify.com/sdk/js/docs/api/puppeteer#puppeteersavesnapshot that does exactly this and a little bit more:

* You can choose the quality of your screenshots (high-quality images take more size)

* You can also save the HTML of the page

An example of such Apify Actor:


```
import { Actor } from 'apify';
import { puppeteerUtils, launchPuppeteer } from 'crawlee';

Actor.main(async () => {
    const input = await Actor.getValue('INPUT');

    console.log('Launching Puppeteer...');
    const browser = await launchPuppeteer();

    const page = await browser.newPage();
    await page.goto(input.url);

    await puppeteerUtils.saveSnapshot(page, { key: 'test-screen' });

    console.log('Closing Puppeteer...');
    await browser.close();

    console.log('Done.');
});
```


After you call the function, your screen appears in the KEY-VALUE STORE tab in the Actor console. You can click on the row with your saved screen and it'll open it in a new window.

![Puppeteer Key-Value store](/assets/images/kv-store-puppeteer-35b752a254c5d7f34d23bea8d97bb3dc.png)

If you have any questions, feel free to contact us in chat.

Happy coding!


---

# How to scrape hidden JavaScript objects in HTML

**Learn about "hidden" data found within the JavaScript of certain pages, which can increase the scraper reliability and improve your development experience.**

***

Depending on the technology the target website is using, the data to be collected not only can be found within HTML elements, but also in a JSON format within `` tags in the DOM.

The advantages of using these objects instead of parsing the HTML are that parsing JSON is much simpler, and more reliable than parsing HTML elements. They are much less likely to change, while the CSS selectors are prone to updates and re-namings every time the website is updated.

> **Note:** In this tutorial, we'll be using https://soundcloud.com as an example target, but the techniques described here can be applied to any site.

## Locating JSON objects within script tags

Using our DevTools, we can inspect our https://soundcloud.com/tiesto/tracks, or right click the page and click **View Page Source** to see the DOM. Next, we'll find a value on the page that we can predict would be in a potential API response. For our page, we'll use the **Tracks** count of `845`. On the **View Page Source** page, we'll do **⌘** + **F** and type in this value, which will show all matches for it within the DOM. This method can expose `` tag objects which hold the target data.

![Find the value within the DOM using CMD + F](/assets/images/view-845-77582d897496190ac1b44e2eb4364273.png)

These data objects will usually be attached to the window object (often prefixed with two underscores - `__`). When scrolling to the beginning of the script tag on our **View Page Source** page, we see that the name of our target object is `__sc_hydration`. Heading back to DevTools and typing this into the console, the object is displayed.

![View the target data in the window object using the console in DevTools](/assets/images/view-object-in-window-b9e1031f84b636d9038ecf8a4f6b394d.png)

## Parsing

You can obtain these objects to be used and manipulated in JavaScript in two ways:

### 1. Parsing them directly from the HTML


```
// same as "document.querySelector('html').innerHTML"
const html = $.html();

const string = html.split('window.__sc_hydration = ')[1].split(';')[0];

const data = JSON.parse(string);

console.log(data);
```


### 2. Retrieving them within the context of the browser

Tools like https://github.com/puppeteer/puppeteer allow us to run code within the context in the browser, as well as return things out of these functions and use the data back in the Node.js context.


```
const data = await page.evaluate(() => window.__sc_hydration);

console.log(data);
```


Which of these methods you use totally depends on the type of crawler you are using. Grabbing the data directly from the `window` object within the context of the browser using Puppeteer is of course the most reliable solution; however, it is less efficient than making a static HTTP request and parsing the object directly from the downloaded HTML.


---

# Scrape website in parallel with multiple Actor runs

**Learn how to run multiple instances of an Actor to scrape a website faster. This tutorial will guide you through the process of setting up your scraper.**

***



Imagine a large website that you need to scrape. You have a scraper that works well, but scraping the whole website is slow. You can speed up the scraping process by running multiple instances of the scraper in parallel. This tutorial will guide you through setting up your scraper to run multiple instances in parallel.

In a rush?

You can check https://github.com/apify/apify-docs/tree/master/examples/ts-parallel-scraping right away.

## Managing Multiple Scraper Runs

To manage multiple instances of the scraper, we need to build an Orchestrator Actor to oversee the process. This Orchestrator Actor will initiate several scraper runs and manage their operations. It will set up a request queue and a dataset that the other Actor runs will utilize to crawl the website and store results. In this tutorial, we set up the Orchestrator Actor and the scraper Actor.

## Orchestrator Actor Configuration

The Orchestrator Actor orchestrates the parallel execution of scraper Actor runs. It runs multiple instances of the scraper Actor and passes the request queue and dataset to them. For the Actor's base structure, we use Apify CLI and create a new Actor with the following command and use the https://apify.com/templates/ts-empty.


```
apify create orchestrator-actor
```


If you don't have Apify CLI installed, check out our installation https://docs.apify.com/cli/docs/installation.

### Input Configuration

Let's start by defining the Input Schema for the Orchestrator Actor. The input for the Actor will specify configurations needed to initiate and manage multiple scraper Actors in parallel. Here’s the breakdown of the necessary input:

* input\_schema.json
* main.ts


```
{
    "title": "Orchestrator Actor Input",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "parallelRunsCount": {
            "title": "Parallel Actor runs count",
            "type": "integer",
            "description": "Number of parallel runs of the Actor.",
            "default": 1
        },
        "targetActorId": {
            "title": "Actor ID",
            "type": "string",
            "editor": "textfield",
            "description": "ID of the Actor to run."
        },
        "targetActorInput": {
            "title": "Actor Input",
            "type": "object",
            "description": "Input of the Actor to run",
            "editor": "json",
            "prefill": {}
        },
        "targetActorRunOptions": {
            "title": "Actor Run Options",
            "type": "object",
            "description": "Options for the Actor run",
            "editor": "json",
            "prefill": {}
        }
    },
    "required": ["parallelRunsCount", "targetActorId"]
}
```



```
import { Actor, log } from 'apify';

interface Input {
    parallelRunsCount: number;
    targetActorId: string;
    targetActorInput: Record;
    targetActorRunOptions: Record;
}

await Actor.init();

const {
    parallelRunsCount = 1,
    targetActorId,
    targetActorInput = {},
    targetActorRunOptions = {},
} = await Actor.getInput() ?? {} as Input;
const { apifyClient } = Actor;

if (!targetActorId) throw new Error('Missing the "targetActorId" input!');
```


### Reusing dataset and request queue

The Orchestrator Actor will reuse its default dataset and request queue. The dataset stores the results of the scraping process, and the request queue is used as shared storage for processing requests.


```
import { Actor } from 'apify';

const requestQueue = await Actor.openRequestQueue();
const dataset = await Actor.openDataset();
```


### State

The Orchestrator Actor will maintain the state of the scraping runs to track progress and manage continuity. It will record the state of Actor runs, initializing this tracking with the first run. This persistent state ensures that, in migration or restart (resurrection) cases, the Actor can resume the same runs without losing progress.


```
import { Actor, log } from 'apify';

const { apifyClient } = Actor;
const state = await Actor.useState('actor-state', { parallelRunIds: [], isInitialized: false });

if (state.isInitialized) {
    for (const runId of state.parallelRunIds) {
        const runClient = apifyClient.run(runId);
        const run = await runClient.get();

        // This should happen if the run was deleted or the state was incorectly saved.
        if (!run) throw new Error(`The run ${runId} from state does not exists.`);

        if (run.status === 'RUNNING') {
            log.info('Parallel run is already running.', { runId });
        } else {
            log.info(`Parallel run was in state ${run.status}, resurrecting.`, { runId });
            await runClient.resurrect(targetActorRunOptions);
        }
    }
} else {
    for (let i = 0; i {const runClient=apifyClient.run(runId);return runClient.waitForFinish();});// Abort parallel runs if the main run is aborted
Actor.on('aborting',async()=>{for(const runId of state.parallelRunIds){log.info('Aborting run',{runId});await apifyClient.run(runId).abort();}});// Wait for all parallel runs to finish
await Promise.all(parallelRunPromises);// Gracefully exit the Actor process. It's recommended to quit all Actors with an exit()
await Actor.exit();
```


### Pushing to Apify

Once you have the Orchestrator Actor ready, you can push it to Apify using the following command from the root directory of the Actor project:


```
apify push
```


First log in

If you are pushing the Actor for the first time, you will need to https://docs.apify.com/cli/docs/reference#apify-login.

By running this command, you will be prompted to provide the Actor ID, which you can find in the Apify Console under the Actors tab.

![orchestrator-actor.png](/assets/images/orchestrator-actor-7a722f44faddf4f5e3a8439acb4baea0.png)

## Scraper Actor Configuration

The Scraper Actor performs website scraping. It operates using the request queue and dataset provided by the Orchestrator Actor. You will need to integrate your chosen scraper logic into this framework. The only thing you need to do is utilize the request queue and dataset initialized by the Orchestrator Actor.


```
import { Actor } from 'apify';

interface Input {
    requestQueueId: string;
    datasetId: string;
}

const {
    requestQueueId,
    datasetId,
} = await Actor.getInput() ?? {} as Input;

const requestQueue = await Actor.openRequestQueue(requestQueueId);
const dataset = await Actor.openDataset(datasetId);
```


Once you initialized the request queue and dataset, you can start scraping the website. In this example, we will use the CheerioCrawler to scrape https://warehouse-theme-metal.myshopify.com/. You can create your scraper from the https://apify.com/templates/ts-crawlee-cheerio.

* input\_schema.json
* main.ts


```
{
    "title": "Scraper Actor Input",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "requestQueueId": {
            "title": "Request Queue ID",
            "type": "string",
            "editor": "textfield",
            "description": "Request queue to use in scraper."
        },
        "datasetId": {
            "title": "Dataset ID",
            "type": "string",
            "editor": "textfield",
            "description": "Dataset to use in scraper."
        }
    },
    "required": ["requestQueueId", "datasetId"]
}
```



```
import{Actor}from'apify';import{CheerioCrawler}from'crawlee';await Actor.init();const{requestQueueId,datasetId}=(await Actor.getInput())??{};const requestQueue=await Actor.openRequestQueue(requestQueueId);const dataset=await Actor.openDataset(datasetId);const proxyConfiguration=await Actor.createProxyConfiguration();const crawler=new CheerioCrawler({proxyConfiguration,requestQueue,requestHandler:async({enqueueLinks,request,$,log})=>{log.info('Processing page',{url:request.url});const newPages=await enqueueLinks({selector:'a[href]'});log.info(`Enqueued ${newPages.processedRequests.length} new pages.`);// If the product page is loaded, save the title and URL to the Dataset.
if(request?.loadedUrl?.includes('/products/')){const title=$('title').text();await dataset.pushData({url:request.loadedUrl,title});}}});await crawler.run(['https://warehouse-theme-metal.myshopify.com/']);// Gracefully exit the Actor process. It's recommended to quit all Actors with an exit()
await Actor.exit();
```


You can check https://github.com/apify/apify-docs/tree/master/examples/ts-parallel-scraping/scraper.

You need to push the Scraper Actor to Apify using the following command from the root directory of the Actor project:


```
apify push
```


After pushing the Scraper Actor to Apify, you must get the Actor ID from the Apify Console.

![scraper-actor.png](/assets/images/scraper-actor-1a5be4b501a30d12e2b13ab56d8f6a05.png)

## Run orchestration in Apify Console

Once you have the Orchestrator Actor and Scraper Actor pushed to Apify, you can run the Orchestrator Actor in the Apify Console. You can set the input for the Orchestrator Actor to specify the number of parallel runs and the target Actor ID, input, and run options. After you hit the **Start** button, the Orchestrator Actor will start the parallel runs of the Scraper Actor.

![orchestrator-actor-input.png](/assets/images/orchestrator-actor-input-37f6e29cbeb76c3db86773b4587e24ce.png)

After starting the Orchestrator Actor, you will see the parallel runs initiated in the Apify Console.

![scraper-actor-runs.png](/assets/images/scraper-actor-runs-e07ddb88c801539c276c62a4a110f2e2.png)

## Summary

In this tutorial, you learned how to run multiple instances of an Actor to scrape a website faster. You created an Orchestrator Actor to manage the parallel execution of the Scraper Actor runs. The Orchestrator Actor initialized the Scraper Actor runs and managed their state. The Scraper Actor utilized the request queue and dataset provided by the Orchestrator Actor to scrape the website. You could speed up the scraping process by running multiple instances of the Scraper Actor in parallel.

The code in this tutorial is for learning purposes and does not cover all specific edge cases. You can modify it to suit your exact requirements and use cases.


---

# How to optimize and speed up your web scraper

**We all want our scrapers to run as cost-effective as possible. Learn how to think about performance in the context of web scraping and automation.**

***

Especially if you are running your scrapers on https://apify.com, performance is directly related to your wallet (or rather bank account). The slower and heavier your program is, the more proxy bandwidth, storage, https://help.apify.com/en/articles/3490384-what-is-a-compute-unit and higher https://apify.com/pricing you'll need.

The goal of optimization is to make the code run as fast as possible while using the least resources possible. On Apify, the resources are memory and CPU usage (don't forget that the more memory you allocate to a run, the bigger share of CPU you get - proportionally). The memory alone should never be a bottleneck though. If it is, that means either a bug (memory leak) or bad architecture of the program (you need to split the computation into smaller parts). The rest of this article will focus only on optimizing CPU usage. You allocate more memory only to get more power from the CPU.

One more thing to remember. Optimization has its own cost: development time. You should always think about how much time you're able to spend on it and if it's worth it.

Before we dive into the practical side of things, let us diverge with an analogy to help us think about the performance of scrapers.

## Game development analogy

Games are extremely complicated beasts. Every frame (usually 60 times a second), the game has to calculate the physics of the world, run AI, user input, and render everything into a beautiful scene. You can imagine that running all of that every 16 ms in a complicated game is a developer's nightmare. That's why a significant portion of game development is spent on optimizations. Every little waste matters.

This is mainly true in the programming heart of the game - the engine. The engine is responsible for the heavy lifting of performance critical parts like physics, animation, AI, and rendering. Once the engine is built, you can design the game on top of it. You can add different spells, conversation chains, items, animations etc. to make your game cool. Those extra things may not run every frame and don't need to be optimized as heavily as the engine itself.

Now, if you want to build your own game and you are not a C/C++ veteran with a team, you will likely use an existing engine (like Unreal or Unity) and focus on the design of the game environment itself. Unless you go crazy, the game will likely run just fine since those engines have already been optimized for you. Your job is to choose an appropriate engine and use it well.

## Back to scrapers

What are the engines of the scraping world? A https://github.com/puppeteer/puppeteer?tab=readme-ov-file#puppeteer, an https://www.npmjs.com/package/@apify/http-request, an https://github.com/cheeriojs/cheerio, and a https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/JSON/parse. The CPU spends more than 99% of its workload in these libraries. As with engines, you are not likely gonna write these from scratch - instead you'll use something like https://crawlee.dev that handles a lot of the overheads for you.

It is about how you use these tools. The small amount of code you write in your https://crawlee.dev/api/http-crawler/interface/HttpCrawlerOptions#requestHandler is absolutely insignificant compared to what is running inside these tools. In other words, it doesn't matter how many functions you call or how many variables you extract. If you want to optimize your scrapers, you need to choose the lightweight option from the tools and use it as little as possible. A crawler scraping only JSON API can be as much as 200 times faster/cheaper than a browser based solution.

**Ranking of the tools from the most efficient to the least:**

1. **JSON API** (HTTP call + JSON parse) - Scraping an API (public or internal) is the best option. The response is usually smaller than the HTML page and the data are already structured and cheap to parse. Usable for about 30% of websites.
2. **Pure HTML** (HTTP call + HTML parse) - All data is on the main single HTML page. Often the HTML contains script and JSON data that are rich and nicely structured. Some pages can be quite big and the parsing is slower than for JSON. But it is still 10–20 times faster than a browser. Usable for about 90% of websites.
3. **Browser** (hundreds of HTTP calls, script execution, rendering) - Browsers are huge beasts. They do so much work to allow for smooth human interaction which makes them really inefficient for scraping. Use a browser only if it helps you bypass anti-scraping protection or if you need to interact with the page.


---

Sometimes you need to process the same URL several times, but each time with a different setup. For example, you may want to submit the same form with different data each time.

Let's illustrate a solution to this problem by creating a scraper which starts with an array of keywords and inputs each of them to Google, one by one. Then it retrieves the results.

> This isn't an efficient solution to searching keywords on Google. You could directly enqueue search URLs like `https://www.google.cz/search?q=KEYWORD`.

# Enqueuing start pages for all keywords

> Solving a common problem with scraper automatically deduplicating the same URLs.

First, we need to start the scraper on the page from which we're going to do our enqueuing. To do that, we create one start URL with the label "enqueue" and URL "https://example.com/". Now we can proceed to enqueue all the pages. The first part of our `pageFunction` will look like this:


```
async function pageFunction(context) {
    const $ = context.jQuery;

    if (context.request.userData.label === 'enqueue') {
    // parse input keywords
        const keywords = context.customData;

        // process all the keywords
        for (const keyword of keywords) {
        // enqueue the page and pass the keyword in
        // the interceptRequestData attribute
            await context.enqueueRequest({
                url: 'https://google.com',
                uniqueKey: `${Math.random()}`,
                userData: {
                    label: 'fill-form',
                    keyword,
                },
            });
        }
        // No return here because we don't extract any data yet
    }
}
```


To set the keywords, we're using the customData scraper parameter. This is useful for smaller data sets, but may not be perfect for bigger ones. For such cases you may want to use something like https://docs.apify.com/academy/node-js/scraping-urls-list-from-google-sheets.

Since we're enqueuing the same page more than once, we need to set our own uniqueKey so the page will be added to the queue (by default uniqueKey is set to be the same as the URL). The label for the next page will be "fill-form". We're passing the keyword to the next page in the userData field (this can contain any data).

# Inputting the keyword into Google

Now we come to the next page (Google). We need to retrieve the keyword and input it into the Google search bar. This will be the next part of the pageFunction:


```
async function pageFunction(context) {
    const $ = context.jQuery;

    if (context.request.userData.label === 'enqueue') {
        // copy from the previous part
    } else if (context.request.userData.label === 'fill-form') {
        // retrieve the keyword
        const { keyword } = context.request.userData;

        // input the keyword into the search bar
        $('#lst-ib').val(keyword);

        // submit the form
        $('#tsf').submit();
    }
}
```


For the next page to correctly enqueue, we're going to need a new pseudoURL. Create a pseudoURL with the label "result" and the URL `https://www.google.com/search?[.+]`.

Now we're on the last page and can finally extract the results.


```
async function pageFunction(context) {
    const $ = context.jQuery;

    if (context.request.userData.label === 'enqueue') {
        // copy from the previous part
    } else if (context.request.userData.label === 'result') {
        // create result array
        const result = [];

        // process all the results
        $('.rc').each((index, elem) => {

            // wrap element in jQuery
            const gResult = $(elem);

            // lookup link and text
            const link = gResult.find('.r a');
            const text = gResult.find('.s .st');

            // extract data and add it to result array
            result.push({
                name: link.text(),
                link: link.attr('href'),
                text: text.text(),
            });
        });
        // Now we finally return

        return result;
    }
}
```


To test the scraper, set the customData to something like this `["apple", "orange", "banana"]` and push the Run button to start.


---

# Request labels and how to pass data to other requests

Are you trying to use Actors for the first time and don't know how to deal with the request label or how to pass data to the request?

Here's how to do it.

If you are using the requestQueue, you can do it this way.

When you add a request to the queue, use the userData attribute.


```
// Create a request list.
const requestQueue = await Apify.openRequestQueue();
// Add the request to the queue
await requestQueue.addRequest({
    url: 'https://www.example.com/',
    userData: {
        label: 'START',
    },
});
```


Right now, we have one request in the queue that has the label "START".  Now we can specify which code should be executed for this request in the handlePageFunction.


```
if (request.userData.label === 'START') {
    // your code for the first request for example
    // enqueue the items of a shop
} else if (request.userData.label === 'ITEM') {
    // other code for the item of a shop
}
```


And in the same way you can keep adding requests in the handlePageFunction.

You can also handle the passing of data to the request like this. For example, when we have extracted the item from the shop above, we want to extract some information about the seller. We need to pass the item object to the seller page, where we save the rating of a seller, e.g..


```
await requestQueue.addRequest({
    url: sellerDetailUrl,
    userData: {
        label: 'SELLERDETAIL',
        data: itemObject,
    },
});
```


Now, in the "SELLERDETAIL" url, we can evaluate the page and extracted data merge to the object from the item detail, for example like this


```
const result = { ...request.userData.data, ...sellerDetail };
```


Save the results, and we're done!


```
await Apify.pushData(result);
```


---

# How to scrape from sitemaps

Processing sitemaps automatically with Crawlee

Crawlee allows you to scrape sitemaps with ease. If you are using Crawlee, you can skip the following steps and just gather all the URLs from the sitemap in a few lines of code.


```
import { RobotsFile } from 'crawlee';

const robots = await RobotsFile.find('https://www.mysite.com');

const allWebsiteUrls = await robots.parseUrlsFromSitemaps();
```


**The sitemap.xml file is a jackpot for every web scraper developer. Take advantage of this and learn an easier way to extract data from websites using Crawlee.**

***

Let's say we want to scrape a database of craft beers (https://www.brewbound.com/) before summer starts. If we are lucky, the website will contain a sitemap at https://www.brewbound.com/sitemap.xml.

> Check out https://apify.com/vaclavrut/sitemap-sniffer, which can discover sitemaps in hidden locations!

## Analyzing the sitemap

The sitemap is usually located at the path **/sitemap.xml**. It is always worth trying that URL, as it is rarely linked anywhere on the site. It usually contains a list of all pages in https://en.wikipedia.org/wiki/XML.


```


    
        http://www.brewbound.com/advertise
        2015-03-19
        daily
    
    
    ...
```


The URLs of breweries take this form:


```
http://www.brewbound.com/breweries/[BREWERY_NAME]
```


And the URLs of craft beers look like this:


```
http://www.brewbound.com/breweries/[BREWERY_NAME]/[BEER_NAME]
```


They can be matched using the following regular expression:


```
http(s)?:\/\/www\.brewbound\.com\/breweries\/[^\/]+\/[^\/` tag, which closes each URL.

## Scraping the sitemap in Crawlee

If you're scraping sitemaps (or anything else, really), https://crawlee.dev is perfect for the job.

First, let's add the beer URLs from the sitemap to the https://crawlee.dev/api/core/class/RequestList using our regular expression to match only the (craft!!) beer URLs and not pages of breweries, contact page, etc.


```
const requestList = await RequestList.open(null, [{
    requestsFromUrl: 'https://www.brewbound.com/sitemap.xml',
    regex: /http(s)?:\/\/www\.brewbound\.com\/breweries\/[^/ {
            return document.getElementsByClassName('productreviews').length;
        });
        if (!beerPage) return;

        const data = await page.evaluate(() => {
            const title = document.getElementsByTagName('h1')[0].innerText;
            const [brewery, beer] = title.split(':');
            const description = document.getElementsByClassName('productreviews')[0].innerText;

            return { brewery, beer, description };
        });

        await Dataset.pushData(data);
    },
});
```


## Full code

If we create a new Actor using the code below on the https://docs.apify.com/academy/apify-platform.md, it returns a nicely formatted spreadsheet containing a list of breweries with their beers with descriptions.

Make sure to use the **apify/actor-node-puppeteer-chrome** image for your Dockerfile, otherwise the run will fail.

https://console.apify.com/actors/7tWSD8hrYzuc9Lte7?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IERhdGFzZXQsIFB1cHBldGVlckNyYXdsZXIsIFJlcXVlc3RMaXN0IH0gZnJvbSAnY3Jhd2xlZSc7XFxuXFxuY29uc3QgcmVxdWVzdExpc3QgPSBhd2FpdCBSZXF1ZXN0TGlzdC5vcGVuKG51bGwsIFt7XFxuICAgIHJlcXVlc3RzRnJvbVVybDogJ2h0dHBzOi8vd3d3LmJyZXdib3VuZC5jb20vc2l0ZW1hcC54bWwnLFxcbiAgICByZWdleDogL2h0dHAocyk_OlxcXFwvXFxcXC93d3dcXFxcLmJyZXdib3VuZFxcXFwuY29tXFxcXC9icmV3ZXJpZXNcXFxcL1teLzxdK1xcXFwvW14vPF0rL2dtLFxcbn1dKTtcXG5cXG5jb25zdCBjcmF3bGVyID0gbmV3IFB1cHBldGVlckNyYXdsZXIoe1xcbiAgICByZXF1ZXN0TGlzdCxcXG4gICAgYXN5bmMgcmVxdWVzdEhhbmRsZXIoeyBwYWdlIH0pIHtcXG4gICAgICAgIGNvbnN0IGJlZXJQYWdlID0gYXdhaXQgcGFnZS5ldmFsdWF0ZSgoKSA9PiB7XFxuICAgICAgICAgICAgcmV0dXJuIGRvY3VtZW50LmdldEVsZW1lbnRzQnlDbGFzc05hbWUoJ3Byb2R1Y3RyZXZpZXdzJykubGVuZ3RoO1xcbiAgICAgICAgfSk7XFxuICAgICAgICBpZiAoIWJlZXJQYWdlKSByZXR1cm47XFxuXFxuICAgICAgICBjb25zdCBkYXRhID0gYXdhaXQgcGFnZS5ldmFsdWF0ZSgoKSA9PiB7XFxuICAgICAgICAgICAgY29uc3QgdGl0bGUgPSBkb2N1bWVudC5nZXRFbGVtZW50c0J5VGFnTmFtZSgnaDEnKVswXS5pbm5lclRleHQ7XFxuICAgICAgICAgICAgY29uc3QgW2JyZXdlcnksIGJlZXJdID0gdGl0bGUuc3BsaXQoJzonKTtcXG4gICAgICAgICAgICBjb25zdCBkZXNjcmlwdGlvbiA9IGRvY3VtZW50LmdldEVsZW1lbnRzQnlDbGFzc05hbWUoJ3Byb2R1Y3RyZXZpZXdzJylbMF0uaW5uZXJUZXh0O1xcblxcbiAgICAgICAgICAgIHJldHVybiB7IGJyZXdlcnksIGJlZXIsIGRlc2NyaXB0aW9uIH07XFxuICAgICAgICB9KTtcXG5cXG4gICAgICAgIGF3YWl0IERhdGFzZXQucHVzaERhdGEoZGF0YSk7XFxuICAgIH0sXFxufSk7XFxuXFxuYXdhaXQgY3Jhd2xlci5ydW4oKTtcXG5cIn0iLCJvcHRpb25zIjp7ImJ1aWxkIjoibGF0ZXN0IiwiY29udGVudFR5cGUiOiJhcHBsaWNhdGlvbi9qc29uOyBjaGFyc2V0PXV0Zi04IiwibWVtb3J5Ijo0MDk2LCJ0aW1lb3V0IjoxODB9fQ.KFqjQiNxNkx_HPnvJ4H_W0e58W3L7D_Ga9pq_ZQ7tqI&asrc=run_on_apify


```
import { Dataset, PuppeteerCrawler, RequestList } from 'crawlee';

const requestList = await RequestList.open(null, [{
    requestsFromUrl: 'https://www.brewbound.com/sitemap.xml',
    regex: /http(s)?:\/\/www\.brewbound\.com\/breweries\/[^/ {
            return document.getElementsByClassName('productreviews').length;
        });
        if (!beerPage) return;

        const data = await page.evaluate(() => {
            const title = document.getElementsByTagName('h1')[0].innerText;
            const [brewery, beer] = title.split(':');
            const description = document.getElementsByClassName('productreviews')[0].innerText;

            return { brewery, beer, description };
        });

        await Dataset.pushData(data);
    },
});

await crawler.run();
```


---

# How to scrape sites with a shadow DOM

**The shadow DOM enables isolation of web components, but causes problems for those building web scrapers. Here's a workaround.**

***

Each website is represented by an HTML DOM, a tree-like structure consisting of HTML elements (e.g. paragraphs, images, videos) and text. https://developer.mozilla.org/en-US/docs/Web/API/Web_components/Using_shadow_DOM allows the separate DOM trees to be attached to the main DOM while remaining isolated in terms of CSS inheritance and JavaScript DOM manipulation. The CSS and JavaScript codes of separate shadow DOM components do not clash, but the downside is that you can't access the content from outside.

Let's take a look at this page https://www.alodokter.com/. If you click on the menu and open a Chrome debugger, you will see that the menu tree is attached to the main DOM as shadow DOM under the element ``.

![Shadow root of the top-navbar-view custom element](/assets/images/shadow-023c6b4266de5874b37593ca6e0a0ad6.png)

The rest of the content is rendered the same way. This makes it hard to scrape because `document.body.innerText`, `document.getElementsByTagName('a')`, and all others return an empty result.

The content of the menu can be accessed only via the https://developer.mozilla.org/en-US/docs/Web/API/ShadowRoot property. If you use jQuery you can do the following:


```
// Find element that is shadow root of menu DOM tree.
const { shadowRoot } = document.getElementById('top-navbar-view');

// Create a copy of its HTML and use jQuery find links.
const links = $(shadowRoot.innerHTML).find('a');

// Get URLs from link elements.
const urls = links.map((obj, el) => el.href);
```


However, this isn't very convenient, because you have to find the root element of each component you want to work with, and you can't take advantage of all the scripts and tools you already have.

Instead of that, we can replace the content of each element containing shadow DOM with the HTML of shadow DOM.


```
// Iterate over all elements in the main DOM.
for (const el of document.getElementsByTagName('*')) {
    // If element contains shadow root then replace its
    // content with the HTML of shadow DOM.
    if (el.shadowRoot) el.innerHTML = el.shadowRoot.innerHTML;
}
```


After you run this, you can access all the elements and content using jQuery or plain JavaScript. The downside is that it breaks all the interactive components because you create a new copy of the shadow DOM HTML content without the JavaScript code and CSS attached, so this must be done after all the content has been rendered.

Some websites may contain shadow DOMs recursively inside of shadow DOMs. In these cases, we must replace them with HTML recursively:


```
// Returns HTML of given shadow DOM.
const getShadowDomHtml = (shadowRoot) => {
    let shadowHTML = '';
    for (const el of shadowRoot.childNodes) {
        shadowHTML += el.nodeValue || el.outerHTML;
    }
    return shadowHTML;
};

// Recursively replaces shadow DOMs with their HTML.
const replaceShadowDomsWithHtml = (rootElement) => {
    for (const el of rootElement.querySelectorAll('*')) {
        if (el.shadowRoot) {
            replaceShadowDomsWithHtml(shadowRoot);
            el.innerHTML += getShadowDomHtml(el.shadowRoot);
        }
    }
};

replaceShadowDomsWithHtml(document.body);
```


---

# Scraping a list of URLs from a Google Sheets document

You can export URLs from https://workspace.google.com/products/sheets/ such as https://docs.google.com/spreadsheets/d/1-2mUcRAiBbCTVA5KcpFdEYWflLMLp9DDU3iJutvES4w directly into an https://docs.apify.com/platform/actors.md's Start URLs field.

1. Make sure the spreadsheet has one sheet and a simple structure to help the Actor find the URLs.

2. Add the `/gviz/tq?tqx=out:csv` query parameter to the Google Sheet URL base, right after the long document identifier part. For example, https://docs.google.com/spreadsheets/d/1-2mUcRAiBbCTVA5KcpFdEYWflLMLp9DDU3iJutvES4w/gviz/tq?tqx=out:csv. This automatically exports the spreadsheet to CSV format.

3. In the Actor's input, click Link remote text file and paste the URL there:

![List of URLs](/assets/images/gsheets-url-27adbc7f89057db71fc4d2f03a65cedf.png)

IMPORTANT: Make sure anyone with the link can view the document. Otherwise, the Actor will not be able to access it.

![Link sharing](/assets/images/anyone-with-link-38a1b714c55ca2b0f1ee21c9adaed0a3.png)


---

When doing web automation with Apify, it can sometimes be necessary to submit an HTML form with a file attachment. This article will cover a situation where the file is publicly accessible (e.g. hosted somewhere) and will use an Apify Actor. If it's impossible to use request-promise, it might be necessary to use https://docs.apify.com/academy/puppeteer-playwright/common-use-cases/submitting-a-form-with-a-file-attachment.

# Downloading the file to memory

**How to submit a form with attachment using request-promise.**

***

After creating a new Actor, the first thing to do is download the file. We can do that using the request-promise module, so make sure it is included.


```
const request = require('request-promise');
```


The actual downloading is going to be slightly different for text and binary files. For a text file, do it like this:


```
const fileData = await request('https://example.com/file.txt');
```


For a binary file, we need to provide additional parameters so as not to interpret it as text:


```
const fileData = await request({
    uri: 'https://example.com/file.pdf',
    encoding: null,
});
```


In this case, fileData will be a Buffer instead of a String.

# Submitting the form

When the file is ready, we can submit the form as follows:


```
await request({
    uri: 'https://example.com/submit-form.php',
    method: 'POST',

    formData: {
        // set any form values
        name: 'John',
        surname: 'Doe',
        email: 'john.doe@example.com',

        // add the attachment
        attachment: {
            value: fileData,
            options: {
                filename: 'file.pdf',
                contentType: 'application/pdf',
            },
        },
    },
});
```


The header Content-Type: multipart/form-data will be set automatically.


---

# Submitting forms on .ASPX pages

Apify users sometimes need to submit a form on pages created with ASP.NET (URL typically ends with .aspx). These pages have a different approach for how they submit forms and navigate through pages.

This tutorial shows you how to handle these kinds of pages. This approach is based on a https://web.archive.org/web/20230530120937/https://toddhayton.com/2015/05/04/scraping-aspnet-pages-with-ajax-pagination/ from Todd Hayton, where he explains how crawlers for ASP.NET pages should work.

First of all, you need to copy\&paste this function to your https://apify.com/apify/web-scraper *Page function*:


```
const enqueueAspxForm = async function (request, formSelector, submitButtonSelector, async) {
    request.payload = $(formSelector).serialize();
    if ($(submitButtonSelector).length) {
        request.payload += decodeURIComponent(`&${$(submitButtonSelector).attr('name')}=${$(submitButtonSelector).attr('value')}`);
    }
    request.payload += decodeURIComponent(`&__ASYNCPOST=${async.toString()}`);
    request.method = 'POST';
    request.uniqueKey = Math.random();
    await context.enqueueRequest(request);
    return request;
};
```


The function has these parameters:

`request` - the object that describes the next request

`formSelector` - selector for a form to be submitted e.g 'form\[name="test"]'

`submitButtonSelector` - selector for a button for submit form e.g. '#nextPageButton'

`async` - if true, request returns only params, not HTML content

Then you can use it in your Page function as follows:


```
await enqueueAspxForm({
    url: 'http://architectfinder.aia.org/frmSearch.aspx',
    userData: { label: 'SEARCH-RESULT' },
}, 'form[name="aspnetForm"]', '#ctl00_ContentPlaceHolder1_btnSearch', false);
```


---

# Using man-in-the-middle proxy to intercept requests in Puppeteer

Sometimes you may need to intercept (or maybe block) requests in headless Chrome / Puppeteer, but `page.setRequestInterception()`  is not 100% reliable when the request is started in a new window.

One possible way to intercept these requests is to use a man-in-the-middle (MITM) proxy, i.e. a proxy server that can intercept and modify HTTP requests, even those over HTTPS. In this example, we're going to use https://github.com/joeferner/node-http-mitm-proxy, since it has all the tools that we need.

First we set up the MITM proxy:


```
const { promisify } = require('util');
const { exec } = require('child_process');
const Proxy = require('http-mitm-proxy');
const Promise = require('bluebird');

const execPromise = promisify(exec);

const wait = (timeout) => new Promise((resolve) => setTimeout(resolve, timeout));

const setupProxy = async (port) => {
    // Setup chromium certs directory
    // WARNING: this only works in debian docker images
    // modify it for any other use cases or local usage.
    await execPromise('mkdir -p $HOME/.pki/nssdb');
    await execPromise('certutil -d sql:$HOME/.pki/nssdb -N');
    const proxy = Proxy();
    proxy.use(Proxy.wildcard);
    proxy.use(Proxy.gunzip);
    return new Promise((resolve, reject) => {
        proxy.listen({ port, silent: true }, (err) => {
            if (err) return reject(err);
            // Add CA certificate to chromium and return initialize proxy object
            execPromise('certutil -d sql:$HOME/.pki/nssdb -A -t "C,," -n mitm-ca -i ./.http-mitm-proxy/certs/ca.pem')
                .then(() => resolve(proxy))
                .catch(reject);
        });
    });
};
```


Then we'll need a Docker image that has the `certutil` utility. Here is an https://github.com/apify/actor-example-proxy-intercept-request/blob/master/Dockerfile that can create such an image and is based on the https://hub.docker.com/r/apify/actor-node-chrome/ image that contains Puppeteer.

Now we need to specify how the proxy shall handle the intercepted requests:


```
// Setup blocking of requests in proxy
const proxyPort = 8000;
const proxy = setupProxy(proxyPort);
proxy.onRequest((context, callback) => {
    if (blockRequests) {
        const request = context.clientToProxyRequest;
        // Log out blocked requests
        console.log('Blocked request:', request.headers.host, request.url);

        // Close the connection with custom content
        context.proxyToClientResponse.end('Blocked');
        return;
    }
    return callback();
});
```


The final step is to let Puppeteer use the local proxy:


```
// Launch puppeteer with local proxy
const browser = await puppeteer.launch({
    args: ['--no-sandbox', `--proxy-server=localhost:${proxyPort}`],
});
```


And we're done! By adjusting the `blockRequests` variable, you can allow or block any request initiated through Puppeteer.

Here is a GitHub repository with a full example and all necessary files: https://github.com/apify/actor-example-proxy-intercept-request

If you have any questions, feel free to contact us in the chat.

Happy intercepting!


---

# Waiting for dynamic content

Use these helper functions to wait for data:

* `page.waitFor` in https://pptr.dev/ (or Puppeteer Scraper (https://apify.com/apify/puppeteer-scraper)).

* `context.waitFor` in Web Scraper (https://apify.com/apify/web-scraper).

Pass in time in milliseconds or a selector to wait for.

Examples:

* `await page.waitFor(10000)` - waits for 10 seconds.

* `await context.waitFor('my-selector')` - waits for `my-selector` to appear on the page.

For details, code examples, and advanced use cases, visit our https://docs.apify.com/academy/puppeteer-playwright/page/waiting.md.


---

# When to use Puppeteer Scraper

You may have read in the https://apify.com/apify/web-scraper readme or somewhere else at Apify that https://apify.com/apify/puppeteer-scraper is more powerful and gives you more control over the browser, enabling you to do almost anything. But what does that really mean? In this article, we will talk about the differences in more detail and show you some minimal examples to strengthen that understanding.

## What exactly is Puppeteer?

Both the Web Scraper and Puppeteer Scraper use Puppeteer to control the Chrome browser, so, what's the difference? Consider Puppeteer and Chrome as two separate programs.

Puppeteer is a JavaScript program that's used to control the browser and by controlling we mean opening tabs, closing tabs, moving the mouse, clicking buttons, typing on the keyboard, managing network activity, etc. If a website is watching for any of these events, there is no way for it to know that those actions were performed by a robot and not a human user. Chrome is just Chrome as you know it.

*Robot browsers can be detected in numerous ways.. But there are no ways to tell if a specific mouse click was made by a user or a robot.*

Ok, so both Web Scraper and Puppeteer Scraper use Puppeteer to give commands to Chrome. Where's the difference? It's called the execution environment.

## Execution environment

It may sound fancy, but it's just a technical term for "where does my code run". When you open the DevTools and start typing JavaScript in the browser Console, it gets executed in the browser. Browser is the code's execution environment. But you can't control the browser from the inside. For that, you need a different environment. Puppeteer's environment is Node.js. If you don't know what Node.js is, don't worry about it too much. Remember that it's the environment where Puppeteer runs.

By now you probably figured this out on your own, so this will not come as a surprise. The difference between Web Scraper and Puppeteer Scraper is where your page function gets executed. When using the Web Scraper, it's executed in the browser environment. It means that it gets access to all the browser specific features such as the `window` or `document` objects, but it cannot control the browser with Puppeteer directly. This is done automatically in the background by the scraper. Whereas in Puppeteer Scraper, the page function is executed in the Node.js environment, giving you full access to Puppeteer and all its features.

![Puppeteer Scraper Diagram](/assets/images/puppeteer-scraper-diagram-5eb36bbee183cfd0066ee3807e8f9073.jpeg) *This does not mean that you can't execute in-browser code with Puppeteer Scraper. Keep reading to learn how.*

## Practical differences

Ok, cool, different environments, but how does that help you scrape stuff? Actually, quite a lot. Some things you just can't do from within the browser, but you can do them with Puppeteer. We will not attempt to create an exhaustive list, but rather show you some very useful features that we use every day in our scraping.

## Evaluating in-browser code

In Web Scraper, everything runs in the browser, so there's really not much to talk about there. With Puppeteer Scraper, it's a single function call away.


```
const bodyHTML = await context.page.evaluate(() => {
    console.log('This will be printed in browser console.');
    return document.body.innerHTML;
});
```


The `context.page.evaluate()` call executes the provided function in the browser environment and passes back the return value back to the Node.js environment. One very important caveat though! Since we're in different environments, we cannot use our existing variables, such as `context` inside of the evaluated function, because they are not available there. Different environments, different variables.

*See the* `page.evaluate()` *https://pptr.dev/#?product=Puppeteer&show=api-pageevaluatepagefunction-args for info on how to pass variables from Node.js to browser.*

With the help of Apify SDK, we can even inject jQuery into the browser. You can use the `Pre goto function` input option to manipulate the page's environment before it loads.


```
async function preGotoFunction({ request, page, Apify }) {
    await Apify.utils.puppeteer.injectJQuery(page);
}
```


This will make jQuery available in all pages. You can then use it in `context.page.evaluate()` calls:


```
const bodyText = await context.page.evaluate(() => {
    return $('body').text();
});
```


You can do a lot of DOM manipulation directly from Node.js / Puppeteer, but when you're planning to do a lot of sequential operations, it's often better and faster to do it with jQuery in a single `context.page.evaluate()` call than using multiple `context.page.$`, `context.page.$eval()` and other Puppeteer methods.

## Navigation to other pages (URLs)

In Web Scraper, your page function literally runs within a page so it makes sense that when this page gets destroyed, the page function throws an error. Sadly, navigation (going to a different URL) destroys pages, so whenever you click a button in Web Scraper that forces the browser to navigate somewhere else, you end up with an error. In Puppeteer Scraper, this is not an issue, because the `page` object gets updated with new data seamlessly.

Imagine that you currently have `https://example.com/page-1` open and there's a button on the page that will take you to `https://example.com/page-2`.Or that you're on `https://google.com` and you fill in the search bar and click on the search button.

Consider the following code inside Web Scraper page function:


```
await context.waitFor('button');
$('button').click();
```


With a `button` that takes you to the next page or launches a Google search (which takes you to the results page), the page function will fail with a nasty error.

However, when using Puppeteer Scraper, this code:


```
await context.page.waitFor('button');
await Promise.all([
    context.page.waitForNavigation(),
    context.page.click('button'),
]);
```


Will work as expected and after the `Promise.all()` call resolves, you will have the next page loaded and ready for scraping.

Pay special attention to the `page.waitForNavigation()` (https://pptr.dev/#?product=Puppeteer&show=api-pagewaitfornavigationoptions) call which is very important. It pauses your script until the navigation completes. Without it, the execution would start immediately after the mouse click. It's also important that you place it before the click itself, otherwise it creates a race condition and your script will behave unpredictably.

You can go even further and navigate programmatically by calling:


```
await context.page.goto('https://some-new-page.com');
```


## Intercepting network activity

Some very useful scraping techniques revolve around listening to network requests and responses and even modifying them on the fly. Web Scraper's page function doesn't have access to the network, besides calling JavaScript APIs such as `fetch()`. Puppeteer Scraper, on the other hand, has full control over the browser's network activity.

You can listen to all the network requests that are being dispatched from the browser. For example, the following code will print all their URLs to the console.


```
context.page.on('request', (req) => console.log(req.url()));
```


This can be useful in many ways, such as blocking unwanted assets or scripts from being downloaded, modifying request methods or faking responses, etc.

*Explaining how to do interception properly is out of scope of this article. See https://pptr.dev/#?product=Puppeteer&show=api-pagesetrequestinterceptionvalue and the https://docs.apify.com/sdk/js/docs/api/puppeteer#puppeteeraddinterceptrequesthandler-promise for request interception.*

## Enqueueing JavaScript links

A large number of websites use either form submissions or JavaScript redirects for navigation and displaying of data. With Web Scraper, you cannot crawl those websites, because there are no links to find and enqueue on those pages. Puppeteer Scraper enables you to automatically click all those elements that cause navigation, intercept the navigation requests and enqueue them to the request queue.

If it seems complicated, don't worry. We've abstracted all the complexity away to a `Clickable elements selector` input option. When left empty, none of the said clicking and intercepting happens, but once you choose a selector, Puppeteer Scraper will automatically click all the selected elements, watch for page navigations and enqueue them into the `RequestQueue`.

*The* `Clickable elements selector` *will also work on regular non-JavaScript links, however, it is significantly slower than using the plain* `Link selector`*. Unless you know you need it, use the* `Link selector` *for best performance.*

## Word of caution

Since we're actually clicking in the page, which may or may not trigger some nasty JavaScript, anything can happen really, including the page completely breaking. Three common scenarios exist though.

## Plain form submit navigations

This works out of the box. It's typically used on older websites such as https://www.remax.com.tr/ofis-office-franchise-girisimci-agent-arama. For a site like this you can set the `Clickable elements selector` and you're good to go:


```
'a[onclick ^= getPage]';
```


## Form submit navigations with side-effects

Those are similar to the ones above with an important caveat. Once you click the first thing, it usually modifies the page in a way that causes more clicking to become impossible. We deal with those by scraping the pages one by one, using the pagination "next" button. See http://www.maxwellrender.com/materials/ and use the following selector:


```
'li.page-item.next a';
```


## Frontend navigations

Websites often won't navigate away just to fetch the next set of results. They will do it in the background and update the displayed data. You can paginate such websites with either Web Scraper or Puppeteer Scraper. Try it on https://www.udemy.com/topic/javascript/ for example. Click the next button to load the next set of courses.


```
// Web Scraper\
$('li a span.pagination-next').click();

// Puppeteer Scraper\
await page.click('li a span.pagination-next');
```


## Using Apify SDK

https://docs.apify.com/sdk/js is the library we used to build all of our scrapers. For power users, it is the best tool out there to scrape using JavaScript. If you're not yet ready to start writing your own Actors using SDK, Puppeteer Scraper enables you to use its features without having to worry about building your own Actors.

The possibilities are endless, but to show you some examples:

* Check out the https://docs.apify.com/sdk/js/docs/api/puppeteer#puppeteer.infiniteScroll function that enables scraping pages with infinite scroll in one line of code.

* https://docs.apify.com/sdk/js/docs/api/puppeteer#puppeteer.blockRequests allows you to block network requests based on URL patterns.

* https://docs.apify.com/sdk/js/docs/api/apify#module_Apify.openDataset lets you work with any dataset under your account.

* Make HTTP requests with `Apify.utils.requestAsBrowser()` to fetch external resources.

And we're only scratching the surface here.

## Wrapping it up

Many more techniques are available to Puppeteer Scraper that are either too complicated to replicate in Web Scraper or downright impossible to do. Web Scraper is a great tool for basic scraping, because it goes right to the point and uses in-browser JavaScript which is well-known to millions of people, even non-developers.

Once you start hitting some roadblocks, you may find that Puppeteer Scraper is just what you need to overcome them. And if Puppeteer Scraper still doesn't cut it, there's still Apify SDK to rule them all. We hope you found this tutorial helpful and happy scraping.


---

# How to use Apify from PHP

Apify's https://docs.apify.com/api/v2# allows you to use the platform from basically anywhere. Many projects are and will continue to be built using https://www.php.net/. This tutorial enables you to use Apify in these projects in PHP and frameworks built on it.

Apify does not have an official PHP client (yet), so we are going to use https://github.com/guzzle/guzzle, a great library for HTTP requests. By covering a few fundamental endpoints, this tutorial will show you the principles you can use for all Apify API endpoints.

## Before you start

Make sure you have an Apify account and API token. You will find the token in the https://console.apify.com/account#/integrations section in Apify Console.

If you don't already have guzzle installed in your project (or just want to try out the code examples), run `composer require guzzlehttp/guzzle` to install it in the current directory.

## Preparing the client

To get a guzzle instance ready to be used with the Apify API, we first need to set up the base endpoint and authentication.


```
require 'vendor/autoload.php';

$client = new \GuzzleHttp\Client([
    'base_uri' => 'https://api.apify.com/v2/',
    'headers' => [
        // Replace  with your actual token
        'Authorization' => 'Bearer ',
    ]
]);
```


Note that we pass the API token in the header. It can also be passed as a query string `token` parameter, but passing it in the header is preferred and more secure.

To check whether everything works well, we'll try to get information about the https://docs.apify.com/api/v2/users-me-get.md.


```
// Call the endpoint using our client
// Note that the path does not have a leading slash
$response = $client->get('users/me');
// Parse the response (most Apify API endpoints return JSON)
$parsedResponse = \json_decode($response->getBody(), true);
// The actual data are usually present under the `data` key
$data = $parsedResponse['data'];

echo \json_encode($data, JSON_PRETTY_PRINT);
```


If, instead of data, you see an error saying `Authentication token is not valid`, check if the API token you used to instantiate the client is valid.

## Running an Actor

Now that we have our guzzle client ready to go, we can run some Actors. Let's try the **Contact Details Scraper** (https://apify.com/vdrmota/contact-info-scraper).

The https://docs.apify.com/api/v2/act-runs-post.md states that an Actor's input should be passed as JSON in the request body. Other options are passed as query parameters.


```
// To run the Actor, we make a POST request to its run's endpoint
// To identify the Actor, you can use its ID, but you can also pass
// the full Actor name [username]~[actorName] or just ~[actorName] for
// your own Actors
$response = $client->post('acts/vdrmota~contact-info-scraper/runs', [
  // Actors usually accept JSON as input. When using the `json` key in
  // a POST request's options, guzzle sets proper request headers
  // and serializes the array we pass in
  'json' => [
    'startUrls' => [
        ['url' => 'https://www.apify.com/contact']
    ],
    'maxDepth' => 0,
  ],
  // Other run options are passed in as query parameters
  // This is optional since Actors usually have reasonable defaults
  'query' => [ 'timeout' => 30 ],
]);
$parsedResponse = \json_decode($response->getBody(), true);
$data = $parsedResponse['data'];

echo \json_encode($data, JSON_PRETTY_PRINT);
```


You should see information about the run, including its ID and the ID of its default https://docs.apify.com/platform/storage/dataset.md. Take note of these, we will need them later.

## Getting the results from dataset

Actors usually store their output in a default dataset. The https://docs.apify.com/api/v2/actor-runs.md lets you get overall info about an Actor run's default dataset.


```
// Replace  with the run ID you from earlier
$response = $client->get('actor-runs//dataset');
$parsedResponse = \json_decode($response->getBody(), true);
$data = $parsedResponse['data'];

echo \json_encode($data, JSON_PRETTY_PRINT);
```


As you can see, the response contains overall stats about the dataset, like its number of items, but not the actual data. To get those, we have to call the **items** endpoint.


```
// Replace  with the run ID from earlier
$response = $client->get('actor-runs//dataset/items');
// The dataset items endpoint returns an array of dataset items
// they are not under the `data` key like in other endpoints
$data = \json_decode($response->getBody(), true);

echo \json_encode($data, JSON_PRETTY_PRINT);
```


Some of the Actors write to datasets other than the default. In these cases, you need to have the dataset ID and call the `datasets/` and `datasets//items` endpoints instead.

For larger datasets, you can paginate through the results by passing query parameters.


```
$response = $client->get('datasets//items', [
    'query' => [
        'offset' => 20,
        'limit' => 10,
    ]
]);
$parsedResponse = \json_decode($response->getBody(), true);
echo \json_encode($parsedResponse, JSON_PRETTY_PRINT);
```


All the available parameters are described in https://docs.apify.com/api/v2/dataset-items-get.md and work both for all datasets.

## Getting the results from key-value stores

Datasets are great for structured data, but are not suited for binary files like images or PDFs. In these cases, Actors store their output in https://docs.apify.com/platform/storage/key-value-store.md. One such Actor is the **HTML String To PDF** (https://apify.com/mhamas/html-string-to-pdf) converter. Let's run it.


```
$response = $client->post('acts/mhamas~html-string-to-pdf/runs', [
    'json' => [
        'htmlString' => 'Hello World'
    ],
]);
$parsedResponse = \json_decode($response->getBody(), true);
$data = $parsedResponse['data'];

echo \json_encode($data, JSON_PRETTY_PRINT);
```


Keep track of the returned run ID.

Similar to datasets, we can get overall info about the default key-value store.


```
// Replace  with the ID returned by the code above
$response = $client->get('actor-runs//key-value-store');
$parsedResponse = \json_decode($response->getBody(), true);
$data = $parsedResponse['data'];

echo \json_encode($data, JSON_PRETTY_PRINT);
```


The items in key-value stores are not structured, so we cannot use the same approach as we did with dataset items. We can obtain some information about a store's content using its **keys** endpoint.


```
// Don't forget to replace  with the ID you got earlier
$response = $client->get('actor-runs//key-value-store/keys');
$parsedResponse = \json_decode($response->getBody(), true);
$data = $parsedResponse['data'];

echo \json_encode($data, JSON_PRETTY_PRINT);
```


We can see that there are two record keys: `INPUT` and `OUTPUT`. The HTML String to PDF Actor's README states that the PDF is stored under the `OUTPUT` key. Let's download it:


```
// Don't forget to replace the 
$response = $client->get('actor-runs//key-value-store/records/OUTPUT');
// Make sure that the destination (filename) is writable
file_put_contents(__DIR__ . '/hello-world.pdf', $response->getBody());
```


If you open the generated `hello-world.pdf` file, you should see... well, "Hello World".

If the Actor stored the data in a key-value store other than the default, we can use the standalone endpoints, `key-value-stores/`, `key-value-stores//keys`, and `key-value-stores//records/`. They behave the same way as the default endpoints. https://docs.apify.com/api/v2/storage-key-value-stores.md.

## When are the data ready

It takes some time for an Actor to generate its output. Some even have Actors that run for days! In the previous examples, we chose Actors whose runs only take a few seconds. This meant the runs had enough time to finish before we ran the code to retrieve their dataset or key-value store (so the Actor had time to produce some output). If we ran the code immediately after starting a longer-running Actor, the dataset would probably still be empty.

For Actors that are expected to be quick, we can use the `waitForFinish` parameter. Then, the running Actor's endpoint does not respond immediately but waits until the run finishes (up to the given limit). Let's try this with the HTML String to PDF Actor.


```
$response = $client->post('acts/mhamas~html-string-to-pdf/runs', [
    'json' => [
        'htmlString' => 'Hi World'
    ],
    // Pass in how long we want to wait, in seconds
    'query' => [ 'waitForFinish' => 60 ]
]);
$parsedResponse = \json_decode($response->getBody(), true);
$data = $parsedResponse['data'];

echo \json_encode($data, JSON_PRETTY_PRINT);

$runId = $data['id'];
$response = $client->get(sprintf('actor-runs/%s/key-value-store/records/OUTPUT', $runId));
file_put_contents(__DIR__ . '/hi-world.pdf', $response->getBody());
```


## Webhooks

For Actors that take longer to run, we can use https://docs.apify.com/platform/integrations/webhooks.md. A webhook is an HTML POST request that is sent to a specified URL when an Actor's status changes. We can use them as a kind of notification that is sent when your run finishes. You can set them up using query parameters. If we used webhooks in the example above, it would look like this:


```
// Webhooks need to be passed as a base64-encoded JSON string
$webhooks = \base64_encode(\json_encode([
    [
        // The webhook can be sent on multiple events
        // this one fires when the run succeeds
        'eventTypes' => ['ACTOR.RUN.SUCCEEDED'],
        // Set this to some url that you can react to
        // To see what is sent to the URL,
        // you can set up a temporary request bin at https://requestbin.com/r
        'requestUrl' => '',
    ],
]));
$response = $client->post('acts/mhamas~html-string-to-pdf/runs', [
    'json' => [
        'htmlString' => 'Hello World'
    ],
    'query' => [ 'webhooks' => $webhooks ]
]);
```


## How to use Apify Proxy

Let's use another important feature: https://docs.apify.com/platform/proxy.md. If you want to make sure that your server's IP address won't get blocked somewhere when making requests, you can use the automatic proxy selection mode.


```
$client = new \GuzzleHttp\Client([
    // Replace  below with your password
    // found at https://console.apify.com/proxy
    'proxy' => 'http://auto:@proxy.apify.com:8000'
]);

// This request will be made through an automatically chosen proxy
$response = $client->get("http://proxy.apify.com/?format=json");
echo $response->getBody();
```


If you want to maintain the same IP between requests, you can use the session mode.


```
$client = new \GuzzleHttp\Client([
    // Replace  below with your password
    // found at https://console.apify.com/proxy
    'proxy' => 'http://session-my_session:@proxy.apify.com:8000'
]);

// Both responses should contain the same clientIp
$response = $client->get("https://api.apify.com/v2/browser-info");
echo $response->getBody();

$response = $client->get("https://api.apify.com/v2/browser-info");
echo $response->getBody();
```


https://docs.apify.com/platform/proxy/usage.md for more details on using specific proxies.

## Feedback

Are you interested in an Apify PHP client or other PHP-related content? Do you have some feedback on this tutorial? https://apify.typeform.com/to/KqhmiJge#source=tutorial_use_apify_from_php!


---

# Puppeteer & Playwright course

**Learn in-depth how to use two of the most popular Node.js libraries for controlling a headless browser - Puppeteer and Playwright.**

***

https://pptr.dev/ and https://playwright.dev/ are libraries that allow you to automate browsing. Based on your instructions, they can open a browser window, load a website, click on links, etc. They can also do this *headlessly*, i.e., in a way that the browser window isn't visible, which is faster.

Both packages were developed by the same team and are very similar, which is why we have combined the Puppeteer course and the Playwright course into one super-course that shows code examples for both technologies. The two differ in only small ways, and those will always be highlighted in the examples.

> Each lesson's activity will contain examples for both libraries, but we recommend using Playwright, as it is newer and has more features and better https://playwright.dev/docs/intro

## Advantages of using a headless browser

When automating a headless browser, you can do a whole lot more in comparison to making HTTP requests for static content. In fact, you can programmatically do pretty much anything a human could do with a browser, such as clicking elements, taking screenshots, typing into text areas, etc.

Additionally, since the requests aren't static, https://docs.apify.com/academy/concepts/dynamic-pages.md can be rendered and interacted with (or, data from the dynamic content can be scraped). Turn on the https://playwright.dev/docs/api/class-testoptions#test-options-headless (`headless: false`) to see exactly what the browser is doing.

Browsers can also be effective for https://docs.apify.com/academy/anti-scraping.md, especially if the website is running https://docs.apify.com/academy/anti-scraping/techniques/browser-challenges.md.

## Disadvantages of headless browsers

Browsers are slow and expensive to run. In the follow-up courses, the Apify Academy will show you how to scrape websites without a browser. Every website can potentially be reverse-engineered into a series of quick and cheap HTTP calls, but it might require significant effort and specialized knowledge.

## Setup

For this course, we'll be jumping right into the features of these awesome libraries and expecting you to already have an environment set up. Here's how we set up our environment:

1. Make sure you've installed https://nodejs.org/en/
2. Create a new folder called **puppeteer-playwright** (or whatever you want to call it)
3. Run the command `npm init -y` within your new folder to automatically initialize the project
4. Add `"type": "module"` to the **package.json** file
5. Create a new file named **index.js**
6. Install the library you're going to be using during this course:

* Install Playwright
* Install Puppeteer


```
npm install playwright
```



```
npm install puppeteer
```


> For a more in-depth guide on how to set up the basic environment we'll be using in this tutorial, check out the https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/computer-preparation.md lesson in the **Web scraping basics for JavaScript devs** course

## Course overview

1. https://docs.apify.com/academy/puppeteer-playwright/browser.md

2. https://docs.apify.com/academy/puppeteer-playwright/page.md

   * https://docs.apify.com/academy/puppeteer-playwright/page/interacting-with-a-page.md
   * https://docs.apify.com/academy/puppeteer-playwright/page/waiting.md
   * https://docs.apify.com/academy/puppeteer-playwright/page/page-methods.md

3. https://docs.apify.com/academy/puppeteer-playwright/executing-scripts.md

   * https://docs.apify.com/academy/puppeteer-playwright/executing-scripts/injecting-code.md
   * https://docs.apify.com/academy/puppeteer-playwright/executing-scripts/collecting-data.md

4. https://docs.apify.com/academy/puppeteer-playwright/reading-intercepting-requests.md

5. https://docs.apify.com/academy/puppeteer-playwright/proxies.md

6. https://docs.apify.com/academy/puppeteer-playwright/browser-contexts.md

7. https://docs.apify.com/academy/puppeteer-playwright/common-use-cases.md

## First up

In the https://docs.apify.com/academy/puppeteer-playwright/browser.md of this course, we'll be learning a bit about how to create and use the **Browser** object.


---

# Browser

**Understand what the Browser object is in Puppeteer/Playwright, how to create one, and a bit about how to interact with one.**

***

In order to automate a browser in Playwright or Puppeteer, we need to open one up programmatically. Playwright supports Chromium, Firefox, and Webkit (Safari), while Puppeteer only supports Chromium based browsers. For ease of understanding, we've chosen to use Chromium in the Playwright examples to keep things working on the same plane.

Let's start by using the `launch()` function in the **index.js** file we created in the intro to this course:

* Playwright
* Puppeteer


```
import { chromium } from 'playwright';

await chromium.launch();

console.log('launched!');
```



```
import puppeteer from 'puppeteer';

await puppeteer.launch();

console.log('launched!');
```


When we run this code with the command `node index.js`, a browser will open up; however, we won't actually see anything. This is because the default mode of a browser after `launch()`ing it is **headless**, meaning that it has no visible UI.

> If you run this code right now, it will hang. Use **control^** + **C** to force quit the program.

## Launch options

In order to see what's actually happening, we can pass an **options** object (https://pptr.dev/#?product=Puppeteer&version=v13.7.0&show=api-puppeteerlaunchoptions, https://playwright.dev/docs/api/class-browsertype#browser-type-launch) with **headless** set to **false**.

* Playwright
* Puppeteer


```
import { chromium } from 'playwright';

const browser = await chromium.launch({ headless: false });
await browser.newPage();
```



```
import puppeteer from 'puppeteer';

const browser = await puppeteer.launch({ headless: false });
await browser.newPage();
```


Now we'll actually see a browser open up.

![Chromium browser opened by Puppeteer/Playwright](/assets/images/chromium-844298b27f771e8c1bb0441bf5572180.jpg)

You can pass a whole lot more options to the `launch()` function. We'll be getting into those a little bit later on.

## Browser methods

The `launch()` function also returns a **Browser** object (https://pptr.dev/#?product=Puppeteer&version=v13.7.0&show=api-class-browser, https://playwright.dev/docs/api/class-browser), which is a representation of the browser. This object has many methods, which allow us to interact with the browser from our code. One of them is `close()`. Until now, we've been using **control^** + **C** to force quit the process, but with this function, we'll no longer have to do that.

* Playwright
* Puppeteer


```
import { chromium } from 'playwright';

const browser = await chromium.launch({ headless: false });
await browser.newPage();

// code will be here in the future

await browser.close();
```



```
import puppeteer from 'puppeteer';

const browser = await puppeteer.launch({ headless: false });
await browser.newPage();

// code will be here in the future

await browser.close();
```


## Next up

Now that we can open a browser, let's move onto the https://docs.apify.com/academy/puppeteer-playwright/page.md where we will learn how to create pages and visit websites programmatically.


---

# Creating multiple browser contexts

**Learn what a browser context is, how to create one, how to emulate devices, and how to use browser contexts to automate multiple sessions at one time.**

***

A https://playwright.dev/docs/api/class-browsercontext is an isolated incognito session within a **Browser** instance. This means that contexts can have different device/screen size configurations, different language and color scheme settings, etc. It is useful to use multiple browser instances when dealing with automating logging into multiple accounts simultaneously (therefore requiring multiple sessions), or in any cases where multiple sessions are required.

When we create a **Browser** object by using the `launch()` function, a single https://playwright.dev/docs/browser-contexts is automatically created. In order to create more, we use the https://playwright.dev/docs/api/class-browser#browser-new-context function in Playwright, and https://pptr.dev/#?product=Puppeteer&version=v14.1.0&show=api-browsercreateincognitobrowsercontextoptions in Puppeteer.

* Playwright
* Puppeteer


```
const myNewContext = await browser.newContext();
```



```
const myNewContext = await browser.createIncognitoBrowserContext();
```


## Persistent vs non-persistent browser contexts

In both examples above, we are creating a new **non-persistent** browser context, which means that once it closes, all of its cookies, cache, etc. will be lost. For some cases, that's okay, but in most situations, the performance hit from this is too large. This is why we have **persistent** browser contexts. Persistent browser contexts open up a bit slower and they store all their cache, cookies, session storage, and local storage in a file on disk.

In Puppeteer, the **default** browser context is the persistent one, while in Playwright we have to use https://playwright.dev/docs/api/class-browsertype#browser-type-launch-persistent-context instead of `BrowserType.launch()` in order for the default context to be persistent.

* Playwright
* Puppeteer


```
import { chromium } from 'playwright';

// Here, we launch a persistent browser context. The first
// argument is the location to store the data.
const browser = await chromium.launchPersistentContext('./persistent-context', { headless: false });

const page = await browser.newPage();

await browser.close();
```



```
import puppeteer from 'puppeteer';

const browser = await puppeteer.launch({ headless: false });

// This page will be under the default context, which is persistent.
// Cache, cookies, etc. will be stored on disk and persisted
const page = await browser.newPage();

await browser.close();
```


## Using browser contexts

In both Playwright and Puppeteer, various devices (iPhones, iPads, Androids, etc.) can be emulated by using https://playwright.dev/docs/api/class-playwright#playwright-devices or https://pptr.dev/#?product=Puppeteer&version=v14.1.0&show=api-puppeteerdevices. We'll be using this to create two different browser contexts, one emulating an iPhone, and one emulating an Android device:

* Playwright
* Puppeteer


```
import { chromium, devices } from 'playwright';

// Launch the browser
const browser = await chromium.launch({ headless: false });

const iPhone = devices['iPhone 11 Pro'];
// Create a new context for our iPhone emulation
const iPhoneContext = await browser.newContext({ ...iPhone });
// Open a page on the newly created iPhone context
const iPhonePage = await iPhoneContext.newPage();

const android = devices['Galaxy Note 3'];
// Create a new context for our Android emulation
const androidContext = await browser.newContext({ ...android });
// Open a page on the newly created Android context
const androidPage = await androidContext.newPage();

// The code in the next step will go here

await browser.close();
```



```
import puppeteer from 'puppeteer';

// Launch the browser
const browser = await puppeteer.launch({ headless: false });

const iPhone = puppeteer.devices['iPhone 11 Pro'];
// Create a new context for our iPhone emulation
const iPhoneContext = await browser.createIncognitoBrowserContext();
// Open a page on the newly created iPhone context
const iPhonePage = await iPhoneContext.newPage();
// Emulate the device
await iPhonePage.emulate(iPhone);

const android = puppeteer.devices['Galaxy Note 3'];
// Create a new context for our Android emulation
const androidContext = await browser.createIncognitoBrowserContext();
// Open a page on the newly created Android context
const androidPage = await androidContext.newPage();
// Emulate the device
await androidPage.emulate(android);

// The code in the next step will go here

await browser.close();
```


Then, we'll make both `iPhonePage` and `androidPage` visit https://www.deviceinfo.me/, which is a website that displays the type of device you have, the operating system you're using, and more device and location-specific information.


```
// Go to deviceinfo.me on both at the same time
await Promise.all([iPhonePage.goto('https://www.deviceinfo.me/'), androidPage.goto('https://www.deviceinfo.me/')]);

// Wait for 10 seconds on both before shutting down
await Promise.all([iPhonePage.waitForTimeout(10000), androidPage.waitForTimeout(10000)]);
```


Let's go ahead and run our code and analyze the data on each **deviceinfo.me** page. Here's what we see:

![deviceinfo.me results for both browser contexts](/assets/images/dual-contexts-1cf77aac6062264d0ba205af600f5c5a.jpg)

We see that **deviceinfo.me** detects both contexts as using different devices, despite the fact they're visiting the same page at the same time. This shows firsthand that different browser contexts can have totally different configurations, as they all have separate sessions.

## Accessing browser contexts

When working with multiple browser contexts, it can be difficult to keep track of all of them and making changes becomes a repetitive job. This is why the **Browser** instance returned from the `launch()` function also has a `contexts()` function (`browserContexts()` in Puppeteer). This function returns an array of all the contexts that are currently attached to the browser.

Let's go ahead and use this function to loop through all of our browser contexts and make them log **Site visited** to the console whenever the website is visited:

* Playwright
* Puppeteer


```
for (const context of browser.contexts()) {
    // In Playwright, lots of events are supported in the "on" function of
    // a BrowserContext instance
    context.on('request', (req) => req.url() === 'https://www.deviceinfo.me/' && console.log('Site visited'));
}
```



```
for (const context of browser.browserContexts()) {
    // In Puppeteer, only three events are supported in the "on" function
    // of a BrowserContext instance
    context.on('targetchanged', () => console.log('Site visited'));
}
```


After adding this above our `page.goto`s and running the code once again, we see this logged to the console:


```
Site visited
Site visited
```


Cool! We've modified both our `iPhoneContext` and `androidContext`, as well as our default context, to log the message.

> Note that the Puppeteer code and Playwright code are slightly different in the examples above. The Playwright code will log **Site visited** any time the specific URL is visited, while the Puppeteer code will log any time the target URL is changed to anything.

Finally, in Puppeteer, you can use the `browser.defaultBrowserContext()` function to grab hold of the default context at any point.

## Wrap up

Thus far in this course, you've learned how to launch a browser, open a page, run scripts on a page, extract data from a page, intercept requests made on the page, use proxies, and use multiple browser contexts. Stay tuned for new lessons!


---

# Common use cases

**Learn about some of the most common use cases of Playwright and Puppeteer, and how to handle these use cases when you run into them.**

***

You can do about anything with a headless browser, but, there are some extremely common use cases that are important to understand and be prepared for when you might run into them. This short section will be all about solving these common situations. Here's what we'll be covering:

1. Login flow (logging into an account)
2. Paginating through results on a website
3. Solving browser challenges (ex. captchas)
4. More!

# Next up

The https://docs.apify.com/academy/puppeteer-playwright/common-use-cases/logging-into-a-website.md of this section is all about logging into a website and running multiple concurrent operations within a user's account.


---

# Downloading files

**Learn how to automatically download and save files to the disk using two of the most popular web automation libraries, Puppeteer and Playwright.**

***

Downloading a file using Puppeteer can be tricky. On some systems, there can be issues with the usual file saving process that prevent you from doing it in a straightforward way. However, there are different techniques that work (most of the time).

These techniques are only necessary when we don't have a direct file link, which is usually the case when the file being downloaded is based on more complicated data export.

## Setting up a download path

Let's start with the easiest technique. This method tells the browser in what folder we want to download a file from Puppeteer after clicking on it.


```
const client = await page.target().createCDPSession();
await client.send('Page.setDownloadBehavior', { behavior: 'allow', downloadPath: './my-downloads' });
```


We use the mysterious `client` API which gives us access to all the functions of the underlying https://pptr.dev/api/puppeteer.cdpsession (Puppeteer & Playwright are built on top of it). Basically, it extends Puppeteer's functionality. Then we can download the file by clicking on the button.


```
await page.click('.export-button');
```


Let's wait for one minute. In a real use case, you want to check the state of the file in the file system.


```
await page.waitFor(60000);
```


To extract the file from the file system into memory, we have to first find its name, and then we can read it.


```
import fs from 'fs';

const fileNames = fs.readdirSync('./my-downloads');

// Let's pick the first one
const fileData = fs.readFileSync(`./my-downloads/${fileNames[0]}`);

// ...Now we can do whatever we want with the data
```


## Intercepting and replicating a file download request

For this second option, we can trigger the file download, intercept the request going out, and then replicate it to get the actual data. First, we need to enable request interception. This is done using the following line of code:


```
await page.setRequestInterception(true);
```


Next, we need to trigger the actual file export. We might need to fill in some form, select an exported file type, etc. In the end, it will look something like this:


```
await page.click('.export-button');
```


We don't need to await this promise since we'll be waiting for the result of this action anyway (the triggered request).

The crucial part is intercepting the request that would result in downloading the file. Since the interception is already enabled, we just need to wait for the request to be sent.


```
const xRequest = await new Promise((resolve) => {
    page.on('request', (interceptedRequest) => {
        interceptedRequest.abort(); // stop intercepting requests
        resolve(interceptedRequest);
    });
});
```


The last thing is to convert the intercepted Puppeteer request into a request-promise options object. We need to have the `request-promise` package installed.


```
import request from 'request-promise';
```


Since the request interception does not include cookies, we need to add them subsequently.


```
const options = {
    encoding: null,
    method: xRequest._method,
    uri: xRequest._url,
    body: xRequest._postData,
    headers: xRequest._headers,
};

// Add the cookies
const cookies = await page.cookies();
options.headers.Cookie = cookies.map((ck) => `${ck.name}=${ck.value}`).join(';');

// Resend the request
const response = await request(options);
```


Now, the response contains the binary data of the downloaded file. It can be saved to the disk, uploaded somewhere, or https://docs.apify.com/academy/puppeteer-playwright/common-use-cases/submitting-a-form-with-a-file-attachment.md.


---

# Logging into a website

**Understand the "login flow" - logging into a website, then maintaining a logged in status within different browser contexts for an efficient automation process.**

***

Whether it's auto-renewing a service, automatically sending a message on an interval, or automatically cancelling a Netflix subscription, one of the most popular things headless browsers are used for is automating things within a user's account on a certain website. Of course, automating anything on a user's account requires the automation of the login process as well. In this lesson, we'll be covering how to build a login flow from start to finish with Playwright or Puppeteer.

> In this lesson, we'll be using https://www.yahoo.com/ as an example. Feel free to follow along using the academy Yahoo account credentials, or even deviate from the lesson a bit and try building a login flow for a different website of your choosing!

## Inputting credentials

The full logging in process on Yahoo goes like this:

1. Accept their cookies policy, then load the main page.
2. Click on the **Sign in** button and load the sign-in page.
3. Enter the username and click the button.
4. Enter the password and click the button, then load the main page again (but now logged in).

When we lay out the steps like this in https://en.wikipedia.org/wiki/Pseudocode, it makes it significantly easier to translate over into code. Here's the four steps above loop in JavaScript:

* Playwright
* Puppeteer


```
import { chromium } from 'playwright';

// Launch a browser and open a page
const browser = await chromium.launch({ headless: false });
const page = await browser.newPage();

await page.goto('https://www.yahoo.com/');

// Agree to the cookies terms, then click on the "Sign in" button
await page.click('button[name="agree"]')