Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
475 changes: 475 additions & 0 deletions website/versioned_docs/version-1.5/changelog.md

Large diffs are not rendered by default.

36 changes: 36 additions & 0 deletions website/versioned_docs/version-1.5/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
---
title: Apify CLI overview
description: An introduction to Apify CLI, a command-line interface for creating, developing, building, and running Apify Actors and managing the Apify cloud platform.
sidebar_label: Overview
---

Apify command-line interface (Apify CLI) helps you create, develop, build and run [Apify Actors](https://apify.com/actors), and manage the Apify platform from any computer.

## What are Apify Actors?

Actors are cloud programs that can perform arbitrary web scraping, automation or data processing job. They accept input, perform their job and generate output.

## Why use Apify CLI?

The Apify CLI enables you to develop Actors locally on your computer using your preferred tools:

- Your favorite code editor
- Version control system
- Development tools and workflows

This gives you full control over your development environment and makes it easier to work on complex projects. You can leverage the [Apify SDK](https://github.com/apify/apify-sdk-js) with all its powerful features, then push your Actor to the Apify platform for deployment when ready.

:::note Actor development environment

Actors run in Docker containers on the Apify platform. With an appropriate `Dockerfile`, you can build Actors in any programming language. We recommend JavaScript/Node.js and Python, for which we have the most comprehensive libraries and support.

:::

## Learn more

Learn everything you need to use the Apify CLI effectively:

- Learn how to [install](./installation.md) the CLI on your system
- Get started with your [first Actor project](./quick-start.md)
- See the complete [reference of all CLI commands](./reference.md) and options
- Find [solutions to common issues](./troubleshooting.md)
87 changes: 87 additions & 0 deletions website/versioned_docs/version-1.5/installation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
---
title: Installation
description: Learn how to install Apify CLI using installation scripts, Homebrew, or NPM.
---

Learn how to install Apify CLI using installation scripts, Homebrew, or NPM.

---

## Installation scripts

### MacOS / Linux

```bash
curl -fsSL https://apify.com/install-cli.sh | bash
```

### Windows

```powershell
irm https://apify.com/install-cli.ps1 | iex
```

:::tip No need for Node.js

If you install Apify CLI using our installation scripts, you don't need Node.js. The scripts use [Bun](https://bun.sh/) to create a standalone executable file.

This approach eliminates Node.js dependency management, which is useful for Python developers or users working in non-Node.js environments.

:::

## Homebrew

```bash
brew install apify-cli
```

:::tip Homebrew and Node.js dependency

When you install Apify CLI using Homebrew, it automatically installs Node.js as a dependency. If you already have Node.js installed through another method (e.g., `nvm`), this may create version conflicts.

If you experience Node.js version conflicts, modify your `PATH` environment variable to prioritize your preferred Node.js installation over Homebrew's version.

:::

## NPM

First, make sure you have [Node.js](https://nodejs.org) version 22 or higher with NPM installed on your computer:

```bash showLineNumbers
node --version
npm --version
```

Install or upgrade Apify CLI by running:

```bash
npm install -g apify-cli
```

:::tip Troubleshooting

If you receive a permission error, read npm's [official guide](https://docs.npmjs.com/resolving-eacces-permissions-errors-when-installing-packages-globally) on installing packages globally.

:::

## Verify installation

You can verify the installation process by running the following command:

```bash
apify --version
```

The output should resemble the following (exact details like version or platform may vary):

```bash
apify-cli/1.0.1 (0dfcfd8) running on darwin-arm64 with bun-1.2.19 (emulating node 24.3.0), installed via bundle
```

## Upgrading

Upgrading Apify CLI is as simple as running the following command:

```bash
apify upgrade
```
188 changes: 188 additions & 0 deletions website/versioned_docs/version-1.5/integrating-scrapy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
---
title: Integrating Scrapy projects
description: Learn how to run Scrapy projects as Apify Actors and deploy them on the Apify platform.
sidebar_label: Integrating Scrapy projects
---

[Scrapy](https://scrapy.org/) is a widely used open-source web scraping framework for Python. Scrapy projects can now be executed on the Apify platform using our dedicated wrapping tool. This tool allows users to transform their Scrapy projects into [Apify Actors](https://docs.apify.com/platform/actors) with just a few simple commands.

## Prerequisites

Before you begin, make sure you have the Apify CLI installed on your system. If you haven't installed it yet, follow the [installation guide](./installation.md).

## Actorization of your existing Scrapy spider

Assuming your Scrapy project is set up, navigate to the project root where the `scrapy.cfg` file is located.

```bash
cd your_scraper
```

Verify the directory contents to ensure the correct location.

```bash showLineNumbers
$ ls -R
.:
your_scraper README.md requirements.txt scrapy.cfg

./your_scraper:
__init__.py items.py __main__.py main.py pipelines.py settings.py spiders

./your_scraper/spiders:
your_spider.py __init__.py
```

To convert your Scrapy project into an Apify Actor, initiate the wrapping process by executing the following command:

```bash
apify init
```

The script will prompt you with a series of questions. Upon completion, the output might resemble the following:

```bash showLineNumbers
Info: The current directory looks like a Scrapy project. Using automatic project wrapping.
? Enter the Scrapy BOT_NAME (see settings.py): books_scraper
? What folder are the Scrapy spider modules stored in? (see SPIDER_MODULES in settings.py): books_scraper.spiders
? Pick the Scrapy spider you want to wrap: BookSpider (/home/path/to/actor-scrapy-books-example/books_scraper/spiders/book.py)
Info: Downloading the latest Scrapy wrapper template...
Info: Wrapping the Scrapy project...
Success: The Scrapy project has been wrapped successfully.
```

For example, here is a [source code](https://github.com/apify/actor-scrapy-books-example) of an actorized Scrapy project, and [here](https://apify.com/vdusek/scrapy-books-example) the corresponding Actor in Apify Store.

### Run the Actor locally

Create a Python virtual environment by running:

```bash
python -m virtualenv .venv
```

Activate the virtual environment:

```bash
source .venv/bin/activate
```

Install Python dependencies using the provided requirements file named `requirements_apify.txt`. Ensure these requirements are installed before executing your project as an Apify Actor locally. You can put your own dependencies there as well.

```bash
pip install -r requirements-apify.txt [-r requirements.txt]
```

Finally execute the Apify Actor.

```bash
apify run [--purge]
```

If [ActorDatasetPushPipeline](https://github.com/apify/apify-sdk-python/blob/master/src/apify/scrapy/pipelines.py) is configured, the Actor's output will be stored in the `storage/datasets/default/` directory.

### Run the scraper as Scrapy project

The project remains executable as a Scrapy project.

```bash
scrapy crawl your_spider -o books.json
```

## Deploy on Apify

### Log in to Apify

You will need to provide your [Apify API Token](https://console.apify.com/settings/integrations) to complete this action.

```bash
apify login
```

### Deploy your Actor

This command will deploy and build the Actor on the Apify platform. You can find your newly created Actor under [Actors -> My Actors](https://console.apify.com/actors?tab=my).

```bash
apify push
```

## What the wrapping process does

The initialization command enhances your project by adding necessary files and updating some of them while preserving its functionality as a typical Scrapy project. The additional requirements file, named `requirements_apify.txt`, includes the Apify Python SDK and other essential requirements. The `.actor/` directory contains basic configuration of your Actor. We provide two new Python files [main.py](https://github.com/apify/actor-templates/blob/master/templates/python-scrapy/src/main.py) and [\_\_main\_\_.py](https://github.com/apify/actor-templates/blob/master/templates/python-scrapy/src/__main__.py), where we encapsulate the Scrapy project within an Actor. We also import and use there a few Scrapy components from our [Python SDK](https://github.com/apify/apify-sdk-python/tree/master/src/apify/scrapy). These components facilitate the integration of the Scrapy projects with the Apify platform. Further details about these components are provided in the following subsections.

### Scheduler

The [scheduler](https://docs.scrapy.org/en/latest/topics/scheduler.html) is a core component of Scrapy responsible for receiving and providing requests to be processed. To leverage the [Apify request queue](https://docs.apify.com/platform/storage/request-queue) for storing requests, a custom scheduler becomes necessary. Fortunately, Scrapy is a modular framework, allowing the creation of custom components. As a result, we have implemented the [ApifyScheduler](https://github.com/apify/apify-sdk-python/blob/master/src/apify/scrapy/scheduler.py). When using the Apify CLI wrapping tool, the scheduler is configured in the [src/main.py](https://github.com/apify/actor-templates/blob/master/templates/python-scrapy/src/main.py) file of your Actor.

### Dataset push pipeline

[Item pipelines](https://docs.scrapy.org/en/latest/topics/item-pipeline.html) are used for the processing of the results produced by your spiders. To handle the transmission of result data to the [Apify dataset](https://docs.apify.com/platform/storage/dataset), we have implemented the [ActorDatasetPushPipeline](https://github.com/apify/apify-sdk-python/blob/master/src/apify/scrapy/pipelines.py). When using the Apify CLI wrapping tool, the pipeline is configured in the [src/main.py](https://github.com/apify/actor-templates/blob/master/templates/python-scrapy/src/main.py) file of your Actor. It is assigned the highest integer value (1000), ensuring its execution as the final step in the pipeline sequence.

### Retry middleware

[Downloader middlewares](https://docs.scrapy.org/en/latest/topics/downloader-middleware.html) are a way how to hook into Scrapy's request/response processing. Scrapy comes with various default middlewares, including the [RetryMiddleware](https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.retry), designed to handle retries for requests that may have failed due to temporary issues. When integrating with the [Apify request queue](https://docs.apify.com/platform/storage/request-queue), it becomes necessary to enhance this middleware to facilitate communication with the request queue marking the requests either as handled or ready for a retry. When using the Apify CLI wrapping tool, the default `RetryMiddleware` is disabled, and [ApifyRetryMiddleware](https://github.com/apify/apify-sdk-python/blob/master/src/apify/scrapy/middlewares/apify_retry.py) takes its place. Configuration for the middlewares is established in the [src/main.py](https://github.com/apify/actor-templates/blob/master/templates/python-scrapy/src/main.py) file of your Actor.

### HTTP proxy middleware

Another default Scrapy [downloader middleware](https://docs.scrapy.org/en/latest/topics/downloader-middleware.html) that requires replacement is [HttpProxyMiddleware](https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.httpproxy). To utilize the use of proxies managed through the Apify [ProxyConfiguration](https://github.com/apify/apify-sdk-python/blob/master/src/apify/proxy_configuration.py), we provide [ApifyHttpProxyMiddleware](https://github.com/apify/apify-sdk-python/blob/master/src/apify/scrapy/middlewares/apify_proxy.py). When using the Apify CLI wrapping tool, the default `HttpProxyMiddleware` is disabled, and [ApifyHttpProxyMiddleware](https://github.com/apify/apify-sdk-python/blob/master/src/apify/scrapy/middlewares/apify_proxy.py) takes its place. Additionally, inspect the [.actor/input_schema.json](https://github.com/apify/actor-templates/blob/master/templates/python-scrapy/.actor/input_schema.json) file, where proxy configuration is specified as an input property for your Actor. The processing of this input is carried out together with the middleware configuration in [src/main.py](https://github.com/apify/actor-templates/blob/master/templates/python-scrapy/src/main.py).

## Known limitations

There are some known limitations of running the Scrapy projects on Apify platform we are aware of.

### Asynchronous code in spiders and other components

Scrapy asynchronous execution is based on the [Twisted](https://twisted.org/) library, not the
[AsyncIO](https://docs.python.org/3/library/asyncio.html), which brings some complications on the table.

Due to the asynchronous nature of the Actors, all of their code is executed as a coroutine inside the `asyncio.run`.
In order to execute Scrapy code inside an Actor, following the section
[Run Scrapy from a script](https://docs.scrapy.org/en/latest/topics/practices.html?highlight=CrawlerProcess#run-scrapy-from-a-script)
from the official Scrapy documentation, we need to invoke a
[`CrawlProcess.start`](https://github.com/scrapy/scrapy/blob/2.11.0/scrapy/crawler.py#L393:L427)
method. This method triggers Twisted's event loop, also known as a reactor.
Consequently, Twisted's event loop is executed within AsyncIO's event loop.
On top of that, when employing AsyncIO code in spiders or other components, it necessitates the creation of a new
AsyncIO event loop, within which the coroutines from these components are executed. This means there is
an execution of the AsyncIO event loop inside the Twisted event loop inside the AsyncIO event loop.

We have resolved this issue by leveraging the [nest-asyncio](https://pypi.org/project/nest-asyncio/) library,
enabling the execution of nested AsyncIO event loops. For executing a coroutine within a spider or other component,
it is recommended to use Apify's instance of the nested event loop. Refer to the code example below or derive
inspiration from Apify's Scrapy components, such as the
[ApifyScheduler](https://github.com/apify/apify-sdk-python/blob/v1.5.0/src/apify/scrapy/scheduler.py#L114).

```python
from apify.scrapy.utils import nested_event_loop
...

# Coroutine execution inside a spider
nested_event_loop.run_until_complete(my_coroutine())
```

### More spiders per Actor

It is recommended to execute only one Scrapy spider per Apify Actor.

Mapping more Scrapy spiders to a single Apify Actor does not make much sense. We would have to create a separate
instace of the [request queue](https://docs.apify.com/platform/storage/request-queue) for every spider.
Also, every spider can produce a different output resulting in a mess in an output
[dataset](https://docs.apify.com/platform/storage/dataset). A solution for this could be to store an output
of every spider to a different [key-value store](https://docs.apify.com/platform/storage/key-value-store). However,
a much more simple solution to this problem would be to just have a single spider per Actor.

If you want to share common Scrapy components (middlewares, item pipelines, ...) among more spiders (Actors), you
can use a dedicated Python package containing your components and install it to your Actors environment. The
other solution to this problem could be to have more spiders per Actor, but keep only one spider run per Actor.
What spider is going to be executed in an Actor run can be specified in the
[input schema](https://docs.apify.com/academy/deploying-your-code/input-schema).

## Additional links

- [Scrapy Books Example Actor](https://apify.com/vdusek/scrapy-books-example)
- [Python Actor Scrapy template](https://apify.com/templates/python-scrapy)
- [Apify SDK for Python](https://docs.apify.com/sdk/python)
- [Apify platform](https://docs.apify.com/platform)
- [Join our developer community on Discord](https://discord.com/invite/jyEM2PRvMU)

> We welcome any feedback! Please feel free to contact us at [python@apify.com](mailto:python@apify.com). Thank you for your valuable input.
84 changes: 84 additions & 0 deletions website/versioned_docs/version-1.5/quick-start.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
---
title: Quick start
description: Learn how to create, run, and manage Actors using Apify CLI.
---

Learn how to create, run, and manage Actors using Apify CLI.

## Prerequisites

Before you begin, make sure you have the Apify CLI installed on your system. If you haven't installed it yet, follow the [installation guide](./installation.md).

## Step 1: Create your Actor

Run the following command in your terminal. It will guide you step by step through the creation process.

```bash
apify create
```

:::info Explore Actor templates

The Apify CLI will prompt you to choose a template. Browse the [full list of templates](https://apify.com/templates) to find the best fit for your Actor.

:::

## Step 2: Run your Actor

Once the Actor is initialized, you can run it:

```bash
apify run
```

You'll see output similar to this in your terminal:

```bash
INFO System info {"apifyVersion":"3.4.3","apifyClientVersion":"2.12.6","crawleeVersion":"3.13.10","osType":"Darwin","nodeVersion":"v22.17.0"}
Extracted heading { level: 'h1', text: 'Your full‑stack platform for web scraping' }
Extracted heading { level: 'h3', text: 'TikTok Scraper' }
Extracted heading { level: 'h3', text: 'Google Maps Scraper' }
Extracted heading { level: 'h3', text: 'Instagram Scraper' }
```

## Step 3: Push your Actor

Once you are ready, you can push your Actor to the Apify platform, where you can schedule runs, or make the Actor public for other developers.

### Login to Apify Console

```bash
apify login
```

:::note Create an Apify account

Before you can interact with the Apify Console, [create an Apify account](https://console.apify.com/).
When you run `apify login`, you can choose one of the following methods:

- Sign in via the Apify Console in your browser — recommended.
- Provide an [Apify API token](https://console.apify.com/settings/integrations) — alternative method.

The interactive prompt will guide you through either option.

:::

### Push to Apify Console

```bash
apify push
```

## Step 4: Call your Actor (optional)

You can run your Actor on the Apify platform. In the following example, the command runs `apify/hello-world` on the Apify platform.

```bash
apify call apify/hello-world
```

## Next steps

- Check the [command reference](./reference.md) for more information about individual commands.
- If you have a problem with the Apify CLI, check the [troubleshooting](./troubleshooting.md) guide.
- Learn more about [Actors](https://docs.apify.com/platform/actors).
Loading