Crawling across requests | crawler | Spatie

 SPATIE

  Crawler
==========

spatie.be/open-source

  [Docs](https://spatie.be/docs)  [Crawler](https://spatie.be/docs/crawler/v9)  Advanced-usage  Crawling across requests

 Version   v9

 Other versions for crawler [v9](https://spatie.be/docs/crawler/v9)

- [ Introduction ](https://spatie.be/docs/crawler/v9/introduction)
- [ Installation &amp; setup ](https://spatie.be/docs/crawler/v9/installation-setup)
- [ Support us ](https://spatie.be/docs/crawler/v9/support-us)
- [ Questions and issues ](https://spatie.be/docs/crawler/v9/questions-issues)
- [ Changelog ](https://spatie.be/docs/crawler/v9/changelog)
- [ About us ](https://spatie.be/docs/crawler/v9/about-us)

Basic usage
-----------

- [ Your first crawl ](https://spatie.be/docs/crawler/v9/basic-usage/starting-your-first-crawl)
- [ Crawl responses ](https://spatie.be/docs/crawler/v9/basic-usage/handling-crawl-responses)
- [ Using observers ](https://spatie.be/docs/crawler/v9/basic-usage/using-observers)
- [ Collecting URLs ](https://spatie.be/docs/crawler/v9/basic-usage/collecting-urls)
- [ Filtering URLs ](https://spatie.be/docs/crawler/v9/basic-usage/filtering-urls)
- [ Testing ](https://spatie.be/docs/crawler/v9/basic-usage/testing)
- [ Tracking progress ](https://spatie.be/docs/crawler/v9/basic-usage/tracking-progress)

Configuring the crawler
-----------------------

- [ Concurrency &amp; throttling ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/crawl-behavior)
- [ Limits ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/setting-crawl-limits)
- [ Extracting resources ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/extracting-resources)
- [ Configuring requests ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/configuring-requests)
- [ Response filtering ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/handling-responses)
- [ Respecting robots.txt ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/respecting-robots-txt)

Advanced usage
--------------

- [ JavaScript rendering ](https://spatie.be/docs/crawler/v9/advanced-usage/rendering-javascript)
- [ Custom link extraction ](https://spatie.be/docs/crawler/v9/advanced-usage/extracting-custom-links)
- [ Custom request handlers ](https://spatie.be/docs/crawler/v9/advanced-usage/custom-request-handlers)
- [ Crawling across requests ](https://spatie.be/docs/crawler/v9/advanced-usage/crawling-across-requests)
- [ Custom crawl queue ](https://spatie.be/docs/crawler/v9/advanced-usage/using-a-custom-crawl-queue)
- [ Graceful shutdown ](https://spatie.be/docs/crawler/v9/advanced-usage/graceful-shutdown)

 Crawling across requests
========================

###  On this page

1. [ Initial request ](#content-initial-request)
2. [ Subsequent requests ](#content-subsequent-requests)

You can use `limitPerExecution()` to break up long running crawls across multiple HTTP requests. This is useful in serverless environments or when you want to avoid timeouts. See [setting crawl limits](/docs/crawler/v9/configuring-the-crawler/setting-crawl-limits) for all available limit options.

Initial request
-----------------------------------------------------------------------------------------------------

To start crawling across different requests, create a queue instance and pass it to the crawler. The crawler will fill the queue as pages are processed and new URLs are discovered. After the crawler finishes (because it hit the per execution limit), serialize and store the queue.

```
use Spatie\Crawler\Crawler;
use Spatie\Crawler\CrawlQueues\ArrayCrawlQueue;

$queue = new ArrayCrawlQueue(); // or your custom queue

// Crawl the first batch of URLs
Crawler::create('https://example.com')
    ->crawlQueue($queue)
    ->limitPerExecution(10)
    ->start();

// Serialize and store the queue for the next request
$serializedQueue = serialize($queue);
```

Subsequent requests
-----------------------------------------------------------------------------------------------------------------

For following requests, unserialize the queue and pass it to the crawler:

```
use Spatie\Crawler\Crawler;

$queue = unserialize($serializedQueue);

// Crawl the next batch of URLs
Crawler::create('https://example.com')
    ->crawlQueue($queue)
    ->limitPerExecution(10)
    ->start();

// Serialize and store the queue again
$serializedQueue = serialize($queue);
```

The behavior is based on the information in the queue. Only if the same queue instance is passed will the crawler continue where it left off. When a completely new queue is passed, the limits of previous crawls won't apply.

A more detailed example can be found in [this repository](https://github.com/spekulatius/spatie-crawler-cached-queue-example).
