Custom crawl queue | crawler | Spatie

 SPATIE

  Crawler
==========

spatie.be/open-source

  [Docs](https://spatie.be/docs)  [Crawler](https://spatie.be/docs/crawler/v9)  Advanced-usage  Custom crawl queue

 Version   v9

 Other versions for crawler [v9](https://spatie.be/docs/crawler/v9)

- [ Introduction ](https://spatie.be/docs/crawler/v9/introduction)
- [ Installation &amp; setup ](https://spatie.be/docs/crawler/v9/installation-setup)
- [ Support us ](https://spatie.be/docs/crawler/v9/support-us)
- [ Questions and issues ](https://spatie.be/docs/crawler/v9/questions-issues)
- [ Changelog ](https://spatie.be/docs/crawler/v9/changelog)
- [ About us ](https://spatie.be/docs/crawler/v9/about-us)

Basic usage
-----------

- [ Your first crawl ](https://spatie.be/docs/crawler/v9/basic-usage/starting-your-first-crawl)
- [ Crawl responses ](https://spatie.be/docs/crawler/v9/basic-usage/handling-crawl-responses)
- [ Using observers ](https://spatie.be/docs/crawler/v9/basic-usage/using-observers)
- [ Collecting URLs ](https://spatie.be/docs/crawler/v9/basic-usage/collecting-urls)
- [ Filtering URLs ](https://spatie.be/docs/crawler/v9/basic-usage/filtering-urls)
- [ Testing ](https://spatie.be/docs/crawler/v9/basic-usage/testing)
- [ Tracking progress ](https://spatie.be/docs/crawler/v9/basic-usage/tracking-progress)

Configuring the crawler
-----------------------

- [ Concurrency &amp; throttling ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/crawl-behavior)
- [ Limits ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/setting-crawl-limits)
- [ Extracting resources ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/extracting-resources)
- [ Configuring requests ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/configuring-requests)
- [ Response filtering ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/handling-responses)
- [ Respecting robots.txt ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/respecting-robots-txt)

Advanced usage
--------------

- [ JavaScript rendering ](https://spatie.be/docs/crawler/v9/advanced-usage/rendering-javascript)
- [ Custom link extraction ](https://spatie.be/docs/crawler/v9/advanced-usage/extracting-custom-links)
- [ Custom request handlers ](https://spatie.be/docs/crawler/v9/advanced-usage/custom-request-handlers)
- [ Crawling across requests ](https://spatie.be/docs/crawler/v9/advanced-usage/crawling-across-requests)
- [ Custom crawl queue ](https://spatie.be/docs/crawler/v9/advanced-usage/using-a-custom-crawl-queue)
- [ Graceful shutdown ](https://spatie.be/docs/crawler/v9/advanced-usage/graceful-shutdown)

 Custom crawl queue
==================

###  On this page

1. [ URL normalization ](#content-url-normalization)

When crawling a site, the crawler stores URLs to be crawled in a queue. By default, this queue is stored in memory using the built-in `ArrayCrawlQueue`.

URL normalization
-----------------------------------------------------------------------------------------------------------

The built-in `ArrayCrawlQueue` normalizes URLs before using them as deduplication keys. This means that `https://Example.com/page` and `https://example.com/page/` are treated as the same URL, preventing redundant requests.

The following normalizations are applied (per RFC 3986):

- Lowercasing scheme and host
- Removing default ports (`:80` for http, `:443` for https)
- Stripping trailing slashes (except for the root `/`)
- Removing empty query strings
- Stripping URL fragments

The original URL is preserved on the `CrawlUrl` object and used for HTTP requests and observer notifications. Only the internal deduplication key uses the normalized form.

If you implement a custom crawl queue, consider applying similar normalizations to avoid crawling duplicate URLs.

When a site is very large you may want to store that queue elsewhere, for example in a database. You can write your own crawl queue by implementing the `Spatie\Crawler\CrawlQueues\CrawlQueue` interface:

```
use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->crawlQueue(new MyCustomQueue())
    ->start();
```

The `CrawlQueue` interface requires the following methods:

```
interface CrawlQueue
{
    public function add(CrawlUrl $url): self;
    public function has(string $url): bool;
    public function hasPendingUrls(): bool;
    public function getUrlById(mixed $id): CrawlUrl;
    public function getPendingUrl(): ?CrawlUrl;
    public function hasAlreadyBeenProcessed(CrawlUrl $url): bool;
    public function markAsProcessed(CrawlUrl $crawlUrl): void;
    public function getProcessedUrlCount(): int;
    public function getUrlCount(): int;        // total URLs added to the queue
    public function getPendingUrlCount(): int;  // URLs not yet processed
}
```

The `getUrlCount()` and `getPendingUrlCount()` methods are used by the `CrawlProgress` object to report queue statistics. See [tracking progress](/docs/crawler/v9/basic-usage/tracking-progress) for details.

Here are some queue implementations:

- [ArrayCrawlQueue](https://github.com/spatie/crawler/blob/main/src/CrawlQueues/ArrayCrawlQueue.php) (built in, in memory)
- [RedisCrawlQueue](https://github.com/repat/spatie-crawler-redis) (third party)
- [CacheCrawlQueue for Laravel](https://github.com/spekulatius/spatie-crawler-toolkit-for-laravel) (third party)
- [Laravel Model as Queue](https://github.com/insign/spatie-crawler-queue-with-laravel-model) (third party example)
