Testing | crawler | Spatie

 SPATIE

  Crawler
==========

spatie.be/open-source

  [Docs](https://spatie.be/docs)  [Crawler](https://spatie.be/docs/crawler/v9)  Basic-usage  Testing

 Version   v9

 Other versions for crawler [v9](https://spatie.be/docs/crawler/v9)

- [ Introduction ](https://spatie.be/docs/crawler/v9/introduction)
- [ Installation &amp; setup ](https://spatie.be/docs/crawler/v9/installation-setup)
- [ Support us ](https://spatie.be/docs/crawler/v9/support-us)
- [ Questions and issues ](https://spatie.be/docs/crawler/v9/questions-issues)
- [ Changelog ](https://spatie.be/docs/crawler/v9/changelog)
- [ About us ](https://spatie.be/docs/crawler/v9/about-us)

Basic usage
-----------

- [ Your first crawl ](https://spatie.be/docs/crawler/v9/basic-usage/starting-your-first-crawl)
- [ Crawl responses ](https://spatie.be/docs/crawler/v9/basic-usage/handling-crawl-responses)
- [ Using observers ](https://spatie.be/docs/crawler/v9/basic-usage/using-observers)
- [ Collecting URLs ](https://spatie.be/docs/crawler/v9/basic-usage/collecting-urls)
- [ Filtering URLs ](https://spatie.be/docs/crawler/v9/basic-usage/filtering-urls)
- [ Testing ](https://spatie.be/docs/crawler/v9/basic-usage/testing)
- [ Tracking progress ](https://spatie.be/docs/crawler/v9/basic-usage/tracking-progress)

Configuring the crawler
-----------------------

- [ Concurrency &amp; throttling ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/crawl-behavior)
- [ Limits ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/setting-crawl-limits)
- [ Extracting resources ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/extracting-resources)
- [ Configuring requests ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/configuring-requests)
- [ Response filtering ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/handling-responses)
- [ Respecting robots.txt ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/respecting-robots-txt)

Advanced usage
--------------

- [ JavaScript rendering ](https://spatie.be/docs/crawler/v9/advanced-usage/rendering-javascript)
- [ Custom link extraction ](https://spatie.be/docs/crawler/v9/advanced-usage/extracting-custom-links)
- [ Custom request handlers ](https://spatie.be/docs/crawler/v9/advanced-usage/custom-request-handlers)
- [ Crawling across requests ](https://spatie.be/docs/crawler/v9/advanced-usage/crawling-across-requests)
- [ Custom crawl queue ](https://spatie.be/docs/crawler/v9/advanced-usage/using-a-custom-crawl-queue)
- [ Graceful shutdown ](https://spatie.be/docs/crawler/v9/advanced-usage/graceful-shutdown)

 Testing
=======

###  On this page

1. [ How faking works ](#content-how-faking-works)
2. [ Faking with custom status codes and headers ](#content-faking-with-custom-status-codes-and-headers)
3. [ Faking with foundUrls ](#content-faking-with-foundurls)
4. [ Testing depth limits ](#content-testing-depth-limits)
5. [ Testing finish reasons ](#content-testing-finish-reasons)

The crawler provides a `fake()` method that lets you test your crawl logic without making real HTTP requests. Pass an array of URLs mapped to HTML strings:

```
use Spatie\Crawler\Crawler;
use Spatie\Crawler\CrawlResponse;

$crawled = [];

Crawler::create('https://example.com')
    ->fake([
        'https://example.com' => 'About',
        'https://example.com/about' => 'About page',
    ])
    ->onCrawled(function (string $url, CrawlResponse $response) use (&$crawled) {
        $crawled[] = $url;
    })
    ->start();

// $crawled will contain both URLs
```

How faking works
--------------------------------------------------------------------------------------------------------

When `fake()` is used, the crawler replaces Guzzle's HTTP handler with a fake handler that returns responses from the array you provided. URLs not found in the array will return a 404 response.

The fake handler normalizes URLs (handling trailing slashes) and automatically handles `robots.txt` requests. If you don't include a `robots.txt` URL in the fakes array, it returns a 404, which means no restrictions.

Faking with custom status codes and headers
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

By default, fake responses return a 200 status with a `text/html` content type. You can use `CrawlResponse::fake()` to create responses with custom status codes and headers:

```
use Spatie\Crawler\Crawler;
use Spatie\Crawler\CrawlResponse;

Crawler::create('https://example.com')
    ->fake([
        'https://example.com' => 'Link',
        'https://example.com/redirect' => CrawlResponse::fake('', 301, [
            'Location' => 'https://example.com/new-location',
        ]),
        'https://example.com/protected' => CrawlResponse::fake('Forbidden', 403),
        'https://example.com/audio.mp3' => CrawlResponse::fake('audio data', 200, [
            'Content-Type' => 'audio/mpeg',
        ]),
    ])
    ->start();
```

Faking with foundUrls
-----------------------------------------------------------------------------------------------------------------------

The `fake()` method works great with `foundUrls()`:

```
use Spatie\Crawler\Crawler;

$urls = Crawler::create('https://example.com')
    ->fake([
        'https://example.com' => 'Page 1Page 2',
        'https://example.com/page-1' => 'Page 1',
        'https://example.com/page-2' => 'Page 2',
    ])
    ->internalOnly()
    ->foundUrls();

expect($urls)->toHaveCount(3);
```

Testing depth limits
--------------------------------------------------------------------------------------------------------------------

```
use Spatie\Crawler\Crawler;

$urls = Crawler::create('https://example.com')
    ->fake([
        'https://example.com' => 'Level 1',
        'https://example.com/level-1' => 'Level 2',
        'https://example.com/level-2' => 'Level 2',
    ])
    ->depth(1)
    ->foundUrls();

// Only the start URL and level-1 will be crawled (depth 0 and 1)
expect($urls)->toHaveCount(2);
```

Testing finish reasons
--------------------------------------------------------------------------------------------------------------------------

You can assert which `FinishReason` was returned by `start()`:

```
use Spatie\Crawler\Crawler;
use Spatie\Crawler\Enums\FinishReason;

$reason = Crawler::create('https://example.com')
    ->fake([
        'https://example.com' => 'Page',
        'https://example.com/page' => 'Page',
    ])
    ->limit(1)
    ->start();

expect($reason)->toBe(FinishReason::CrawlLimitReached);
```
