Filtering URLs | crawler | Spatie

 SPATIE

  Crawler
==========

spatie.be/open-source

  [Docs](https://spatie.be/docs)  [Crawler](https://spatie.be/docs/crawler/v9)  Basic-usage  Filtering URLs

 Version   v9

 Other versions for crawler [v9](https://spatie.be/docs/crawler/v9)

- [ Introduction ](https://spatie.be/docs/crawler/v9/introduction)
- [ Installation &amp; setup ](https://spatie.be/docs/crawler/v9/installation-setup)
- [ Support us ](https://spatie.be/docs/crawler/v9/support-us)
- [ Questions and issues ](https://spatie.be/docs/crawler/v9/questions-issues)
- [ Changelog ](https://spatie.be/docs/crawler/v9/changelog)
- [ About us ](https://spatie.be/docs/crawler/v9/about-us)

Basic usage
-----------

- [ Your first crawl ](https://spatie.be/docs/crawler/v9/basic-usage/starting-your-first-crawl)
- [ Crawl responses ](https://spatie.be/docs/crawler/v9/basic-usage/handling-crawl-responses)
- [ Using observers ](https://spatie.be/docs/crawler/v9/basic-usage/using-observers)
- [ Collecting URLs ](https://spatie.be/docs/crawler/v9/basic-usage/collecting-urls)
- [ Filtering URLs ](https://spatie.be/docs/crawler/v9/basic-usage/filtering-urls)
- [ Testing ](https://spatie.be/docs/crawler/v9/basic-usage/testing)
- [ Tracking progress ](https://spatie.be/docs/crawler/v9/basic-usage/tracking-progress)

Configuring the crawler
-----------------------

- [ Concurrency &amp; throttling ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/crawl-behavior)
- [ Limits ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/setting-crawl-limits)
- [ Extracting resources ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/extracting-resources)
- [ Configuring requests ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/configuring-requests)
- [ Response filtering ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/handling-responses)
- [ Respecting robots.txt ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/respecting-robots-txt)

Advanced usage
--------------

- [ JavaScript rendering ](https://spatie.be/docs/crawler/v9/advanced-usage/rendering-javascript)
- [ Custom link extraction ](https://spatie.be/docs/crawler/v9/advanced-usage/extracting-custom-links)
- [ Custom request handlers ](https://spatie.be/docs/crawler/v9/advanced-usage/custom-request-handlers)
- [ Crawling across requests ](https://spatie.be/docs/crawler/v9/advanced-usage/crawling-across-requests)
- [ Custom crawl queue ](https://spatie.be/docs/crawler/v9/advanced-usage/using-a-custom-crawl-queue)
- [ Graceful shutdown ](https://spatie.be/docs/crawler/v9/advanced-usage/graceful-shutdown)

 Filtering URLs
==============

###  On this page

1. [ Scope helpers ](#content-scope-helpers)
2. [ Inline filtering ](#content-inline-filtering)
3. [ Custom crawl profiles ](#content-custom-crawl-profiles)
4. [ Always crawl and never crawl ](#content-always-crawl-and-never-crawl)

By default, the crawler will crawl every URL it finds, including links to external sites. You can control which URLs are crawled using scope helpers or custom crawl profiles.

Scope helpers
-----------------------------------------------------------------------------------------------

The simplest way to filter URLs is with the built-in scope helpers:

```
use Spatie\Crawler\Crawler;

// Only crawl URLs on the same host
Crawler::create('https://example.com')
    ->internalOnly()
    ->start();

// Crawl URLs on the same host and its subdomains
Crawler::create('https://example.com')
    ->internalOnly()
    ->includeSubdomains()
    ->start();
```

### Matching www and non-www

By default, `internalOnly()` treats `example.com` and `www.example.com` as different hosts. If you want them to be treated as equivalent, chain the `matchWww()` method:

```
use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->internalOnly()
    ->matchWww()
    ->start();
```

This will crawl links on both `example.com` and `www.example.com`. It works in both directions: starting from `www.example.com` will also include `example.com` links.

### Combining matchWww and includeSubdomains

Both `matchWww()` and `includeSubdomains()` can be used together. When `includeSubdomains()` is enabled, www is stripped from both hosts before the subdomain check. This means `blog.example.com` will match a base URL of `www.example.com`.

```
use Spatie\Crawler\Crawler;

Crawler::create('https://www.example.com')
    ->internalOnly()
    ->matchWww()
    ->includeSubdomains()
    ->start();
```

This will crawl `www.example.com`, `example.com`, `blog.example.com`, `cdn.example.com`, and any other subdomain of `example.com`.

Inline filtering
--------------------------------------------------------------------------------------------------------

For custom filtering logic, use the `shouldCrawl` method with a closure:

```
use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->shouldCrawl(function (string $url) {
        return !str_contains($url, '/admin');
    })
    ->start();
```

Custom crawl profiles
-----------------------------------------------------------------------------------------------------------------------

For reusable filtering logic, create a class that implements `Spatie\Crawler\CrawlProfiles\CrawlProfile`:

```
use Spatie\Crawler\CrawlProfiles\CrawlProfile;

class MyCustomProfile implements CrawlProfile
{
    public function shouldCrawl(string $url): bool
    {
        return parse_url($url, PHP_URL_HOST) === 'example.com'
            && !str_contains($url, '/private');
    }
}
```

Then pass it to the crawler:

```
use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->crawlProfile(new MyCustomProfile())
    ->start();
```

This package comes with three built-in profiles:

- `CrawlAllUrls`: crawls all URLs on all pages, including external sites (this is the default)
- `CrawlInternalUrls`: only crawls URLs on the same host
- `CrawlSubdomains`: crawls URLs on the same host and its subdomains

Always crawl and never crawl
--------------------------------------------------------------------------------------------------------------------------------------------

Sometimes you need to override your crawl profile for specific URL patterns. The `alwaysCrawl` and `neverCrawl` methods accept arrays of patterns (using `fnmatch` syntax) that take priority over your crawl profile.

```
use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->internalOnly()
    ->alwaysCrawl(['https://cdn.example.com/*'])
    ->neverCrawl(['*/admin/*', '*/tmp/*'])
    ->start();
```

`alwaysCrawl` patterns bypass both the crawl profile and `robots.txt` rules. This is useful for checking external assets (like CDN resources) while keeping the crawl scoped to your own site.

`neverCrawl` patterns block matching URLs from being added to the crawl queue, regardless of what the crawl profile returns.

When a URL matches both an `alwaysCrawl` and a `neverCrawl` pattern, `alwaysCrawl` wins.
