By default, the crawler will crawl every URL it finds, including links to external sites. You can control which URLs are crawled using scope helpers or custom crawl profiles.
##Scope helpers
The simplest way to filter URLs is with the built-in scope helpers:
use Spatie\Crawler\Crawler;
Crawler::create('https://example.com')
->internalOnly()
->start();
Crawler::create('https://example.com')
->internalOnly()
->includeSubdomains()
->start();
##Matching www and non-www
By default, internalOnly() treats example.com and www.example.com as different hosts. If you want them to be treated as equivalent, chain the matchWww() method:
use Spatie\Crawler\Crawler;
Crawler::create('https://example.com')
->internalOnly()
->matchWww()
->start();
This will crawl links on both example.com and www.example.com. It works in both directions: starting from www.example.com will also include example.com links.
##Combining matchWww and includeSubdomains
Both matchWww() and includeSubdomains() can be used together. When includeSubdomains() is enabled, www is stripped from both hosts before the subdomain check. This means blog.example.com will match a base URL of www.example.com.
use Spatie\Crawler\Crawler;
Crawler::create('https://www.example.com')
->internalOnly()
->matchWww()
->includeSubdomains()
->start();
This will crawl www.example.com, example.com, blog.example.com, cdn.example.com, and any other subdomain of example.com.
##Inline filtering
For custom filtering logic, use the shouldCrawl method with a closure:
use Spatie\Crawler\Crawler;
Crawler::create('https://example.com')
->shouldCrawl(function (string $url) {
return !str_contains($url, '/admin');
})
->start();
##Custom crawl profiles
For reusable filtering logic, create a class that implements Spatie\Crawler\CrawlProfiles\CrawlProfile:
use Spatie\Crawler\CrawlProfiles\CrawlProfile;
class MyCustomProfile implements CrawlProfile
{
public function shouldCrawl(string $url): bool
{
return parse_url($url, PHP_URL_HOST) === 'example.com'
&& !str_contains($url, '/private');
}
}
Then pass it to the crawler:
use Spatie\Crawler\Crawler;
Crawler::create('https://example.com')
->crawlProfile(new MyCustomProfile())
->start();
This package comes with three built-in profiles:
CrawlAllUrls: crawls all URLs on all pages, including external sites (this is the default)
CrawlInternalUrls: only crawls URLs on the same host
CrawlSubdomains: crawls URLs on the same host and its subdomains
##Always crawl and never crawl
Sometimes you need to override your crawl profile for specific URL patterns. The alwaysCrawl and neverCrawl methods accept arrays of patterns (using fnmatch syntax) that take priority over your crawl profile.
use Spatie\Crawler\Crawler;
Crawler::create('https://example.com')
->internalOnly()
->alwaysCrawl(['https://cdn.example.com/*'])
->neverCrawl(['*/admin/*', '*/tmp/*'])
->start();
alwaysCrawl patterns bypass both the crawl profile and robots.txt rules. This is useful for checking external assets (like CDN resources) while keeping the crawl scoped to your own site.
neverCrawl patterns block matching URLs from being added to the crawl queue, regardless of what the crawl profile returns.
When a URL matches both an alwaysCrawl and a neverCrawl pattern, alwaysCrawl wins.