By default, the crawler continues until it has crawled every page it can find. This behavior might cause issues if you are working in an environment with limitations such as a serverless environment.
##Crawl depth
You can limit how deep the crawler will go using the depth method.
use Spatie\Crawler\Crawler;
Crawler::create('https://example.com')
->depth(2)
->start();
A depth of 0 means only the start URL will be crawled. A depth of 1 means the start URL and any pages it links to, and so on.
##Crawl and time limits
The crawl behavior can be controlled with these options:
limit(): the maximum number of URLs to crawl across all executions
limitPerExecution(): how many URLs to process during the current crawl
timeLimit(): the maximum execution time in seconds across all executions
timeLimitPerExecution(): the maximum execution time in seconds for the current crawl
When any of these limits are reached, the crawler stops and returns a FinishReason from start(). See tracking progress for details.
##Using the total crawl limit
The limit() method allows you to limit the total number of URLs to crawl, no matter how often you call the crawler.
use Spatie\Crawler\Crawler;
use Spatie\Crawler\Enums\FinishReason;
$queue = <your queue implementation>;
$reason = Crawler::create('https://example.com')
->crawlQueue($queue)
->limit(5)
->start();
Crawler::create('https://example.com')
->crawlQueue($queue)
->limit(5)
->start();
##Using the current crawl limit
The limitPerExecution() method limits how many URLs will be crawled in a single execution. This is especially useful when crawling across multiple requests. This code will process 5 pages with each execution, without a total limit of pages to crawl.
use Spatie\Crawler\Crawler;
$queue = <your queue implementation>;
Crawler::create('https://example.com')
->crawlQueue($queue)
->limitPerExecution(5)
->start();
Crawler::create('https://example.com')
->crawlQueue($queue)
->limitPerExecution(5)
->start();
##Using time limits
The timeLimit() method sets the maximum execution time across all executions. The timeLimitPerExecution() method sets the maximum execution time for a single crawl. Both accept a value in seconds.
use Spatie\Crawler\Crawler;
$reason = Crawler::create('https://example.com')
->timeLimit(60)
->start();
Crawler::create('https://example.com')
->crawlQueue($queue)
->timeLimitPerExecution(30)
->start();
##Combining limits
All limits can be combined to control the crawler:
use Spatie\Crawler\Crawler;
$queue = <your queue implementation>;
Crawler::create('https://example.com')
->crawlQueue($queue)
->limit(10)
->limitPerExecution(5)
->start();
Crawler::create('https://example.com')
->crawlQueue($queue)
->limit(10)
->limitPerExecution(5)
->start();
Crawler::create('https://example.com')
->crawlQueue($queue)
->limit(10)
->limitPerExecution(5)
->start();