The crawler itself can be configured to do a few different things.
You can configure the crawler used by the sitemap generator, for example, to ignore robot checks:
use Spatie\Crawler\Crawler;
use Spatie\Sitemap\SitemapGenerator;
SitemapGenerator::create('https://example.com')
->configureCrawler(function (Crawler $crawler) {
$crawler->ignoreRobots();
})
->writeToFile($sitemapPath);
##Limiting the amount of pages crawled
You can limit the amount of pages crawled by calling setMaximumCrawlCount:
use Spatie\Sitemap\SitemapGenerator;
SitemapGenerator::create('https://example.com')
->setMaximumCrawlCount(500)
->writeToFile($sitemapPath);
##Setting the crawl concurrency
You can set the number of concurrent connections the crawler will use:
use Spatie\Sitemap\SitemapGenerator;
SitemapGenerator::create('https://example.com')
->setConcurrency(1)
->writeToFile($sitemapPath);
The default concurrency is 10.
##Controlling the crawl depth
You can limit how deep the crawler follows links:
use Spatie\Crawler\Crawler;
use Spatie\Sitemap\SitemapGenerator;
SitemapGenerator::create('https://example.com')
->configureCrawler(function (Crawler $crawler) {
$crawler->depth(3);
})
->writeToFile($path);
##Executing JavaScript
The sitemap generator can execute JavaScript on each page so it will discover links that are generated by your JS scripts. You can enable this feature by setting execute_javascript in the config file to true.
Under the hood, headless Chrome is used to execute JavaScript. You'll need to install spatie/browsershot separately:
composer require spatie/browsershot
Here are some pointers on how to install it on your system.
The package will make an educated guess as to where Chrome is installed on your system. You can also set the path in config/sitemap.php.