By default, the crawler will respect robots data from robots.txt files, meta tags, and response headers. More information on the spec can be found at robotstxt.org.
Parsing robots data is done by the spatie/robots-txt package.
##Ignoring robots rules
You can disable all robots checks using the ignoreRobots method.
use Spatie\Crawler\Crawler;
Crawler::create('https://example.com')
->ignoreRobots()
->start();
You can re-enable robots checking after disabling it using the respectRobots method.
$crawler = Crawler::create('https://example.com')
->ignoreRobots();
$crawler->respectRobots();
##Accepting nofollow links
By default, the crawler will reject all links containing rel="nofollow". You can disable this check using the followNofollow method.
use Spatie\Crawler\Crawler;
Crawler::create('https://example.com')
->followNofollow()
->start();
You can re-enable nofollow rejection using the rejectNofollowLinks method.
$crawler = Crawler::create('https://example.com')
->followNofollow();
$crawler->rejectNofollowLinks();
##Custom user agent
The user agent is also used when checking robots.txt rules. When you set a custom user agent, robots.txt rules specific to that agent will be respected. For example, if your robots.txt contains:
User-agent: my-agent
Disallow: /
The crawler (when using my-agent as user agent) will not crawl any pages on the site.