You can customize how links are extracted from a page by creating a class that implements the UrlParser interface. The extractUrls method should return an array of ExtractedUrl objects:
use Spatie\Crawler\Enums\ResourceType;
use Spatie\Crawler\ExtractedUrl;
use Spatie\Crawler\UrlParsers\UrlParser;
class MyUrlParser implements UrlParser
{
public function extractUrls(string $html, string $baseUrl): array
{
return [
new ExtractedUrl(
url: 'https://example.com/page',
linkText: 'Example page',
resourceType: ResourceType::Link,
),
];
}
}
Each ExtractedUrl has the following properties:
url: the discovered URL
linkText: the text content of the link (optional)
resourceType: the type of resource (Link, Image, Script, Stylesheet, or OpenGraphImage)
malformedReason: if set, the URL is treated as malformed and will be skipped
By default, the LinkUrlParser is used. It extracts URLs from <a> tags, <link rel="next/prev">, and <link hreflang> elements. When resource extraction is enabled, it also extracts images, scripts, stylesheets, and Open Graph images.
To use your custom parser, pass it to the urlParser method:
use Spatie\Crawler\Crawler;
Crawler::create('https://example.com')
->urlParser(new MyUrlParser())
->start();
##Crawling sitemaps
There is a built-in option to parse sitemaps instead of (or in addition to) following links. It supports sitemap index files.
use Spatie\Crawler\Crawler;
Crawler::create('https://example.com')
->parseSitemaps()
->start();