By default, the crawler only extracts links (<a> tags and some <link> tags) from each page. You can also instruct it to extract images, scripts, stylesheets, and Open Graph images. This is useful for broken asset checking, content auditing, or building a complete inventory of a site's resources.
Use the alsoExtract method to extract additional resource types alongside links:
use Spatie\Crawler\Crawler;
use Spatie\Crawler\CrawlResponse;
use Spatie\Crawler\Enums\ResourceType;
Crawler::create('https://example.com')
->alsoExtract(ResourceType::Image, ResourceType::Stylesheet)
->onCrawled(function (string $url, CrawlResponse $response) {
echo "{$response->resourceType()->value}: {$url}\n";
})
->start();
The available resource types are:
| Type |
What it extracts |
ResourceType::Link |
<a> tags, <link rel="next/prev">, <link hreflang> (always included) |
ResourceType::Image |
<img src> and <img data-src> (lazy loaded images) |
ResourceType::Script |
<script src> and <link rel="modulepreload"> |
ResourceType::Stylesheet |
<link rel="stylesheet">, <link type="text/css">, <link as="style"> |
ResourceType::OpenGraphImage |
<meta property="og:image"> and <meta property="twitter:image"> |
To extract everything at once, use extractAll:
Crawler::create('https://example.com')
->extractAll()
->onCrawled(function (string $url, CrawlResponse $response) {
})
->start();
##Resource types in observers
When using observers, the resource type is available through the CrawlResponse:
use Spatie\Crawler\CrawlObservers\CrawlObserver;
use Spatie\Crawler\CrawlProgress;
use Spatie\Crawler\CrawlResponse;
use Spatie\Crawler\Enums\ResourceType;
class AssetChecker extends CrawlObserver
{
public function crawled(
string $url,
CrawlResponse $response,
CrawlProgress $progress,
): void {
if ($response->resourceType() === ResourceType::Image && $response->status() === 404) {
echo "Broken image: {$url} (found on {$response->foundOnUrl()})\n";
}
}
}
##Base href support
When extracting resources (images, scripts, stylesheets, and Open Graph images), the crawler respects the <base href> tag in the HTML. If a page contains <base href="https://example.com/assets/">, relative resource URLs will be resolved against that base URL instead of the page URL.
Links (<a> tags) also respect <base href> through Symfony's DomCrawler.
##Malformed URLs
When the crawler encounters a malformed URL in the HTML (for example, href="https:///invalid"), it will report it through your crawlFailed callback or observer instead of silently ignoring it. The RequestException message will contain the reason the URL could not be parsed.
use GuzzleHttp\Exception\RequestException;
Crawler::create('https://example.com')
->onFailed(function (string $url, RequestException $exception, CrawlProgress $progress) {
if (str_contains($exception->getMessage(), 'Malformed URL')) {
echo "Found malformed URL: {$url}\n";
}
})
->start();
##Resource types in collected URLs
When using foundUrls(), each CrawledUrl includes the resource type:
$urls = Crawler::create('https://example.com')
->extractAll()
->foundUrls();
foreach ($urls as $crawledUrl) {
echo "{$crawledUrl->resourceType->value}: {$crawledUrl->url} ({$crawledUrl->status})\n";
}