Extracting resources | crawler | Spatie

 SPATIE

  Crawler
==========

spatie.be/open-source

  [Docs](https://spatie.be/docs)  [Crawler](https://spatie.be/docs/crawler/v9)  Configuring-the-crawler  Extracting resources

 Version   v9

 Other versions for crawler [v9](https://spatie.be/docs/crawler/v9)

- [ Introduction ](https://spatie.be/docs/crawler/v9/introduction)
- [ Installation &amp; setup ](https://spatie.be/docs/crawler/v9/installation-setup)
- [ Support us ](https://spatie.be/docs/crawler/v9/support-us)
- [ Questions and issues ](https://spatie.be/docs/crawler/v9/questions-issues)
- [ Changelog ](https://spatie.be/docs/crawler/v9/changelog)
- [ About us ](https://spatie.be/docs/crawler/v9/about-us)

Basic usage
-----------

- [ Your first crawl ](https://spatie.be/docs/crawler/v9/basic-usage/starting-your-first-crawl)
- [ Crawl responses ](https://spatie.be/docs/crawler/v9/basic-usage/handling-crawl-responses)
- [ Using observers ](https://spatie.be/docs/crawler/v9/basic-usage/using-observers)
- [ Collecting URLs ](https://spatie.be/docs/crawler/v9/basic-usage/collecting-urls)
- [ Filtering URLs ](https://spatie.be/docs/crawler/v9/basic-usage/filtering-urls)
- [ Testing ](https://spatie.be/docs/crawler/v9/basic-usage/testing)
- [ Tracking progress ](https://spatie.be/docs/crawler/v9/basic-usage/tracking-progress)

Configuring the crawler
-----------------------

- [ Concurrency &amp; throttling ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/crawl-behavior)
- [ Limits ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/setting-crawl-limits)
- [ Extracting resources ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/extracting-resources)
- [ Configuring requests ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/configuring-requests)
- [ Response filtering ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/handling-responses)
- [ Respecting robots.txt ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/respecting-robots-txt)

Advanced usage
--------------

- [ JavaScript rendering ](https://spatie.be/docs/crawler/v9/advanced-usage/rendering-javascript)
- [ Custom link extraction ](https://spatie.be/docs/crawler/v9/advanced-usage/extracting-custom-links)
- [ Custom request handlers ](https://spatie.be/docs/crawler/v9/advanced-usage/custom-request-handlers)
- [ Crawling across requests ](https://spatie.be/docs/crawler/v9/advanced-usage/crawling-across-requests)
- [ Custom crawl queue ](https://spatie.be/docs/crawler/v9/advanced-usage/using-a-custom-crawl-queue)
- [ Graceful shutdown ](https://spatie.be/docs/crawler/v9/advanced-usage/graceful-shutdown)

 Extracting resources
====================

###  On this page

1. [ Extracting specific resource types ](#content-extracting-specific-resource-types)
2. [ Extracting all resource types ](#content-extracting-all-resource-types)
3. [ Resource types in observers ](#content-resource-types-in-observers)
4. [ Base href support ](#content-base-href-support)
5. [ Malformed URLs ](#content-malformed-urls)
6. [ Resource types in collected URLs ](#content-resource-types-in-collected-urls)

By default, the crawler only extracts links (`` tags and some `` tags) from each page. You can also instruct it to extract images, scripts, stylesheets, and Open Graph images. This is useful for broken asset checking, content auditing, or building a complete inventory of a site's resources.

Extracting specific resource types
--------------------------------------------------------------------------------------------------------------------------------------------------------------

Use the `alsoExtract` method to extract additional resource types alongside links:

```
use Spatie\Crawler\Crawler;
use Spatie\Crawler\CrawlResponse;
use Spatie\Crawler\Enums\ResourceType;

Crawler::create('https://example.com')
    ->alsoExtract(ResourceType::Image, ResourceType::Stylesheet)
    ->onCrawled(function (string $url, CrawlResponse $response) {
        echo "{$response->resourceType()->value}: {$url}\n";
    })
    ->start();
```

The available resource types are:

TypeWhat it extracts`ResourceType::Link``` tags, ``, `` (always included)`ResourceType::Image``` and `` (lazy loaded images)`ResourceType::Script``` and ```ResourceType::Stylesheet```, ``, ```ResourceType::OpenGraphImage``` and ``Extracting all resource types
-----------------------------------------------------------------------------------------------------------------------------------------------

To extract everything at once, use `extractAll`:

```
Crawler::create('https://example.com')
    ->extractAll()
    ->onCrawled(function (string $url, CrawlResponse $response) {
        // $response->resourceType() tells you what kind of resource this is
    })
    ->start();
```

Resource types in observers
-----------------------------------------------------------------------------------------------------------------------------------------

When using observers, the resource type is available through the `CrawlResponse`:

```
use Spatie\Crawler\CrawlObservers\CrawlObserver;
use Spatie\Crawler\CrawlProgress;
use Spatie\Crawler\CrawlResponse;
use Spatie\Crawler\Enums\ResourceType;

class AssetChecker extends CrawlObserver
{
    public function crawled(
        string $url,
        CrawlResponse $response,
        CrawlProgress $progress,
    ): void {
        if ($response->resourceType() === ResourceType::Image && $response->status() === 404) {
            echo "Broken image: {$url} (found on {$response->foundOnUrl()})\n";
        }
    }
}
```

Base href support
-----------------------------------------------------------------------------------------------------------

When extracting resources (images, scripts, stylesheets, and Open Graph images), the crawler respects the `` tag in the HTML. If a page contains ``, relative resource URLs will be resolved against that base URL instead of the page URL.

Links (`` tags) also respect `` through Symfony's DomCrawler.

Malformed URLs
--------------------------------------------------------------------------------------------------

When the crawler encounters a malformed URL in the HTML (for example, `href="https:///invalid"`), it will report it through your `crawlFailed` callback or observer instead of silently ignoring it. The `RequestException` message will contain the reason the URL could not be parsed.

```
use GuzzleHttp\Exception\RequestException;

Crawler::create('https://example.com')
    ->onFailed(function (string $url, RequestException $exception, CrawlProgress $progress) {
        if (str_contains($exception->getMessage(), 'Malformed URL')) {
            echo "Found malformed URL: {$url}\n";
        }
    })
    ->start();
```

Resource types in collected URLs
--------------------------------------------------------------------------------------------------------------------------------------------------------

When using `foundUrls()`, each `CrawledUrl` includes the resource type:

```
$urls = Crawler::create('https://example.com')
    ->extractAll()
    ->foundUrls();

foreach ($urls as $crawledUrl) {
    echo "{$crawledUrl->resourceType->value}: {$crawledUrl->url} ({$crawledUrl->status})\n";
}
```
