Custom link extraction | crawler | Spatie

 SPATIE

  Crawler
==========

spatie.be/open-source

  [Docs](https://spatie.be/docs)  [Crawler](https://spatie.be/docs/crawler/v9)  Advanced-usage  Custom link extraction

 Version   v9

 Other versions for crawler [v9](https://spatie.be/docs/crawler/v9)

- [ Introduction ](https://spatie.be/docs/crawler/v9/introduction)
- [ Installation &amp; setup ](https://spatie.be/docs/crawler/v9/installation-setup)
- [ Support us ](https://spatie.be/docs/crawler/v9/support-us)
- [ Questions and issues ](https://spatie.be/docs/crawler/v9/questions-issues)
- [ Changelog ](https://spatie.be/docs/crawler/v9/changelog)
- [ About us ](https://spatie.be/docs/crawler/v9/about-us)

Basic usage
-----------

- [ Your first crawl ](https://spatie.be/docs/crawler/v9/basic-usage/starting-your-first-crawl)
- [ Crawl responses ](https://spatie.be/docs/crawler/v9/basic-usage/handling-crawl-responses)
- [ Using observers ](https://spatie.be/docs/crawler/v9/basic-usage/using-observers)
- [ Collecting URLs ](https://spatie.be/docs/crawler/v9/basic-usage/collecting-urls)
- [ Filtering URLs ](https://spatie.be/docs/crawler/v9/basic-usage/filtering-urls)
- [ Testing ](https://spatie.be/docs/crawler/v9/basic-usage/testing)
- [ Tracking progress ](https://spatie.be/docs/crawler/v9/basic-usage/tracking-progress)

Configuring the crawler
-----------------------

- [ Concurrency &amp; throttling ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/crawl-behavior)
- [ Limits ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/setting-crawl-limits)
- [ Extracting resources ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/extracting-resources)
- [ Configuring requests ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/configuring-requests)
- [ Response filtering ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/handling-responses)
- [ Respecting robots.txt ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/respecting-robots-txt)

Advanced usage
--------------

- [ JavaScript rendering ](https://spatie.be/docs/crawler/v9/advanced-usage/rendering-javascript)
- [ Custom link extraction ](https://spatie.be/docs/crawler/v9/advanced-usage/extracting-custom-links)
- [ Custom request handlers ](https://spatie.be/docs/crawler/v9/advanced-usage/custom-request-handlers)
- [ Crawling across requests ](https://spatie.be/docs/crawler/v9/advanced-usage/crawling-across-requests)
- [ Custom crawl queue ](https://spatie.be/docs/crawler/v9/advanced-usage/using-a-custom-crawl-queue)
- [ Graceful shutdown ](https://spatie.be/docs/crawler/v9/advanced-usage/graceful-shutdown)

 Custom link extraction
======================

###  On this page

1. [ Crawling sitemaps ](#content-crawling-sitemaps)

You can customize how links are extracted from a page by creating a class that implements the `UrlParser` interface. The `extractUrls` method should return an array of `ExtractedUrl` objects:

```
use Spatie\Crawler\Enums\ResourceType;
use Spatie\Crawler\ExtractedUrl;
use Spatie\Crawler\UrlParsers\UrlParser;

class MyUrlParser implements UrlParser
{
    /** @return array */
    public function extractUrls(string $html, string $baseUrl): array
    {
        // parse the HTML and return an array of discovered URLs
        return [
            new ExtractedUrl(
                url: 'https://example.com/page',
                linkText: 'Example page',
                resourceType: ResourceType::Link,
            ),
        ];
    }
}
```

Each `ExtractedUrl` has the following properties:

- `url`: the discovered URL
- `linkText`: the text content of the link (optional)
- `resourceType`: the type of resource (`Link`, `Image`, `Script`, `Stylesheet`, or `OpenGraphImage`)
- `malformedReason`: if set, the URL is treated as malformed and will be skipped

By default, the `LinkUrlParser` is used. It extracts URLs from `` tags, ``, and `` elements. When [resource extraction](/docs/crawler/v9/configuring-the-crawler/extracting-resources) is enabled, it also extracts images, scripts, stylesheets, and Open Graph images.

To use your custom parser, pass it to the `urlParser` method:

```
use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->urlParser(new MyUrlParser())
    ->start();
```

Crawling sitemaps
-----------------------------------------------------------------------------------------------------------

There is a built-in option to parse sitemaps instead of (or in addition to) following links. It supports sitemap index files.

```
use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->parseSitemaps()
    ->start();
```
