Configuring requests | crawler | Spatie

 SPATIE

  Crawler
==========

spatie.be/open-source

  [Docs](https://spatie.be/docs)  [Crawler](https://spatie.be/docs/crawler/v9)  Configuring-the-crawler  Configuring requests

 Version   v9

 Other versions for crawler [v9](https://spatie.be/docs/crawler/v9)

- [ Introduction ](https://spatie.be/docs/crawler/v9/introduction)
- [ Installation &amp; setup ](https://spatie.be/docs/crawler/v9/installation-setup)
- [ Support us ](https://spatie.be/docs/crawler/v9/support-us)
- [ Questions and issues ](https://spatie.be/docs/crawler/v9/questions-issues)
- [ Changelog ](https://spatie.be/docs/crawler/v9/changelog)
- [ About us ](https://spatie.be/docs/crawler/v9/about-us)

Basic usage
-----------

- [ Your first crawl ](https://spatie.be/docs/crawler/v9/basic-usage/starting-your-first-crawl)
- [ Crawl responses ](https://spatie.be/docs/crawler/v9/basic-usage/handling-crawl-responses)
- [ Using observers ](https://spatie.be/docs/crawler/v9/basic-usage/using-observers)
- [ Collecting URLs ](https://spatie.be/docs/crawler/v9/basic-usage/collecting-urls)
- [ Filtering URLs ](https://spatie.be/docs/crawler/v9/basic-usage/filtering-urls)
- [ Testing ](https://spatie.be/docs/crawler/v9/basic-usage/testing)
- [ Tracking progress ](https://spatie.be/docs/crawler/v9/basic-usage/tracking-progress)

Configuring the crawler
-----------------------

- [ Concurrency &amp; throttling ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/crawl-behavior)
- [ Limits ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/setting-crawl-limits)
- [ Extracting resources ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/extracting-resources)
- [ Configuring requests ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/configuring-requests)
- [ Response filtering ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/handling-responses)
- [ Respecting robots.txt ](https://spatie.be/docs/crawler/v9/configuring-the-crawler/respecting-robots-txt)

Advanced usage
--------------

- [ JavaScript rendering ](https://spatie.be/docs/crawler/v9/advanced-usage/rendering-javascript)
- [ Custom link extraction ](https://spatie.be/docs/crawler/v9/advanced-usage/extracting-custom-links)
- [ Custom request handlers ](https://spatie.be/docs/crawler/v9/advanced-usage/custom-request-handlers)
- [ Crawling across requests ](https://spatie.be/docs/crawler/v9/advanced-usage/crawling-across-requests)
- [ Custom crawl queue ](https://spatie.be/docs/crawler/v9/advanced-usage/using-a-custom-crawl-queue)
- [ Graceful shutdown ](https://spatie.be/docs/crawler/v9/advanced-usage/graceful-shutdown)

 Configuring requests
====================

###  On this page

1. [ User agent ](#content-user-agent)
2. [ Extra headers ](#content-extra-headers)
3. [ Timeouts ](#content-timeouts)
4. [ Authentication ](#content-authentication)
5. [ SSL verification ](#content-ssl-verification)
6. [ Proxy ](#content-proxy)
7. [ Cookies ](#content-cookies)
8. [ Query parameters ](#content-query-parameters)
9. [ Retrying failed requests ](#content-retrying-failed-requests)
10. [ Guzzle middleware ](#content-guzzle-middleware)
11. [ Custom Guzzle client options ](#content-custom-guzzle-client-options)
12. [ Redirects ](#content-redirects)
13. [ Streaming responses ](#content-streaming-responses)

User agent
--------------------------------------------------------------------------------------

By default, the crawler identifies itself as `*`. You can set a custom user agent using the `userAgent` method.

```
use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->userAgent('MyBot/1.0')
    ->start();
```

The user agent is also used when checking `robots.txt` rules, so make sure it matches any user agent specific rules you want to respect.

Extra headers
-----------------------------------------------------------------------------------------------

You can add extra headers to every request the crawler makes using the `headers` method.

```
use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->headers([
        'Accept-Language' => 'en-US',
        'X-Custom-Header' => 'value',
    ])
    ->start();
```

The headers will be merged with the default headers. You can call `headers` multiple times. Each call will merge the new headers with the previously set ones.

Timeouts
--------------------------------------------------------------------------------

By default, the crawler uses a 10 second timeout for both connecting and receiving a response. You can change these values using the `connectTimeout` and `requestTimeout` methods. Both accept a value in seconds.

```
use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->connectTimeout(5)
    ->requestTimeout(30)
    ->start();
```

The `connectTimeout` method sets the maximum number of seconds to wait while trying to connect to the server. The `requestTimeout` method sets the maximum number of seconds to wait for the entire request (including the response) to complete.

Authentication
--------------------------------------------------------------------------------------------------

When crawling sites that require authentication, you can use the `basicAuth` or `token` methods.

The `basicAuth` method configures HTTP Basic authentication for all requests.

```
use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->basicAuth('username', 'password')
    ->start();
```

The `token` method sets an `Authorization` header. It defaults to the `Bearer` type.

```
use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->token('your-api-token')
    ->start();
```

You can pass a second argument to change the token type.

```
use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->token('your-api-token', 'Token')
    ->start();
```

SSL verification
--------------------------------------------------------------------------------------------------------

When crawling sites with self-signed or invalid SSL certificates (for example, a staging environment), you can disable certificate verification using the `withoutVerifying` method.

```
use Spatie\Crawler\Crawler;

Crawler::create('https://staging.example.com')
    ->withoutVerifying()
    ->start();
```

You should only use this for trusted environments. In production, always keep SSL verification enabled.

Proxy
-----------------------------------------------------------------------

You can route all crawler requests through a proxy server using the `proxy` method.

```
use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->proxy('http://proxy-server:8080')
    ->start();
```

This accepts any proxy string supported by Guzzle, including authenticated proxies.

```
use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->proxy('http://username:password@proxy-server:8080')
    ->start();
```

Cookies
-----------------------------------------------------------------------------

You can send cookies with every request using the `cookies` method. This is useful when you need to crawl a site that requires a session cookie or other cookie based authentication.

```
use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->cookies(['session_id' => 'abc123', 'token' => 'xyz'], 'example.com')
    ->start();
```

The first argument is an array of cookie names and values. The second argument is the domain the cookies belong to.

Query parameters
--------------------------------------------------------------------------------------------------------

You can append query parameters to every request the crawler makes using the `queryParameters` method. This is useful for passing API keys or other parameters that need to be present on every request.

```
use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->queryParameters(['api_key' => 'your-key'])
    ->start();
```

You can call `queryParameters` multiple times. Each call will merge the new parameters with the previously set ones.

Retrying failed requests
--------------------------------------------------------------------------------------------------------------------------------

Some servers occasionally return 5xx errors or drop connections. You can configure the crawler to automatically retry failed requests using the `retry` method.

```
use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->retry(times: 2, delayInMs: 500)
    ->start();
```

The first argument is the maximum number of retries per request. The second argument is the base delay between retries in milliseconds. The delay increases linearly with each attempt (500ms, 1000ms, 1500ms, ...).

A request will be retried when it results in a connection error or a 5xx response status code.

Guzzle middleware
-----------------------------------------------------------------------------------------------------------

You can add custom [Guzzle middleware](https://docs.guzzlephp.org/en/stable/handlers-and-middleware.html) to the underlying HTTP client using the `middleware` method. This lets you hook into the request/response lifecycle for logging, caching, modifying headers, or any other purpose.

```
use GuzzleHttp\Middleware;
use Psr\Http\Message\RequestInterface;
use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->middleware(Middleware::mapRequest(function (RequestInterface $request) {
        return $request->withHeader('X-Custom-Header', 'value');
    }), 'add-custom-header')
    ->start();
```

The first argument is a callable that follows Guzzle's middleware signature. The optional second argument is a name for the middleware, which can be useful for debugging.

You can call `middleware` multiple times to add multiple middlewares. They will be pushed onto the handler stack in the order they are added.

```
use GuzzleHttp\Middleware;
use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->middleware($loggingMiddleware, 'logging')
    ->middleware($cachingMiddleware, 'caching')
    ->start();
```

Custom Guzzle client options
--------------------------------------------------------------------------------------------------------------------------------------------

The second argument to `Crawler::create()` accepts an array of [Guzzle request options](https://docs.guzzlephp.org/en/stable/request-options.html). These are merged with the crawler's defaults, so you only need to specify the options you want to change.

```
use GuzzleHttp\RequestOptions;
use Spatie\Crawler\Crawler;

Crawler::create('https://example.com', [
    RequestOptions::STREAM => true,
    RequestOptions::TIMEOUT => 30,
])->start();
```

The defaults are:

```
[
    RequestOptions::COOKIES => true,
    RequestOptions::CONNECT_TIMEOUT => 10,
    RequestOptions::TIMEOUT => 10,
    RequestOptions::ALLOW_REDIRECTS => ['track_redirects' => true],
    RequestOptions::HEADERS => ['User-Agent' => '*'],
]
```

To explicitly remove a default option, set it to `null`:

```
Crawler::create('https://example.com', [
    RequestOptions::COOKIES => null, // removes the default COOKIES option entirely
])->start();
```

Redirects
-----------------------------------------------------------------------------------

By default, the crawler follows redirects and tracks the redirect chain. This means that when a URL redirects to another location, the crawler will follow the redirect and use the final URL as the base for extracting links.

If you need to disable redirect following, you can pass custom client options:

```
use GuzzleHttp\RequestOptions;
use Spatie\Crawler\Crawler;

Crawler::create('https://example.com', [
    RequestOptions::ALLOW_REDIRECTS => false,
])->start();
```

Streaming responses
-----------------------------------------------------------------------------------------------------------------

For sites with large responses, you can enable streaming to reduce memory usage. When streaming is enabled, response bodies are read in chunks rather than loaded entirely into memory.

```
use Spatie\Crawler\Crawler;

Crawler::create('https://example.com')
    ->stream()
    ->start();
```
