CrawlerConfig

org.llm4s.rag.loader.CrawlerConfig
See theCrawlerConfig companion object
final case class CrawlerConfig(maxDepth: Int, maxPages: Int, followPatterns: Seq[String], excludePatterns: Seq[String], respectRobotsTxt: Boolean, delayMs: Int, timeoutMs: Int, userAgent: String, maxQueueSize: Int, includeQueryParams: Boolean, sameDomainOnly: Boolean, acceptContentTypes: Set[String])

Configuration for web crawling.

Controls how the WebCrawlerLoader discovers and fetches web pages.

Value parameters

acceptContentTypes

Content types to process (others are skipped)

delayMs

Delay between requests in milliseconds (rate limiting)

excludePatterns

URL patterns to exclude (glob syntax)

followPatterns

URL patterns to follow (glob syntax with asterisk wildcards)

includeQueryParams

Whether to treat URLs with different query params as distinct pages

maxDepth

Maximum link depth to follow from seed URLs (0 = seed URLs only)

maxPages

Maximum total pages to crawl

maxQueueSize

Maximum number of URLs to queue (prevents unbounded memory usage)

respectRobotsTxt

Whether to respect robots.txt directives

sameDomainOnly

Whether to restrict crawling to the same domain as seed URLs

timeoutMs

HTTP request timeout in milliseconds

userAgent

User agent string for HTTP requests

Attributes

Companion
object
Graph
Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all

Members list

Value members

Concrete methods

def withContentTypes(types: Set[String]): CrawlerConfig

Set acceptable content types

Set acceptable content types

Attributes

def withDelay(ms: Int): CrawlerConfig

Set rate limit delay

Set rate limit delay

Attributes

def withExcludePatterns(patterns: String*): CrawlerConfig

Set URL patterns to exclude

Set URL patterns to exclude

Attributes

def withFollowPatterns(patterns: String*): CrawlerConfig

Set URL patterns to follow

Set URL patterns to follow

Attributes

def withMaxDepth(depth: Int): CrawlerConfig

Set max crawl depth

Set max crawl depth

Attributes

def withMaxPages(pages: Int): CrawlerConfig

Set max pages to crawl

Set max pages to crawl

Attributes

Set max queue size

Set max queue size

Attributes

def withQueryParams(include: Boolean): CrawlerConfig

Set whether to include query parameters in URL comparison

Set whether to include query parameters in URL comparison

Attributes

def withRobotsTxt(respect: Boolean): CrawlerConfig

Set whether to respect robots.txt

Set whether to respect robots.txt

Attributes

def withSameDomainOnly(enabled: Boolean): CrawlerConfig

Set whether to restrict to same domain

Set whether to restrict to same domain

Attributes

def withTimeout(ms: Int): CrawlerConfig

Set request timeout

Set request timeout

Attributes

def withUserAgent(ua: String): CrawlerConfig

Set user agent

Set user agent

Attributes

Inherited methods

def productElementNames: Iterator[String]

Attributes

Inherited from:
Product
def productIterator: Iterator[Any]

Attributes

Inherited from:
Product