org.llm4s.rag.loader.CrawlerConfig
See theCrawlerConfig companion object
final case class CrawlerConfig(maxDepth: Int, maxPages: Int, followPatterns: Seq[String], excludePatterns: Seq[String], respectRobotsTxt: Boolean, delayMs: Int, timeoutMs: Int, userAgent: String, maxQueueSize: Int, includeQueryParams: Boolean, sameDomainOnly: Boolean, acceptContentTypes: Set[String])
Configuration for web crawling.
Controls how the WebCrawlerLoader discovers and fetches web pages.
Value parameters
- acceptContentTypes
-
Content types to process (others are skipped)
- delayMs
-
Delay between requests in milliseconds (rate limiting)
- excludePatterns
-
URL patterns to exclude (glob syntax)
- followPatterns
-
URL patterns to follow (glob syntax with asterisk wildcards)
- includeQueryParams
-
Whether to treat URLs with different query params as distinct pages
- maxDepth
-
Maximum link depth to follow from seed URLs (0 = seed URLs only)
- maxPages
-
Maximum total pages to crawl
- maxQueueSize
-
Maximum number of URLs to queue (prevents unbounded memory usage)
- respectRobotsTxt
-
Whether to respect robots.txt directives
- sameDomainOnly
-
Whether to restrict crawling to the same domain as seed URLs
- timeoutMs
-
HTTP request timeout in milliseconds
- userAgent
-
User agent string for HTTP requests
Attributes
- Companion
- object
- Graph
-
- Supertypes
-
trait Serializabletrait Producttrait Equalsclass Objecttrait Matchableclass Any
Members list
In this article