WebCrawlerLoader

org.llm4s.rag.loader.WebCrawlerLoader
See theWebCrawlerLoader companion object
final case class WebCrawlerLoader(seedUrls: Seq[String], config: CrawlerConfig, metadata: Map[String, String]) extends DocumentLoader

Load documents by crawling from seed URLs.

Features:

  • Breadth-first link discovery
  • Domain/pattern restrictions
  • robots.txt support
  • Rate limiting
  • HTML to text conversion
  • Deduplication by URL

Value parameters

config

Crawler configuration

metadata

Additional metadata for all documents

seedUrls

Starting URLs to crawl from

Attributes

Companion
object
Graph
Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all

Members list

Value members

Concrete methods

def description: String

Human-readable description of this loader.

Human-readable description of this loader.

Used for logging and debugging.

Attributes

override def estimatedCount: Option[Int]

Estimated number of documents (if known).

Estimated number of documents (if known).

Used for progress reporting and resource allocation. Returns None if count is unknown or expensive to compute.

Attributes

Definition Classes
def load(): Iterator[LoadResult]

Load documents from this source.

Load documents from this source.

Returns an iterator of LoadResult for streaming large document sets. Each result is either a successfully loaded document or a loading error. This allows processing to continue even when some documents fail.

Attributes

Returns

Iterator of load results (successes and failures)

def withDelay(ms: Int): WebCrawlerLoader

Set rate limit delay

Set rate limit delay

Attributes

def withExcludePatterns(patterns: String*): WebCrawlerLoader

Set exclude patterns

Set exclude patterns

Attributes

def withFollowPatterns(patterns: String*): WebCrawlerLoader

Set follow patterns

Set follow patterns

Attributes

def withMaxDepth(depth: Int): WebCrawlerLoader

Set max depth

Set max depth

Attributes

def withMaxPages(pages: Int): WebCrawlerLoader

Set max pages

Set max pages

Attributes

Set maximum queue size

Set maximum queue size

Attributes

def withMetadata(m: Map[String, String]): WebCrawlerLoader

Add metadata

Add metadata

Attributes

def withQueryParams(include: Boolean): WebCrawlerLoader

Set whether to include query parameters

Set whether to include query parameters

Attributes

def withRobotsTxt(respect: Boolean): WebCrawlerLoader

Set whether to respect robots.txt

Set whether to respect robots.txt

Attributes

def withSameDomainOnly(enabled: Boolean): WebCrawlerLoader

Set whether to restrict to same domain

Set whether to restrict to same domain

Attributes

def withSeed(url: String): WebCrawlerLoader

Add a seed URL

Add a seed URL

Attributes

def withSeeds(urls: String*): WebCrawlerLoader

Add multiple seed URLs

Add multiple seed URLs

Attributes

Set request timeout

Set request timeout

Attributes

Inherited methods

Combine this loader with another.

Combine this loader with another.

Creates a composite loader that loads from both sources.

Attributes

Inherited from:
DocumentLoader
def productElementNames: Iterator[String]

Attributes

Inherited from:
Product
def productIterator: Iterator[Any]

Attributes

Inherited from:
Product