llm4s-core/org.llm4s/org.llm4s.rag/org.llm4s.rag.loader/WebCrawlerLoader

WebCrawlerLoader

org.llm4s.rag.loader.WebCrawlerLoader

See theWebCrawlerLoader companion object

final case class WebCrawlerLoader(seedUrls: Seq[String], config: CrawlerConfig, metadata: Map[String, String]) extends DocumentLoader

Load documents by crawling from seed URLs.

Features:

Breadth-first link discovery
Domain/pattern restrictions
robots.txt support
Rate limiting
HTML to text conversion
Deduplication by URL

Value parameters

config: Crawler configuration
metadata: Additional metadata for all documents
seedUrls: Starting URLs to crawl from

Attributes

Companion: object
Graph
Supertypes: trait Serializable

trait Product

trait Equals

trait DocumentLoader

class Object

trait Matchable

class Any
Show all

Members list

Value members

Concrete methods

Human-readable description of this loader.

Used for logging and debugging.

Attributes

Estimated number of documents (if known).

Used for progress reporting and resource allocation. Returns None if count is unknown or expensive to compute.

Attributes

Definition Classes: DocumentLoader

Load documents from this source.

Returns an iterator of LoadResult for streaming large document sets. Each result is either a successfully loaded document or a loading error. This allows processing to continue even when some documents fail.

Attributes

Returns: Iterator of load results (successes and failures)

Set rate limit delay

Attributes

Set exclude patterns

Attributes

Set follow patterns

Attributes

Set max depth

Attributes

Set max pages

Attributes

Set maximum queue size

Attributes

Add metadata

Attributes

Set whether to include query parameters

Attributes

Set whether to respect robots.txt

Attributes

Set whether to restrict to same domain

Attributes

Add a seed URL

Attributes

Add multiple seed URLs

Attributes

Set request timeout

Attributes

Inherited methods

Combine this loader with another.

Creates a composite loader that loads from both sources.

Attributes

Inherited from:: DocumentLoader

Attributes

Inherited from:: Product

Attributes

Inherited from:: Product

In this article

Generated with