Load documents by crawling from seed URLs.
Features:
- Breadth-first link discovery
- Domain/pattern restrictions
- robots.txt support
- Rate limiting
- HTML to text conversion
- Deduplication by URL
Value parameters
- config
-
Crawler configuration
- metadata
-
Additional metadata for all documents
- seedUrls
-
Starting URLs to crawl from
Attributes
- Companion
- object
- Graph
-
- Supertypes
-
trait Serializabletrait Producttrait Equalstrait DocumentLoaderclass Objecttrait Matchableclass AnyShow all
Members list
Value members
Concrete methods
Human-readable description of this loader.
Human-readable description of this loader.
Used for logging and debugging.
Attributes
Estimated number of documents (if known).
Estimated number of documents (if known).
Used for progress reporting and resource allocation. Returns None if count is unknown or expensive to compute.
Attributes
- Definition Classes
Load documents from this source.
Load documents from this source.
Returns an iterator of LoadResult for streaming large document sets. Each result is either a successfully loaded document or a loading error. This allows processing to continue even when some documents fail.
Attributes
- Returns
-
Iterator of load results (successes and failures)
Set rate limit delay
Set rate limit delay
Attributes
Set exclude patterns
Set exclude patterns
Attributes
Set follow patterns
Set follow patterns
Attributes
Set max depth
Set max depth
Attributes
Set max pages
Set max pages
Attributes
Set maximum queue size
Set maximum queue size
Attributes
Add metadata
Add metadata
Attributes
Set whether to include query parameters
Set whether to include query parameters
Attributes
Set whether to respect robots.txt
Set whether to respect robots.txt
Attributes
Set whether to restrict to same domain
Set whether to restrict to same domain
Attributes
Add a seed URL
Add a seed URL
Attributes
Add multiple seed URLs
Add multiple seed URLs
Attributes
Set request timeout
Set request timeout
Attributes
Inherited methods
Combine this loader with another.
Combine this loader with another.
Creates a composite loader that loads from both sources.
Attributes
- Inherited from:
- DocumentLoader
Attributes
- Inherited from:
- Product
Attributes
- Inherited from:
- Product