org.llm4s.rag.loader.internal

Members list

Type members

Classlikes

Utility for matching URLs against glob-style patterns.

Utility for matching URLs against glob-style patterns.

Supports:

  • asterisk matches any string (non-greedy within path segments)
  • double asterisk matches any string including path separators
  • question mark matches single character
  • Literal matching for other characters

Examples:

  • subdomain.example.com/path matches any subdomain and path
  • example.com/docs/anything matches any path under /docs/
  • example.com/page1.html matches page1.html, page2.html, etc.

Attributes

Supertypes
class Object
trait Matchable
class Any
Self type

Utility for extracting clean text content and links from HTML.

Utility for extracting clean text content and links from HTML.

Uses JSoup for parsing and provides:

  • Title extraction
  • Main content extraction (removing nav, header, footer, etc.)
  • Link extraction for crawling
  • Clean text output suitable for RAG chunking

Attributes

Supertypes
class Object
trait Matchable
class Any
Self type

Parser and cache for robots.txt files.

Parser and cache for robots.txt files.

Supports:

  • User-agent directive
  • Disallow directive
  • Allow directive
  • Crawl-delay directive
  • Wildcard patterns (* and $)

Attributes

Supertypes
class Object
trait Matchable
class Any
Self type
object UrlNormalizer

Utility for normalizing URLs to ensure consistent deduplication.

Utility for normalizing URLs to ensure consistent deduplication.

Handles:

  • Scheme normalization (lowercase)
  • Host normalization (lowercase)
  • Path normalization (remove trailing slash, decode/encode consistently)
  • Fragment removal
  • Optional query parameter handling

Attributes

Supertypes
class Object
trait Matchable
class Any
Self type