org.llm4s.rag.loader.internal
package org.llm4s.rag.loader.internal
Members list
Type members
Classlikes
object GlobPatternMatcher
Utility for matching URLs against glob-style patterns.
Utility for matching URLs against glob-style patterns.
Supports:
- asterisk matches any string (non-greedy within path segments)
- double asterisk matches any string including path separators
- question mark matches single character
- Literal matching for other characters
Examples:
- subdomain.example.com/path matches any subdomain and path
- example.com/docs/anything matches any path under /docs/
- example.com/page1.html matches page1.html, page2.html, etc.
Attributes
- Supertypes
-
class Objecttrait Matchableclass Any
- Self type
-
GlobPatternMatcher.type
object HtmlContentExtractor
Utility for extracting clean text content and links from HTML.
Utility for extracting clean text content and links from HTML.
Uses JSoup for parsing and provides:
- Title extraction
- Main content extraction (removing nav, header, footer, etc.)
- Link extraction for crawling
- Clean text output suitable for RAG chunking
Attributes
- Supertypes
-
class Objecttrait Matchableclass Any
- Self type
-
HtmlContentExtractor.type
object RobotsTxtParser
Parser and cache for robots.txt files.
Parser and cache for robots.txt files.
Supports:
- User-agent directive
- Disallow directive
- Allow directive
- Crawl-delay directive
- Wildcard patterns (* and $)
Attributes
- Supertypes
-
class Objecttrait Matchableclass Any
- Self type
-
RobotsTxtParser.type
object UrlNormalizer
Utility for normalizing URLs to ensure consistent deduplication.
Utility for normalizing URLs to ensure consistent deduplication.
Handles:
- Scheme normalization (lowercase)
- Host normalization (lowercase)
- Path normalization (remove trailing slash, decode/encode consistently)
- Fragment removal
- Optional query parameter handling
Attributes
- Supertypes
-
class Objecttrait Matchableclass Any
- Self type
-
UrlNormalizer.type
In this article