HtmlContentExtractor
org.llm4s.rag.loader.internal.HtmlContentExtractor
object HtmlContentExtractor
Utility for extracting clean text content and links from HTML.
Uses JSoup for parsing and provides:
- Title extraction
- Main content extraction (removing nav, header, footer, etc.)
- Link extraction for crawling
- Clean text output suitable for RAG chunking
Attributes
- Graph
-
- Supertypes
-
class Objecttrait Matchableclass Any
- Self type
-
HtmlContentExtractor.type
Members list
In this article