HtmlContentExtractor

org.llm4s.rag.loader.internal.HtmlContentExtractor

Utility for extracting clean text content and links from HTML.

Uses JSoup for parsing and provides:

  • Title extraction
  • Main content extraction (removing nav, header, footer, etc.)
  • Link extraction for crawling
  • Clean text output suitable for RAG chunking

Attributes

Graph
Supertypes
class Object
trait Matchable
class Any
Self type

Members list

Type members

Classlikes

final case class ExtractionResult(title: String, content: String, links: Seq[String], description: Option[String])

Result of extracting content from an HTML page.

Result of extracting content from an HTML page.

Value parameters

content

Clean text content

description

Meta description if available

links

Discovered links on the page

title

Page title

Attributes

Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all

Value members

Concrete methods

def extract(html: String, baseUrl: String): ExtractionResult

Extract content and links from HTML.

Extract content and links from HTML.

Value parameters

baseUrl

Base URL for resolving relative links

html

Raw HTML content

Attributes

Returns

ExtractionResult with title, content, and links

def extractLinksOnly(html: String, baseUrl: String): Seq[String]

Extract just the links from HTML (faster if only links needed).

Extract just the links from HTML (faster if only links needed).

Value parameters

baseUrl

Base URL for resolving relative links

html

Raw HTML content

Attributes

Returns

Sequence of absolute URLs