Generated with

llm4s-core/org.llm4s/org.llm4s.rag/org.llm4s.rag.loader/org.llm4s.rag.loader.internal/HtmlContentExtractor

HtmlContentExtractor

org.llm4s.rag.loader.internal.HtmlContentExtractor

object HtmlContentExtractor

Utility for extracting clean text content and links from HTML.

Uses JSoup for parsing and provides:

Title extraction
Main content extraction (removing nav, header, footer, etc.)
Link extraction for crawling
Clean text output suitable for RAG chunking

Attributes

Graph
Supertypes: class Object

trait Matchable

class Any
Self type: HtmlContentExtractor.type

Members list

Type members

Classlikes

Result of extracting content from an HTML page.

Result of extracting content from an HTML page.

Value parameters

content: Clean text content
description: Meta description if available
links: Discovered links on the page
title: Page title

Attributes

Supertypes: trait Serializable

trait Product

trait Equals

class Object

trait Matchable

class Any
Show all

Value members

Concrete methods

Extract content and links from HTML.

Extract content and links from HTML.

Value parameters

baseUrl: Base URL for resolving relative links
html: Raw HTML content

Attributes

Returns: ExtractionResult with title, content, and links

Extract just the links from HTML (faster if only links needed).

Extract just the links from HTML (faster if only links needed).

Value parameters

baseUrl: Base URL for resolving relative links
html: Raw HTML content

Attributes

Returns: Sequence of absolute URLs

In this article

Generated with