UrlNormalizer

org.llm4s.rag.loader.internal.UrlNormalizer
object UrlNormalizer

Utility for normalizing URLs to ensure consistent deduplication.

Handles:

  • Scheme normalization (lowercase)
  • Host normalization (lowercase)
  • Path normalization (remove trailing slash, decode/encode consistently)
  • Fragment removal
  • Optional query parameter handling

Attributes

Graph
Supertypes
class Object
trait Matchable
class Any
Self type

Members list

Value members

Concrete methods

def extractDomain(url: String): Option[String]

Extract the domain (host) from a URL.

Extract the domain (host) from a URL.

Value parameters

url

URL to extract domain from

Attributes

Returns

Domain string (lowercase), or None if invalid

def isInDomains(url: String, allowedDomains: Set[String]): Boolean

Check if a URL belongs to one of the allowed domains.

Check if a URL belongs to one of the allowed domains.

Value parameters

allowedDomains

Set of allowed domains

url

URL to check

Attributes

Returns

true if URL's domain matches or is subdomain of an allowed domain

def isValidHttpUrl(url: String): Boolean

Check if a URL is a valid HTTP/HTTPS URL.

Check if a URL is a valid HTTP/HTTPS URL.

Value parameters

url

URL to validate

Attributes

Returns

true if URL is valid HTTP or HTTPS

def normalize(url: String, includeQueryParams: Boolean): String

Normalize a URL for comparison and deduplication.

Normalize a URL for comparison and deduplication.

Value parameters

includeQueryParams

Whether to keep query parameters

url

URL string to normalize

Attributes

Returns

Normalized URL string, or original if parsing fails

def resolve(baseUrl: String, href: String, includeQueryParams: Boolean): Option[String]

Resolve a potentially relative URL against a base URL.

Resolve a potentially relative URL against a base URL.

Value parameters

baseUrl

Base URL (page the link was found on)

href

Link href (may be relative or absolute)

includeQueryParams

Whether to keep query parameters

Attributes

Returns

Resolved and normalized absolute URL