Generated with

llm4s-core/org.llm4s/org.llm4s.rag/org.llm4s.rag.loader/org.llm4s.rag.loader.internal/RobotsTxtParser

RobotsTxtParser

org.llm4s.rag.loader.internal.RobotsTxtParser

object RobotsTxtParser

Parser and cache for robots.txt files.

Supports:

User-agent directive
Disallow directive
Allow directive
Crawl-delay directive
Wildcard patterns (* and $)

Attributes

Graph
Supertypes: class Object

trait Matchable

class Any
Self type: RobotsTxtParser.type

Members list

Type members

Classlikes

Parsed robots.txt rules for a domain.

Parsed robots.txt rules for a domain.

Value parameters

allowRules: Paths that are explicitly allowed
crawlDelay: Suggested delay between requests in seconds
disallowRules: Paths that are disallowed

Attributes

Companion: object
Supertypes: trait Serializable

trait Product

trait Equals

class Object

trait Matchable

class Any
Show all

Attributes

Companion: class
Supertypes: trait Product

trait Mirror

class Object

trait Matchable

class Any
Self type: RobotsTxt.type

Value members

Concrete methods

Clear the robots.txt cache (for testing or memory management).

Clear the robots.txt cache (for testing or memory management).

Attributes

Get parsed robots.txt rules for a URL.

Get parsed robots.txt rules for a URL.

Value parameters

timeoutMs: Request timeout
url: URL to check
userAgent: User agent string

Attributes

Returns: Parsed rules for this user agent

Check if a URL is allowed according to robots.txt.

Check if a URL is allowed according to robots.txt.

Fetches and caches robots.txt for the domain if not already cached.

Value parameters

timeoutMs: Request timeout
url: URL to check
userAgent: User agent string

Attributes

Returns: true if URL is allowed

Parse robots.txt content for a specific user agent.

Parse robots.txt content for a specific user agent.

Follows standard robots.txt parsing rules:

Look for matching User-agent group or "*"
Collect Allow/Disallow rules
Parse Crawl-delay

Attributes

In this article

Generated with