RobotsTxtParser

org.llm4s.rag.loader.internal.RobotsTxtParser

Parser and cache for robots.txt files.

Supports:

  • User-agent directive
  • Disallow directive
  • Allow directive
  • Crawl-delay directive
  • Wildcard patterns (* and $)

Attributes

Graph
Supertypes
class Object
trait Matchable
class Any
Self type

Members list

Type members

Classlikes

final case class RobotsTxt(allowRules: Seq[String], disallowRules: Seq[String], crawlDelay: Option[Int])

Parsed robots.txt rules for a domain.

Parsed robots.txt rules for a domain.

Value parameters

allowRules

Paths that are explicitly allowed

crawlDelay

Suggested delay between requests in seconds

disallowRules

Paths that are disallowed

Attributes

Companion
object
Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
object RobotsTxt

Attributes

Companion
class
Supertypes
trait Product
trait Mirror
class Object
trait Matchable
class Any
Self type
RobotsTxt.type

Value members

Concrete methods

def clearCache(): Unit

Clear the robots.txt cache (for testing or memory management).

Clear the robots.txt cache (for testing or memory management).

Attributes

def getRules(url: String, userAgent: String, timeoutMs: Int): RobotsTxt

Get parsed robots.txt rules for a URL.

Get parsed robots.txt rules for a URL.

Value parameters

timeoutMs

Request timeout

url

URL to check

userAgent

User agent string

Attributes

Returns

Parsed rules for this user agent

def isAllowed(url: String, userAgent: String, timeoutMs: Int): Boolean

Check if a URL is allowed according to robots.txt.

Check if a URL is allowed according to robots.txt.

Fetches and caches robots.txt for the domain if not already cached.

Value parameters

timeoutMs

Request timeout

url

URL to check

userAgent

User agent string

Attributes

Returns

true if URL is allowed

def parse(content: String, targetUserAgent: String): RobotsTxt

Parse robots.txt content for a specific user agent.

Parse robots.txt content for a specific user agent.

Follows standard robots.txt parsing rules:

  • Look for matching User-agent group or "*"
  • Collect Allow/Disallow rules
  • Parse Crawl-delay

Attributes