The files robots.txt, located on the base of the websites, allow the webmaster to defined the allowed/denied path to the crawlers of search engines.
This class allow to read a robots.txt file and allow or deny a provided URL as the search engine can do.
Namespace Domframework
/** This class analyze the provided robots.txt file content and allow to get the configured data for DomSearch. It allow to examine an URL against the robots.txt file and return if the URL is allowed to be used or not The definition of the format of robots.txt file is available here : https://www.rfc-editor.org/rfc/rfc9309.txt http://www.robotstxt.org/norobots-rfc.txt https://en.wikipedia.org/wiki/Robots_exclusion_standard
No property available
/** Return true if the provided URL can be used against the robots.txt definition or FALSE if it is not the case@param string $url
The URL to check@return
boolean The result of the test
/** Get the robots.txt file content and do the analyze@param string $content
The robots.txt file content to analyze@param string $crawlerName
The crawler name to use in analyze@return
$this
/**
Return the allowed urls
@return
array $allow The array of allow rules
/**
Return the crawldelay
@return
integer $crawldelay The crawlDelay defined in robots.txt
/**
Return the disallowed urls
@return
array $disallow The array of disallow rules
/**
Return the lines where an error occured
The key of the array is the line number with the default
@return
array The errors
/**
Return the host
@return
string $host The Host string defined in robots.txt
/**
Return the matchRule
@return
string $matchRule The matchRule matching the URLAllow test
/**
Return the sitemaps url
@return
array $sitemap The array of sitemaps URL