Back to the module list

Robots.txt file parser

The files robots.txt, located on the base of the websites, allow the webmaster to defined the allowed/denied path to the crawlers of search engines.

This class allow to read a robots.txt file and allow or deny a provided URL as the search engine can do.

The class definition

Class Domframework\Robotstxt

Namespace Domframework

Description

/**
 This class analyze the provided robots.txt file content and allow to
 get the configured data for DomSearch.
 It allow to examine an URL against the robots.txt file and return if the URL
 is allowed to be used or not
 The definition of the format of robots.txt file is available here :
   https://www.rfc-editor.org/rfc/rfc9309.txt
   http://www.robotstxt.org/norobots-rfc.txt
   https://en.wikipedia.org/wiki/Robots_exclusion_standard

Properties

No property available

Methods

public function URLAllow ( $url)
/**
 Return true if the provided URL can be used against the robots.txt
 definition or FALSE if it is not the case
 @param string $url The URL to check
 @return boolean The result of the test

public function __construct ( $content, $crawlerName)
/**
 Get the robots.txt file content and do the analyze
 @param string $content The robots.txt file content to analyze
 @param string $crawlerName The crawler name to use in analyze
 @return $this

public function allow ()
/**
 Return the allowed urls
 @return array $allow The array of allow rules

public function crawldelay ()
/**
 Return the crawldelay
 @return integer $crawldelay The crawlDelay defined in robots.txt

public function disallow ()
/**
 Return the disallowed urls
 @return array $disallow The array of disallow rules

public function errors ()
/**
 Return the lines where an error occured
 The key of the array is the line number with the default
 @return array The errors

public function host ()
/**
 Return the host
 @return string $host The Host string defined in robots.txt

public function matchRule ()
/**
 Return the matchRule
 @return string $matchRule The matchRule matching the URLAllow test

public function sitemaps ()
/**
 Return the sitemaps url
 @return array $sitemap The array of sitemaps URL