Wednesday, December 22, 2010

Controlling Crawling and Indexing

Search engines generally have two main stages to make content available for users in search results. These are crawling and indexing. Crawling is the act of search engine crawlers accessing publicly available web pages. This involves looking at the web pages and following the links on those pages, just as a human user would do. Indexing involves gathering together information about a page so that it can be made available (“served”) through search results.
Automated website crawlers are powerful tools to help crawl and index content on the web. As a web master, you may wish to guide them towards your useful content and away from irrelevant content. The robots.txt file controls crawling, and the robots meta tag and X-Robots-Tag HTTP header element controls indexing. The robots.txt standard predates Google and is the accepted method of controlling crawling of a website.
  • How to use Robot.txt: It’s a simple text file to tell search robots which pages you don’t want them to index in search engine.
  • How to use Robots meta tags: It’s for those users who can’t control Robot.txt file like blogspot users. Blogger users can keep their content out of the search engine by using these robot meta tags.
  • How to use X-Robots-Tag Header: Just another method to restrict the access control of search engine. You can also prevent pdf file to index with this method.
  • Controlling Crawling

    The advantage with robots.txt file is it allows the specific path that you would like your site to be crawled. Crawlers request the robots.txt file from the server before crawling. Within the robots.txt file, you can include sections for specific (or all) crawlers with instructions (“directives”) that let them know which parts can or cannot be crawled. Location of the robots.txt file The robots.txt file must be located at the root of the website host that it should be valid for. For instance, in order to control crawling on all URLs below http://www.example.com/, the robots.txt file must be located at http://www.example.com/robots.txt. A robots.txt file can be placed on sub domains (like http://website.example.com/robots.txt) or on non-standard ports (http://example.com:8181/robots.txt), but it cannot be placed in a subdirectory (http://example.com/pages/robots.txt). More about Crawling and Indexing at Kensium Business Process Outsourcing

No comments:

Post a Comment