BLoG IT: robots.txt

robots.txt

In a nutshell

The Robots Exclusion Protocol :

The Web site use the /robots.txt file to give instructions about their site to
web robots.

robots.txt :

singkat kata suatu scripts yang ditaruh untuk mempermudah SiBLogger / Users

agar konten yang ditampil bisa dnikmati khalayak ramai atau mempermudah

pengenalan konten ke publiks maupun bagi yang ingin privacy dapat membatasi

agar hanya konten apa saja yang bisa dinikmati

Sample :

It works likes this: a robot wants to vists a Web site URL, say

http://www.example.com /welcome.html. Before it does so, it firsts checks for

http://www.example.com/robots.txt, and finds:

       User-agent: *
      Disallow: /

Contoh 1 : membatasi BLoggers dalam membuka blogs sampai dimana saja batasan yang ditetapkan
         oleh Si Admin

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

There are two important considerations when using /robots.txt:

robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use.

So don't try to use /robots.txt to hide information.

The details

The /robots.txt menurut Standarisasinya ada 2 macam pengenalan dalam pembuatan robot.txt

Originalnya dimulai : 1994 A Standard for Robot Exclusion document.
Web Robot 1997 Khusus terbatas pada pengguna Website saja tidak ada melalu ini A Method for Web Robots Control

In addition there are external resources:

The /robots.txt standard is not actively developed. See What about further development of /robots.txt? for more discussion.

The rest of this page gives an overview of how to use /robots.txt on your server, with some simple recipes. To learn more see also the FAQ.

How to create a /robots.txt file

- Where to put it

- The short answer: in the top-level directory of your web server.

- The longer answer:

When a robot looks for the "/robots.txt" file for URL, it strips the path component from the URL (everything from the first single slash), and puts "/robots.txt" in its place.

For example, for "http://www.example.com/shop/index.html, it will remove the "/shop/index.html", and replace it with "/robots.txt", and will end up with "http://www.example.com/robots.txt".

So, as a web site owner you need to put it in the right place on your web server for that resulting URL to work. Usually that is the same place where you put your web site's main "index.html" welcome page. Where exactly that is, and how to put the file there, depends on your web server software.

Remember to use all lower case for the filename: "robots.txt", not "Robots.TXT.

What to put in it

The "/robots.txt" file is a text file, with one or more records. Usually contains a single record looking like this:

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/

In this example, three directories are excluded.

Note that you need a separate "Disallow" line for every URL prefix you want to exclude -- you cannot say "Disallow: /cgi-bin/ /tmp/" on a single line. Also, you may not have blank lines in a record, as they are used to delimit multiple records.

Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: *bot*", "Disallow: /tmp/*" or "Disallow: *.gif".

What you want to exclude depends on your server. Everything not explicitly disallowed is considered fair game to retrieve. Here follow some examples:

To exclude all robots from the entire server

User-agent: *
Disallow: /

To allow all robots complete access

User-agent: *
Disallow:

(or just create an empty "/robots.txt" file, or don't use one at all)

To exclude all robots from part of the server

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/

To exclude a single robot

User-agent: BadBot
Disallow: /

To allow a single robot not Multi

User-agent: Google
Disallow:

User-agent: *
Disallow: /

To exclude all files except one / Single Contens Site

This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "stuff", and leave the one file in the level above this directory:

User-agent: *
Disallow: /~joe/stuff/

Alternatively you can explicitly disallow all disallowed pages:

User-agent: *
Disallow: /~joe/junk.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html

Sumber : http://www.robotstxt.org/robotstxt.html

BLoG IT

HARDWARE CONTENS

,

Find Konten Mobile Phone

Kamis, 12 Agustus 2010

robots.txt

robots.txt

In a nutshell

The details

How to create a /robots.txt file

- Where to put it

What to put in it

To exclude all robots from the entire server

To allow all robots complete access

To exclude all robots from part of the server

To exclude a single robot

To allow a single robot not Multi

To exclude all files except one / Single Contens Site

Tidak ada komentar:

Posting Komentar

BLoG IT

Share Contens

File Index

SaLuran Channel TV,Koran Harian, dan TaBloid NaSIonaL

Link Exchange