SEO News Tip of the Day!
It’s been a long time since my last post, too much work and not enough time. Anyway, a short while ago the SEO forums were a blaze with the latest patent release from Google. The patent included loads of different techniques a search engine could use to determine relevancy and I say could because it would be very difficult to use all of the items covered in the patent. My own thoughts on the patent are many items have been included to dilute the true gems. One part that really caught my eye was using domain registration information as part of a search engines algorithm. I remember several months ago reading a news story that Google had become a domain registrar and many were presuming domain registration would be one of Google’s many new adventures. This gossip was stopped as quickly as it started with Google announcing no domain registration service was on the cards. So why become a domain registrar? Length of Domain Registration, logical ranking criteria Well as mentioned above part of the patent focused on using domain information to determine the reliability of a website. In short, if a domain has only been registered for one year it may be more volatile than a domain that’s been registered for ten years. If you’re serious about your online venture, it makes sense to register your domain for a longer period of time. Google could use this fact as part of their ranking criteria and I suspect they have been doing so for some time. So if you haven’t already done so, I would recommend contacting your domain registration Company to register your domain for at least five years. It may not affect your rankings at all, but on the other hand it may give you just enough of a boost to push some of those keywords onto the front pages. With domain registration costing as less as it does nowadays, it may turn out to be the most cost effective ranking boost you’ve ever paid out for. If it doesn’t produce any results, no harm done and you may have just saved yourself a couple bucks on next year’s budget :)
Search Engine Spiders Lost Without Guidance - Post This Sign!
The robots.txt file is an exclusion standard required by all
web crawlers/robots to tell them what files and directories
that you want them to stay OUT of on your site. Not all
crawlers/bots follow the exclusion standard and will continue
crawling your site anyway. I like to call them "Bad Bots" or
trespassers. We block them by IP exclusion which is another
story entirely. This is a very simple overview of robots.txt basics for
webmasters. For a complete and thorough lesson, visit
http://www.robotstxt.org/ To see the proper format for a somewhat standard robots.txt
file look directly below. That file should be at the root of
the domain because that is where the crawlers expect it to be,
not in some secondary directory. Below is the proper format for a robots.txt file -----> User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /group/ User-agent: msnbot
Crawl-delay: 10 User-agent: Teoma
Crawl-delay: 10 User-agent: Slurp
Crawl-delay: 10 User-agent: aipbot
Disallow: / User-agent: BecomeBot
Disallow: / User-agent: psbot
Disallow: / --------> End of robots.txt file This tiny text file is saved as a plain text document and
ALWAYS with the name "robots.txt" in the root of your domain. A quick review of the listed information from the robots.txt
file above follows. The "User Agent: MSNbot" is from MSN,
Slurp is from Yahoo and Teoma is from AskJeeves. The others
listed are "Bad" bots that crawl very fast and to nobody's
benefit but their own, so we ask them to stay out entirely.
The * asterisk is a wild card that means "All"
crawlers/spiders/bots should stay out of that group of files
or directories listed. The bots given the instruction "Disallow: /" means they should
stay out entirely and those with "Crawl-delay: 10" are those
that crawled our site too quickly and caused it to bog down
and overuse the server resources. Google crawls more slowly
than the others and doesn't require that instruction, so is
not specifically listed in the above robots.txt file.
Crawl-delay instruction is only needed on very large sites
with hundreds or thousands of pages. The wildcard asterisk *
applies to all crawlers, bots and spiders, including
Googlebot. Those we provided that "Crawl-delay: 10" instruction to were
requesting as many as 7 pages every second and so we asked
them to slow down. The number you see is seconds and you can
change it to suit your server capacity, based on their
crawling rate. Ten seconds between page requests is far more
leisurely and stops them from asking for more pages than your
server can dish up. (You can discover how fast robots and spiders are crawling by
looking at your raw server logs - which show pages requested
by precise times to within a hundredth of a second - available
from your web host or ask your web or IT person. Your server
logs can be found in the root directory if you have server
access, you can usually download compressed server log files
by calendar day right off your server. You'll need a utility
that can expand compressed files to open and read those plain
text raw server log files.) To see the contents of any robots.txt file just type
robots.txt after any domain name. If they have that file up,
you will see it displayed as a text file in your web browser.
Click on the link below to see that file for Amazon.com http://www.Amazon.com/robots.txt You can see the contents of any website robots.txt file that
way. The robots.txt shown above is what we currently use at
Publish101 Web Content Distributor, just launched in May of
2005. We did an extensive case study and published a series of
articles on crawler behavior and indexing delays known as the
Google Sandbox. That Google Sandbox Case Study is highly
instructive on many levels for webmasters everywhere about the
importance of this often ignored little text file. One thing we didn't expect to glean from the research involved
in indexing delays (known as the Google Sandbox) was the
importance of robots.txt files to quick and efficient crawling
by the spiders from the major search engines and the number of
heavy crawls from bots that will do no earthly good to the
site owner, yet crawl most sites extensively and heavily,
straining servers to the breaking point with requests for
pages coming as fast as 7 pages per second. We discovered in our launch of the new site that Google and
Yahoo will crawl the site whether or not you use a robots.txt
file, but MSN seems to REQUIRE it before they will begin
crawling at all. All of the search engine robots seem to
request the file on a regular basis to verify that it hasn't
changed. Then when you DO change it, they will stop crawling for brief
periods and repeatedly ask for that robots.txt file during
that time without crawling any additional pages. (Perhaps they
had a list of pages to visit that included the directory or
files you have instructed them to stay out of and must now
adjust their crawling schedule to eliminate those files from
their list.) Most webmasters instruct the bots to stay out of "image"
directories and the "cgi-bin" directory as well as any
directories containing private or proprietary files intended
only for users of an intranet or password protected sections
of your site. Clearly, you should direct the bots to stay out
of any private areas that you don't want indexed by the search
engines. The importance of robots.txt is rarely discussed by average
webmasters and I've even had some of my client business'
webmasters ask me what it is and how to implement it when I
tell them how important it is to both site security and
efficient crawling by the search engines. This should be
standard knowledge by webmasters at substantial companies, but
this illustrates how little attention is paid to use of
robots.txt. The search engine spiders really do want your guidance and
this tiny text file is the best way to provide crawlers and
bots a clear signpost to warn off trespassers and protect
private property - and to warmly welcome invited guests, such
as the big three search engines while asking them nicely to
stay out of private areas. Copyright © August 17, 2005 by Mike Banks Valentine Google Sandbox Case Study http://publish101.com/Sandbox2
Mike Banks Valentine operates http://Publish101.com
Free Web Content Distribution for Article Marketers and
Provides content aggregation, press release optimization
and custom web content for Search Engine Positioning
http://www.seoptimism.com/SEO_Contact.htm
|