|
|
? Copyright July 18, 2005 Mike Banks Valentine
Search engine listing delays have come to be called the Google
Sandbox effect are actually true in practice at each of four top
tier search engines in one form or another. MSN, it seems has
the shortest indexing delay at 30 days. This article is the
second in a series following the spiders through a brand new web
site beginning on May 11, 2005 when the site was first made live
on that day under a newly purchased domain name. First Case Study
Article
Previously we looked at the first 35 days and detailed the
crawling behavior of Googlebot, Teoma, MSNbot and Slurp as they
traversed the pages of this new site. We discovered the each
robot spider displays distinctly different behavior in crawling
frequency and similarly differing indexing patterns.
For reference, there are about 15 to 20 new pages added to the
site daily, which are each linked from the home page for a day.
Site structure is non-traditional with no categories and a
linking structure tied to author pages listing their articles as
well as a "related articles" index varied by linking to relevant
pages containing similar content.
So let's review where we are with each spider crawling and look
at pages crawled and compare pages indexed by engine.
The AskJeeves spider, Teoma has crawled most of the pages on the
site, yet indexes no pages 60 days later at this writing. This
is clearly a site aging delay that's modeled on Google's Sandbox
behavior. Although the Teoma spider from Ask.com has crawled
more pages on this site than any other engine over a 60 day
period and appears to be tired of crawling as they've not
returned since July 13 - their first break in 60 days.
In the first two days, Googlebot gobbled up 250 pages and didn't
return until 60 days later, but has not indexed even a single
page in 60 days since they made that initial crawl. But
Googlebot is showing a renewed interest in crawling the site
since this crawling case study article was published on several
high traffic sites. Now Googlebot is looking at a few pages each
day. So far no more than about 20 pages at a decidedly
lackluster pace, a true "Crawl" that will keep it occupied for
years if continued that slowly.
MSNbot crawled timidly for the first 45 days, looking over 30 to
50 pages daily, but not until they found a robots.txt file,
which we'd neglected to post to the site for a week and then
bobbled the ball as we changed site structure, then failed to
implement robots.txt in new subdomains until day 25 - and THEN
MSNbot didn't return until day 30. If little else were
discovered about initial crawls and indexing, we have seen that
MSNbot relies heavily on that robots.txt file and proper
implementation of that file will speed crawling.
MSNbot is now crawling with enthusiasm at anywhere between 200
to 800 pages daily. As a matter of fact, we had to use a
"crawl-delay" command in the robots.txt file after MSNbot began
hitting 6 pages per second last week. The MSN index now shows
4905 pages 60 days into this experiment. Cached pages change
weekly. MSNbot has apparently found that it likes how we changed
the page structure to include a new feature which links to
questions from several other article pages.
Slurp gets strangely inactive then alternately hyperactive for
periods of time. The Yahoo crawler will look at 40 pages one day
and then 4000 the next, then simply look at the home page for a
few days and then jump back in for 3000 pages the next day and
back to only reviewing robots.txt for two days. Consistency is
not a curse suffered by Slurp. Yahoo now shows 6 pages in their
index, one an errors page and another is a "index/of" page as we
have not posted a home page to several subdomains. But Slurp has
crawled easily 15,000 pages to date.
Lessons learned in the first 60 days on a new site follow:
1) Google crawls 250 pages on first discovery of links to site.
Then they don't return until they find more links and crawl
slowly. Google has failed to index new domain for 60 days.
2) Yahoo looks for errors pages and once they find bad links
will crawl them ceaselessly until you tell them to stop it. Then
won't crawl at all for weeks until crawling heavily one day and
lightly the next in random fashion.
3) MSNbot requires robots.txt files and once they decide they
like your site, may crawl too fast, requiring "crawl-delay"
instructions in that robots.txt file. Implement immediately.
4) Bad bots can strain resources and hit too many pages too
quickly until you tell them to stay out. We banned 3 bots
outright after they slammed our servers for a day or two. Noted
"aipbot" crawled first then "BecomeBot" came along and then
"Pbot" from Picsearch.com crawled heavily looking for image
files we don't have. Bad bots, stay out. Best to implement
robots.txt exclusions for all but top engines if their crawlers
strain your server resources. We considered excluding the
Chinese search engine named Baidu.com when they began crawling
heavily early on. We don't expect much traffic from China, but
why exclude one billion people? Especially since Google is
rumored to be considering a possible purchase of Baidu.com as
entry to Chinese market.
The bottom line is that we've discovered all engines seem to
delay indexing of new domain names for at least thirty days.
Google so far has delayed indexing THIS new domain for 60 days
since first crawling it. AskJeeves has crawled thousands of
pages, while indexing none of them. MSN indexes faster than all
engines but requires robots.txt file. Yahoo's Slurp crawls on
again off again for 60 days, but indexes only six of total
15,000 or more pages crawled to date.
We seem to have settled that there is a clear indexing delay,
but whether this site specifically is "Sandboxed" and whether
delays apply universally is less clear. Many webmasters claim
that they have been indexed fully within 30 days of first
posting a new domain. We'd love to see others track spiders
through new sites following launch to document their results
publicly so that indexing and crawling behavior are proven.
About the author:
Mike Banks Valentine is a search engine optimization specialist
who operates WebSite101
Ecommerce Tutorial and will continue reports of case study
chronicling search indexing of Publish101 Article Resource
http://www.seoptimism.com/SEO_Contact.htm
|
|
|