|
|
 |
|
|
|
 |
| 01 |
Google Search
Engine Info
Top Search Engine Info has this section dedicated
to everything you ever wanted to know about Google,
we have collected and put together every press
release Google has ever published also all the
financial records it has released.
Also if
you would like to know a little about Google's
infrastructure then have a look at the news article
below. We can not verify its accuracy or even its
age but still it's worth the read!
|
| |
| 02 |
Major
Search Engines Like Google Unite to Support a
Common Mechanism For Website Submission!
Sitemaps protocol will enable Google, Yahoo! and
Microsoft to provide more comprehensive and fresh
search results
In the first joint and open initiative to improve
the Web crawl process for search engines, Google,
Yahoo! and Microsoft today announced support for
Sitemaps 0.90 (www.sitemaps.org), a free and easy
way for webmasters to notify search engines about
their websites and be indexed more comprehensively
and efficiently, resulting in better representation
in search indices. For users, Sitemaps enables
higher quality, fresher search results. An
initiative initially driven by Yahoo! and Google,
Sitemaps builds upon the pioneering Sitemaps 0.84,
released by Google in June of 2005, which is now
being adopted by Yahoo! and Microsoft to offer a
single protocol to enhance Web crawling efforts.
Together, the sponsoring companies will continue to
collaborate on the Sitemaps protocol and publish
enhancements on a jointly maintained website
www.sitemaps.org, which provides all of the details
about the Sitemaps protocol.
How Sitemaps Work
A Sitemap is an XML file that can be made available
on a website and acts as a marker for search engines
to crawl certain pages. It is an easy way for
webmasters to make their sites more search engine
friendly. It does this by conveniently allowing
webmasters to list all of their URLs along with
optional metadata, such as the last time the page
changed, to improve how search engines crawl and
index their websites.
Sitemaps enhance the current model of Web crawling
by allowing webmasters to list all their Web pages
to improve comprehensiveness, notify search engines
of changes or new pages to help freshness, and
identify unchanged pages to prevent unnecessary
crawling and save bandwidth. Webmasters can now
universally submit their content in a uniform
manner. Any webmaster can submit their Sitemap to
any search engine which has adopted the protocol.
The Sitemaps protocol used by Google has been widely
adopted by many Web properties, including sites from
the Wikimedia Foundation. Any company that manages
dynamic content and a lot of web pages can benefit
from Sitemaps. For example, if a company that
utilizes a content management system (CMS) to
deliver custom web content – (i.e., pricing,
availability and promotional offers) – to thousands
of URLs places a Sitemap file on its web servers,
search engine crawlers will be able discover what
pages are present and which have recently changed
and to crawl them accordingly. By using Sitemaps,
new links can reach search engine users more rapidly
by informing search engine "spiders" and helping
them to crawl more pages and discover new content
faster. This can also drive online traffic and make
search engine marketing more effective by delivering
better results to users.
For companies looking to improve user experience
while keeping costs low, Sitemaps also helps make
more efficient use of bandwidth. Sitemaps can help
search engines find a company’s newest content more
efficiently and avoid the need to revisit unchanged
pages. Sitemaps can list what is new on a site and
quickly guide crawlers to that new content.
"At industry conferences, webmasters have asked for
open standards just like this," said Danny Sullivan,
editor-in-chief of Search Engine Watch. "This is a
great development for the whole community and
addresses a real need of webmasters in a very
convenient fashion. I believe it will lead to
greater collaboration in the industry for common
standards, including those based around robots.txt,
a file that gives Web crawlers direction when they
visit a website."
"Announcing industry supported Sitemaps is an
important milestone for all of us because it will
help webmasters and search engines get the most
relevant information to users faster. Sitemaps
address the challenges of a growing and dynamic Web
by letting webmasters and search engines talk to
each other, enabling a better web crawl and better
results," said Narayanan Shivakumar, Distinguished
Entrepreneur with Google. "Our initial efforts have
provided webmasters with useful information about
their sites, and the information we've received in
turn has improved the quality of Google's search."
"The launch of Sitemaps is significant because it
allows for a single, easy way for websites to
provide content and metadata to search engines,"
said Tim Mayer, senior director of product
management, Yahoo Search. "Sitemaps helps webmasters
surface content that is typically difficult for
crawlers to discover, leading to a more
comprehensive search experience for users."
"The quality of your index is predicated by the
quality of your sources and Windows Live Search is
happy to be working with Google and Yahoo! on
Sitemaps to not only help webmasters, but also help
consumers by delivering more relevant search results
so they can find what they’re looking for faster,"
said Ken Moss, General Manager of Windows Live
Search at Microsoft.
The protocol will be available at sitemaps.org, and
the companies plan to have Yahoo Small Business host
the site. Any site owner can create and upload an
XML Sitemap and submit the URL of the file to
participating search engines.
|
| |
| 03 |
Look at
what makes the great Google tick!.
Look at what makes the great Google tick!
More than> four billion Web pages, each an average
of 10KB, all indexed and cached (copied).
Up to 2,000 computers in a cluster. With over 30
clusters
Over 60,000 computers! Yes 60,000 pc's.
104 interface languages (how many can you name!).
One petabyte of data in a cluster -- so much that
hard disk error rates of 10 to the power 15 begin to
be a real problem if not planned for.
Constant transfer rates of over 60Gbps. (Gigabites
per second)
An expectation that sixty machines will fail every
day.
No complete system failure since February 2000.
It is one of the largest computing projects on the
planet, arguably employing more computers than any
other single, fully managed system (we're not
counting distributed computing projects here), some
200 computer science PhDs, and 600 other computer
scientists.
And it is all hidden behind a deceptively simple,
white, Web page that contains a single one-line text
box and a button that says Google Search.
Nobody hides the complexity of the job better than
Google does; so long as we have a connection to the
Internet, the Google search page is there day and
night, every day of the year, and it is not just
there, but it returns results. Google recognises
that the returns are not always perfect, and there
are still issues there -- more on those later -- but
when you understand the complexity of the system
behind that Web page you may be able to forgive the
imperfections. You may even agree that what Google
achieves is nothing short of magic.
Google's vice-president of engineering, Urs Hölzle,
who has been with the company since 1999 and who is
now a Google fellow, gave an insight recently to
would-be Google employees into just what it takes to
run an operation on such a scale, with such
reliability. Read on for some of the secrets of
Google's magic.
Google's vision is broader than most people imagine,
said Hölzle: "Most people say Google is a search
engine but our mission is to organise information to
make it accessible."
Behind that, he said, comes a vast scale of
computing power based on cheap, no-name hardware
that is prone to failure. There are hardware
malfunctions not just once, but time and time again,
many times a day.
Yes, that's right, Google is built on imperfect
hardware. The magic is writing software that accepts
that hardware will fail, and expeditiously deals
with that reality, says Hölzle.
Google indexes over four billion Web pages, using an
average of 10KB per page, which comes to about 40TB.
Google is asked to search this data over 1,000 times
every second of every day, and typically comes back
with sub-second response rates. If anything goes
wrong, said Hölzle, "you can't just switch the
system off and switch it back on again."
The job is not helped by the nature of the Web. "In
academia," said Hölzle, "the information retrieval
field has been around for years, but that is for
books in libraries. On the Web, content is not
nicely written - there are many different grades of
quality."
Some, he noted, may not even have text. "You may
think we don't need to know about those but that’s
not true - it may be the home page of a very large
company where the Webmaster decided to have
everything graphical. The company name may not even
appear on the page."
Google deals with such pages by regarding the Web
not as a collection of text documents, but a
collection of linked text documents, with each link
containing valuable information.
"Take a link pointing to the Stanford university
home page," said Hölzle. "This tells us several
things: First, that someone must think pointing to
Stanford is important. The text in the link also
gives us some idea of what is on the page being
pointed to. And if we know something about the page
that contains the link we can tell something about
the quality of the page being linked to."
This knowledge is encapsulated in Google's famous
PageRank algorithm, which looks not just at the
number of links to a page but at the quality or
weight of those links, to help determine which page
is most likely to be of use, and so which is
presented at the top of the list when the search
results are returned to the user. Hölzle believes
the PageRank algorithm is 'relatively' spam
resistant, and those interested in exactly how it
works can find more information here.
Obviously it would be impractical to run the
algorithm once every page for every query, so Google
splits the problem down.
When a query comes in to the system it is sent off
to index servers, which contain an index of the Web.
This index is a mapping of each word to each page
that contains that word. For instance, the word
'Imperial' will point to a list of documents
containing that word, and similarly for 'College'.
For a search on 'Imperial College' Google does a
Boolean 'AND' operation on the two words to get a
list of what Hölzle calls 'word pages'.
"We also consider additional data, such as where in
the page does the word occur: in the title, the
footnote, is it in bold or not, and so on.
Each index server indexes only part of the Web, as
the whole Web will not fit on a single machine --
certainly not the type of machines that Google uses.
Google's index of the Web is distributed across many
machines, and the query gets sent to many of them --
Google calls each on a shard (of the Web). Each one
works on its part of the problem.
Google computes the top 1000 or so results, and
those come back as document IDs rather than text.
The next step is to use document servers, which
contain a copy of the Web as crawled by Google's
spiders. Again the Web is essentially chopped up so
that each machine contains one part of the Web. When
a match is found, it is sent to the ad server which
matches the ads and produces the familiar results
page.
Google's business model works because all this is
done on cheap hardware, which allows it to run the
service free-of-charge to users, and charge only for
advertising.
The hardware
"Even though it is a big problem", said Hölzle, "it
is tractable, and not just technically but
economically too. You can use very cheap hardware,
but to do this you have to have the right software."
Google runs its systems on cheap, no-name IU and 2U
servers -- so cheap that Google refers to them as
PCs. After all each one has a standard x86 PC
processor, standard IDE hard disk, and standard PC
reliability -- which means it is expected to fail
once in three years.
On a PC at home, that is acceptable for many people
(if only because they're used to it), but on the
scale that Google works at it becomes a real issue;
in a cluster of 1,000 PCs you would expect, on
average, one to fail every day. "On our scale you
cannot deal with this failure by hand," said Hölzle.
"We wrote our software to assume that the components
will fail and we can just work around it. This
software is what makes it work.
One key idea is replication. "This server that
contains this shard of the Web, let's have two, or
10," said Hölzle. "This sounds expensive, but if you
have a high-volume service you need that replication
anyway. So you have replication and redundancy for
free. If one fails you have 10 percent reduction in
service so no failures so long as the load balancer
works. So failure becomes and a manageable event."
In reality, he said, Google probably has "50 copies
of every server". Google replicates servers, sets of
servers and entire data centres, added Hölzle, and
has not had a complete system failure since February
2000. Back then it had a single data centre, and the
main switch failed, shutting the search engine down
for an hour. Today the company mirrors everything
across multiple independent data centres, and the
fault tolerance works across sites, "so if we lose a
data centre we can continue elsewhere -- and it
happens more often than you would think. Stuff
happens and you have to deal with it."
A new data centre can be up and running in under
three days. "Our data centre now is like an iMac,"
said Schulz." You have two cables, power and data.
All you need is a truck to bring the servers in and
the whole burning in, operating system install and
configuration is automated."
Working around failure of cheap hardware, said
Hölzle, is fairly simple. If a connection breaks it
means that machine has crashed so no more queries
are sent to it. If there is no response to a query
then again that signals a problem, and it can cut it
out of the loop.
That is redundancy taken care of, but what about
scaling? The Web grows every year, as do the number
of people using it, and that means more strain on
Google's servers.
Google has two crucial factors in its favour. First,
the whole problem is what Hölzle refers to as
embarrassingly parallel, which means that if you
double the amount of hardware, you can double
performance (or capacity if you prefer -- the
important point is that there are no diminishing
returns as there would be with less parallel
problems).
The second factor in Google's favour is the falling
cost of hardware. If the index size doubles, then
the embarrassingly parallel nature of the problem
means that Google could double the number of
machines and get the same response time so it can
grow linearly with traffic. "In reality (from a
business point of view) we would like to grow less
than linear to keep costs down," said Hölzle, "but
luckily the hardware keeps getting cheaper."
So every year as the Web gets bigger and requires
more hardware to index, search and return Web pages,
hardware gets cheaper so it "more or less evens out"
to use Hölzle's words.
As the scale of the operation increases, it
introduces some particular problems that would not
be an issue on smaller systems. For instance, Google
uses IDE drives for all its storage. They are fast
and cheap, but not highly reliable. To help deal
with this, Google developed its own file system --
called the Google File System, or GFS -- which
assumes an individual unit of storage can go away at
any time either because of a crash, a lost disk or
just because someone stepped on a cable.
The power of three
There are no disk arrays within individual PCs;
instead Google stores every bit of data in
triplicate on three machines on three racks on three
data switches to make sure there is no single point
of failure between you and the data. "We use this
for hundreds of terabytes of data," said Hölzle.
Don't expect to see GFS on a desktop near you any
time soon -- it is not a general-purpose file
system. For instance, a GFS block size is 64MB,
compared with the more usual 2KB on a desktop file
system. Hölzle said Google has 30 plus clusters
running GFS, some as large as 2,000 machines with
petabytes of storage. These large clusters can
sustain read/write speeds of 2Gbps -- a feat made
possible because each PC manages 2Mbps.
Once, said Hölzle, "someone disconnected an
80-machine rack from a GFS cluster, and the
computation slowed down as the system began to
re-replicate and we lost some bandwidth, but it
continued to work. This is really important if you
have 2,000 machines in a cluster." If you have 2000
machines then you can expect to see two failures a
day.
Running thousands of cheap servers with relatively
high failure rates is not an easy job. Standard
tools don't work at this scale, so Google has had to
develop them in-house. Some of the other challenges
the company continues to face include:
Debugging: "You see things on the real site you
never saw in testing because some special set of
circumstances that create a bug," said Hölzle. "This
can create non-trivial but fun problems to work on."
Data errors: A regular IDE hard disk will have an
error rate in the order of 10 to the power of 15 -
that is one millionth of one billionth of the data
written to it may get corrupted and the hard-disk's
own error checking will not pick it up. "But when
you have a petabyte of data you need to start
worrying about these failures," said Hölzle. "You
must expect that you will have undetected bit errors
on your disk several times a month, even with
hardware checking built-in, so GFS does have an
extra level of checksumming. Again this is something
we didn’t expect, but things happen."
Spelling: Google wrote its own spell checker, and
maintains that nobody know as many spelling errors
as it does. The amount of computing power available
at the company means it can afford to begin teaching
the system which words are related -- for instance
"Imperial", "College" and "London". It's a job that
many CPU years, and which would not have been
possible without these thousands of machines. "When
you have tons of data and tons of computation you
can make things work that don’t work on smaller
systems," said Hölzle. One goal of the company now
is to develop a better conceptual understanding of
text, to get from the text string to a concept.
Power density: "There is an interesting problem when
you use PCs," said Hölzle. "If you go to a
commercial data centre and look at what they can
support, you'll see a typical design allowing for
50W to 100W per square foot. At 200W per square foot
you notice the sales person still wants to sell it
but their international tech guy starts sweating. At
300W per square foot they cry out in pain."
Eighty mid-range PCs in a rack, of which you will
find many dozens in a Google data centre, produce
over 500W per square foot. "So we're not going to
blade technology," said Hölzle. "We're already too
dense. Finally Intel has realised this is a problem
and is now focusing more on power efficiency, but it
took some time to get the message across."
Quality of search results: One big area of
complaints for Google is connected to the growing
prominence of commercial search results -- in
particular price comparison engines and e-commerce
sites. Hölzle is quick to defend Google's
performance "on every metric", but admits there is a
problem with the Web getting, as he puts it, "more
commercial". Even three years ago, he said, the Web
had much more of a grass roots feeling to it. "We
have thought of having a button saying 'give me less
commercial results'," but the company has shied away
from implementing this yet.
Sent via email to us in
2005 |
|
|
 |
| 01 |
|
Upgrade to a Safe
Browser Free and get Google search included.
|
| |
| 02 |
|
Simple Easy Photo Management
|
| |
| 03 |
|
Contact Us
If you would like contact us please email us a
contact@ top-search-engine-info.co.uk |
|
|
|
|