Search Engines

The Parts Of A Crawler-Based Search Engine

"Hybrid Search Engines" Or Mixed Results

The Internet has well over ten billion pages, and is still rapidly growing. To find the proverbial needle in this immense haystack (or tiny fly in the Web), there are at least two basic approaches: using a search engine or a search directory .Search Directories are useful when browsing general topics, and search engines work well when searching for specific information.The term "search engine" is often used generically to describe both crawler-based search engines and human-powered directories. These two types of search engines gather their listings in radically different ways.

Crawler-Based Search Engines

Crawler-based search engines, such as Google, create their listings automatically. They "crawl" or "spider" the web, then people search through what they have found.

If you change your web pages, crawler-based search engines eventually find these changes, and that can affect how you are listed. Page titles, body copy and other elements all play a role.

The Parts Of A Crawler-Based Search Engine

Spider

Crawler-based search engines have three major elements. First is the spider, also called the crawler. The spider visits a web page, reads it, and then follows links to other pages within the site. This is what it means when someone refers to a site being "spidered" or "crawled." The spider returns to the site on a regular basis, such as every month or two, to look for changes.

Index

Everything the spider finds goes into the second part of the search engine, the index. The index, sometimes called the catalog, is like a giant book containing a copy of every web page that the spider finds. If a web page changes, then this book is updated with new information.

Sometimes it can take a while for new pages or changes that the spider finds to be added to the index. Thus, a web page may have been "spidered" but not yet "indexed." Until it is indexed -- added to the index -- it is not available to those searching with the search engine.

Search engine software

Search engine software is the third part of a search engine. This is the program that sifts through the millions of pages recorded in the index to find matches to a search and rank them in order of what it believes is most relevant. You can learn more about how search engine software ranks web pages as you read on .

Human-Powered Directories

A human-powered directory, such as the Open Directory, depends on humans for its listings. You submit a short description to the directory for your entire site, or editors write one for sites they review. A search looks for matches only in the descriptions submitted.

Changing your web pages has no effect on your listing. Things that are useful for improving a listing with a search engine have nothing to do with improving a listing in a directory. The only exception is that a good site, with good content, might be more likely to get reviewed for free than a poor site.

"Hybrid Search Engines" Or Mixed Results

In the web's early days, it used to be that a search engine either presented crawler-based results or human-powered listings. Today, it extremely common for both types of results to be presented. Usually, a hybrid search engine will favor one type of listings over another. For example, MSN Search is more likely to present human-powered listings from LookSmart. However, it does also present crawler-based results (as provided by Inktomi), especially for more obscure queries.

Page Ranking

Search for anything using your favorite crawler-based search engine. Nearly instantly, the search engine will sort through the millions of pages it knows about and present you with ones that match your topic. The matches will even be ranked, so that the most relevant ones come first.Of course, the search engines don't always get it right. Non-relevant pages make it through, and sometimes it may take a little more digging to find what you are looking for. But, by and large, search engines do an amazing job.So, how do crawler-based search engines go about determining relevancy, when confronted with hundreds of millions of web pages to sort through? They follow a set of rules, known as an algorithm. Exactly how a particular search engine's algorithm works is a closely-kept trade secret. However, all major search engines follow the general rules below.

Location...and Frequency

One of the the main rules in a ranking algorithm involves the location and frequency of keywords on a web page.Pages with the search terms appearing in the HTML title tag are often assumed to be more relevant than others to the topic.

Search engines will also check to see if the search keywords appear near the top of a web page, such as in the headline or in the first few paragraphs of text. They assume that any page relevant to the topic will mention those words right from the beginning.

Frequency is the other major factor in how search engines determine relevancy. A search engine will analyze how often keywords appear in relation to other words in a web page. Those with a higher frequency are often deemed more relevant than other web pages.

Search engine "spamming"

Search engines may also penalize pages or exclude them from the index, if they detect search engine "spamming." An example is when a word is repeated hundreds of times on a page, to increase the frequency and propel the page higher in the listings. Search engines watch for common spamming methods in a variety of ways, including following up on complaints from their users.

Out do SPAMMING

Crawler-based search engines have plenty of experience now with webmasters who constantly rewrite their web pages in an attempt to gain better rankings. Some sophisticated webmasters may even go to great lengths to "reverse engineer" the location/frequency systems used by a particular search engine. Because of this, all major search engines now also make use of "off the page" ranking criteria.

Off the page

Off the page factors are those that a webmasters cannot easily influence. Chief among these is link analysis. By analyzing how pages link to each other, a search engine can both determine what a page is about and whether that page is deemed to be "important" and thus deserving of a ranking boost. In addition, sophisticated techniques are used to screen out attempts by webmasters to build "artificial" links designed to boost their rankings.

Another off the page factor is clickthrough measurement. In short, this means that a search engine may watch what results someone selects for a particular search, then eventually drop high-ranking pages that aren't attracting clicks, while promoting lower-ranking pages that do pull in visitors. As with link analysis, systems are used to compensate for artificial links generated by eager webmasters.

GOOGLE Technology

Google has developed an advanced search technology that involves a series of simultaneous calculations typically occurring in under half a second-without human intervention. At the heart of this technology is PageRank™ technology and hypertext-matching analysis developed by Larry Page and Sergey Brin. Google's search architecture also is scalable, which enables us to continue to index the Internet as it expands.

PageRank technology

PageRank performs an objective measurement of the importance of web pages and is calculated by solving an equation of 500 million variables and more than 3 billion terms. Google does not count links; instead PageRank uses the vast link structure of the web as an organizational tool. In essence, Google interprets a link from Page A to Page B as a "vote" by Page A for Page B. Google assesses a page's importance by the votes it receives.

Google also analyzes the pages that cast the votes. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages important. Important, high-quality pages receive a higher PageRank and are ordered or ranked higher in the results. Google's technology uses the collective intelligence of the web to determine a page's importance. Google does not use editors or its own employees to judge a page's importance.

Hypertext-Matching Analysis

Unlike conventional search engines, Google is hypertext-based. It analyzes all the content on each web page and factors in fonts, subdivisions, and the precise positions of all terms on the page. Google also analyzes the content of neighboring web pages. All of this data enables Google to return results that are more relevant to user queries. As a result, millions of users worldwide look to Google as the fastest, easiest way to find exactly the information they're looking for on the web the first time.

TOP CONTENTS