Feature Chart
SEs by Feature
News Searches
Multi-Search
Directories
Opinions/Usenet
Phone Numbers
Others
Size
Freshness
Inconsistencies
And more
Google
Yahoo!
Search
Teoma
Gigablast
News Archive
Email Lists
Alerts
On the Net columns
Search Strategies
Books on Searching
Site Awards
About this Site |
|
|
|
Google Special Report: Database Components
Data from search engine analysis run on March 4-6, 2002.
by Greg R.
Google's Multifaceted Database
Google's Web database has several facets of interest to searchers. This article compares Google's reports on the size of these components of its Web database to actual search results and what searchers can expect to find.
Google's Reported Numbers
Google
includes some results (URLs) that it has not actually indexed. In addition, it includes other file types like PDFs, PostScript, and others. According to a company press release from Dec. 11, 2001 along with additional information from company representatives, the breakdown is something like this:
|
|
in millions |
percent |
|
Indexed Web Pages |
1,465 |
73.1% |
|
Unindexed URLs |
500 |
25% |
|
Other file types |
35 |
1.75% |
|
Daily Reindexed Web Pages |
3 |
0.15% |
Counting all the above, Google reports over 2 billion "Web documents." However, in an analysis of the results from 25 very specific searches, the effective size is considerably less, since most searchers will see very few of the unindexed URLs. What are these four categories?
- Indexed Web Pages
- Regular search engine results -- Web pages whose words have been indexed.
- Unindexed URLs
- URLs for Web pages or documents that Google's spider has not actually visited and has not been indexed. See my page on Google's Unindexed URLs for more details and an example.
- Other File Types
- Web-accessible documents that are not Web pages, such as Adobe Acrobat PDF, Microsoft Word, PostScript, Excel, PowerPoint, WordPerfect, and other files.
- Daily Reindexed Web Pages
- These are just regular indexed Web pages like those in the first category, except that Google has noticed that these are pages that are frequently updated. Therefore, Google reindexes these every day or so. These pages display the date they were last refreshed after the URL and size in Google's results.
Analysis of 25 Google Searches
So what can we expect from our searches? The upper pie chart to the right represents a total of 8,371 results from 25 small, specific, one-word searches on Google. The largest slice includes both the Indexed Web Pages and the Daily Reindexed Web Pages, since the latter were not counted separately. The lower pie chart shows the subdivision of the Other File Types and the key identifies the file extension and the actual number of results for each type. PDFs are by far the most numerous.
See some of my other statistical analyses such as the relative size and change over time for more search engine database information.
|