Domain names from search engines

To get more host (domain) names for the skitter destination list, we queried a number of search engines feeding a list of 30,672 words obtained by parsing about 1,000,000 words from popular science articles contained in Chance News of 1997 at Dartmouth College (statistics table for this material was compiled at the same time).

The feeding of queries was started for Alta Vista and later continued for other search engines. The search, however, was reset in the middle due to a computer failure, so the conditions under which the data was gathered differ significantly. (We tried two more engines, but were unable to trick InfoSeek (Goto.gom) into returning data for our queries. Northern Light, which is located on the East Coast, was responding too slow for our purposes.) We intend to make another run in the near future.

The columns in the table have tha following meaning. The column named "batches" corresponds to the number of matches returned per query (and thus, per http connection). This number is 50 for Excite and 10 for all other engines.

Engine		Batches	Worked	Words	Names in seq.	Unique names	grep sex
Alta Vista	10	90.5h	1064	571817		156822		283
Google		10	39.5h	1180	311602		87962		130
Excite		50	17h	487	308277		159335		2826
HotBot (Lycos)	10	11.5h	173	81679		53164		74
The column titled "Names in sequence" contains the number of names in the whole sequence retrieved from the search engine, when immediately repeated names (same name on the next line of output) are removed, but repetitions separated by other names are not. (We originally ran Alta Vista script without taking this into account, and removed adjacent repetitions only at the processing stage.)

Note that Excite is a clear favorite in this run (see however the data from another experiment below). Not only is it capable to produce 5.3 times as many unique names per hour as Alta Vista (Google is about 1.3 times as fast as Alta Vista in terms of unique domain names per hour.) Excite is also producing about twice as many unique names as Alta Vista per input word and the output of distinct names runs at more than 50% of all output names. The last two numbers are somewhat better for HotBot, although this might be due to the limited size of the sample. Nonetheless, the output of unique names per hour for Excite is still about twice as much as for HotBot, despite the shorter runtime biasing the results in favor of the latter. Finally, in accordance with its name, about 1.75% of all domain names by Excite contain the combination "sex". Closest competitor, Alta Vista, has about 0.175% and the next one, Google, has about 0.15% "hits" of that type.

In the next few days, we restarted experiment from the 1200-th word in our list and ran Excite, Alta Vista and HotBot for another 66.7 hours. (Google for some reason stopped answering our queries after a few hundred words). Now the runtimes are exactly the same (within a few minutes of each other) an we can safely exclude this factor from comparison. All scripts now also remove adjacent repetitions, so this factor can also be excluded. Filtering on the specific domain names which are aggressively promoted by an engine, however, is still different: we have to filter on 1 name (altavista) for Alta Vista, 3 names (hotbot, lycos, doubleclick) for HotBot, and 7 for Excite. Batches are the same, i.e. 10 links per query for HotBot and AltaVista and 50 for Excite. Each search engine is queried for 1000 links, so the number of HTTP requests is 100 times the number of words for HotBot and Alta Vista, and 20*words for Excite. Our goal is getting as many distinct (unique) domain name as possible; for that reason we give a number of statistics computed on "per-count" basis.

Engine		HTTPs	Words	Names	Unique	/HTTP	/hour	/word	/name
Alta Vista	73963	1040	673148	191153	2.58	2866	184	0.284
HotBot		103877	1279	1057083	365579	3.52	5481	286	0.346
Excite		57080	3007	2676888	682605	11.96	10234	227	0.255
This data differs in many respects from the data we gathered in the first experiment and is possibly more credible due to the presence of the same conditions for each and every engine. In particular, it is important that engines were started and the outputs were sampled simultaneously, and that they ran within the same network environment. (The machine on which all of them run, an Intel FreeBSD box, was the same in both experiments.)

Note that although Excite is once again favorite in terms of per hour and per HTTP output of unique names (and its "sex drive" is about 2.8%, whereas HotBot has about 0.3% and AltaVista 0.2% - this, of course, may have nothing to do with existence or content), HotBot produces more unique names per word and apparently less repeating names per name. This, however, may be due to saturation which arises when more data is taken from the engine, since Excite performed about 2.5 as much word processing as HotBot, and produced about 2.5 as many domain names. Yet another reason may be parsing errors. HotBot output is much harder to parse as the domain names are buried in links to engine itself which also contain lots of hex-converted punctuation. This results in a relatively high amount of garbage (non-domain names) in HotBot output.

Excite dominance on per-HTTP basis may also be contested since it can retrieve 5 times as much data per HTTP connection; when this is taken into account (11.96 divided by 5), it goes down to 2.4 unique names per connection, which is close to Alta Vista's. Yet another unresolved issue is whether domain names produced by those engines are still in existence and reachable from local hosts. A preliminary computation shows that many of them are not. Nevertheless, the data that we have collected so far suggests that if one has to choose a search engine for domain name retrieval, Excite should be given a try.

The experiment described above ran from May 21,2000, 23:55 and until June 06,2000, 17:10, which is 15 days and 17.25 hours, or 377.25 hours. In that time, Excite ran from word 1200 to word 20,000, Alta Vista from 1200 to 15633 and HotBot from 1200 to 11700. Total number of lines, words and bytes in the files is

 6441761  7631666  112893952 altavistaHosts1200.dat
14761908  15804213 275292160 exciteHosts1200.dat
  183734    238742   3399680 googleHosts1200.dat
 6985718   8861646 133726208 hotbotHosts1200.dat
 28373121 32536267 525312000 total
In fact, the file for Alta Vista should be reduced by the number of HTTP connections since it contains an extra newline per connection (was overlooked in the script). Total number of connections made to each of engines equals
594952 Alta Vista  
343632 Excite 
625312 HotBot  
which reduces total number of lines retrieved from Alta Vista to around 1/3 of those from Excite. The number of different names in the file equals
  635184 altavistaHosts1200u.nms
 1475475 exciteHosts1200u.nms
 1112902 hotbotHosts1200u.nms
 3223561 total
which after filtering out some nonnames becomes
 627371 altavistaHosts1200u.nms
1468377 exciteHosts1200u.nms
1089234 hotbotHosts1200u.nms
When merged together, these files contain 2,945,415 different domain names. Note that this is almost as many as the sum of their numbers, so these three engines possess complementary knowledge of the Internet. Note that despite much smaller size of the file, HotBot produces much greater number of unique names per output line.