To get more host (domain) names for the skitter destination list, we queried a number of search engines feeding a list of 30,672 words obtained by parsing about 1,000,000 words from popular science articles contained in Chance News of 1997 at Dartmouth College (statistics table for this material was compiled at the same time).
The feeding of queries was started for Alta Vista and later continued for other search engines. The search, however, was reset in the middle due to a computer failure, so the conditions under which the data was gathered differ significantly. (We tried two more engines, but were unable to trick InfoSeek (Goto.gom) into returning data for our queries. Northern Light, which is located on the East Coast, was responding too slow for our purposes.) We intend to make another run in the near future.
The columns in the table have tha following meaning. The column named "batches" corresponds to the number of matches returned per query (and thus, per http connection). This number is 50 for Excite and 10 for all other engines.
Engine Batches Worked Words Names in seq. Unique names grep sex Alta Vista 10 90.5h 1064 571817 156822 283 Google 10 39.5h 1180 311602 87962 130 Excite 50 17h 487 308277 159335 2826 HotBot (Lycos) 10 11.5h 173 81679 53164 74The column titled "Names in sequence" contains the number of names in the whole sequence retrieved from the search engine, when immediately repeated names (same name on the next line of output) are removed, but repetitions separated by other names are not. (We originally ran Alta Vista script without taking this into account, and removed adjacent repetitions only at the processing stage.)
Note that Excite is a clear favorite in this run (see however the data from another experiment below). Not only is it capable to produce 5.3 times as many unique names per hour as Alta Vista (Google is about 1.3 times as fast as Alta Vista in terms of unique domain names per hour.) Excite is also producing about twice as many unique names as Alta Vista per input word and the output of distinct names runs at more than 50% of all output names. The last two numbers are somewhat better for HotBot, although this might be due to the limited size of the sample. Nonetheless, the output of unique names per hour for Excite is still about twice as much as for HotBot, despite the shorter runtime biasing the results in favor of the latter. Finally, in accordance with its name, about 1.75% of all domain names by Excite contain the combination "sex". Closest competitor, Alta Vista, has about 0.175% and the next one, Google, has about 0.15% "hits" of that type.
In the next few days, we restarted experiment from the 1200-th word in our list and ran Excite, Alta Vista and HotBot for another 66.7 hours. (Google for some reason stopped answering our queries after a few hundred words). Now the runtimes are exactly the same (within a few minutes of each other) an we can safely exclude this factor from comparison. All scripts now also remove adjacent repetitions, so this factor can also be excluded. Filtering on the specific domain names which are aggressively promoted by an engine, however, is still different: we have to filter on 1 name (altavista) for Alta Vista, 3 names (hotbot, lycos, doubleclick) for HotBot, and 7 for Excite. Batches are the same, i.e. 10 links per query for HotBot and AltaVista and 50 for Excite. Each search engine is queried for 1000 links, so the number of HTTP requests is 100 times the number of words for HotBot and Alta Vista, and 20*words for Excite. Our goal is getting as many distinct (unique) domain name as possible; for that reason we give a number of statistics computed on "per-count" basis.
Engine HTTPs Words Names Unique /HTTP /hour /word /name Alta Vista 73963 1040 673148 191153 2.58 2866 184 0.284 HotBot 103877 1279 1057083 365579 3.52 5481 286 0.346 Excite 57080 3007 2676888 682605 11.96 10234 227 0.255This data differs in many respects from the data we gathered in the first experiment and is possibly more credible due to the presence of the same conditions for each and every engine. In particular, it is important that engines were started and the outputs were sampled simultaneously, and that they ran within the same network environment. (The machine on which all of them run, an Intel FreeBSD box, was the same in both experiments.)
Note that although Excite is once again favorite in terms of per hour and per HTTP output of unique names (and its "sex drive" is about 2.8%, whereas HotBot has about 0.3% and AltaVista 0.2% - this, of course, may have nothing to do with existence or content), HotBot produces more unique names per word and apparently less repeating names per name. This, however, may be due to saturation which arises when more data is taken from the engine, since Excite performed about 2.5 as much word processing as HotBot, and produced about 2.5 as many domain names. Yet another reason may be parsing errors. HotBot output is much harder to parse as the domain names are buried in links to engine itself which also contain lots of hex-converted punctuation. This results in a relatively high amount of garbage (non-domain names) in HotBot output.
Excite dominance on per-HTTP basis may also be contested since it can retrieve 5 times as much data per HTTP connection; when this is taken into account (11.96 divided by 5), it goes down to 2.4 unique names per connection, which is close to Alta Vista's. Yet another unresolved issue is whether domain names produced by those engines are still in existence and reachable from local hosts. A preliminary computation shows that many of them are not. Nevertheless, the data that we have collected so far suggests that if one has to choose a search engine for domain name retrieval, Excite should be given a try.
The experiment described above ran from May 21,2000, 23:55 and until June 06,2000, 17:10, which is 15 days and 17.25 hours, or 377.25 hours. In that time, Excite ran from word 1200 to word 20,000, Alta Vista from 1200 to 15633 and HotBot from 1200 to 11700. Total number of lines, words and bytes in the files is
6441761 7631666 112893952 altavistaHosts1200.dat 14761908 15804213 275292160 exciteHosts1200.dat 183734 238742 3399680 googleHosts1200.dat 6985718 8861646 133726208 hotbotHosts1200.dat 28373121 32536267 525312000 totalIn fact, the file for Alta Vista should be reduced by the number of HTTP connections since it contains an extra newline per connection (was overlooked in the script). Total number of connections made to each of engines equals
594952 Alta Vista 343632 Excite 625312 HotBotwhich reduces total number of lines retrieved from Alta Vista to around 1/3 of those from Excite. The number of different names in the file equals
635184 altavistaHosts1200u.nms 1475475 exciteHosts1200u.nms 1112902 hotbotHosts1200u.nms 3223561 totalwhich after filtering out some nonnames becomes
627371 altavistaHosts1200u.nms 1468377 exciteHosts1200u.nms 1089234 hotbotHosts1200u.nmsWhen merged together, these files contain 2,945,415 different domain names. Note that this is almost as many as the sum of their numbers, so these three engines possess complementary knowledge of the Internet. Note that despite much smaller size of the file, HotBot produces much greater number of unique names per output line.