KinoSearch Benchmarks

RESULTS A: Documents indexed, but full text not stored
Engine Environment Time Memory
Lucene 1.9.1 JVM 1.4 40.40 secs 80 MB
Lucene 1.9.1 JVM 1.5 42.17 secs 94 MB
KinoSearch 0.09 Perl 5.8.8 58.01 secs 28 MB
KinoSearch 0.09 Perl 5.8.6 68.23 secs 29 MB
Plucene 1.24 Perl 5.8.8 1914.07 secs skipped
Plucene 1.24 Perl 5.8.6 1951.02 secs skipped
RESULTS B: Documents indexed, stored, and "vectorized"
Engine Environment Time Memory
Lucene 1.9.1 JVM 1.4 57.71 secs 161 MB
Lucene 1.9.1 JVM 1.5 59.99 secs 193 MB
KinoSearch 0.09 Perl 5.8.8 62.91 secs 28 MB
KinoSearch 0.09 Perl 5.8.6 74.77 secs 29 MB
Plucene 1.24 Perl 5.8.8 1931.49 secs skipped
Plucene 1.24 Perl 5.8.6 1963.82 secs skipped

Test Corpus

19043 news articles drawn from the Reuters 21578 collection, Distribution 1.0.

Software

Hardware

Methodology

The Reuters 21578 collection comes in SGML format, but since the goal is to measure indexers and not SGML parsers, the articles are first expanded out onto the file system, 1 file per article, spread out over 22 directories, using a supplemental Perl script.

The indexing apps perform several identical indexing runs, each following immediately after another. OS caching can have a huge impact upon performance -- starting cold typically doubles the time it takes to complete the first rep -- so all of the tests are prepped by at least one dry run. Once the indexer launches, the machine is not touched until the process completes.

Each indexer is run under maximally favorable conditions -- within reason (no internal API hacks). For instance, JVM startup and warmup time slow down the first pass the Lucene indexer, so all passes are run under a single process. In contrast, provided that the OS cache is suitably warm, the Perl apps consistently perform best on the first run during a given process (a quirk which might be due to the overhead of managing a fragmented memory pool on subsequent runs), thus they spawn a child process for each rep.

As they finish, the apps provide two means: an arithmetic mean and a truncated mean. The truncated mean, which discards outliers and is best known as a technique for normalizing scores at judged sporting events, is given greater prominence.

Memory consumption is measured using a very crude method: eyeballing RPRVT in the output of top. This is done on a separate run, as top consumes a lot of CPU cycles and would skew the timing results. (There has to be a better way. If you know of one, please contact me.)

Notes

Raw Data

slothbear:~/Desktop/ks/t/benchmarks marvin$ java -server -Xmx500M \
> -XX:CompileThreshold=100 LuceneIndexer -reps 6 -store 1
---------------------------------------------------
1   Secs: 59.11  Docs: 19043
2   Secs: 57.24  Docs: 19043
3   Secs: 57.89  Docs: 19043
4   Secs: 56.61  Docs: 19043
5   Secs: 56.48  Docs: 19043
6   Secs: 60.38  Docs: 19043
---------------------------------------------------
Lucene 1.9.1
JVM 1.4.2_09 (Apple Computer, Inc.)
Mac OS X 10.4.6 ppc
Mean: 57.95 secs
Truncated mean (4 kept, 2 discarded): 57.71 secs
---------------------------------------------------


slothbear:~/Desktop/ks/t/benchmarks marvin$ java -server -Xmx500M \
> -XX:CompileThreshold=100 LuceneIndexer -reps 6         
---------------------------------------------------
1   Secs: 41.76  Docs: 19043
2   Secs: 40.12  Docs: 19043
3   Secs: 40.24  Docs: 19043
4   Secs: 40.99  Docs: 19043
5   Secs: 40.12  Docs: 19043
6   Secs: 40.25  Docs: 19043
---------------------------------------------------
Lucene 1.9.1
JVM 1.4.2_09 (Apple Computer, Inc.)
Mac OS X 10.4.6 ppc
Mean: 40.58 secs
Truncated mean (4 kept, 2 discarded): 40.40 secs
---------------------------------------------------


slothbear:~/Desktop/ks/t/benchmarks marvin$ java15 -server -Xmx500M \
> -XX:CompileThreshold=100 LuceneIndexer -reps 6 -store 1
---------------------------------------------------
1   Secs: 61.35  Docs: 19043
2   Secs: 59.10  Docs: 19043
3   Secs: 60.04  Docs: 19043
4   Secs: 59.45  Docs: 19043
5   Secs: 62.18  Docs: 19043
6   Secs: 58.34  Docs: 19043
---------------------------------------------------
Lucene 1.9.1
JVM 1.5.0_02 (Apple Computer, Inc.)
Mac OS X 10.4.6 ppc
Mean: 60.08 secs
Truncated mean (4 kept, 2 discarded): 59.99 secs
---------------------------------------------------


slothbear:~/Desktop/ks/t/benchmarks marvin$ java15 -server -Xmx500M \
> -XX:CompileThreshold=100 LuceneIndexer -reps 6         
---------------------------------------------------
1   Secs: 42.07  Docs: 19043
2   Secs: 41.89  Docs: 19043
3   Secs: 42.04  Docs: 19043
4   Secs: 41.77  Docs: 19043
5   Secs: 42.68  Docs: 19043
6   Secs: 43.19  Docs: 19043
---------------------------------------------------
Lucene 1.9.1
JVM 1.5.0_02 (Apple Computer, Inc.)
Mac OS X 10.4.6 ppc
Mean: 42.27 secs
Truncated mean (4 kept, 2 discarded): 42.17 secs
---------------------------------------------------


slothbear:~/Desktop/ks/t/benchmarks marvin$ perl -Mblib \
> indexers/kinosearch_indexer.plx --reps=6
------------------------------------------------------------
1    Secs: 68.45  Docs: 19043
2    Secs: 67.78  Docs: 19043
3    Secs: 68.01  Docs: 19043
4    Secs: 72.09  Docs: 19043
5    Secs: 68.15  Docs: 19043
6    Secs: 68.32  Docs: 19043
------------------------------------------------------------
KinoSearch 0.09 
Perl 5.8.6
Thread support: yes
Darwin 8.6.0 Power Macintosh
Mean: 68.80 secs 
Truncated mean (4 kept, 2 discarded): 68.23 secs
------------------------------------------------------------


slothbear:~/Desktop/ks/t/benchmarks marvin$ perl -Mblib \
> indexers/kinosearch_indexer.plx --reps=6 --store=1
------------------------------------------------------------
1    Secs: 75.19  Docs: 19043
2    Secs: 75.60  Docs: 19043
3    Secs: 73.07  Docs: 19043
4    Secs: 73.35  Docs: 19043
5    Secs: 74.94  Docs: 19043
6    Secs: 75.83  Docs: 19043
------------------------------------------------------------
KinoSearch 0.09 
Perl 5.8.6
Thread support: yes
Darwin 8.6.0 Power Macintosh
Mean: 74.66 secs 
Truncated mean (4 kept, 2 discarded): 74.77 secs
------------------------------------------------------------


slothbear:~/Desktop/ks/t/benchmarks marvin$ perl588 -Mblib \
> indexers/kinosearch_indexer.plx --reps=6
------------------------------------------------------------
1    Secs: 57.62  Docs: 19043
2    Secs: 58.45  Docs: 19043
3    Secs: 57.68  Docs: 19043
4    Secs: 57.98  Docs: 19043
5    Secs: 60.22  Docs: 19043
6    Secs: 57.92  Docs: 19043
------------------------------------------------------------
KinoSearch 0.09 
Perl 5.8.8
Thread support: no
Darwin 8.6.0 Power Macintosh
Mean: 58.31 secs 
Truncated mean (4 kept, 2 discarded): 58.01 secs
------------------------------------------------------------


slothbear:~/Desktop/ks/t/benchmarks marvin$ perl588 -Mblib \
> indexers/kinosearch_indexer.plx --reps=6 --store=1
------------------------------------------------------------
1    Secs: 63.03  Docs: 19043
2    Secs: 62.92  Docs: 19043
3    Secs: 62.68  Docs: 19043
4    Secs: 65.53  Docs: 19043
5    Secs: 62.95  Docs: 19043
6    Secs: 62.74  Docs: 19043
------------------------------------------------------------
KinoSearch 0.09 
Perl 5.8.8
Thread support: no
Darwin 8.6.0 Power Macintosh
Mean: 63.31 secs 
Truncated mean (4 kept, 2 discarded): 62.91 secs
------------------------------------------------------------


slothbear:~/Desktop/ks/t/benchmarks marvin$ perl -Mblib \
> indexers/plucene_indexer.plx --reps=6
------------------------------------------------------------
1    Secs: 1972.23  Docs: 19043
2    Secs: 1952.69  Docs: 19043
3    Secs: 1948.12  Docs: 19043
4    Secs: 1949.07  Docs: 19043
5    Secs: 1945.95  Docs: 19043
6    Secs: 1954.19  Docs: 19043
------------------------------------------------------------
Plucene 1.24 
Perl 5.8.6
Thread support: yes
Darwin 8.6.0 Power Macintosh
Mean: 1953.71 secs 
Truncated mean (4 kept, 2 discarded): 1951.02 secs
------------------------------------------------------------


slothbear:~/Desktop/ks/t/benchmarks marvin$ perl -Mblib \
> indexers/plucene_indexer.plx --reps=6 --store=1
------------------------------------------------------------
1    Secs: 1958.82  Docs: 19043
2    Secs: 1959.02  Docs: 19043
3    Secs: 1961.08  Docs: 19043
4    Secs: 1965.85  Docs: 19043
5    Secs: 1970.43  Docs: 19043
6    Secs: 1969.34  Docs: 19043
------------------------------------------------------------
Plucene 1.24 
Perl 5.8.6
Thread support: yes
Darwin 8.6.0 Power Macintosh
Mean: 1964.09 secs 
Truncated mean (4 kept, 2 discarded): 1963.82 secs
------------------------------------------------------------


slothbear:~/Desktop/ks/t/benchmarks marvin$ perl588 \
> indexers/plucene_indexer.plx --reps=6
------------------------------------------------------------
1    Secs: 1928.86  Docs: 19043
2    Secs: 1919.43  Docs: 19043
3    Secs: 1914.60  Docs: 19043
4    Secs: 1911.27  Docs: 19043
5    Secs: 1910.99  Docs: 19043
6    Secs: 1906.60  Docs: 19043
------------------------------------------------------------
Plucene 1.24 
Perl 5.8.8
Thread support: no
Darwin 8.6.0 Power Macintosh
Mean: 1915.29 secs 
Truncated mean (4 kept, 2 discarded): 1914.07 secs
------------------------------------------------------------


slothbear:~/Desktop/ks/t/benchmarks marvin$ perl588 \
> indexers/plucene_indexer.plx --reps=6 --store=1
------------------------------------------------------------
1    Secs: 2155.41  Docs: 19043
2    Secs: 1925.17  Docs: 19043
3    Secs: 1936.31  Docs: 19043
4    Secs: 1932.67  Docs: 19043
5    Secs: 1924.51  Docs: 19043
6    Secs: 1931.80  Docs: 19043
------------------------------------------------------------
Plucene 1.24 
Perl 5.8.8
Thread support: no
Darwin 8.6.0 Power Macintosh
Mean: 1967.65 secs 
Truncated mean (4 kept, 2 discarded): 1931.49 secs
------------------------------------------------------------

Copyright © 2004-2008 Marvin Humphrey