Big datasets for full-text search benchmarking

A few times recently I’ve looked for large datasets to experiment/benchmark against and I usually manage to come up blank.

I managed to spend longer than usual on this problem yesterday and came up with some which I’ll share with you.

Project Gutenberg

http://www.gutenberg.org/

This project hosts the content of over 33,000 books. You can download the data as one book per file and there are full instructions for downloading/mirroring here. It seems that they’ve blocked Amazon AWS IP ranges from mirroring content from their site which is a shame.

The Westbury Lab USENET Corpus

http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html

The contents of some USENET groups have been anonymised and cleaned up to form a set containing >28M documents and >25B words. Each week’s data is stored as a single text file and each post simply delimited which makes parsing a breeze. You can easily load this data into a MySQL database with a command similar to:

LOAD DATA LOCAL INFILE '/path/to/file.txt' INTO TABLE myTable LINES TERMINATED BY '---END.OF.DOCUMENT---';

It’s also available as part of Amazon’s AWS Public Datasets offering as EBS snapshot snap-­c1d156aa in the US West AWS region. Using AWS is a really quick way of getting hold of this dataset without the need to wait for any downloading to complete.

I found this a really nice dataset with each document at around 5kB, it seemed to be a sensible size for benchmarking email body text, blog/publishing posts, etc

Wikipedia Text

http://dumps.wikimedia.org/
Wikipedia provide huge database dumps. It seems that there’s an AWS Snapshot snap-­8041f2e9 which contains this data too but it’s a couple of years old. There’s also a “WEX” extract (snap-­1781757e on AWS) created by the Freebase team which is provided as XML markup ready to be easily imported into a db table, one row per article.

In doing this research I came across a couple of v.interesting projects that extract/compile metadata from various sources including Wikipedia. They’re Freebase and DBpedia. I hope to play with some of their datasets and write a post on that in the future.