Big datasets for full-text search benchmarking
April 21, 2011 2 Comments
A few times recently I’ve looked for large datasets to experiment/benchmark against and I usually manage to come up blank.
I managed to spend longer than usual on this problem yesterday and came up with some which I’ll share with you.
Project Gutenberg
This project hosts the content of over 33,000 books. You can download the data as one book per file and there are full instructions for downloading/mirroring here. It seems that they’ve blocked Amazon AWS IP ranges from mirroring content from their site which is a shame.
The Westbury Lab USENET Corpus
http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html
The contents of some USENET groups have been anonymised and cleaned up to form a set containing >28M documents and >25B words. Each week’s data is stored as a single text file and each post simply delimited which makes parsing a breeze. You can easily load this data into a MySQL database with a command similar to:
LOAD DATA LOCAL INFILE '/path/to/file.txt' INTO TABLE myTable LINES TERMINATED BY '---END.OF.DOCUMENT---';
It’s also available as part of Amazon’s AWS Public Datasets offering as EBS snapshot snap-c1d156aa
in the US West AWS region. Using AWS is a really quick way of getting hold of this dataset without the need to wait for any downloading to complete.
I found this a really nice dataset with each document at around 5kB, it seemed to be a sensible size for benchmarking email body text, blog/publishing posts, etc
Wikipedia Text
http://dumps.wikimedia.org/
Wikipedia provide huge database dumps. It seems that there’s an AWS Snapshot snap-8041f2e9
which contains this data too but it’s a couple of years old. There’s also a “WEX” extract (snap-1781757e
on AWS) created by the Freebase team which is provided as XML markup ready to be easily imported into a db table, one row per article.
In doing this research I came across a couple of v.interesting projects that extract/compile metadata from various sources including Wikipedia. They’re Freebase and DBpedia. I hope to play with some of their datasets and write a post on that in the future.