dot-notdot.com

Pre-Query Web Document Clustering

My final year project at Lancaster University. Being a dissertation level project, this project had tougher requirements and required much more of me compared to my previous two projects. This project's goal was bigger than anything I'd ever attempted before: build a search engine. Not just any old search engine however, this service would need to have the ability cluster data together. For example, similar to how Clusty and Carrot^2 can cluster pages on the world wide web.
Well, learning from my mistakes the year before I started by immediately identifying my data source upfront. The answer (rather poetically for a university student) came from Wikipedia. This particular source offers their entire database of content for download right here. It was almost too good to be true. After a bit of data processing to simplify the data and then some basic indexing, I had everything I needed - the search engine "Lime" was born.

Project Proposal
Final Dissertation