CPS842 (F2022) - Wikipedia Search Engine

Brief intro about the project

This the final project for CPS842 - Information Retrieval and Web Search made by David Phan, Sania Syed, Kirill Shmakov and Kim Rikter-Svendsen.

The goal of this project is to apply our understanding and create a meaningful search engine. We've chosen a subset of 1000 Wikipedia articles and trained 2 models to create our search engine.

Here's how it works:
  • Okapi BM25 model - Help calculate how relevant the search term is compared to the contents of the Wikipedia articles. Essentially looking at the contents of each article and rank their relevancy compared to the search term. A higher score would indicate relevancy.
  • Naive Bayes (NB) model - Help the engine interpret which topics are the most relevant to the search terms. We've trained the model with by providing a subset of articles and their topic labels and tested the model with the remaining data.
  • The final result is a search engine capable of inferring the topics the user is interested in based on the search term and only return most relevant articles within those topics.

Link to repo
But enough theory, have a try and see how it goes :)