From c08650229939d22baa5e6aa507837958071e80a3 Mon Sep 17 00:00:00 2001 From: Sherlock <130759470+actuallysherlock@users.noreply.github.com> Date: Sat, 3 May 2025 16:51:20 +0500 Subject: [PATCH] =?UTF-8?q?=F0=9F=A7=A9=20Fix:=20README?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Made the following changes: - Improved README structure by making sure that the list items are consistent. - Nothing else, yeah, nothing else lol. --- README.md | 80 ++++++++++++++++++++++++++++++++----------------------- 1 file changed, 46 insertions(+), 34 deletions(-) diff --git a/README.md b/README.md index 050eeeb..b6def0c 100644 --- a/README.md +++ b/README.md @@ -8,76 +8,88 @@ Sloth Search is a project that aims to recreate Google, including crawling, inde The project is divided into the following folders: - **Client**: Contains the front-end code, providing a user interface similar to Google search, where users can enter queries and view search results. + - **Search**: Contains the core components of Sloth Search, which replicate the three main parts of Google: + - **Crawling**: The web crawler that collects information from the web. + - **Indexing**: Processing and storing the content collected by the crawler for efficient searching. + - **Serving (PageRank)**: Serving search results based on their relevance and PageRank algorithm. + - **Server**: Contains the search API used to handle client requests and provide search results. ## Installation and Setup -1. **Clone the Repository** +**1. Clone the Repository** - ```sh - git clone https://github.com/The-CodingSloth/sloth-search.git - cd sloth-search - ``` +```sh +git clone https://github.com/The-CodingSloth/sloth-search.git +cd sloth-search +``` -2. ## Install the necessary Python dependencies, run: +**2. Install the necessary Python dependencies** ```sh pip install -r requirements.txt ``` -3. **Client Setup** +**3. Client Setup** - - The client contains the HTML, CSS, and JavaScript code to run the front-end. - - Open the `index.html` file in your browser, or use a static file server to serve the client code locally. - - You can also use the live server extension. +- The client contains the HTML, CSS, and JavaScript code to run the front-end. -4. **Search Setup** +- Open the `index.html` file in your browser, or use a static file server to serve the client code locally. + +- You can also use the live server extension. + +**4. Search Setup** + +- The `search` directory contains the code for crawling, indexing, and serving. -- The `Search` directory contains the code for crawling, indexing, and serving. - You can start the process by running: - ```sh - python search/complete_examples/advanced_pagerank.py - ``` + +```sh +python search/complete_examples/advanced_pagerank.py +``` + - This will crawl, index, and prepare the content for searching. -- If you want to run any other files do the same process: + +- If you want to run any other files, do the same process: ```sh python search/ ``` -4. **Search Setup** - - The server uses Flask to provide an API for search queries. - - Start the Flask server by navigating to the `Server` directory and running: - ```sh - python google_search_api.py - ``` - ## How It Works -1. **Crawling** +**1. Crawling** - - The crawler starts with a set of seed URLs and collects links and content from the web. - - It respects `robots.txt` to avoid being blocked and to ensure ethical crawling. - - Parsed data is stored in a format ready for indexing. +- The crawler starts with a set of seed URLs and collects links and content from the web. -2. **Indexing** +- It respects `robots.txt` to avoid being blocked and to ensure ethical crawling. - - The indexing module processes the crawled pages. - - The content is tokenized, cleaned, stemmed, and stop words are removed using the NLTK library. - - The resulting indexed data is saved to be used by the search API. +- Parsed data is stored in a format ready for indexing. -3. **Serving and PageRank** - - The PageRank algorithm is used to rank pages based on their importance. - - When a user searches for a query through the client, the server uses the indexed data and PageRank scores to return the most relevant pages. +**2. Indexing** + +- The indexing module processes the crawled pages. + +- The content is tokenized, cleaned, stemmed, and stop words are removed using the NLTK library. + +- The resulting indexed data is saved to be used by the search API. + +**3. Serving and PageRank** + +- The PageRank algorithm is used to rank pages based on their importance. + +- When a user searches for a query through the client, the server uses the indexed data and PageRank scores to return the most relevant pages. ## Important Notes - **Respecting Websites**: The crawler respects `robots.txt` rules. Please make sure not to overload any websites. + - **PageRank Algorithm**: The implementation of the PageRank algorithm uses an iterative approach to rank pages based on the links. + - **Data Storage**: The crawler and indexer use CSV files for data storage (`advanced_pagerank_inverted_index.csv` and `advanced_pagerank.csv`). Make sure these files are writable during execution. ## Contributing