🧩 Fix: README

Made the following changes:

- Improved README structure by making sure that the list items are consistent.

- Nothing else, yeah, nothing else lol.
This commit is contained in:
Sherlock 2025-05-03 16:51:20 +05:00 committed by GitHub
parent a541cc8a7e
commit c086502299
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -8,76 +8,88 @@ Sloth Search is a project that aims to recreate Google, including crawling, inde
The project is divided into the following folders: The project is divided into the following folders:
- **Client**: Contains the front-end code, providing a user interface similar to Google search, where users can enter queries and view search results. - **Client**: Contains the front-end code, providing a user interface similar to Google search, where users can enter queries and view search results.
- **Search**: Contains the core components of Sloth Search, which replicate the three main parts of Google: - **Search**: Contains the core components of Sloth Search, which replicate the three main parts of Google:
- **Crawling**: The web crawler that collects information from the web. - **Crawling**: The web crawler that collects information from the web.
- **Indexing**: Processing and storing the content collected by the crawler for efficient searching. - **Indexing**: Processing and storing the content collected by the crawler for efficient searching.
- **Serving (PageRank)**: Serving search results based on their relevance and PageRank algorithm. - **Serving (PageRank)**: Serving search results based on their relevance and PageRank algorithm.
- **Server**: Contains the search API used to handle client requests and provide search results. - **Server**: Contains the search API used to handle client requests and provide search results.
## Installation and Setup ## Installation and Setup
1. **Clone the Repository** **1. Clone the Repository**
```sh ```sh
git clone https://github.com/The-CodingSloth/sloth-search.git git clone https://github.com/The-CodingSloth/sloth-search.git
cd sloth-search cd sloth-search
``` ```
2. ## Install the necessary Python dependencies, run: **2. Install the necessary Python dependencies**
```sh ```sh
pip install -r requirements.txt pip install -r requirements.txt
``` ```
3. **Client Setup** **3. Client Setup**
- The client contains the HTML, CSS, and JavaScript code to run the front-end. - The client contains the HTML, CSS, and JavaScript code to run the front-end.
- Open the `index.html` file in your browser, or use a static file server to serve the client code locally. - Open the `index.html` file in your browser, or use a static file server to serve the client code locally.
- You can also use the live server extension. - You can also use the live server extension.
4. **Search Setup** **4. Search Setup**
- The `search` directory contains the code for crawling, indexing, and serving.
- The `Search` directory contains the code for crawling, indexing, and serving.
- You can start the process by running: - You can start the process by running:
```sh ```sh
python search/complete_examples/advanced_pagerank.py python search/complete_examples/advanced_pagerank.py
``` ```
- This will crawl, index, and prepare the content for searching. - This will crawl, index, and prepare the content for searching.
- If you want to run any other files do the same process:
- If you want to run any other files, do the same process:
```sh ```sh
python search/<path to file you want to run> python search/<path to file you want to run>
``` ```
4. **Search Setup**
- The server uses Flask to provide an API for search queries.
- Start the Flask server by navigating to the `Server` directory and running:
```sh
python google_search_api.py
```
## How It Works ## How It Works
1. **Crawling** **1. Crawling**
- The crawler starts with a set of seed URLs and collects links and content from the web. - The crawler starts with a set of seed URLs and collects links and content from the web.
- It respects `robots.txt` to avoid being blocked and to ensure ethical crawling. - It respects `robots.txt` to avoid being blocked and to ensure ethical crawling.
- Parsed data is stored in a format ready for indexing. - Parsed data is stored in a format ready for indexing.
2. **Indexing** **2. Indexing**
- The indexing module processes the crawled pages. - The indexing module processes the crawled pages.
- The content is tokenized, cleaned, stemmed, and stop words are removed using the NLTK library. - The content is tokenized, cleaned, stemmed, and stop words are removed using the NLTK library.
- The resulting indexed data is saved to be used by the search API. - The resulting indexed data is saved to be used by the search API.
3. **Serving and PageRank** **3. Serving and PageRank**
- The PageRank algorithm is used to rank pages based on their importance. - The PageRank algorithm is used to rank pages based on their importance.
- When a user searches for a query through the client, the server uses the indexed data and PageRank scores to return the most relevant pages. - When a user searches for a query through the client, the server uses the indexed data and PageRank scores to return the most relevant pages.
## Important Notes ## Important Notes
- **Respecting Websites**: The crawler respects `robots.txt` rules. Please make sure not to overload any websites. - **Respecting Websites**: The crawler respects `robots.txt` rules. Please make sure not to overload any websites.
- **PageRank Algorithm**: The implementation of the PageRank algorithm uses an iterative approach to rank pages based on the links. - **PageRank Algorithm**: The implementation of the PageRank algorithm uses an iterative approach to rank pages based on the links.
- **Data Storage**: The crawler and indexer use CSV files for data storage (`advanced_pagerank_inverted_index.csv` and `advanced_pagerank.csv`). Make sure these files are writable during execution. - **Data Storage**: The crawler and indexer use CSV files for data storage (`advanced_pagerank_inverted_index.csv` and `advanced_pagerank.csv`). Make sure these files are writable during execution.
## Contributing ## Contributing