🧩 Fix: README

Made the following changes:

- Improved README structure by making sure that the list items are consistent.

- Nothing else, yeah, nothing else lol.
This commit is contained in:
Sherlock 2025-05-03 16:51:20 +05:00 committed by GitHub
parent a541cc8a7e
commit c086502299
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -8,76 +8,88 @@ Sloth Search is a project that aims to recreate Google, including crawling, inde
The project is divided into the following folders:
- **Client**: Contains the front-end code, providing a user interface similar to Google search, where users can enter queries and view search results.
- **Search**: Contains the core components of Sloth Search, which replicate the three main parts of Google:
- **Crawling**: The web crawler that collects information from the web.
- **Indexing**: Processing and storing the content collected by the crawler for efficient searching.
- **Serving (PageRank)**: Serving search results based on their relevance and PageRank algorithm.
- **Server**: Contains the search API used to handle client requests and provide search results.
## Installation and Setup
1. **Clone the Repository**
**1. Clone the Repository**
```sh
git clone https://github.com/The-CodingSloth/sloth-search.git
cd sloth-search
```
2. ## Install the necessary Python dependencies, run:
**2. Install the necessary Python dependencies**
```sh
pip install -r requirements.txt
```
3. **Client Setup**
**3. Client Setup**
- The client contains the HTML, CSS, and JavaScript code to run the front-end.
- Open the `index.html` file in your browser, or use a static file server to serve the client code locally.
- You can also use the live server extension.
4. **Search Setup**
**4. Search Setup**
- The `search` directory contains the code for crawling, indexing, and serving.
- The `Search` directory contains the code for crawling, indexing, and serving.
- You can start the process by running:
```sh
python search/complete_examples/advanced_pagerank.py
```
- This will crawl, index, and prepare the content for searching.
- If you want to run any other files do the same process:
- If you want to run any other files, do the same process:
```sh
python search/<path to file you want to run>
```
4. **Search Setup**
- The server uses Flask to provide an API for search queries.
- Start the Flask server by navigating to the `Server` directory and running:
```sh
python google_search_api.py
```
## How It Works
1. **Crawling**
**1. Crawling**
- The crawler starts with a set of seed URLs and collects links and content from the web.
- It respects `robots.txt` to avoid being blocked and to ensure ethical crawling.
- Parsed data is stored in a format ready for indexing.
2. **Indexing**
**2. Indexing**
- The indexing module processes the crawled pages.
- The content is tokenized, cleaned, stemmed, and stop words are removed using the NLTK library.
- The resulting indexed data is saved to be used by the search API.
3. **Serving and PageRank**
**3. Serving and PageRank**
- The PageRank algorithm is used to rank pages based on their importance.
- When a user searches for a query through the client, the server uses the indexed data and PageRank scores to return the most relevant pages.
## Important Notes
- **Respecting Websites**: The crawler respects `robots.txt` rules. Please make sure not to overload any websites.
- **PageRank Algorithm**: The implementation of the PageRank algorithm uses an iterative approach to rank pages based on the links.
- **Data Storage**: The crawler and indexer use CSV files for data storage (`advanced_pagerank_inverted_index.csv` and `advanced_pagerank.csv`). Make sure these files are writable during execution.
## Contributing