QuoteSearch / README.md
ruidiao's picture
Update README.md
0b9c460 verified
metadata
title: Quote Search
emoji: πŸ“š
colorFrom: purple
colorTo: purple
sdk: static
pinned: false
license: mit
short_description: 'Client-side AI quote search: fast, private, no servers.'
thumbnail: >-
  https://cdn-uploads.huggingface.co/production/uploads/684a3909e791345bf47d0899/_GjWDxNUBjg96NfZefwaE.png

Client-Side Semantic Quote Retrieval Engine

This project implements a "Zero-Infrastructure" client-side semantic quote retrieval engine. All computationally intensive tasks, including vectorization, indexing, and model quantization, are handled offline. The search and retrieval operations then occur entirely within the user's browser, ensuring data privacy and minimal operational overhead.

Project Overview

The core idea is to pre-process a large dataset of quotes into a highly optimized, quantized vector index. This index, along with a small machine learning model, is then loaded by the client-side application. When a user enters a query, the application generates an embedding for the query and performs a fast Approximate Nearest Neighbor Search (ANNS) directly in the browser to find semantically similar quotes.

Data Source

The quotes data used in this project was sourced from: https://archive.org/details/quotes_20230625

Setup and Usage

  1. Offline Data Processing:

    • Ensure you have Python and the necessary libraries (e.g., pandas, numpy, torch, sentence-transformers, tqdm) installed.
    • Run the offline_processing.py script to generate the quotes_index.bin file. This file contains the pre-computed embeddings and metadata.
    python offline_processing.py
    
    • Note: This step can be time-consuming for large datasets, especially the first time as the embedding model needs to be downloaded.
    • Important: The script now includes validation to ensure that categories in the CSV do not contain uppercase letters. Rows with invalid categories will be ignored.
  2. Client-Side Application:

    • Open index.html in your web browser.
    • The search input and button are immediately available.
    • The application will first check for a cached index. If not found, it will display a message indicating a significant one-time download (for the model and index) which will occur on your first search.
    • The quotes_index.bin is loaded (from cache or downloaded) and the necessary machine learning model (via transformers.js) is downloaded on demand during your first search.
    • A progress bar with detailed status will be shown during downloads and processing, disappearing once complete.
    • A "Delete Cached Data" button will appear when data is cached, allowing you to clear local storage.
    • Enter your search queries to retrieve semantically similar quotes.

Technologies Used

  • Frontend: HTML, CSS (Tailwind CSS), JavaScript
  • Offline Processing: Python (pandas, numpy, torch, sentence-transformers)
  • Embedding Model: nomic-ai/nomic-embed-text-v1.5
  • Client-Side ML: transformers.js, Web Workers