Spaces:
Running
title: Quote Search
emoji: π
colorFrom: purple
colorTo: purple
sdk: static
pinned: false
license: mit
short_description: 'Client-side AI quote search: fast, private, no servers.'
thumbnail: >-
https://cdn-uploads.huggingface.co/production/uploads/684a3909e791345bf47d0899/_GjWDxNUBjg96NfZefwaE.png
Client-Side Semantic Quote Retrieval Engine
This project implements a "Zero-Infrastructure" client-side semantic quote retrieval engine. All computationally intensive tasks, including vectorization, indexing, and model quantization, are handled offline. The search and retrieval operations then occur entirely within the user's browser, ensuring data privacy and minimal operational overhead.
Project Overview
The core idea is to pre-process a large dataset of quotes into a highly optimized, quantized vector index. This index, along with a small machine learning model, is then loaded by the client-side application. When a user enters a query, the application generates an embedding for the query and performs a fast Approximate Nearest Neighbor Search (ANNS) directly in the browser to find semantically similar quotes.
Data Source
The quotes data used in this project was sourced from: https://archive.org/details/quotes_20230625
Setup and Usage
Offline Data Processing:
- Ensure you have Python and the necessary libraries (e.g.,
pandas,numpy,torch,sentence-transformers,tqdm) installed. - Run the
offline_processing.pyscript to generate thequotes_index.binfile. This file contains the pre-computed embeddings and metadata.
python offline_processing.py- Note: This step can be time-consuming for large datasets, especially the first time as the embedding model needs to be downloaded.
- Important: The script now includes validation to ensure that categories in the CSV do not contain uppercase letters. Rows with invalid categories will be ignored.
- Ensure you have Python and the necessary libraries (e.g.,
Client-Side Application:
- Open
index.htmlin your web browser. - The search input and button are immediately available.
- The application will first check for a cached index. If not found, it will display a message indicating a significant one-time download (for the model and index) which will occur on your first search.
- The
quotes_index.binis loaded (from cache or downloaded) and the necessary machine learning model (viatransformers.js) is downloaded on demand during your first search. - A progress bar with detailed status will be shown during downloads and processing, disappearing once complete.
- A "Delete Cached Data" button will appear when data is cached, allowing you to clear local storage.
- Enter your search queries to retrieve semantically similar quotes.
- Open
Technologies Used
- Frontend: HTML, CSS (Tailwind CSS), JavaScript
- Offline Processing: Python (pandas, numpy, torch, sentence-transformers)
- Embedding Model: nomic-ai/nomic-embed-text-v1.5
- Client-Side ML: transformers.js, Web Workers