Spaces:

Bellok
/

warbler-cda

Running on Zero

Bellok commited on 5 days ago

Commit

2c1a23f

1 Parent(s): 9692d79

revert(app): limit HF datasets to 50k arxiv and core packs for space constraints

Reduced dataset ingestion in README.md and app.py to exclude most HF datasets except for a balanced 50k arxiv papers, novels, and npc-dialogue, lowering from ~2.6M to ~100k documents. Updated Gradio SDK to 6.0.2 and added thumbnail for deployment efficiency on HuggingFace Spaces. Addresses resource limits while maintaining core functionality.

Files changed (2) hide show

README.md +23 -15
app.py +12 -13

README.md CHANGED Viewed

@@ -4,17 +4,19 @@ emoji: 🦜
 colorFrom: blue
 colorTo: purple
 sdk: gradio
-sdk_version: 4.44.0
 app_file: app.py
 pinned: false
 license: mit
-short_description: RAG system with 8D FractalStat and 2.6M+ documents
 tags:
-  - rag
-  - semantic-search
-  - retrieval
-  - fastapi
-  - fractalstat
 ---
 # Warbler CDA - Cognitive Development Architecture RAG System
@@ -53,22 +55,28 @@ A **production-ready RAG (Retrieval-Augmented Generation) system** with **Fracta
 The Warbler system is trained on carefully curated, MIT-licensed datasets from HuggingFace:
-### Primary Datasets
 - **arXiv Papers** (`nick007x/arxiv-papers`) - 2.5M+ scholarly papers covering scientific domains
 - **Prompt Engineering Report** (`PromptSystematicReview/ThePromptReport`) - 83 comprehensive prompt documentation entries
 - **Generated Novels** (`GOAT-AI/generated-novels`) - 20 narrative-rich novels for storytelling patterns
 - **Technical Manuals** (`nlasso/anac-manuals-23`) - 52 procedural and operational documents
 - **ChatEnv Enterprise** (`SustcZhangYX/ChatEnv`) - 112K+ software development conversations
 - **Portuguese Education** (`Solshine/Portuguese_Language_Education_Texts`) - 21 multilingual educational texts
 - **Educational Stories** (`MU-NLPC/Edustories-en`) - 1.5K+ case studies and learning narratives
-### Original Warbler Packs
-- `warbler-pack-core` - Core narrative and reasoning patterns
-- `warbler-pack-wisdom-scrolls` - Philosophical and wisdom-based content
-- `warbler-pack-faction-politics` - Political and faction dynamics
 All datasets are provided under MIT or compatible licenses. For complete attribution, see the HuggingFace Hub pages listed above.
 ## 📦 Installation
@@ -387,4 +395,4 @@ MIT License - see [LICENSE](LICENSE) for details.
 ---
-### **Made with ❤️ by Tiny Walnut Games**

 colorFrom: blue
 colorTo: purple
 sdk: gradio
+sdk_version: 6.0.2
 app_file: app.py
 pinned: false
 license: mit
+short_description: RAG system with 8D FractalStat and 100k documents
 tags:
+- rag
+- semantic-search
+- retrieval
+- fastapi
+- fractalstat
+thumbnail: >-
+  https://cdn-uploads.huggingface.co/production/uploads/68c705b6fc90bcc7a4f56721/8G2TJJT8enAFaBLJGTXka.png
 ---
 # Warbler CDA - Cognitive Development Architecture RAG System
 The Warbler system is trained on carefully curated, MIT-licensed datasets from HuggingFace:
+### Original Warbler Packs
+- `warbler-pack-core` - Core narrative and reasoning patterns
+- `warbler-pack-wisdom-scrolls` - Philosophical and wisdom-based content
+- `warbler-pack-faction-politics` - Political and faction dynamics
+### HuggingFace Datasets
 - **arXiv Papers** (`nick007x/arxiv-papers`) - 2.5M+ scholarly papers covering scientific domains
+  - Due to space limits, we only ingest 100k of these documents for use on HuggingFace Spaces.
 - **Prompt Engineering Report** (`PromptSystematicReview/ThePromptReport`) - 83 comprehensive prompt documentation entries
+  - Currently unavailable due to same reasons above.
 - **Generated Novels** (`GOAT-AI/generated-novels`) - 20 narrative-rich novels for storytelling patterns
+  - Currently unavailable due to same reasons above.
 - **Technical Manuals** (`nlasso/anac-manuals-23`) - 52 procedural and operational documents
+  - Currently unavailable due to same reasons above.
 - **ChatEnv Enterprise** (`SustcZhangYX/ChatEnv`) - 112K+ software development conversations
+  - Currently unavailable due to same reasons above.
 - **Portuguese Education** (`Solshine/Portuguese_Language_Education_Texts`) - 21 multilingual educational texts
+  - Currently unavailable due to same reasons above.
 - **Educational Stories** (`MU-NLPC/Edustories-en`) - 1.5K+ case studies and learning narratives
 All datasets are provided under MIT or compatible licenses. For complete attribution, see the HuggingFace Hub pages listed above.
 ## 📦 Installation
 ---
+### **Made with ❤️ by Tiny Walnut Games**

app.py CHANGED Viewed

@@ -80,9 +80,16 @@ if len(documents) == 0:
         try:
             ingestor = HFWarblerIngestor(packs_dir=pack_loader.packs_dir, verbose=True)
-            # Re-enable HF dataset downloads with better timeout handling and limits
             datasets_to_download = [
-                "arxiv",  # Physics papers - minimal set for HF Spaces
             ]
             total_docs = 0
@@ -92,8 +99,8 @@ if len(documents) == 0:
                 try:
                     print(f"📦 Downloading {dataset} (timeout: 3 minutes)...")
-                    # Scale up now that CPU embeddings are solid - 100k papers for massive knowledge base
-                    arxiv_limit = 100000 if dataset == "arxiv" else None  # Maximum capacity now!
                     success = ingestor.ingest_dataset(dataset, arxiv_limit=arxiv_limit)
                     if success:
@@ -169,14 +176,6 @@ else:
             if embedding and hasattr(embedding_provider, "compute_fractalstat_from_embedding"):
                 fractalstat_coords = embedding_provider.compute_fractalstat_from_embedding(embedding)
-            # DEBUG - check embedding creation
-            import sys
-            embedding_summary = f"zero_embedding" if not embedding else f"embedding_dim_{len(embedding)}"
-            print(
-                f"DEBUG: Adding doc {doc['id'][:50]}... with {embedding_summary}",
-                file=sys.stderr,
-            )
             api.add_document(
                 doc_id=doc["id"],
                 content=doc["content"],
@@ -345,4 +344,4 @@ with gr.Blocks(title="Warbler CDA - FractalStat RAG") as demo:
         """)
 if __name__ == "__main__":
-    demo.launch(server_name="0.0.0.0", server_port=7860)

         try:
             ingestor = HFWarblerIngestor(packs_dir=pack_loader.packs_dir, verbose=True)
+            # Enable all available HF dataset packs for maximum knowledge diversity
             datasets_to_download = [
+                #"arxiv",      # Physics and mathematics papers
+                #"edustories", # Educational narratives and stories
+                "novels",     # Fiction literature
+                #"manuals",    # Technical documentation
+                #"enterprise", # Business and corporate content
+                "npc-dialogue", # Game character conversations
+                #"portuguese-edu", # Portuguese educational content
+                #"prompt-report"   # AI prompt engineering reports
             ]
             total_docs = 0
                 try:
                     print(f"📦 Downloading {dataset} (timeout: 3 minutes)...")
+                    # Balance between coverage and deployment time - 50k arxiv papers plus all other packs
+                    arxiv_limit = 50000 if dataset == "arxiv" else None  # Balanced capacity
                     success = ingestor.ingest_dataset(dataset, arxiv_limit=arxiv_limit)
                     if success:
             if embedding and hasattr(embedding_provider, "compute_fractalstat_from_embedding"):
                 fractalstat_coords = embedding_provider.compute_fractalstat_from_embedding(embedding)
             api.add_document(
                 doc_id=doc["id"],
                 content=doc["content"],
         """)
 if __name__ == "__main__":
+    demo.launch(share=True, server_name="0.0.0.0", server_port=7860)