Bellok commited on
Commit
2c1a23f
·
1 Parent(s): 9692d79

revert(app): limit HF datasets to 50k arxiv and core packs for space constraints

Browse files

Reduced dataset ingestion in README.md and app.py to exclude most HF datasets except for a balanced 50k arxiv papers, novels, and npc-dialogue, lowering from ~2.6M to ~100k documents. Updated Gradio SDK to 6.0.2 and added thumbnail for deployment efficiency on HuggingFace Spaces. Addresses resource limits while maintaining core functionality.

Files changed (2) hide show
  1. README.md +23 -15
  2. app.py +12 -13
README.md CHANGED
@@ -4,17 +4,19 @@ emoji: 🦜
4
  colorFrom: blue
5
  colorTo: purple
6
  sdk: gradio
7
- sdk_version: 4.44.0
8
  app_file: app.py
9
  pinned: false
10
  license: mit
11
- short_description: RAG system with 8D FractalStat and 2.6M+ documents
12
  tags:
13
- - rag
14
- - semantic-search
15
- - retrieval
16
- - fastapi
17
- - fractalstat
 
 
18
  ---
19
 
20
  # Warbler CDA - Cognitive Development Architecture RAG System
@@ -53,22 +55,28 @@ A **production-ready RAG (Retrieval-Augmented Generation) system** with **Fracta
53
 
54
  The Warbler system is trained on carefully curated, MIT-licensed datasets from HuggingFace:
55
 
56
- ### Primary Datasets
 
 
 
 
 
 
57
 
58
  - **arXiv Papers** (`nick007x/arxiv-papers`) - 2.5M+ scholarly papers covering scientific domains
 
59
  - **Prompt Engineering Report** (`PromptSystematicReview/ThePromptReport`) - 83 comprehensive prompt documentation entries
 
60
  - **Generated Novels** (`GOAT-AI/generated-novels`) - 20 narrative-rich novels for storytelling patterns
 
61
  - **Technical Manuals** (`nlasso/anac-manuals-23`) - 52 procedural and operational documents
 
62
  - **ChatEnv Enterprise** (`SustcZhangYX/ChatEnv`) - 112K+ software development conversations
 
63
  - **Portuguese Education** (`Solshine/Portuguese_Language_Education_Texts`) - 21 multilingual educational texts
 
64
  - **Educational Stories** (`MU-NLPC/Edustories-en`) - 1.5K+ case studies and learning narratives
65
 
66
- ### Original Warbler Packs
67
-
68
- - `warbler-pack-core` - Core narrative and reasoning patterns
69
- - `warbler-pack-wisdom-scrolls` - Philosophical and wisdom-based content
70
- - `warbler-pack-faction-politics` - Political and faction dynamics
71
-
72
  All datasets are provided under MIT or compatible licenses. For complete attribution, see the HuggingFace Hub pages listed above.
73
 
74
  ## 📦 Installation
@@ -387,4 +395,4 @@ MIT License - see [LICENSE](LICENSE) for details.
387
 
388
  ---
389
 
390
- ### **Made with ❤️ by Tiny Walnut Games**
 
4
  colorFrom: blue
5
  colorTo: purple
6
  sdk: gradio
7
+ sdk_version: 6.0.2
8
  app_file: app.py
9
  pinned: false
10
  license: mit
11
+ short_description: RAG system with 8D FractalStat and 100k documents
12
  tags:
13
+ - rag
14
+ - semantic-search
15
+ - retrieval
16
+ - fastapi
17
+ - fractalstat
18
+ thumbnail: >-
19
+ https://cdn-uploads.huggingface.co/production/uploads/68c705b6fc90bcc7a4f56721/8G2TJJT8enAFaBLJGTXka.png
20
  ---
21
 
22
  # Warbler CDA - Cognitive Development Architecture RAG System
 
55
 
56
  The Warbler system is trained on carefully curated, MIT-licensed datasets from HuggingFace:
57
 
58
+ ### Original Warbler Packs
59
+
60
+ - `warbler-pack-core` - Core narrative and reasoning patterns
61
+ - `warbler-pack-wisdom-scrolls` - Philosophical and wisdom-based content
62
+ - `warbler-pack-faction-politics` - Political and faction dynamics
63
+
64
+ ### HuggingFace Datasets
65
 
66
  - **arXiv Papers** (`nick007x/arxiv-papers`) - 2.5M+ scholarly papers covering scientific domains
67
+ - Due to space limits, we only ingest 100k of these documents for use on HuggingFace Spaces.
68
  - **Prompt Engineering Report** (`PromptSystematicReview/ThePromptReport`) - 83 comprehensive prompt documentation entries
69
+ - Currently unavailable due to same reasons above.
70
  - **Generated Novels** (`GOAT-AI/generated-novels`) - 20 narrative-rich novels for storytelling patterns
71
+ - Currently unavailable due to same reasons above.
72
  - **Technical Manuals** (`nlasso/anac-manuals-23`) - 52 procedural and operational documents
73
+ - Currently unavailable due to same reasons above.
74
  - **ChatEnv Enterprise** (`SustcZhangYX/ChatEnv`) - 112K+ software development conversations
75
+ - Currently unavailable due to same reasons above.
76
  - **Portuguese Education** (`Solshine/Portuguese_Language_Education_Texts`) - 21 multilingual educational texts
77
+ - Currently unavailable due to same reasons above.
78
  - **Educational Stories** (`MU-NLPC/Edustories-en`) - 1.5K+ case studies and learning narratives
79
 
 
 
 
 
 
 
80
  All datasets are provided under MIT or compatible licenses. For complete attribution, see the HuggingFace Hub pages listed above.
81
 
82
  ## 📦 Installation
 
395
 
396
  ---
397
 
398
+ ### **Made with ❤️ by Tiny Walnut Games**
app.py CHANGED
@@ -80,9 +80,16 @@ if len(documents) == 0:
80
  try:
81
  ingestor = HFWarblerIngestor(packs_dir=pack_loader.packs_dir, verbose=True)
82
 
83
- # Re-enable HF dataset downloads with better timeout handling and limits
84
  datasets_to_download = [
85
- "arxiv", # Physics papers - minimal set for HF Spaces
 
 
 
 
 
 
 
86
  ]
87
 
88
  total_docs = 0
@@ -92,8 +99,8 @@ if len(documents) == 0:
92
  try:
93
  print(f"📦 Downloading {dataset} (timeout: 3 minutes)...")
94
 
95
- # Scale up now that CPU embeddings are solid - 100k papers for massive knowledge base
96
- arxiv_limit = 100000 if dataset == "arxiv" else None # Maximum capacity now!
97
 
98
  success = ingestor.ingest_dataset(dataset, arxiv_limit=arxiv_limit)
99
  if success:
@@ -169,14 +176,6 @@ else:
169
  if embedding and hasattr(embedding_provider, "compute_fractalstat_from_embedding"):
170
  fractalstat_coords = embedding_provider.compute_fractalstat_from_embedding(embedding)
171
 
172
- # DEBUG - check embedding creation
173
- import sys
174
- embedding_summary = f"zero_embedding" if not embedding else f"embedding_dim_{len(embedding)}"
175
- print(
176
- f"DEBUG: Adding doc {doc['id'][:50]}... with {embedding_summary}",
177
- file=sys.stderr,
178
- )
179
-
180
  api.add_document(
181
  doc_id=doc["id"],
182
  content=doc["content"],
@@ -345,4 +344,4 @@ with gr.Blocks(title="Warbler CDA - FractalStat RAG") as demo:
345
  """)
346
 
347
  if __name__ == "__main__":
348
- demo.launch(server_name="0.0.0.0", server_port=7860)
 
80
  try:
81
  ingestor = HFWarblerIngestor(packs_dir=pack_loader.packs_dir, verbose=True)
82
 
83
+ # Enable all available HF dataset packs for maximum knowledge diversity
84
  datasets_to_download = [
85
+ #"arxiv", # Physics and mathematics papers
86
+ #"edustories", # Educational narratives and stories
87
+ "novels", # Fiction literature
88
+ #"manuals", # Technical documentation
89
+ #"enterprise", # Business and corporate content
90
+ "npc-dialogue", # Game character conversations
91
+ #"portuguese-edu", # Portuguese educational content
92
+ #"prompt-report" # AI prompt engineering reports
93
  ]
94
 
95
  total_docs = 0
 
99
  try:
100
  print(f"📦 Downloading {dataset} (timeout: 3 minutes)...")
101
 
102
+ # Balance between coverage and deployment time - 50k arxiv papers plus all other packs
103
+ arxiv_limit = 50000 if dataset == "arxiv" else None # Balanced capacity
104
 
105
  success = ingestor.ingest_dataset(dataset, arxiv_limit=arxiv_limit)
106
  if success:
 
176
  if embedding and hasattr(embedding_provider, "compute_fractalstat_from_embedding"):
177
  fractalstat_coords = embedding_provider.compute_fractalstat_from_embedding(embedding)
178
 
 
 
 
 
 
 
 
 
179
  api.add_document(
180
  doc_id=doc["id"],
181
  content=doc["content"],
 
344
  """)
345
 
346
  if __name__ == "__main__":
347
+ demo.launch(share=True, server_name="0.0.0.0", server_port=7860)