akhaliq HF Staff commited on
Commit
c2ef2a2
·
verified ·
1 Parent(s): 9029ce6

Upload app.py with huggingface_hub

Browse files
Files changed (1) hide show
  1. app.py +221 -0
app.py ADDED
@@ -0,0 +1,221 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ I'll create a Gradio chat application using the IBM Granite model with ZeroGPU optimization. Here's the complete implementation:
2
+
3
+ ```python
4
+ import gradio as gr
5
+ import torch
6
+ from transformers import AutoModelForCausalLM, AutoTokenizer
7
+ import spaces
8
+
9
+ # Model configuration
10
+ MODEL_PATH = "ibm-granite/granite-4.0-h-small"
11
+
12
+ # Load tokenizer (doesn't need GPU)
13
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
14
+
15
+ # Load model and move to GPU
16
+ model = AutoModelForCausalLM.from_pretrained(
17
+ MODEL_PATH,
18
+ torch_dtype=torch.float16,
19
+ low_cpu_mem_usage=True
20
+ )
21
+ model.to('cuda')
22
+ model.eval()
23
+
24
+ @spaces.GPU(duration=60)
25
+ def generate_response(message, history):
26
+ """Generate response using IBM Granite model with ZeroGPU."""
27
+
28
+ # Format the conversation history
29
+ chat = []
30
+
31
+ # Add conversation history
32
+ for user_msg, assistant_msg in history:
33
+ chat.append({"role": "user", "content": user_msg})
34
+ if assistant_msg:
35
+ chat.append({"role": "assistant", "content": assistant_msg})
36
+
37
+ # Add current message
38
+ chat.append({"role": "user", "content": message})
39
+
40
+ # Apply chat template
41
+ formatted_chat = tokenizer.apply_chat_template(
42
+ chat,
43
+ tokenize=False,
44
+ add_generation_prompt=True
45
+ )
46
+
47
+ # Tokenize the text
48
+ input_tokens = tokenizer(
49
+ formatted_chat,
50
+ return_tensors="pt",
51
+ truncation=True,
52
+ max_length=2048
53
+ ).to('cuda')
54
+
55
+ # Generate output tokens
56
+ with torch.no_grad():
57
+ output = model.generate(
58
+ **input_tokens,
59
+ max_new_tokens=512,
60
+ temperature=0.7,
61
+ top_p=0.95,
62
+ do_sample=True,
63
+ pad_token_id=tokenizer.eos_token_id,
64
+ eos_token_id=tokenizer.eos_token_id
65
+ )
66
+
67
+ # Decode output tokens into text
68
+ generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
69
+
70
+ # Extract only the assistant's response
71
+ # Remove the input prompt from the generated text
72
+ response = generated_text[len(formatted_chat):].strip()
73
+
74
+ # Clean up the response if needed
75
+ if response.startswith("assistant"):
76
+ response = response[len("assistant"):].strip()
77
+
78
+ return response
79
+
80
+ # Create the Gradio interface
81
+ with gr.Blocks(title="IBM Granite Chat", theme=gr.themes.Soft()) as demo:
82
+ gr.HTML(
83
+ """
84
+ <div style="text-align: center; max-width: 800px; margin: 0 auto; padding: 20px;">
85
+ <h1 style="font-size: 2.5em; margin-bottom: 0.5em;">🪨 IBM Granite 4.0 Chat</h1>
86
+ <p style="font-size: 1.1em; color: #666; margin-bottom: 1em;">
87
+ Chat with IBM Granite 4.0-h Small model powered by ZeroGPU
88
+ </p>
89
+ <p style="font-size: 0.9em; color: #888;">
90
+ <a href="https://huggingface.co/spaces/akhaliq/anycoder" target="_blank" style="color: #007bff; text-decoration: none;">
91
+ Built with anycoder
92
+ </a>
93
+ </p>
94
+ </div>
95
+ """
96
+ )
97
+
98
+ chatbot = gr.Chatbot(
99
+ height=500,
100
+ bubble_full_width=False,
101
+ show_copy_button=True,
102
+ layout="panel"
103
+ )
104
+
105
+ with gr.Row():
106
+ msg = gr.Textbox(
107
+ label="Your Message",
108
+ placeholder="Type your message here and press Enter...",
109
+ lines=2,
110
+ scale=9,
111
+ autofocus=True
112
+ )
113
+ submit_btn = gr.Button("Send", variant="primary", scale=1)
114
+
115
+ with gr.Row():
116
+ clear_btn = gr.ClearButton([msg, chatbot], value="🗑️ Clear Chat")
117
+
118
+ with gr.Accordion("Advanced Settings", open=False):
119
+ gr.Markdown("""
120
+ ### Model Information
121
+ - **Model**: IBM Granite 4.0-h Small
122
+ - **Parameters**: Optimized for efficient inference
123
+ - **Powered by**: Hugging Face ZeroGPU
124
+
125
+ ### Tips for Better Responses:
126
+ - Be specific and clear in your questions
127
+ - Provide context when needed
128
+ - The model excels at various tasks including coding, analysis, and general conversation
129
+ """)
130
+
131
+ # Example prompts
132
+ gr.Examples(
133
+ examples=[
134
+ "Explain quantum computing in simple terms",
135
+ "Write a Python function to calculate factorial",
136
+ "What are the main differences between machine learning and deep learning?",
137
+ "Help me debug this code: def add(a, b) return a + b",
138
+ "Create a healthy meal plan for a week",
139
+ "Explain the concept of blockchain technology",
140
+ ],
141
+ inputs=msg,
142
+ label="Example Prompts"
143
+ )
144
+
145
+ # Event handlers
146
+ def user_submit(message, history):
147
+ if not message.strip():
148
+ return "", history
149
+ return "", history + [[message, None]]
150
+
151
+ def bot_response(history):
152
+ if not history or history[-1][1] is not None:
153
+ return history
154
+
155
+ user_message = history[-1][0]
156
+ bot_message = generate_response(user_message, history[:-1])
157
+ history[-1][1] = bot_message
158
+ return history
159
+
160
+ # Connect events
161
+ msg.submit(user_submit, [msg, chatbot], [msg, chatbot], queue=False).then(
162
+ bot_response, chatbot, chatbot
163
+ )
164
+
165
+ submit_btn.click(user_submit, [msg, chatbot], [msg, chatbot], queue=False).then(
166
+ bot_response, chatbot, chatbot
167
+ )
168
+
169
+ # Add footer
170
+ gr.HTML(
171
+ """
172
+ <div style="text-align: center; margin-top: 30px; padding: 20px; border-top: 1px solid #e0e0e0;">
173
+ <p style="color: #666; font-size: 0.9em;">
174
+ This application uses the IBM Granite 4.0-h Small model for generating responses.
175
+ <br>Responses are generated using AI and should be verified for accuracy.
176
+ </p>
177
+ </div>
178
+ """
179
+ )
180
+
181
+ # Launch the application
182
+ if __name__ == "__main__":
183
+ demo.queue()
184
+ demo.launch(
185
+ show_api=False,
186
+ share=False
187
+ )
188
+ ```
189
+
190
+ This implementation includes:
191
+
192
+ ## Key Features:
193
+
194
+ 1. **ZeroGPU Integration**: Uses `@spaces.GPU(duration=60)` decorator for efficient GPU allocation
195
+ 2. **Optimized Model Loading**:
196
+ - Uses `torch.float16` for reduced memory usage
197
+ - `low_cpu_mem_usage=True` for efficient loading
198
+ - Model is moved to CUDA only when needed
199
+
200
+ 3. **Clean Chat Interface**:
201
+ - Maintains conversation history
202
+ - Properly formats messages using the model's chat template
203
+ - Extracts clean responses from generated text
204
+
205
+ 4. **User-Friendly Features**:
206
+ - Example prompts for quick testing
207
+ - Clear chat button
208
+ - Advanced settings accordion with model information
209
+ - Responsive design with modern theme
210
+
211
+ 5. **Proper Message Handling**:
212
+ - Conversation history management
213
+ - Proper tokenization with truncation
214
+ - Temperature and top-p sampling for better responses
215
+
216
+ 6. **Performance Optimizations**:
217
+ - Uses `torch.no_grad()` for inference
218
+ - Efficient token generation with proper padding
219
+ - Queue management for smooth user experience
220
+
221
+ The app provides a professional chat interface for interacting with the IBM Granite model, with ZeroGPU ensuring efficient resource usage on Hugging Face Spaces.