Spaces:
Sleeping
Sleeping
| import streamlit as st | |
| import streamlit.components.v1 as components | |
| def run_model_arch() -> None: | |
| """ | |
| Displays the model architecture and accompanying abstract and design details for the Knowledge-Based Visual Question | |
| Answering (KB-VQA) model. | |
| This function reads an HTML file containing the model architecture and renders it in a Streamlit application. | |
| It also provides detailed descriptions of the research, abstract, and design of the KB-VQA model. | |
| Returns: | |
| None | |
| """ | |
| # Read the model architecture HTML file | |
| with open("Files/Model Arch.html", 'r', encoding='utf-8') as f: | |
| model_arch_html = f.read() | |
| col1, col2 = st.columns(2) | |
| with col1: | |
| st.markdown("#### Model Architecture") | |
| components.html(model_arch_html, height=1600) | |
| with col2: | |
| st.markdown("#### Abstract") | |
| st.markdown(""" | |
| <div style="text-align: justify;"> | |
| Navigating the frontier of the Visual Turing Test, this research delves into multimodal learning to bridge | |
| the gap between visual perception and linguistic interpretation, a foundational challenge in artificial | |
| intelligence. It scrutinizes the integration of visual cognition and external knowledge, emphasizing the | |
| pivotal role of the Transformer model in enhancing language processing and supporting complex multimodal tasks. | |
| This research explores the task of Knowledge-Based Visual Question Answering (KB-VQA), examining the influence | |
| of Pre-Trained Large Language Models (PT-LLMs) and Pre-Trained Multimodal Models (PT-LMMs), which have | |
| transformed the machine learning landscape by utilizing expansive, pre-trained knowledge repositories to tackle | |
| complex tasks, thereby enhancing KB-VQA systems. | |
| An examination of existing Knowledge-Based Visual Question Answering (KB-VQA) methodologies led to a refined | |
| approach that converts visual content into the linguistic domain, creating detailed captions and object | |
| enumerations. This process leverages the implicit knowledge and inferential capabilities of PT-LLMs. The | |
| research refines the fine-tuning of PT-LLMs by integrating specialized tokens, enhancing the models’ ability | |
| to interpret visual contexts. The research also reviews current image representation techniques and knowledge | |
| sources, advocating for the utilization of implicit knowledge in PT-LLMs, especially for tasks that do not | |
| require specialized expertise. | |
| Rigorous ablation experiments conducted to assess the impact of various visual context elements on model | |
| performance, with a particular focus on the importance of image descriptions generated during the captioning | |
| phase. The study includes a comprehensive analysis of major KB-VQA datasets, specifically the OK-VQA corpus, | |
| and critically evaluates the metrics used, incorporating semantic evaluation with GPT-4 to align the assessment | |
| with practical application needs. | |
| The evaluation results underscore the developed model’s competent and competitive performance. It achieves a | |
| VQA score of 63.57% under syntactic evaluation and excels with an Exact Match (EM) score of 68.36%. Further, | |
| semantic evaluations yield even more impressive outcomes, with VQA and EM scores of 71.09% and 72.55%, | |
| respectively. These results demonstrate that the model effectively applies reasoning over the visual context | |
| and successfully retrieves the necessary knowledge to answer visual questions. | |
| </div> | |
| """, unsafe_allow_html=True) | |
| st.markdown("<br>" * 2, unsafe_allow_html=True) | |
| st.markdown("#### Design") | |
| st.markdown(""" | |
| <div style="text-align: justify;"> | |
| As illustrated in architecture, the model operates through a sequential pipeline, beginning with the Image to | |
| Language Transformation Module. In this module, the image undergoes simultaneous processing via image captioning | |
| and object detection frozen models, aiming to comprehensively capture the visual context and cues. These models, | |
| selected for their initial effectiveness, are designed to be pluggable, allowing for easy replacement with more | |
| advanced models as new technologies develop, thus ensuring the module remains at the forefront of technological | |
| advancement. | |
| Following this, the Prompt Engineering Module processes the generated captions and the list of detected objects, | |
| along with their bounding boxes and confidence levels, merging these elements with the question at hand utilizing | |
| a meticulously crafted prompting template. The pipeline ends with a Fine-tuned Pre-Trained Large Language Model | |
| (PT-LLMs), which is responsible for performing reasoning and deriving the required knowledge to formulate an | |
| informed response to the question. | |
| </div> | |
| """, unsafe_allow_html=True) | |