Spaces:

AUXteam
/

Git-Auto-Deploy

Sleeping

App Files Files Community

AUXteam commited on 7 days ago

Commit

db83589

verified ·

1 Parent(s): 083fccb

Upload folder using huggingface_hub

Browse files

Files changed (16) hide show

.gitattributes +3 -0
ai_scientist/__init__.py +0 -0
ai_scientist/fewshot_examples/132_automated_relational.json +3 -0
ai_scientist/fewshot_examples/132_automated_relational.pdf +3 -0
ai_scientist/fewshot_examples/132_automated_relational.txt +1190 -0
ai_scientist/fewshot_examples/2_carpe_diem.json +3 -0
ai_scientist/fewshot_examples/2_carpe_diem.pdf +3 -0
ai_scientist/fewshot_examples/2_carpe_diem.txt +1035 -0
ai_scientist/fewshot_examples/attention.json +3 -0
ai_scientist/fewshot_examples/attention.pdf +3 -0
ai_scientist/fewshot_examples/attention.txt +662 -0
ai_scientist/generate_ideas.py +546 -0
ai_scientist/llm.py +351 -0
ai_scientist/perform_experiments.py +166 -0
ai_scientist/perform_review.py +395 -0
ai_scientist/perform_writeup.py +579 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+ai_scientist/fewshot_examples/132_automated_relational.pdf filter=lfs diff=lfs merge=lfs -text
+ai_scientist/fewshot_examples/2_carpe_diem.pdf filter=lfs diff=lfs merge=lfs -text
+ai_scientist/fewshot_examples/attention.pdf filter=lfs diff=lfs merge=lfs -text

ai_scientist/__init__.py ADDED Viewed

File without changes

ai_scientist/fewshot_examples/132_automated_relational.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+    "review": "{\n    \"Summary\": \"The paper provides an interesting direction in the meta-learning field. In particular, it proposes to enhance meta learning performance by fully exploring relations across multiple tasks. To capture such information, the authors develop a heterogeneity-aware meta-learning framework by introducing a novel architecture--meta-knowledge graph, which can dynamically find the most relevant structure for new tasks.\",\n    \"Strengths\": [\n        \"The paper takes one of the most important issues of meta-learning: task heterogeneity. For me, the problem itself is real and practical.\",\n        \"The proposed meta-knowledge graph is novel for capturing the relation between tasks and addressing the problem of task heterogeneity. Graph structure provides a more flexible way of modeling relations. The design for using the prototype-based relational graph to query the meta-knowledge graph is reasonable and interesting.\",\n        \"This paper provides comprehensive experiments, including both qualitative analysis and quantitative results,  to show the effectiveness of the proposed framework. The newly constructed Art-Multi dataset further enhances the difficulty of tasks and makes the performance more convincing.\"\n    ],\n    \"Weaknesses\": [\n        \"Although the proposed method provides several ablation studies, I still suggest the authors conduct the following ablation studies to enhance the quality of the paper: (1) It might be valuable to investigate the modulation function. In the paper, the authors compare sigmoid, tanh, and Film layer. Can the authors analyze the results by reducing the number of gating parameters in Eq. 10 by sharing the gate value of each filter in Conv layers? (2) What is the performance of the proposed model by changing the type of aggregators?\",\n        \"For the autoencoder aggregator, it would be better to provide more details about it, which seems not very clear to me.\",\n        \"In the qualitative analysis (i.e., Figure 2 and Figure 3), the authors provide one visualization for each task. It would be more convincing if the authors can provide more cases in the rebuttal period.\"\n    ],\n    \"Originality\": 3,\n    \"Quality\": 3,\n    \"Clarity\": 3,\n    \"Significance\": 4,\n    \"Questions\": [\n        \"Please address and clarify the cons above.\"\n    ],\n    \"Limitations\": [\n        \"My major concern is about the clarity of the paper and some additional ablation models (see cons below). Hopefully the authors can address my concern in the rebuttal period.\"\n    ],\n    \"Ethical Concerns\": false,\n    \"Soundness\": 3,\n    \"Presentation\": 3,\n    \"Contribution\": 3,\n    \"Overall\": 7,\n    \"Confidence\": 5,\n    \"Decision\": \"Accept\"\n}"
+}

ai_scientist/fewshot_examples/132_automated_relational.pdf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a29ed4d84f6be5b9547097c2bc8bd57bfe197e91dc8f4ec9bcde6b545e7abe59
+size 1348476

ai_scientist/fewshot_examples/132_automated_relational.txt ADDED Viewed

	@@ -0,0 +1,1190 @@

+# AUTOMATED RELATIONAL META-LEARNING
+**Anonymous authors**
+Paper under double-blind review
+ABSTRACT
+In order to efficiently learn with small amount of data on new tasks, meta-learning
+transfers knowledge learned from previous tasks to the new ones. However, a
+critical challenge in meta-learning is the task heterogeneity which cannot be well
+handled by traditional globally shared meta-learning methods. In addition, current
+task-specific meta-learning methods may either suffer from hand-crafted structure
+design or lack the capability to capture complex relations between tasks. In this
+paper, motivated by the way of knowledge organization in knowledge bases, we
+propose an automated relational meta-learning (ARML) framework that automatically extracts the cross-task relations and constructs the meta-knowledge graph.
+When a new task arrives, it can quickly find the most relevant structure and tailor
+the learned structure knowledge to the meta-learner. As a result, the proposed
+framework not only addresses the challenge of task heterogeneity by a learned
+meta-knowledge graph, but also increases the model interpretability. We conduct
+extensive experiments on 2D toy regression and few-shot image classification and
+the results demonstrate the superiority of ARML over state-of-the-art baselines.
+1 INTRODUCTION
+Learning quickly with a few samples is the key characteristic of human intelligence, which remains a
+daunting problem in machine intelligence. The mechanism of learning to learn (a.k.a., meta-learning)
+is widely used to generalize and transfer prior knowledge learned from previous tasks to improve
+the effectiveness of learning on new tasks, which has benefited various applications, ranging from
+computer vision (Kang et al., 2019; Liu et al., 2019) to natural language processing (Gu et al., 2018;
+Lin et al., 2019). Most of existing meta-learning algorithms learn a globally shared meta-learner
+(e.g., parameter initialization (Finn et al., 2017), meta-optimizer (Ravi & Larochelle, 2016), metric
+space (Snell et al., 2017)). However, globally shared meta-learners fail to handle tasks lying in
+different distributions, which is known as task heterogeneity. Task heterogeneity has been regarded as
+one of the most challenging issues in few-shot learning, and thus it is desirable to design meta-learning
+models that effectively optimize each of the heterogeneous tasks.
+The key challenge to deal with task heterogeneity is how to customize globally shared meta-learner
+by using task-aware information? Recently, a handful of works try to solve the problem by learning
+a task-specific representation for tailoring the transferred knowledge to each task (Oreshkin et al.,
+2018; Vuorio et al., 2019; Lee & Choi, 2018). However, the success of these methods relies on the
+impaired knowledge generalization among closely correlated tasks (e.g., the tasks sampled from the
+same distribution). Recently, learning the underlying structure among tasks provide a more effective
+way for balancing the customization and generalization. Representatively, Yao et al. propose a
+hierarchically structured meta-learning method to customize the globally shared knowledge to each
+cluster in a hierarchical way (Yao et al., 2019). Nonetheless, the hierarchical clustering structure
+completely relies on the handcrafted design which needs to be tuned carefully and may lack the
+capability to capture complex relationships.
+Hence, we are motivated to propose a framework to automatically extract underlying relational
+structures from previously learned tasks and leverage those relational structures to facilitate knowledge
+customization on a new task. This inspiration comes from the way of structuring knowledge in
+knowledge bases (i.e., knowledge graphs). In knowledge bases, the underlying relational structures
+across text entities are automatically constructed and applied to a new query to improve the searching
+efficiency. In the meta-learning problem, similarly, we aim at automatically establishing the metaknowledge graph between prior knowledge learned from previous tasks. When a new task arrives,
+it queries the meta-knowledge graph and quickly attends to the most relevant entities (nodes), and
+then takes advantage of the relational knowledge structures between them to boost the learning
+effectiveness with the limited training data.
+-----
+The proposed meta-learning framework is named as Automated Relational Meta-Learning (ARML).
+Specifically, the ARML framework automatically builds the meta-knowledge graph from metatraining tasks to memorize and organize learned knowledge from historical tasks, where each vertex
+represent one type of meta-knowledge (e.g., the common contour between birds and aircrafts). To
+learn the meta-knowledge graph at meta-training time, for each task, we construct a prototype-based
+relational graph for each class, where each vertex represents one prototype. The prototype-based
+relational graph not only captures the underlying relationship behind samples, but alleviates the
+potential effects of abnormal samples. The meta-knowledge graph is then learned by and summarizing
+the information from the corresponding prototype-based relational graphs of meta-training tasks.
+After constructing the meta-knowledge graph, when a new task comes in, the prototype-based
+relational graph of the new task taps into the meta-knowledge graph for acquiring the most relevant
+knowledge, which further enhances the task representation and facilitates its training process.
+Our major contributions of the proposed ARML are three-fold: (1) it automatically constructs the
+meta-knowledge graph to facilitate learning a new task; (2) it empirically outperforms state-of-the-art
+meta-learning algorithms; (3) the meta-knowledge graph well captures the relationship among tasks
+and improves the interpretability of meta-learning algorithms.
+2 RELATED WORK
+Meta-learning, allowing machines to learn new skills or adapt to new environments rapidly with a
+few training examples, has been demonstrated to be successful in both supervised learning tasks
+(e.g., few-shot image classification) and reinforcement learning settings. There are mainly three
+research lines of meta-learning: (1) black-box amortized methods design black-box meta-learners
+(e.g., neural networks) to infer the model parameters (Ravi & Larochelle, 2016; Andrychowicz et al.,
+2016; Mishra et al., 2018); (2) gradient-based methods aim to learn an optimized initialization of
+model parameters, which can be adapted to new tasks by a few steps of gradient descent (Finn et al.,
+2017; 2018; Lee & Choi, 2018); (3) non-parameteric methods combine parameteric meta-learners
+and non-parameteric learners to learn an appropriate distance metric for few-shot classification (Snell
+et al., 2017; Vinyals et al., 2016; Yang et al., 2018; Oreshkin et al., 2018; Yoon et al., 2019).
+Our work is built upon the gradient-based meta-learning methods. In the line of gradient-based
+meta-learning, most algorithms learn a globally shared meta-learners from all previous tasks (Finn
+et al., 2017; Li et al., 2017; Flennerhag et al., 2019), to improve the effectiveness of learning process
+on new tasks. However, these algorithms typically lack the ability to handle heterogeneous tasks
+(i.e., tasks sample from sufficient different distributions). To tackle this challenge, recent works
+tailor the globally shared initialization to different tasks by leveraging task-specific information (Lee
+& Choi, 2018; Vuorio et al., 2019; Oreshkin et al., 2018) and using probabilistic models (Grant
+et al., 2018; Yoon et al., 2018; Gordon et al., 2019). Recently, HSML customizes the global shared
+initialization with a manually designed hierarchical clustering structure to balance the generalization
+and customization between previous tasks (Yao et al., 2019). However, the hierarchical structure
+may not accurately reflect the real structure since it highly relies on the hand-crafted design. In
+addition, the clustering structure further constricts the complexity of relational structures. However, to
+customize each task, our proposed ARML leverages the most relevant structure from meta-knowledge
+graph which are automatically constructed by previous knowledge. Thus, ARML not only discovers
+more accurate underlying structures to improve the effectiveness of meta-learning algorithms, but
+also the meta-knowledge graph can further enhance the model interpretability.
+3 PRELIMINARIES
+**Few-shot Learning** Considering a task Ti, the goal of few-shot learning is to learn a model with
+a dataset Di = {Di[tr][,][ D]i[ts][}][, where the labeled training set][ D]i[tr] = {x[tr]j _[,][ y]j[tr][|∀][j][ ∈]_ [[1][, N][ tr][]][}][ only has a]
+few samples and Di[ts] [represents the corresponding test set. A learning model (a.k.a., base model)][ f]
+with parameters θ are used to evaluate the effectiveness on Di[ts] [by minimizing the expected empirical]
+loss on Di[tr][, i.e.,][ L][(][D]T[tr]i _[, θ][)][, and obtain the optimal parameters][ θ][i][. For the regression problem, the loss]_
+function is defined based on the mean square error (i.e., (xj _,yj_ )∈Di[tr] 2[) and for the clas-]
+sification problem, the loss function uses the cross entropy loss (i.e., −[∥][f][P][θ][(]([x]x[j]j[)],y[−]j )[y]∈D[j][∥]i[tr][2] [log][ p][(][y][j][|][x][j][, f][θ][)][).]
+Usually, optimizing and learning parameter θ for the task[P] _Ti with a few labeled training samples_
+is difficult. To address this limitation, meta-learning provides us a new perspective to improve the
+performance by leveraging knowledge from multiple tasks.
+-----
+**Meta-learning and Model-agnostic Meta-learning** In meta-learning, a sequence of tasks
+_{T1, ..., TI_ _} are sampled from a task-level probability distribution p(T ), where each one is a few-shot_
+learning task. To facilitate the adaption for incoming tasks, the meta-learning algorithm aims to find
+a well-generalized meta-learner on I training tasks at meta-learning phase. At meta-testing phase, the
+optimal meta-learner is applied to adapt the new tasks Tt. In this way, meta-learning algorithms are
+capable of adapting to new tasks efficiently even with a shortage of training data for a new task.
+Model-agnostic meta-learning (MAML) (Finn et al., 2017), one of the representative algorithms in
+gradient-based meta-learning, regards the meta-learner as the initialization of parameter θ, i.e., θ0,
+and learns a well-generalized initialization θ0[∗] [during the meta-training process. The optimization]
+problem is formulated as (one gradient step as exemplary):
+_θ0[∗]_ [:= arg min]
+_θ0_
+(fθi _,_ _i_ [) = arg min]
+_L_ _D[ts]_ _θ0_
+_i=1_
+X
+_L(fθ0−α∇θ_ _L(fθ_ _,Ditr_ [)][,][ D]i[ts][)][.] (1)
+_i=1_
+X
+At the meta-testing phase, to obtain the adaptive parameter θt for each new task Tt, we finetune the
+initialization of parameter θ0[∗] [by performing gradient updates a few steps, i.e.,][ f]θt [=][ f]θ0[∗] _t_ [)][.]
+_[−][α][∇][θ]_ _[L][(][f][θ]_ _[,][D][tr]_
+4 METHODOLOGY
+In this section, we introduce the details of the proposed ARML. To better explain how it works,
+we show its framework in Figure 1. The goal of ARML is to facilitate the learning process of new
+tasks by leveraging transferable knowledge learned from historical tasks. To achieve this goal, we
+introduce a meta-knowledge graph, which is automatically constructed at the meta-training time, to
+organize and memorize historical learned knowledge. Given a task, which is built as a prototypebased relational structure, it taps into the meta-knowledge graph to acquire relevant knowledge for
+enhancing its own representation. The enhanced prototype representation further aggregate and
+incorporate with meta-learner for fast and effective adaptions by utilizing a modulating function. In
+the following subsections, we elaborate three key components: prototype-based sample structuring,
+automated meta-knowledge graph construction and utilization, and task-specific knowledge fusion
+and adaptation, respectively.
+**Propagation**
+**Prototype-based** **Meta-knowledge**
+**Prototypes** **Relational** **Graph )**
+**Structure ℛ#**
++#(,
+…
+…
+… !"
+**Aggregator**
+ℒ( **Modulation**
+**Aggregator**
++#(- ℒ' ∇%ℒ !"#
+!#
+Figure 1: The framework of ARML. For each task _i, ARML first builds a prototype-based relational_
+_T_
+structure Ri by mapping the training samples Di[tr] [into prototypes, with each prototype represents]
+one class. Then, Ri interacts with the meta-knowledge graph G to acquire the most relevant historical
+knowledge by information propagation. Finally, the task-specific modulation tailors the globally
+shared initialization θ0 by aggregating of raw prototypes and enriched prototypes, which absorbs
+relevant historical information from the meta-knowledge graph.
+4.1 PROTOTYPE-BASED SAMPLE STRUCTURING
+Given a task which involves either classifications or regressions regarding a set of samples, we first
+investigate the relationships among these samples. Such relationship is represented by a graph, called
+prototype-based relational graph in this work, where the vertices in the graph denote the prototypes
+of different classes while the edges and the corresponding edge weights are created based on the
+-----
+similarities between prototypes. Constructing the relational graph based on prototypes instead of raw
+samples allows us to alleviate the issue raised by abnormal samples. As the abnormal samples, which
+locate far away from normal samples, could pose significant concerns especially when only a limited
+number of samples are available for training. Specifically, for classification problem, the prototype,
+denoted by c[k]i
+_[∈]_ [R][d][, is defined as:] _N_ _[tr]_
+**c[k]i** [=]
+_E(xj),_ (2)
+_j=1_
+X
+_Nk[tr]_
+where Nk[tr] [denotes the number of samples in class][ k][.][ E][ is an embedding function, which projects]
+**xj into a hidden space where samples from the same class are located closer to each other while**
+samples from different classes stay apart. For regression problem, it is not straightforward to construct
+the prototypes explicitly based on class information. Therefore, we cluster samples by learning an
+assignment matrix Pi R[K][×][N] _[tr]_ . Specifically, we formulate the process as:
+_∈_
+**Pi = Softmax(WpE** [T](X) + bp), c[k]i [=][ P]i[[][k][]][F] [(][X][)][,] (3)
+where Pi[k] represents the k-th row of Pi. Thus, training samples are clustered to K clusters, which
+serve as the representation of prototypes.
+After calculating all prototype representations **c[k]i**
+_{_ _[|∀][k][ ∈]_ [[1][, K][]][}][, which serve as the vertices in the the]
+prototype-based relational graph Ri, we further define the edges and the corresponding edge weights.
+The edge weight ARi (c[j]i _[,][ c]i[m][)][ between two prototypes][ c]i[j]_ [and][ c]i[m] [is gauged by the the similarity]
+between them. Formally:
+_ARi_ (c[j]i _[,][ c]i[m][) =][ σ][(][W]r[(][|][c][j]i_ _i_ _r[) +][ b]r[)][,]_ (4)
+_[−]_ **[c][m][|][/γ]**
+where Wr and br represents learnable parameters, γr is a scalar and σ is the Sigmoid function, which
+normalizes the weight between 0 and 1. For simplicity, we denote the prototype-based relational graph
+as Ri = (CRi _, ARi_ ), where CRi = {c[j]i _[|∀][j][ ∈]_ [[1][, K][]][} ∈] [R][K][×][d][ represent a set of vertices, with each]
+one corresponds to the prototype from a class, while ARi = {|ARi (c[j]i _[,][ c]i[m][)][|∀][j, m][ ∈]_ [[1][, K][]][} ∈] [R][K][×][K]
+gives the adjacency matrix, which indicates the proximity between prototypes.
+4.2 AUTOMATED META-KNOWLEDGE GRAPH CONSTRUCTION AND UTILIZATION
+In this section, we first discuss how to organize and distill knowledge from historical learning process
+and then expound how to leverage such knowledge to benefit the training of new tasks. To organize
+and distill knowledge from historical learning process, we construct and maintain a meta-knowledge
+graph. The vertices represent different types of meta-knowledge (e.g., the common contour between
+aircrafts and birds) and the edges are automatically constructed and reflect the relationship between
+meta-knowledge. When serving a new task, we refer to the meta-knowledge, which allows us to
+efficiently and automatically identify relational knowledge from previous tasks. In this way, the
+training of a new task can benefit from related training experience and get optimized much faster
+than otherwise possible. In this paper, the meta-knowledge graph is automatically constructed at the
+meta-training phase. The details of the construction are elaborated as follows:
+Assuming the representation of an vertex g is given by h[g] _∈_ R[d], we define the meta-knowledge
+graph as G = (HG, AG), where HG = {h[j]|∀j ∈ [1, G]} ∈ R[G][×][d] and AG = {AG(h[j], h[m])|∀j, m ∈
+[1, G]} ∈ R[G][×][G] denote the vertex feature matrix and vertex adjacency matrix, respectively. To better
+explain the construction of the meta-knowledge graph, we first discuss the vertex representation H .
+_G_
+During meta-training, tasks arrive one after another in a sequence and their corresponding vertices
+representations are expected to be updated dynamically in a timely manner. Therefore, the vertex
+representation of meta-knowledge graph are defined to get parameterized and learned at the training
+time. Moreover, to encourage the diversity of meta-knowledge encoded in the meta-knowledge graph,
+the vertex representations are randomly initialized. Analogous to the definition of weight in the
+prototype-based relational graph Ri in equation 4, the weight between a pair of vertices j and m is
+constructed as:
+_A_ (h[j], h[m]) = σ(Wo( **h[j]** **h[m]** _/γo) + bo),_ (5)
+_G_ _|_ _−_ _|_
+where Wo and bo represent learnable parameters and γo is a scalar.
+To enhance the learning of new tasks with involvement of historical knowledge, we query the
+prototype-based relational graph in the meta-knowledge graph to obtain the relevant knowledge in
+history. The ideal query mechanism is expected to optimize both graph representations simultaneously
+-----
+at the meta-training time, with the training of one graph facilitating the training of the other. In light
+of this, we construct a super-graph Si by connecting the prototype-based relational graph Ri with the
+meta-knowledge graph G for each task Ti. The union of the vertices in Ri and G contributes to the
+vertices in the super-graph. The edges in Ri and G are also reserved in the super-graph. We connect
+_Ri with G by creating links between the prototype-based relational graph with the meta-knowledge_
+graph. The link between prototype c[j]i [in prototype-based relational graph and vertex][ h][m][ in meta-]
+knowledge graph is weighted by the similarity between them. More precisely, for each prototype c[j]i [,]
+the link weight AS (c[j]i _[,][ h][m][)][ is calculated by applying softmax over Euclidean distances between][ c][j]i_
+and {h[m]|∀m ∈ [1, G]} as follows:
+_AS_ (c[j]i _[,][ h][k][) =]_ _Kexp(−∥(c[j]i_ _[−]_ **[h][k][)][/γ][s][∥]2[2][/][2)]** _,_ (6)
+_k[′]_ =1 [exp(][−∥][(][c]i[j] _[−]_ **[h][k][′][ )][/γ][s][∥]2[2][/][2)]**
+where γs is a scaling factor. We denote the intra-adjacent matrix asP **AS = {AS** (c[j]i _[,][ h][m][)][|∀][j][ ∈]_
+[1, K], m ∈ [1, G]} ∈ R[K][×][G]. Thus, for task Ti, the adjacent matrix and feature matrix of super-graph
+_i = (Ai, Hi) is defined as Ai = (A_ _i_ _, A_ ; A[T] [= (][C][R]i [;][ H][G][)][ ∈]
+_S_ _R_ _S_ _S_ _[,][ A][G][)][ ∈]_ [R][(][K][+][G][)][×][(][K][+][G][)][ and][ H][i]
+R[(][K][+][G][)][×][d], respectively.
+After constructing the super-graph Si, we are able to propagate the most relevant knowledge from
+meta-knowledge graph G to the prototype-based relational graph Ri by introducing a Graph Neural
+Networks (GNN). In this work, following the “message-passing” framework (Gilmer et al., 2017),
+the GNN is formulated as:
+**Hi[(][l][+1)]** = MP(Ai, H[(]i[l][)][;][ W][(][l][)][)][,] (7)
+where MP(·) is the message passing function and has several possible implementations (Hamilton
+et al., 2017; Kipf & Welling, 2017; Velickoviˇ c et al., 2018),´ **H[(]i[l][)]** is the vertex embedding after l
+layers of GNN and W[(][l][)] is a learnable weight matrix of layer l. The input H[(0)]i = Hi. After stacking
+_L GNN layers, we get the information-propagated feature representation for the prototype-based_
+relational graph Ri as the top-K rows of Hi[(][L][)], which is denoted as **C[ˆ]** _Ri = {cˆ[j]i_ _[|][j][ ∈]_ [[1][, K][]][}][.]
+4.3 TASK-SPECIFIC KNOWLEDGE FUSION AND ADAPTATION
+After propagating information form meta-knowledge graph to prototype-based relational graph, in
+this section, we discuss how to learn a well-generalized meta-learner for fast and effective adaptions
+to new tasks with limited training data. To tackle the challenge of task heterogeneity, in this
+paper, we incorporate task-specific information to customize the globally shared meta-learner (e.g.,
+initialization here) by leveraging a modulating function, which has been proven to be effective to
+provide customized initialization in previous studies (Wang et al., 2019; Vuorio et al., 2019).
+The modulating function relies on well-discriminated task representations, while it is difficult to learn
+all representations by merely utilizing the loss signal derived from the test set Di[ts][. To encourage such]
+stability, we introduce two reconstructions by utilizing two auto-encoders. There are two collections
+of parameters, i.e, CRi and **C[ˆ]** _Ri, which contribute the most to the creation of the task-specific_
+meta-learner. CRi express the raw prototype information without tapping into the meta-knowledge
+graph, while **C[ˆ]** _Ri give the prototype representations after absorbing the relevant knowledge from the_
+meta-knowledge graph. Therefore, the two reconstructions are built on CRi and **C[ˆ]** _Ri_ . To reconstruct
+**CRi**, an aggregator AG[q](·) (e.g., recurrent network, fully connected layers) is involved to encode CRi
+into a dense representation, which is further fed into a decoder AG[q]dec[(][·][)][ to achieve reconstructions.]
+Then, the corresponded task representation qi of CRi is summarized by applying a mean pooling
+operator over prototypes on the encoded dense representation. Formally,
+_N_ _[tr]_
+**qi = MeanPool(AG[q](CRi** )) =
+(AG[q](c[j]i [))][,][ L][q][ =][ ∥][C][R]i _dec[(AG][q][(][C][R]i_ [))][∥]F[2] (8)
+_j=1_ _[−]_ [AG][q]
+X
+_N_ _[tr]_
+Similarly, we reconstruct **C[ˆ]** _Ri and get the corresponded task representation ti as follows:_
+_N_ _[tr]_
+**ti = MeanPool(AG[t]( C[ˆ]** _Ri_ )) =
+_j=1(AG[t](ˆc[j]i_ [))][,][ L][t][ =][ ∥]C[ˆ] _Ri −_ AG[t]dec[(AG][t][( ˆ]CRi ))∥F[2] (9)
+X
+_N_ _[tr]_
+The reconstruction errors in Equations 8 and 9 pose an extra constraint to enhance the training
+stability, leading to improvement of task representation learning.
+-----
+**Algorithm 1 Meta-Training Process of ARML**
+**Require: p(T ): distribution over tasks; K: Number of vertices in meta-knowledge graph; α: stepsize**
+for gradient descent of each task (i.e., inner loop stepsize); β: stepsize for meta-optimization (i.e.,
+outer loop stepsize); µ1, µ2: balancing factors in loss function
+1: Randomly initialize all learnable parameters Φ
+2: while not done do
+3: Sample a batch of tasks {Ti|i ∈ [1, I]} from p(T )
+4: **for all Ti do**
+5: Sample training set Di[tr] [and testing set][ D]i[ts]
+6: Construct the prototype-based relational graph Ri by computing prototype in equation 2
+and weight in equation 4
+7: Compute the similarity between each prototype and meta-knowledge vertex in equation 6
+and construct the super-graph Si
+8: Apply GNN on super-graph Si and get the information-propagated representation **C[ˆ]** _Ri_
+9: Aggregate CRi in equation 8 and **C[ˆ]** _Ri in equation 9 to get the representations qi, ti and_
+reconstruction loss Lq, Lt
+10: Compute the task-specific initialization θ0i in equation 10 and update parameters θi =
+_θ0i −_ _α∇θL(fθ, Di[tr][)]_
+11: **end for**
+12: Update Φ Φ _β_ Φ _Ii=1_ _i_ _[,][ D]i[ts][) +][ µ][i][L][t]_ [+][ µ][2][L][q]
+13: end while _←_ _−_ _∇_ _[L][(][f][θ]_
+P
+After getting the task representation qi and ti, the modulating function is then used to tailor the
+task-specific information to the globally shared initialization θ0, which is formulated as:
+_θ0i = σ(Wg(ti ⊕_ **qi) + bg) ◦** _θ0,_ (10)
+where Wg and bg is learnable parameters of a fully connected layer. Note that we adopt the Sigmoid
+gating as exemplary and more discussion about different modulating functions can be found in
+ablation studies of Section 5.
+For each task Ti, we perform the gradient descent process from θ0i and reach its optimal parameter θi.
+Combining the reconstruction loss Lt and Lq with the meta-learning loss defined in equation 1, the
+overall objective function of ARML is:
+_I_
+minΦ Φ Φ _L(fθ0−α∇θ_ _L(fθ_ _,Ditr_ [)][,][ D]i[ts][) +][ µ]1[L]t [+][ µ]2[L]q[,] (11)
+_[L][all][ = min]_ _[L][ +][ µ][1][L][t][ +][ µ][2][L][q][ = min]_ _i=1_
+X
+where µ1 and µ2 are introduced to balance the importance of these three items. Φ represents all
+learnable parameters. The algorithm of meta-training process of ARML is shown in Alg. 2. The
+details of the meta-testing process of ARML are available in Appendix A.
+5 EXPERIMENTS
+In this section, we conduct extensive experiments to demonstrate the effectiveness of the ARML on
+2D regression and few-shot classification with the goal of answering the following questions: (1) Can
+ARML outperform other meta-learning methods?; (2) Can our proposed components improve the
+learning performance?; (3) Can ARML framework improve the model interpretability by discovering
+reasonable meta-knowledge graph?
+5.1 EXPERIMENTAL SETTINGS
+**Methods for Comparison** We compare our proposed ARML with two types of baselines: gradientbased meta-learning algorithms and non-parameteric meta-learning algorithms.
+_For gradient-based meta-learning methods: both globally shared methods (MAML (Finn et al.,_
+2017), Meta-SGD (Li et al., 2017)) and task-specific methods (MT-Net (Lee & Choi, 2018), MUMOMAML (Vuorio et al., 2019), HSML (Yao et al., 2019)) are considered for comparison.
+_For non-parametric meta-learning methods: we select globally shared method Prototypical Network_
+(ProtoNet) (Snell et al., 2017) and task-specific method TADAM (Oreshkin et al., 2018) as baselines.
+Note that, following the traditional settings, non-parametric baselines are only used in few-shot
+classification problem. The detailed implementations of baselines are discussed in Appendix B.3.
+-----
+**Hyperparameter Settings** For the aggregated function in autoencoder structure (AG[t], AG[t]dec
+AG[q], AG[q]dec[), we use the GRU as the encoder and decoder in this autoencoder framework. We]
+adopt one layer GCN (Kipf & Welling, 2017) with tanh activation as the implementation of GNN
+in equation 7. For the modulation network, we try both sigmoid, tanh Film modulation and find that
+sigmoid modulation performs better. Thus, in the future experiment, we use the sigmoid modulation as
+modulating function. More detailed discussion about experiment settings are presented in Appendix B.
+5.2 2D REGRESSION
+**Dataset Description.** In 2D regression problem, we adopt the similar regression problem settings
+as (Finn et al., 2018; Vuorio et al., 2019; Yao et al., 2019; Rusu et al., 2019), which includes several
+families of functions. In this paper, to model more complex relational structures, we design a 2D
+regression problem rather than traditional 1D regression. Input x ∼ _U_ [0.0, 5.0] and y ∼ _U_ [0.0, 5.0]
+are sampled randomly and random Gaussian noisy with standard deviation 0.3 is added to the
+output. Furthermore, six underlying functions are selected, including (1) Sinusoids: z(x, y) =
+_assin(wsx + bs), where as ∼_ _U_ [0.1, 5.0], bs ∼ _U_ [0, 2π] ws ∼ _U_ [0.8, 1.2]; (2) Line: z(x, y) = alx + bl,
+where al ∼ _U_ [−3.0, 3.0], bl ∼ _U_ [−3.0, 3.0]; (3) Quadratic: z(x, y) = aqx[2] + bqx + cq, where aq ∼
+_U_ [−0.2, 0.2], bq ∼ _U_ [−2.0, 2.0], cq ∼ _U_ [−3.0, 3.0]; (4) Cubic: z(x, y) = acx[3] + bcx[2] + ccx + dc,
+where ac ∼ _U_ [−0.1, 0.1], bc ∼ _U_ [−0.2, 0.2], cc ∼ _U_ [−2.0, 2.0], dc ∼ _U_ [−3.0, 3.0]; (5) Quadratic
+_Surface: z(x, y) = aqsx[2]_ + bqsy[2], where aqs ∼ _U_ [−1.0, 1.0], bqs ∼ _U_ [−1.0, 1.0]; (6) Ripple: z(x, y) =
+_sin(−ar(x[2]_ + y[2])) + br, where ar ∼ _U_ [−0.2, 0.2], br ∼ _U_ [−3.0, 3.0]. Note that, function 1-4 are
+located in the subspace of y = 1. Follow (Finn et al., 2017), we use two fully connected layers with
+40 neurons as the base model. The number of vertices of meta-knowledge graph is set as 6.
+**Results and Analysis.** In Figure 2, we summarize the interpretation of meta-knowledge graph
+(see top figure) and the the qualitative results (see bottom table) of 10-shot 2D regression. In the
+bottom table, we can observe that ARML achieves the best performance as compared to competitive
+gradient-based meta-learning methods, i.e., globally shared models and task-specific models. This
+finding demonstrates that the meta-knowledge graph is necessary to model and capture task-specific
+information. The superior performance can also be interpreted in the top figure. In the left, we
+show the heatmap between prototypes and meta-knowledge vertices (deeper color means higher
+similarity). We can see that sinusoids and line activate V1 and V4, which may represent curve and
+line, respectively. V1 and V4 also contribute to quadratic and quadratic surface, which also show
+the similarity between these two families of functions. V3 is activated in P0 of all functions and the
+quadratic surface and ripple further activate V1 in P0, which may show the different between 2D
+functions and 3D functions (sinusoid, line, quadratic and cubic lie in the subspace). Specifically,
+in the right figure, we illustrate the meta-knowledge graph, where we set a threshold to filter the
+link with low similarity score and show the rest. We can see that V3 is the most popular vertice and
+|Model|MAML Meta-SGD MT-Net MUMOMAML HSML ARML|
+|---|---|
+|10-shot|2.292 ± 0.163 2.908 ± 0.229 1.757 ± 0.120 0.523 ± 0.036 0.494 ± 0.038 0.438 ± 0.029|
+|---|---|
+connected with V1, V5 (represent curve) and V4 (represent line). V1 is further connected with V5,
+demonstrating the similarity of curve representation.
+V1
+V2
+Sinusoids Line
+V0 V3
+Quadratic Cubic
+V5 V4
+Quadratic Surface Ripple
+Model MAML Meta-SGD MT-Net MUMOMAML HSML ARML
+10-shot 2.292 0.163 2.908 0.229 1.757 0.120 0.523 0.036 0.494 0.038 **0.438** **0.029**
+Figure 2: In the top figure, we show the interpretation of meta-knowledge graph. The left heatmap
+shows the similarity between prototypes (P0, P1) and meta-knowledge vertices (V0-V5). The right
+part show the meta-knowledge graph. In the bottom table, we show the overall performance (mean
+square error with 95% confidence) of 10-shot 2D regression.
+-----
+5.3 FEW-SHOT CLASSIFICATION
+**Dataset Description and Settings** In the few-shot classification problem, we first use the benchmark proposed in (Yao et al., 2019), where four fine-grained image classification datasets are included
+(i.e., CUB-200-2011 (Bird), Describable Textures Dataset (Texture), FGVC of Aircraft (Aircraft),
+and FGVCx-Fungi (Fungi)). For each few-shot classification task, it samples classes from one of four
+datasets. In this paper, we call this dataset as Plain-Multi and each fine-grained dataset as subdataset.
+Then, to demonstrate the effectiveness of our proposed model for handling more complex underlying
+structures, in this paper, we increase the difficulty of few-shot classification problem by introducing
+two image filters: blur filter and pencil filter. Similar as (Jerfel et al., 2019), for each image in PlainMulti, one artistic filters are applied to simulate a changing distribution of few-shot classification
+tasks. After applying the filters, the total number of subdatasets is 12 and each tasks is sampled from
+one of them. This data is named as Art-Multi. More detailed descriptions of the effect of different
+filters is discussed in Appendix C.
+Following the traditional meta-learning settings, all datasets are divided into meta-training, metavalidation and meta-testing classes. The traditional N-way K-shot settings are used to split training and
+test set for each task. We adopt the standard four-block convolutional layers as the base learner (Finn
+et al., 2017; Snell et al., 2017). The number of vertices of meta-knowledge graph for Plain-Multi
+and Art-Multi datasets are set as 4 and 8, respectively. Additionally, for the miniImagenet, similar
+as (Finn et al., 2018), which tasks are constructed from a single domain and do not have heterogeneity,
+we compare our proposed ARML with other baselines and present the results in Appendix D.
+5.3.1 PERFORMANCE VALIDATION
+**Overall Qualitative Analyses** Experimental results for Plain-Multi and Art-Multi are shown in
+Table 1 and Table 2, respectively. For each dataset, the performance accuracy with 95% confidence
+interval are reported. Note that, due to the space limitation, in Art-Multi dataset, we only show
+the average value of each filter and the full results table are shown in Table 9 of Appendix E. In
+these two tables, first, we can observe that task-specific models (MT-Net, MUMOMAML, HSML,
+TADAM) significantly outperforms globally shared models (MAML, Meta-SGD, ProtoNet) in both
+gradient-based and non-parametric meta-learning research lines. Second, compared ARML with
+other task-specific gradient-based meta-learning methods, the better performance confirms that
+ARML can model and extract task-specific information more accurately by leveraging the constructed
+meta-knowledge graph. Especially, the performance gap between the ARML and HSML verifies the
+benefits of relational structure compared with isolated clustering structure. Finally, as a gradient-based
+meta-learning algorithm, ARML can also outperform ProtoNet and TADAM, two representative
+non-parametric meta-learning algorithms.
+Table 1: Overall few-shot classification results (accuracy ± 95% confidence) on Plain-Multi dataset.
+|Settings|Algorithms|Data: Bird Data: Texture Data: Aircraft Data: Fungi|
+|---|---|---|
+|MAML 53.94 ± 1.45% 31.66 ± 1.31% 51.37 ± 1.38% 42.12 ± 1.36% MetaSGD 55.58 ± 1.43% 32.38 ± 1.32% 52.99 ± 1.36% 41.74 ± 1.34% MT-Net 58.72 ± 1.43% 32.80 ± 1.35% 47.72 ± 1.46% 43.11 ± 1.42% 5-way MUMOMAML 56.82 ± 1.49% 33.81 ± 1.36% 53.14 ± 1.39% 42.22 ± 1.40% 1-shot HSML 60.98 ± 1.50% 35.01 ± 1.36% 57.38 ± 1.40% 44.02 ± 1.39% ProtoNet 54.11 ± 1.38% 32.52 ± 1.28% 50.63 ± 1.35% 41.05 ± 1.37% TADAM 56.58 ± 1.34% 33.34 ± 1.27% 53.24 ± 1.33% 43.06 ± 1.33% ARML 62.33 ± 1.47% 35.65 ± 1.40% 58.56 ± 1.41% 44.82 ± 1.38%|MAML MetaSGD MT-Net MUMOMAML HSML|53.94 ± 1.45% 31.66 ± 1.31% 51.37 ± 1.38% 42.12 ± 1.36% 55.58 ± 1.43% 32.38 ± 1.32% 52.99 ± 1.36% 41.74 ± 1.34% 58.72 ± 1.43% 32.80 ± 1.35% 47.72 ± 1.46% 43.11 ± 1.42% 56.82 ± 1.49% 33.81 ± 1.36% 53.14 ± 1.39% 42.22 ± 1.40% 60.98 ± 1.50% 35.01 ± 1.36% 57.38 ± 1.40% 44.02 ± 1.39%|
+|---|---|---|
+||ProtoNet TADAM|54.11 ± 1.38% 32.52 ± 1.28% 50.63 ± 1.35% 41.05 ± 1.37% 56.58 ± 1.34% 33.34 ± 1.27% 53.24 ± 1.33% 43.06 ± 1.33%|
+||ARML|62.33 ± 1.47% 35.65 ± 1.40% 58.56 ± 1.41% 44.82 ± 1.38%|
+|ARML 62.33 ± 1.47% 35.65 ± 1.40% 58.56 ± 1.41% 44.82 ± 1.38%|ARML|62.33 ± 1.47% 35.65 ± 1.40% 58.56 ± 1.41% 44.82 ± 1.38%|
+|---|---|---|
+|MAML 68.52 ± 0.79% 44.56 ± 0.68% 66.18 ± 0.71% 51.85 ± 0.85% MetaSGD 67.87 ± 0.74% 45.49 ± 0.68% 66.84 ± 0.70% 52.51 ± 0.81% MT-Net 69.22 ± 0.75% 46.57 ± 0.70% 63.03 ± 0.69% 53.49 ± 0.83% 5-way MUMOMAML 70.49 ± 0.76% 45.89 ± 0.69% 67.31 ± 0.68% 53.96 ± 0.82% 5-shot HSML 71.68 ± 0.73% 48.08 ± 0.69% 73.49 ± 0.68% 56.32 ± 0.80% ProtoNet 68.67 ± 0.72% 45.21 ± 0.67% 65.29 ± 0.68% 51.27 ± 0.81% TADAM 69.13 ± 0.75% 45.78 ± 0.65% 69.87 ± 0.66% 53.15 ± 0.82% ARML 73.34 ± 0.70% 49.67 ± 0.67% 74.88 ± 0.64% 57.55 ± 0.82%|MAML MetaSGD MT-Net MUMOMAML HSML|68.52 ± 0.79% 44.56 ± 0.68% 66.18 ± 0.71% 51.85 ± 0.85% 67.87 ± 0.74% 45.49 ± 0.68% 66.84 ± 0.70% 52.51 ± 0.81% 69.22 ± 0.75% 46.57 ± 0.70% 63.03 ± 0.69% 53.49 ± 0.83% 70.49 ± 0.76% 45.89 ± 0.69% 67.31 ± 0.68% 53.96 ± 0.82% 71.68 ± 0.73% 48.08 ± 0.69% 73.49 ± 0.68% 56.32 ± 0.80%|
+||ProtoNet TADAM|68.67 ± 0.72% 45.21 ± 0.67% 65.29 ± 0.68% 51.27 ± 0.81% 69.13 ± 0.75% 45.78 ± 0.65% 69.87 ± 0.66% 53.15 ± 0.82%|
+||ARML|73.34 ± 0.70% 49.67 ± 0.67% 74.88 ± 0.64% 57.55 ± 0.82%|
+-----
+Table 2: Overall few-shot classification results (accuracy ± 95% confidence) on Art-Multi dataset.
+|Settings|Algorithms|Avg. Origninal Avg. Blur Avg. Pencil|
+|---|---|---|
+|MAML 42.70 ± 1.35% 40.53 ± 1.38% 36.71 ± 1.37% MetaSGD 44.21 ± 1.38% 42.36 ± 1.39% 37.21 ± 1.39% MT-Net 43.94 ± 1.40% 41.64 ± 1.37% 37.79 ± 1.38% 5-way, 1-shot MUMOMAML 45.63 ± 1.39% 41.59 ± 1.38% 39.24 ± 1.36% HSML 45.68 ± 1.37% 42.62 ± 1.38% 39.78 ± 1.36% Protonet 42.08 ± 1.34% 40.51 ± 1.37% 36.24 ± 1.35% TADAM 44.73 ± 1.33% 42.44 ± 1.35% 39.02 ± 1.34% ARML 47.92 ± 1.34% 44.43 ± 1.34% 41.44 ± 1.34%|MAML MetaSGD MT-Net MUMOMAML HSML|42.70 ± 1.35% 40.53 ± 1.38% 36.71 ± 1.37% 44.21 ± 1.38% 42.36 ± 1.39% 37.21 ± 1.39% 43.94 ± 1.40% 41.64 ± 1.37% 37.79 ± 1.38% 45.63 ± 1.39% 41.59 ± 1.38% 39.24 ± 1.36% 45.68 ± 1.37% 42.62 ± 1.38% 39.78 ± 1.36%|
+|---|---|---|
+||Protonet TADAM|42.08 ± 1.34% 40.51 ± 1.37% 36.24 ± 1.35% 44.73 ± 1.33% 42.44 ± 1.35% 39.02 ± 1.34%|
+||ARML|47.92 ± 1.34% 44.43 ± 1.34% 41.44 ± 1.34%|
+|ARML 47.92 ± 1.34% 44.43 ± 1.34% 41.44 ± 1.34%|ARML|47.92 ± 1.34% 44.43 ± 1.34% 41.44 ± 1.34%|
+|---|---|---|
+|MAML 58.30 ± 0.74% 55.71 ± 0.74% 49.59 ± 0.73% MetaSGD 57.82 ± 0.72% 55.54 ± 0.73% 50.24 ± 0.72% MT-Net 57.95 ± 0.74% 54.65 ± 0.73% 49.18 ± 0.73% 5-way, 5-shot MUMOMAML 58.60 ± 0.75% 56.29 ± 0.72% 51.15 ± 0.73% HSML 60.63 ± 0.73% 57.91 ± 0.72% 53.93 ± 0.72% Protonet 58.12 ± 0.74% 55.07 ± 0.73% 50.15 ± 0.74% TADAM 60.35 ± 0.72% 58.36 ± 0.73% 53.15 ± 0.74% ARML 61.78 ± 0.74% 58.73 ± 0.75% 55.27 ± 0.73%|MAML MetaSGD MT-Net MUMOMAML HSML|58.30 ± 0.74% 55.71 ± 0.74% 49.59 ± 0.73% 57.82 ± 0.72% 55.54 ± 0.73% 50.24 ± 0.72% 57.95 ± 0.74% 54.65 ± 0.73% 49.18 ± 0.73% 58.60 ± 0.75% 56.29 ± 0.72% 51.15 ± 0.73% 60.63 ± 0.73% 57.91 ± 0.72% 53.93 ± 0.72%|
+||Protonet TADAM|58.12 ± 0.74% 55.07 ± 0.73% 50.15 ± 0.74% 60.35 ± 0.72% 58.36 ± 0.73% 53.15 ± 0.74%|
+||ARML|61.78 ± 0.74% 58.73 ± 0.75% 55.27 ± 0.73%|
+**Model Ablation Study** In this section, we perform the ablation study of the proposed ARML to
+demonstrate the effectiveness of each component. The results of ablation study on 5-way, 5-shot
+scenario for Art-Multi dataset are presented in Table 3. In Appendix F, we also show the full results
+for Art-Multi in Table 6 and the ablation study of Plain-Multi in Table 7. Specifically, to show
+the effectiveness of prototype construction, in ablation I, we use the mean pooling aggregation
+of each sample rather than the prototype-based relational graph to interact with meta-knowledge
+graph. In ablation II, we use all samples to construct the sample-level relational graph without
+using the prototype. Compared with ablation I and II, the better performance of ARML shows
+that structuring samples can (1) better handling the underlying relations (2) alleviating the effect of
+potential anomalies by structuring samples as prototypes.
+In ablation III, we remove the meta-knowledge graph and use the prototype-based relational graph
+structure with aggregator AG[q] as the task representation. The better performance of ARML demonstrates the effectiveness of meta-knowledge graph for capturing the relational structure and facilitating
+the classification performance. We further remove the reconstruction loss and show the results in
+ablation IV and the results demonstrate that the autoencoder structure can benefit the process of
+learning the representation.
+In ablation VI and VII, we change the modulate function to film (Perez et al., 2018) and tanh,
+respectively. We can see that ARML is not very sensitive to the modulating function, and sigmoid
+function is slightly better than other activation functions in most cases.
+Table 3: Results (accuracy ± 95% confidence) of Ablation Models (5-way, 5-shot) on Art-Multi.
+|Ablation Models|Ave. Original Ave. Blur Ave. Pencil|
+|---|---|
+|I. no prototype-based graph II. no prototype|60.80 ± 0.74% 58.36 ± 0.73% 54.79 ± 0.73% 61.34 ± 0.73% 58.34 ± 0.74% 54.81 ± 0.73%|
+|---|---|
+|III. no meta-knowledge graph IV. no reconstruction loss|59.99 ± 0.75% 57.79 ± 0.73% 53.68 ± 0.74% 59.07 ± 0.73% 57.20 ± 0.74% 52.45 ± 0.73%|
+|---|---|
+|V. tanh modulation VI. film modulation|62.34 ± 0.74% 58.58 ± 0.75% 54.01 ± 0.74% 60.06 ± 0.75% 57.47 ± 0.73% 52.06 ± 0.74%|
+|---|---|
+|ARML|61.78 ± 0.74% 58.73 ± 0.75% 55.27 ± 0.73%|
+|---|---|
+5.3.2 ANALYSIS OF CONSTRUCTED META-KNOWLEDGE GRAPH
+In this section, we conduct extensive analysis for the constructed meta-knowledge graph, which is
+regarded as the key component in ARML. Due to the space limit, we only present the results on ArtMulti datasets. For Plain-Multi, the analysis with similar observations are discussed in Appendix G.
+-----
+**Performance v.s. Vertice Numbers** We first investigate the impact of vertice numbers in metaknowledge graph. The results are shown in Table 4. From the results, we can notice that the
+performance saturates as the number of vertices researches around 8. One potential reason is that 8
+vertices is enough to capture the potential relations. If we have a larger datasets with more complex
+relations, more vertices may be needed. In addition, if the meta-knowledge graph do not have enough
+vertices, the worse performance suggests that the graph may not be able to capture enough relations
+across tasks.
+Table 4: Sensitivity analysis with different # of vertices in meta-knowledge graph (5-way, 5-shot).
+|# of vertices|Ave. Original Ave. Blur Ave. Pencil|
+|---|---|
+|4 8 12 16 20|61.18 ± 0.72% 58.13 ± 0.73% 54.88 ± 0.75% 61.78 ± 0.74% 58.73 ± 0.75% 55.27 ± 0.73% 61.66 ± 0.73% 58.61 ± 0.72% 55.07 ± 0.74% 61.75 ± 0.73% 58.67 ± 0.74% 55.26 ± 0.73% 61.91 ± 0.74% 58.92 ± 0.73% 55.24 ± 0.72%|
+|---|---|
+**Model Interpretation Analysis of Meta-Knowledge Graph** We then analyze the learned metaknowledge graph. For each subdataset, we randomly select one task as exemplary. For each task,
+in the left part of Figure 3 we show the similarity heatmap between prototypes and vertices in
+meta-knowledge graph, where deeper color means higher similarity. V0-V8 and P1-P5 denotes
+the different vertices and prototypes, respectively. The meta-knowledge graph is also illustrated
+in the right part. Similar as the graph in 2D regression, we set a threshold to filter links with low
+similarity and illustrate the rest of them. First, We can see that the V1 is mainly activated by bird
+and aircraft (including all filters), which may reflect the shape similarity between bird and aircraft.
+Second, V2, V3, V4 are firstly activated by texture and they form a loop in the meta-knowledge
+graph. Especially, V2 also benefits images with blur and pencil filters. Thus, V2 may represent the
+main texture and facilitate the training process on other subdatasets. The meta-knowledge graph also
+shows the importance of V2 since it is connected with almost all other vertices. Third, when we use
+blur filter, in most cases (bird blur, texture blur, fungi blur), V7 is activated. Thus, V7 may show the
+similarity of images with blur filter. In addition, the connection between V7 and V2 and V3 show that
+classify blur images may depend on the texture information. Fourth, V6 (activated by aircraft mostly)
+connects with V2 and V3, justifying the importance of texture information to classify the aircrafts.
+V1
+V2
+Bird Texture Aircraft Fungi
+V0 V3
+Bird Blur Texture Blur Aircraft Blur Fungi Blur V7
+V4
+V6
+V5
+Bird Pencil Texture Pencil Aircraft Pencil Fungi Pencil
+Figure 3: Interpretation of meta-knowledge graph on Art-Multi dataset. For each subdataset, we
+randomly select one task from them. In the left, we show the similarity heatmap between prototypes
+(P0-P5) and meta-knowledge vertices (V0-V7). In the right part, we show the meta-knowledge graph.
+6 CONCLUSION
+In this paper, to improve the effectiveness of meta-learning for handling heterogeneous task, we
+propose a new framework called ARML, which automatically extract relation across tasks and
+construct a meta-knowledge graph. When a new task comes in, it can quickly find the most relevant
+relations through the meta-knowledge graph and use this knowledge to facilitate its training process.
+The experiments demonstrate the effectiveness of our proposed algorithm.
+-----
+REFERENCES
+Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul,
+Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient
+descent. In NeurIPS, pp. 3981–3989, 2016.
+Chelsea Finn and Sergey Levine. Meta-learning and universality: Deep representations and gradient
+descent can approximate any learning algorithm. In ICLR, 2018.
+Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of
+deep networks. In ICML, pp. 1126–1135, 2017.
+Chelsea Finn, Kelvin Xu, and Sergey Levine. Probabilistic model-agnostic meta-learning. In NeurIPS,
+2018.
+Sebastian Flennerhag, Pablo G Moreno, Neil D Lawrence, and Andreas Damianou. Transferring
+knowledge across learning processes. ICLR, 2019.
+Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural
+message passing for quantum chemistry. In ICML, pp. 1263–1272. JMLR. org, 2017.
+Jonathan Gordon, John Bronskill, Matthias Bauer, Sebastian Nowozin, and Richard E Turner. Metalearning probabilistic inference for prediction. In ICLR, 2019.
+Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Griffiths. Recasting gradientbased meta-learning as hierarchical bayes. In ICLR, 2018.
+Jiatao Gu, Yong Wang, Yun Chen, Kyunghyun Cho, and Victor OK Li. Meta-learning for low-resource
+neural machine translation. In EMNLP, 2018.
+Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In
+_NeurIPS, pp. 1024–1034, 2017._
+Ghassen Jerfel, Erin Grant, Thomas L Griffiths, and Katherine Heller. Reconciling meta-learning and
+continual learning with online mixtures of tasks. NeurIPS, 2019.
+Bingyi Kang, Zhuang Liu, Xin Wang, Fisher Yu, Jiashi Feng, and Trevor Darrell. Few-shot object
+detection via feature reweighting. In ICCV, 2019.
+Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks.
+In ICLR, 2017.
+Yoonho Lee and Seungjin Choi. Gradient-based meta-learning with learned layerwise metric and
+subspace. In ICML, pp. 2933–2942, 2018.
+Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-sgd: Learning to learn quickly for few
+shot learning. arXiv preprint arXiv:1707.09835, 2017.
+Zhaojiang Lin, Andrea Madotto, Chien-Sheng Wu, and Pascale Fung. Personalizing dialogue agents
+via meta-learning. 2019.
+Ming-Yu Liu, Xun Huang, Arun Mallya, Tero Karras, Timo Aila, Jaakko Lehtinen, and Jan Kautz.
+Few-shot unsupervised image-to-image translation. arXiv preprint arXiv:1905.01723, 2019.
+Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive metalearner. ICLR, 2018.
+Alex Nichol and John Schulman. Reptile: a scalable metalearning algorithm. arXiv preprint
+_arXiv:1803.02999, 2018._
+Boris Oreshkin, Pau Rodr´ıguez Lopez, and Alexandre Lacoste. Tadam: Task dependent adaptive´
+metric for improved few-shot learning. In NeurIPS, pp. 721–731, 2018.
+Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron C. Courville. Film: Visual
+reasoning with a general conditioning layer. In AAAI, 2018.
+-----
+Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. ICLR, 2016.
+Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero,
+and Raia Hadsell. Meta-learning with latent embedding optimization. In ICLR, 2019.
+Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In
+_NeurIPS, pp. 4077–4087, 2017._
+Petar Velickoviˇ c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua´
+Bengio. Graph attention networks. In ICLR, 2018.
+Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one
+shot learning. In NeurIPS, pp. 3630–3638, 2016.
+Risto Vuorio, Shao-Hua Sun, Hexiang Hu, and Joseph J Lim. Toward multimodal model-agnostic
+meta-learning. NeurIPS, 2019.
+Xin Wang, Fisher Yu, Ruth Wang, Trevor Darrell, and Joseph E Gonzalez. Tafe-net: Task-aware
+feature embeddings for low shot learning. In CVPR, pp. 1831–1840, 2019.
+Flood Sung Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales.
+Learning to compare: Relation network for few-shot learning. In CVPR, 2018.
+Huaxiu Yao, Ying Wei, Junzhou Huang, and Zhenhui Li. Hierarchically structured meta-learning. In
+_ICML, pp. 7045–7054, 2019._
+Jaesik Yoon, Taesup Kim, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn.
+Bayesian model-agnostic meta-learning. In NeurIPS, pp. 7343–7353, 2018.
+Sung Whan Yoon, Jun Seo, and Jaekyun Moon. Tapnet: Neural network augmented with task-adaptive
+projection for few-shot learning. In ICML, 2019.
+-----
+A ALGORITHM IN META-TESTING PROCESS
+**Algorithm 2 Meta-Testing Process of ARML**
+**Require: Training data** _t_ [of a new task][ T][t]
+_D[tr]_
+1: Construct the prototype-based relational graph Rt by computing prototype in equation 2 and
+weight in equation 4
+2: Compute the similarity between each prototype and meta-knowledge vertice in equation 6 and
+construct the super-graph St
+3: Apply GNN on super-graph St and get the updated prototype representation **C[ˆ]** _Rt_
+4: Aggregate CRt in equation 8, **C[ˆ]** _Rt in equation 9 and get the representations qt, tt_
+5: Compute the task-specific initialization θ0t in equation 10
+6: Update parameters θt = θ0t − _α∇θL(fθ, Dt[tr][)]_
+B HYPERPARAMETERS SETTINGS
+B.1 2D REGRESSION
+In 2D regression problem, we set the inner-loop stepsize (i.e., α) and outer-loop stepsize (i.e., β) as
+0.001 and 0.001, respectively. The embedding function E is set as one layer with 40 neurons. The
+autoencoder aggregator is constructed by the gated recurrent structures. We set the meta-batch size as
+25 and the inner loop gradient steps as 5.
+B.2 FEW-SHOT IMAGE CLASSIFICATION
+In few-shot image classification, for both Plain-Multi and Art-Multi datasets, we set the corresponding
+inner stepsize (i.e., α) as 0.001 and the outer stepsize (i.e., β) as 0.01. For the embedding function E,
+we employ two convolutional layers with 3 × 3 filters. The channel size of these two convolutional
+layers are 32. After convolutional layers, we use two fully connected layers with 384 and 64 neurons
+for each layer. Similar as the hyperparameter settings in 2D regression, the autoencoder aggregator
+is constructed by the gated recurrent structures, i.e., AG[t], AG[t]dec [AG][q][,][ AG]dec[q] [are all GRUs. The]
+meta-batch size is set as 4. For the inner loop, we use 5 gradient steps.
+B.3 DETAILED BASELINE SETTINGS
+For the gradient-based baselines (i.e., MAML, MetaSGD, MT-Net, BMAML. MUMOMAML,
+HSML), we use the same inner loop stepsize and outer loop stepsize rate as our ARML. As for
+non-parametric based meta-learning algorithms, both TADAM and Prototypical network, we use the
+same meta-training and meta-testing process as gradient-based models. Additionally, TADAM uses
+the same embedding function E as ARML for fair comparison (i.e., similar expressive ability).
+C ADDITIONAL DISCUSSION OF DATASETS
+In this dataset, we use pencil and blur filers to change the task distribution. To investigate the effect
+of pencil and blur filters, we provide one example in Figure 4. We can observe that different filters
+result in different data distributions. All used filter are provided by OpenCV[1].
+D RESULTS ON MINIIMAGENET
+For miniimagenet, since it do not have the characteristic of task heterogeneity, we show the results in
+Table 5. In this table, we compare the MiniImagenet dataset with other gradient-based meta-learning
+models (the first four baselines are globally shared models and the next four are task-specific models).
+Similar as (Finn et al., 2018), we also apply the standard 4-block convolutional layers for each
+1https://opencv.org/
+-----
+(a) : Plain Image (b) : with blur filter (c) : with pencil filter
+Figure 4: Effect of different filters.
+baseline. For MT-Net, we use the reported results in (Yao et al., 2019), which control the model with
+the same expressive power. The results indicate that our proposed ARML can outperform the original
+MAML and achieves comparable performance with task-specific models (e.g., MT-Net, PLATIPUS,
+HSML). Most task-specific models achieve the similar performance on the standard benchmark due
+to the homogeneity between tasks.
+Table 5: Performance comparison on the 5-way, 1-shot MiniImagenet dataset.
+|Algorithms|5-way 1-shot Accuracy|
+|---|---|
+|MAML (Finn et al., 2017) LLAMA (Finn & Levine, 2018) Reptile (Nichol & Schulman, 2018) MetaSGD (Li et al., 2017)|48.70 1.84% ± 49.40 1.83% ± 49.97 0.32% ± 50.47 1.87% ±|
+|---|---|
+|MT-Net (Lee & Choi, 2018) MUMOMAML (Vuorio et al., 2019) HSML (Yao et al., 2019) PLATIPUS (Finn et al., 2018)|49.75 1.83% ± 49.86 1.85% ± 50.38 1.85% ± 50.13 1.86% ±|
+|---|---|
+|ARML|50.42 1.73% ±|
+|---|---|
+E ADDITIONAL RESULTS OF FEW-SHOT IMAGE CLASSIFICATION
+E.1 FULL OVERALL RESULTS TABLE OF ART-MULTI DATASET
+We provide the full results table of Art-Multi Dataset in Table 9. In this table, we can see our proposed
+ARML outperforms almost all baselines in every sub-datasets.
+F FURTHER INVESTIGATION OF ABLATION STUDY
+In this section, we first show the full evaluation results of model ablation study on Art-Multi dataset
+in 6. Note that, for the tanh activation (ablation model V), the performance is similar as applying
+the sigmoid activation. On some subdatasets, the results are even better. We choose the sigmoid
+activation for ARML because it achieves overall better performance than the tanh activation on more
+subdatasets. Then, for Plain-Multi dataset, we show the results in 7. The conclusion of ablation study
+in Plain-Multi dataset is similar as the conclusion drawn from the results on Art-Multi dataset. The
+improvement on these two datasets verifies the necessity of the joint framework in ARML.
+G ADDITIONAL ANALYSIS OF META-KNOWLEDGE GRAPH
+In this section, we add more interpretation analysis of meta-knowledge graph. First, we show the full
+evaluation results of sensitivity analysis on Art-Multi dataset in Table 8.
+-----
+Table 6: Full evaluation results of model ablation study on Art-Multi dataset. B, T, A, F represent
+bird, texture, aircraft, fungi, respectively. Plain means original image.
+|Model|B Plain B Blur B Pencil T Plain T Blur T Pencil|
+|---|---|
+|I. no prototype-based graph II. no prototype|72.08% 71.06% 66.83% 45.23% 39.97% 41.67% 72.99% 70.92% 67.19% 45.17% 40.05% 41.04%|
+|---|---|
+|III. no meta-knowledge graph IV. no reconstruction loss|70.79% 69.53% 64.87% 43.37% 39.86% 41.23% 70.82% 69.87% 65.32% 44.02% 40.18% 40.52%|
+|---|---|
+|V. tanh VI. film|72.70% 69.53% 66.85% 45.81% 40.79% 38.64% 71.52% 68.70% 64.23% 43.83% 40.52% 39.49%|
+|---|---|
+|Model|A Plain A Blur A Pencil F Plain F Blur F Pencil|
+|---|---|
+|I. no prototype-based graph II. no prototype|70.06% 68.02% 60.66% 55.81% 54.39% 50.01% 71.10% 67.59% 61.07% 56.11% 54.82% 49.95%|
+|---|---|
+|III. no meta-knowledge graph IV. no reconstruction loss|69.97% 68.03% 59.72% 55.84% 53.72% 48.91% 66.83% 65.73% 55.98% 54.62% 53.02% 48.01%|
+|---|---|
+|V. tanh VI. film|73.96% 69.70% 60.75% 56.87% 54.30% 49.82% 69.13% 66.93% 55.59% 55.77% 53.72% 48.92%|
+|---|---|
+|ARML|71.89% 68.59% 61.41% 56.83% 54.87% 50.53%|
+|---|---|
+ARML **73.05%** **71.31%** **67.14%** 45.32% 40.15% **41.98%**
+Table 7: Results of Model Ablation (5-way, 5-shot results) on Plain-Multi dataset.
+|Ablation Models|Bird|Texture|Aircraft|Fungi|
+|---|---|---|---|---|
+|I. no sample-level graph II. no prototype|71.96 ± 0.72% 72.86 ± 0.74%|48.79 ± 0.67% 49.03 ± 0.69%|74.02 ± 0.65% 74.36 ± 0.65%|56.83 ± 0.80% 57.02 ± 0.81%|
+|---|---|---|---|---|
+|III. no meta-knowledge graph IV. no reconstruction loss|71.23 ± 0.75% 70.99 ± 0.74%|47.96 ± 0.68% 48.03 ± 0.69%|73.71 ± 0.69% 69.86 ± 0.66%|55.97 ± 0.82% 55.78 ± 0.83%|
+|---|---|---|---|---|
+|V. tanh VI. film|73.45 ± 0.71% 72.95 ± 0.73%|49.23 ± 0.66% 49.18 ± 0.69%|74.39 ± 0.65% 73.82 ± 0.68%|57.38 ± 0.80% 56.89 ± 0.80%|
+|---|---|---|---|---|
+|ARML|73.34 ± 0.70%|49.67 ± 0.67%|74.88 ± 0.64%|57.55 ± 0.82%|
+|---|---|---|---|---|
+Then, we analyze the meta-knowledge graph on Plain-Multi dataset by visualizing the learned metaknowledge graph on Plain-Multi dataset (as shown in Figure 5). In this figure, we can see that
+different subdatasets activate different vertices. Specifically, V2, which is mainly activated by texture,
+plays a significantly important role in aircraft and fungi. Thus, V2 connects with V3 and V1 in the
+meta-knowledge graph, which are mainly activated by fungi and aircraft, respectively. In addition,
+V0 is also activated by aircraft because of the similar contour between aircraft and bird. Furthermore,
+in meta-knowledge graph, V0 connects with V3, which shows the similarity of environment between
+bird images and fungi images.
+-----
+Bird
+Texture
+V1
+V2
+V0
+V3
+Aircraft Fungi
+Figure 5: Interpretation of meta-knowledge graph on Plain-Multi dataset. For each subdataset, one
+task is randomly selected from them. In the left figure, we show the similarity heatmap between
+prototypes (P1-P5) and meta-knowledge vertices (denoted as E1-E4), where deeper color means
+higher similarity. In the right part, we show the meta-knowledge graph, where a threshold is also set
+to filter low similarity links.
+Table 8: Full evaluation results of performance v.s. # vertices of meta-knowledge graph on Art-Multi.
+B, T, A, F represent bird, texture, aircraft, fungi, respectively. Plain means original image.
+|# of Vertices|B Plain B Blur B Pencil T Plain T Blur T Pencil|
+|---|---|
+|# of Vertices|A Plain A Blur A Pencil F Plain F Blur F Pencil|
+|---|---|
+|4 8 12 16 20|70.98% 67.36% 60.46% 56.07% 53.77% 50.08% 71.89% 68.59% 61.41% 56.83% 54.87% 50.53% 71.78% 67.26% 60.97% 56.87% 55.14% 50.86% 71.96% 68.55% 61.14% 56.76% 54.54% 49.41% 72.02% 68.29% 60.59% 55.95% 54.53% 50.13%|
+|---|---|
+4 72.29% 70.36% 67.88% 45.37% 41.05% 41.43%
+8 73.05% 71.31% 67.14% 45.32% 40.15% 41.98%
+12 73.45% 70.64% 67.41% 44.53% 41.41% 41.05%
+16 72.68% 70.18% 68.34% 45.63% 41.43% 42.18%
+20 73.41% 71.07% 68.64% 46.26% 41.80% 41.61%
+-----
+|55.27% 52.62% 48.58% 30.57% 28.65% 28.39% 45.59% 42.24% 34.52% 39.37% 38.58% 35.38% 55.23% 53.08% 48.18% 29.28% 28.70% 28.38% 51.24% 47.29% 35.98% 41.08% 40.38% 36.30% 56.99% 54.21% 50.25% 32.13% 29.63% 29.23% 43.64% 40.08% 33.73% 43.02% 42.64% 37.96% 57.73% 53.18% 50.96% 31.88% 29.72% 29.90% 49.95% 43.36% 39.61% 42.97% 40.08% 36.52% 58.15% 53.20% 51.09% 32.01% 30.21% 30.17% 49.98% 45.79% 40.87% 42.58% 41.29% 37.01%|53.67% 50.98% 46.66% 31.37% 29.08% 28.48% 45.54% 43.94% 35.49% 37.71% 38.00% 34.36% 54.76% 52.18% 48.85% 32.03% 29.90% 30.82% 50.42% 47.59% 40.17% 41.73% 40.09% 36.27%|59.67% 54.89% 52.97% 32.31% 30.77% 31.51% 51.99% 47.92% 41.93% 44.69% 42.13% 38.36%|
+|---|---|---|
+|MAML MetaSGD MT-Net MUMOMAML HSML|ProtoNet TADAM|ARML|
+|71.51% 68.65% 63.93% 42.96% 39.59% 38.87% 64.68% 62.54% 49.20% 54.08% 52.02% 46.39% 71.31% 68.73% 64.33% 41.89% 37.79% 37.91% 64.88% 63.36% 52.31% 53.18% 52.26% 46.43% 71.18% 69.29% 68.28% 43.23% 39.42% 39.20% 63.39% 58.29% 46.12% 54.01% 51.70% 47.02% 71.57% 70.50% 64.57% 44.57% 40.31% 40.07% 63.36% 61.55% 52.17% 54.89% 52.82% 47.79% 71.75% 69.31% 65.62% 44.68% 40.13% 41.33% 70.12% 67.63% 59.40% 55.97% 54.60% 49.40%|70.42% 67.90% 61.82% 44.78% 38.43% 38.40% 65.84% 63.41% 54.08% 51.45% 50.56% 46.33% 70.08% 69.05% 65.45% 44.93% 41.80% 40.18% 70.35% 68.56% 59.09% 56.04% 54.04% 47.85%|73.05% 71.31% 67.14% 45.32% 40.15% 41.98% 71.89% 68.59% 61.41% 56.83% 54.87% 50.53%|
+|---|---|---|
+|MAML MetaSGD MT-Net MUMOMAML HSML|ProtoNet TADAM|ARML|
+F Pencil
+F Blur
+F Plain
+A Pencil
+A Blur
+A Plain
+T Pencil
+T Blur
+T Plain
+B Pencil
+B Blur
+B Plain
+Algorithms
+Settings
+%
+**36.38**
+%
+**13.42**
+%
+**69.44**
+%
+**93.41**
+%
+**92.47**
+%
+**99.51**
+%
+**51.31**
+%
+**77.30**
+%
+**31.32**
+%
+**97.52**
+%
+**89.54**
+%
+**67.59**
+ARML
+%
+**53.50**
+%
+**87.54**
+%
+**83.56**
+%
+**41.61**
+%
+**59.68**
+%
+**89.71**
+%
+**98.41**
+15%.40
+%
+**32.45**
+%
+**14.67**
+%
+**31.71**
+%
+**05.73**
+ARML
+-----

ai_scientist/fewshot_examples/2_carpe_diem.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+    "review": "{\n    \"Summary\": \"This paper proposes Recency Bias, an adaptive mini batch selection method for training deep neural networks. To select informative minibatches for training, the proposed method maintains a fixed size sliding window of past model predictions for each data sample. At a given iteration, samples which have highly inconsistent predictions within the sliding window are added to the minibatch. The main contribution of this paper is the introduction of a sliding window to remember past model predictions, as an improvement over the SOTA approach: Active Bias, which maintains a growing window of model predictions. Empirical studies are performed to show the superiority of Recency Bias over two SOTA approaches. Results are shown on the task of (1) image classification from scratch and (2) image classification by fine-tuning pretrained networks.\",\n    \"Strengths\": [\n        \"The idea of using a sliding window over a growing window in active batch selection is interesting.\",\n        \"Overall, the paper is well written. In particular, the Related Work section has a nice flow and puts the proposed method into context. Despite the method having limited novelty (sliding window instead of a growing window), the method has been well motivated by pointing out the limitations in SOTA methods.\",\n        \"The results section is well structured. It's nice to see hyperparameter tuning results; and loss convergence graphs in various learning settings for each dataset.\"\n    ],\n    \"Weaknesses\": [\n        \"The key concern about the paper is the lack of rigorous experimentation to study the usefulness of the proposed method. Despite the paper stating that there have been earlier work (Joseph et al, 2019 and Wang et al, 2019) that attempt mini-batch selection, the paper does not compare with them. This is limiting. Further, since the proposed method is not specific to the domain of images, evaluating it on tasks other than image classification, such as text classification for instance, would have helped validate its applicability across domains.\",\n        \"Considering the limited results, a deeper analysis of the proposed method would have been nice. The idea of a sliding window over a growing window is a generic one, and there have been many efforts to theoretically analyze active learning over the last two decades. How does the proposed method fit in there? (For e.g., how does the expected model variance change in this setting?) Some form of theoretical/analytical reasoning behind the effectiveness of recency bias (which is missing) would provide greater insights to the community and facilitate further research in this direction.\",\n        \"The claim of 20.5% reduction in test error mentioned in the abstract has not been clearly addressed and pointed out in the results section of the paper.\",\n        \"The results would have been more complete if results were shown in a setting where just recency bias is used without the use of the selection pressure parameter. In other words, an ablation study on the effect of the selection pressure parameter would have been very useful.\",\n        \"The intuition behind the method is described well, however, the proposed method would have been really solidified if it were analysed in the context of a simple machine learning problem (such as logistic regression). As an example, verifying if the chosen minibatch samples are actually close to the decision boundary of a model (even if the model is very simple) would have helped analyze the proposed method well.\"\n    ],\n    \"Originality\": 3,\n    \"Quality\": 2,\n    \"Clarity\": 4,\n    \"Significance\": 2,\n    \"Questions\": [\n        \"How important is the warm-up phase to the proposed method? Considering the paper states that this is required to get good estimates of the quantization index of the samples, some ablation studies on reducing/increasing the warm-up phase and showing the results would have been useful to understand this.\",\n        \"Fig 4: Why are there sharp dips periodically in all the graphs? What do these correspond to?\",\n        \"The results are not conclusively in favor of the proposed method, and only is marginally better than the competitors. Why does online batch perform consistently than the proposed method? There is no discussion of these inferences from the results.\"\n    ],\n    \"Limitations\": [\n        \"The primary concern is about the strength of the experimental results, which showed only a modest benefit on relatively simple datasets.\"\n    ],\n    \"Ethical Concerns\": false,\n    \"Soundness\": 2,\n    \"Presentation\": 3,\n    \"Contribution\": 2,\n    \"Overall\": 4,\n    \"Confidence\": 3,\n    \"Decision\": \"Reject\"\n}"
+}

ai_scientist/fewshot_examples/2_carpe_diem.pdf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bae395d4e77efb99634b66a0c91616dab4d4af3d34e3e4eb745821e6ce7edcb1
+size 858387

ai_scientist/fewshot_examples/2_carpe_diem.txt ADDED Viewed

	@@ -0,0 +1,1035 @@

+# CARPE DIEM, SEIZE THE SAMPLES UNCERTAIN “AT
+## THE MOMENT” FOR ADAPTIVE BATCH SELECTION
+**Anonymous authors**
+Paper under double-blind review
+ABSTRACT
+The performance of deep neural networks is significantly affected by how well
+mini-batches are constructed. In this paper, we propose a novel adaptive batch
+selection algorithm called Recency Bias that exploits the uncertain samples
+predicted inconsistently in recent iterations. The historical label predictions of
+each sample are used to evaluate its predictive uncertainty within a sliding window.
+By taking advantage of this design, Recency Bias not only accelerates the training
+step but also achieves a more accurate network. We demonstrate the superiority
+of Recency Bias by extensive evaluation on two independent tasks. Compared with
+existing batch selection methods, the results showed that Recency Bias reduced
+the test error by up to 20.5% in a fixed wall-clock training time. At the same time,
+it improved the training time by up to 59.3% to reach the same test error.
+1 INTRODUCTION
+Stochastic gradient descent (SGD) for randomly selected mini-batch samples is commonly used to
+train deep network netowrks (DNNs). However, many recent studies have pointed out that the performance of DNNs is heavily dependent on how well the mini-batch samples are selected (Shrivastava
+et al., 2016; Chang et al., 2017; Katharopoulos & Fleuret, 2018). In earlier approaches, a sample’s difficulty is employed to identify proper mini-batch samples, and these approaches achieve
+a more accurate and robust network (Han et al., 2018) or expedite the training convergence of
+SGD (Loshchilov & Hutter, 2016). However, the two opposing difficulty-based strategies, i.e., preferring easy samples (Kumar et al., 2010; Han et al., 2018) versus hard samples (Loshchilov & Hutter,
+2016; Shrivastava et al., 2016), work well in different situations. Thus, for practical reasons to cover
+more diverse situations, recent approaches begin to exploit a sample’s uncertainty that indicates the
+consistency of previous predictions (Chang et al., 2017; Song et al., 2019).
+An important question here is how to evaluate the sample’s uncertainty based on its historical
+predictions during the training process. Intuitively, because a series of historical predictions can
+be seen as a series of data indexed in chronological order, the uncertainty can be measured based on
+_two forms of handling time-series observations: (i) a growing window (Figure 1(a)) that consistently_
+increases the size of a window to use all available observations and (ii) a sliding window (Figure 1(b))
+that maintains a window of a fixed size on the most recent observations by deleting outdated ones.
+While the state-of-the-art algorithm, Active Bias (Chang et al., 2017), adopts the growing window,
+we propose to use the sliding window in this paper.
+|Historical observations|Col2|
+|---|---|
+|||
+|||
+Historical observations Historical observations
+Growing Sliding
+All available observations Outdated observations Recent observations
+(a) Growing Window. (b) Sliding Window.
+Figure 1: Two forms of handling the time-series observations.
+In more detail, Active Bias recognizes uncertain samples based on the inconsistency of the predictions
+in the entire history of past SGD iterations. Then, it emphasizes such uncertain samples by choosing
+them with high probability for the next mini-batch. However, according to our experiments presented
+-----
+|… Horse Horse Horse|Col2|
+|---|---|
+|… Deer Deer Deer|Col2|
+|---|---|
+Images Inconsistent Predictions Consistent Predictions Sample Method
+(Horse) History Uncertainty
+Outdated Recent (too easy)
+**High**
+Horse Deer Horse Deer Deer Horse Deer … Horse Horse … Horse Active Bias
+**Low**
+Outdated Recent (too hard)
+Deer Horse Horse Deer Horse Deer Horse … Deer Deer … Deer **High** **Recency Bias**
+**Low**
+Previous Training Iterations
+Figure 2: The difference in sample uncertainty estimated by Active Bias and Recency Bias.
+in Section 5.2, such uncertain samples slowed down the convergence speed of training, though they
+ultimately reduced the generalization error. This weakness is attributed to the inherent limitation of
+the growing window, where older observations could be too outdated (Torgo, 2011). In other words,
+the outdated predictions no longer represent a network’s current behavior. As illustrated in Figure
+2, when the label predictions of two samples were inconsistent for a long time, Active Bias invariably
+regards them as highly uncertain, although their recent label predictions become consistent along
+with the network’s training progress. This characteristic evidently entails the risk of emphasizing
+uninformative samples that are too easy or too hard at the current moment, thereby slowing down
+the convergence speed of training.
+Therefore, we propose a simple but effective batch selection method, called Recency Bias, that takes
+advantage of the sliding window to evaluate the uncertainty in fresher observations. As opposed to
+_Active Bias, Recency Bias excludes the outdated predictions by managing a sliding window of a fixed_
+size and picks up the samples predicted inconsistently within the sliding window. Thus, as shown
+in Figure 2, the two samples uninformative at the moment are no longer selected by Recency Bias
+simply because their recent predictions are consistent. Consequently, since informative samples are
+effectively selected throughout the training process, this strategy not only accelerates the training
+speed but also leads to a more accurate network.
+To validate the superiority of Recency Bias, two popular convolutional neural networks (CNNs) were
+trained for two independent tasks: image classification and fine tuning. We compared Recency Bias
+with not only random batch selection (baseline) but also two state-of-the-art batch selection strategies.
+Compared with three batch selection strategies, Recency Bias provided a relative reduction of test
+error by 1.81%–20.5% in a fixed wall-clock training time. At the same time, it significantly reduced
+the execution time by 24.6%–59.3% to reach the same test error.
+2 RELATED WORK
+Let D = {(xi, yi)|1 ≤ _i ≤_ _N_ _} be the entire training dataset composed of a sample xi with its_
+true label yi, where N is the total number of training samples. Then, a straightforward strategy to
+construct a mini-batch = (xi, yi) 1 _i_ _b_ is to select b samples uniformly at random (i.e.,
+_M_ _{_ _|_ _≤_ _≤_ _}_
+_P_ (xi ) = 1/N ) from the training dataset .
+_|D_ _D_
+Because not all samples have an equal impact on training, many research efforts have been devoted
+to develop advanced sampling schemes. Bengio et al. (2009) first took easy samples and then
+gradually increased the difficulty of samples using heuristic rules. Kumar et al. (2010) determined the
+easiness of the samples using their prediction errors. Recently, Tsvetkov et al. (2016) used Bayesian
+optimization to learn an optimal curriculum for training dense, distributed word representations.
+Sachan & Xing (2016) emphasized that the right curriculum must introduce a small number of the
+samples dissimilar to those previously seen. Fan et al. (2017) proposed a neural data filter based on
+reinforcement learning to select training samples adaptively. However, it is common for deep learning
+to emphasize hard samples because of the plethora of easy ones (Katharopoulos & Fleuret, 2018).
+Loshchilov & Hutter (2016) proposed a difficulty-based sampling scheme, called Online Batch,
+that uses the rank of the loss computed from previous epochs. Online Batch sorts the previously
+computed losses of samples in descending order and exponentially decays the sampling probability
+of a sample according to its rank r. Then, the r-th ranked sample x(r) is selected with the probability
+dropping by a factor of exp log(se)/N, where se is the selection pressure parameter that affects
+the probability gap between the most and the least important samples. When normalized to sum
+to 1.0, the probability P (x(r) ; se) is defined by Eq. (1). It has been reported that _Online Batch_
+_|D_
+-----
+accelerates the convergence of training but deteriorates the generalization error because of the
+overfitting to hard training samples (Loshchilov & Hutter, 2016).
+_r_
+1/ exp log(se)/N
+_P_ (x(r) ; se) = _N_ _j_ (1)
+_|D_ _j=1_ [1][/][ exp] log(se)/N
+Most close to our work, Chang et al. (2017) devised anP _uncertainty_ -based sampling scheme, called
+_Active Bias, that chooses uncertain samples with high probability for the next batch. Active Bias_
+maintains the history _i_ that stores all h(yi _xi) before the current iteration t (i.e., growing window),_
+_H[t][−][1]_ _|_
+where h(yi|xi) is the softmax probability of a given sample xi for its true label yi. Then, it measures
+the uncertainty of the sample xi by computing the variance over all h(yi _xi) in_ _i_ and draws the
+_|_ _H[t][−][1]_
+next mini-batch samples based on the normalized probability P (xi _,_ _i_ ; ϵ) in Eq. (2), where ϵ is
+_|D_ _H[t][−][1]_
+the smoothness constant to prevent the low variance samples from never being selected again. As
+mentioned earlier in Section 1, Active Bias slows down the training process because the oldest part in
+the history _i_ no longer represents the current behavior of the network.
+_H[t][−][1]_
+_P_ (xi|D, Hi[t][−][1]; ϵ) = _Nj=1stdˆ_ ˆstdi(Hji[t]([−][1]j) +) + ϵ _ϵ_ _,_ _stdˆ_ (Hi[t][−][1]) = vuvar _h(yi|xi)_ + _[var]h(iyi|xi)2_
+_H[t][−][1]_ ut   _|H[t][−][1]|_ (2)
+P
+For the completeness of the survey, we include the recent studies on submodular batch selection.
+Joseph et al. (2019) and Wang et al. (2019) designed their own submodular objectives that cover
+diverse aspects, such as sample redundancy and sample representativeness, for more effective
+batch selection. Differently from their work, we explore the issue of truly uncertain samples in
+an orthogonal perspective. Our uncertainty measure can be easily injected into their submodular
+optimization framework as a measure of sample informativeness.
+In Section 5, we will confirm that Recency Bias outperforms Online Batch and Active Bias, which are
+regarded as two state-of-the-art adaptive batch selection methods for deep learning.
+3 _Recency Bias COMPONENTS_
+3.1 CRITERION OF AN UNCERTAIN SAMPLE
+The main challenge of Recency Bias is to identify the samples whose recent label predictions are
+highly inconsistent, which are neither too easy nor too hard at the moment. Thus, we adopt the
+_predictive uncertainty (Song et al., 2019) in Definition 3.1 that uses the information entropy (Chandler,_
+1987) to measure the inconsistency of recent label predictions. Here, the sample with high predictive
+uncertainty is regarded as uncertain and selected with high probability for the next mini-batch.
+**Definition 3.1. (Predictive Uncertainty) Let ˆyt = Φ(xi, θt) be the predicted label of a sample xi at**
+time t and Hxi (q) = {yˆt1 _, ˆyt2_ _, . . ., ˆytq_ _} be the label history of the sample xi that stores the predicted_
+labels at the previous q times, where Φ is a neural network. The label history _xi_ (q) corresponds
+_H_
+to the sliding window of size q to compute the uncertainty of the sample xi. Next, p(yi _xi; q) is_
+_|_
+formulated such that it provides the probability of the label y ∈{1, 2, ..., k} estimated as the label of
+the sample xi based on Hxi (q) as in Eq. (3), where [·] is the Iverson bracket[1].
+_p(y_ _xi; q) =_ _yˆ∈Hxi_ (q)[[ˆ]y = y] (3)
+_|_ P _xi_ (q)
+_|H_ _|_
+Then, to quantify the uncertainty of the sample xi, the predictive uncertainty F (xi; q) is defined by
+Eq. (4), where δ is the standardization term to normalize the value to [0, 1].
+_F_ (xi; q) = (1/δ)
+_−_
+_p(j_ _xi; q) log p(j_ _xi; q)_
+_|_ _|_
+_j=1_
+X
+(4)
+_δ = −_ log (1/k) □
+1The Iverson bracket [p] returns 1 if p is true; 0 otherwise.
+-----
+3.2 SAMPLING PROBABILITY FOR MINI-BATCH CONSTRUCTION
+To construct next mini-batch samples, we assign the sampling probability according to the predictive
+uncertainty in Definition 3.1. Motivated by Loshchilov & Hutter (2016), the sampling probability
+of a given sample xi is exponentially decayed with its predictive uncertainty F (xi; q). In detail,
+we adopt the quantization method (Chen & Wornell, 2001) and use the quantization index to decay
+the sampling probability. The index is obtained by the simple quantizer Q in Eq. (5), where ∆ is
+the quantization step size. Compared with the rank-based index (Loshchilov & Hutter, 2016), the
+quantization index is known to well reflect the difference in actual values (Widrow et al., 1996).
+_Q_ _F_ (xi; q) = 1 _F_ (xi; q) _/∆_ _, 0_ _F_ (xi; q) 1 (5)
+_⌈_ _−_ _⌉_ _≤_ _≤_
+In Eq. (5), we set ∆ to be 1/N such that the index is bounded to   _N (the total number of samples)._
+Then, the sampling probability P (xi ; se) is defined as in Eq. (6). The higher the predictive
+_|D_
+uncertainty, the smaller the quantization index. Therefore, a higher sampling probability is assigned
+for uncertain samples in Eq. (6).
+1/ exp log(se)/N _Q(F (xi;q))_
+_P_ (xi|D; se) = _N_ _Q(F (xj_ ;q)) (6)
+_j=1_ [1][/][ exp] log(se)/N
+Meanwhile, it is known that using only some part of training data exacerbates the overfitting problemP
+at a late stage of training (Loshchilov & Hutter, 2016; Zhou & Bilmes, 2018). Thus, to alleviate
+the problem, we include more training samples as the training progresses by exponentially decaying
+the selection pressure se as in Eq. (7). At each epoch e from e0 to eend, the selection pressure
+_se exponentially decreases from se0 to 1. Because this technique gradually reduces the sampling_
+probability gap between the most and the least uncertain samples, more diverse samples are selected
+for the next mini-batch at a later epoch. When the selection pressure se becomes 1, the mini-batch
+samples are randomly chosen from the entire dataset.
+0
+_se = se0_ exp log (1/se0 )/(eend − _e0)_ (7)
+  [][e][−][e]
+4 _Recency Bias ALGORITHM_
+**Algorithm 1 Recency Bias Algorithm**
+INPUT: : data, epochs, b: batch size, q: window size, se0 : initial selection pressure, γ: warm-up
+_D_
+OUTPUT: θt: model parameter
+1: t ← 1;
+2: θt ← Initialize the model parameter;
+3: for i = 1 to epochs do
+4: /* Sampling Probability Derivation */
+5: **if i > γ then**
+6: _se ←_ Decay_Selection_Pressure(se0, i); /* Decaying se by Eq. (7) */
+7: **for m = 1 to N do** /* Updating the index and the sampling probability in a batch */
+8: _q_dict[xm] = Q_ _F_ (xm; q) ; /* By Eq. (5) */
+9: _p_table ←_ Compute_Prob(q_dict, se); /* By Eq. (6) */
+10: /* Network Training */
+11: **for j = 1 to N/b do** /* Mini-batch */
+12: **if i ≤** _γ then_ /* Warm-up */
+13: (x1, y1), . . ., (xb, yb) Randomly select next mini-batch samples;
+_{_ _} ←_
+14: **else /* Adaptive batch selection */**
+15: (x1, y1), . . ., (xb, yb) Select next mini-batch samples based on p_table;
+_{_ _} ���_
+16: _losses, labels_ Inference_Step( (x1, y1), . . ., (xb, yb),θt); /* Forward */
+_←_ _{_ _}_
+17: _θt+1 ←_ SGD_Step(losses, θt); /* Backward */
+18: Update_Label_History(labels); /* By Definition 3.1 */
+19: _t ←_ _t + 1;_
+20: return θt;
+Algorithm 1 describes the overall procedure of Recency Bias. The algorithm requires a warm-up
+period of γ epochs because the quantization index for each sample is not confirmed yet. During
+the warm-up period, which should be at least q epochs (γ ≥ _q) to obtain the label history of size_
+-----
+_q, randomly selected mini-batch samples are used for the network update (Lines 12–13). After the_
+warm-up period, the algorithm decays the selection pressure se and updates not only the quantization
+index but also the sampling probability in a batch at the beginning of each epoch (Lines 4–9).
+Subsequently, the uncertain samples are selected for the next mini-batch according to the updated
+sampling probability (Line 14–15), and then the label history is updated along with the network
+update (Lines 16–19).
+Overall, the key technical novelty of Recency Bias is to incorporate the notion of a sliding win_dow (Line 8) rather than a growing window into adaptive batch selection, thereby improving both_
+training speed and generalization error.
+**Time Complexity: The main “additional” cost of Recency Bias is the derivation of the sampling**
+probability for each sample (Lines 4–9). Because only simple mathematical operations are needed
+per sample, its time complexity is linear to the number of samples (i.e., O(N )), which is negligible
+compared with that of the forward and backward steps of a complex network (Lines 16–17). Therefore,
+we contend that Recency Bias does not add the complexity of an underlying optimization algorithm.
+5 EVALUATION
+We empirically show the improvement of Recency Bias over not only Random Batch (baseline) but also
+_Online Batch (Loshchilov & Hutter, 2016) and Active Bias (Chang et al., 2017), which are two state-_
+of-the-art adaptive batch selections. In particular, we elaborate on the effect of the sliding window
+approach (Recency Bias) compared with the growing window approach (Active Bias). Random Batch
+selects next mini-batch samples uniformly at random from the entire dataset. Online Batch selects hard
+samples based on the rank of the loss computed from previous epochs. Active Bias selects uncertain
+samples with high variance of true label probabilities in the growing window. All the algorithms
+were implemented using TensorFlow 1.8.0 and executed using a single NVIDIA Titan Volta GPU.
+[For reproducibility, we provide the source code at https://github.com/anonymized.](https://github.com/anonymized)
+Image classification and fine-tuning tasks were performed to validate the superiority of Recency Bias.
+Because fine-tuning is used to quickly adapt to a new dataset, it is suitable to reap the benefit of fast
+training speed. In support of reliable evaluation, we repeated every task thrice and reported the average
+and standard error of the best test errors. The best test error in a given time has been widely used for
+the studies on fast and accurate training (Katharopoulos & Fleuret, 2018; Loshchilov & Hutter, 2016).
+5.1 ANALYSIS ON SELECTED MINI-BATCH SAMPLES
+For an in-depth analysis on selected samples, we plot the loss distribution of mini-batch samples
+selected from CIFAR-10 by four different strategies in Figure 3. (i) The distribution of Online Batch
+is the most skewed toward high loss by the design principle of selecting hard samples. (ii) Active Bias
+emphasizes moderately hard samples at an early training stage in considering that its loss distribution
+lies between those of Random Batch and Online Batch. However, owing to the outdated predictions
+caused by the growing window, the proportion of easy samples with low loss increases at a late
+training stage. These easy samples, which are misclassified as uncertain at that stage, tend to make the
+convergence of training slow down. (iii) In contrast to Active Bias, by virtue of the sliding window,
+the distribution of Recency Bias lies between those of Random Batch and Online Batch regardless of
+the training stage. Consequently, Recency Bias continues to highlight the moderately hard samples,
+which are likely to be informative, during the training process.
+Random Batch
+Online Batch
+Active Bias
+(Growing window)
+Recency Bias
+(Sliding window)
+Loss (Log-scale) Loss (Log-scale)
+(a) Early Stage (30%). (b) Late Stage (70%).
+Figure 3: The loss distribution of mini-batch samples selected by four batch selection strategies: (a)
+and (b) show the loss distribution at the 30% and 70% of total training epochs, respectively.
+-----
+5.2 TASK I: IMAGE CLASSIFICATION
+**Experiment Setting: We trained DenseNet (L=40, k=12) and ResNet (L=50) with a momentum**
+optimizer and an SGD optimizer on three benchmark datasets: MNIST (10 classes)[2], classification
+of handwritten digits (LeCun, 1998), and CIFAR-10 (10 classes)[3] and CIFAR-100 (100 classes)[3],
+classification of a subset of 80 million categorical images (Krizhevsky et al., 2014). Specifically, we
+used data augmentation, batch normalization, a momentum of 0.9, and a batch size of 128. As for the
+algorithm parameters, we fixed the window size q = 10 and the initial selection pressure se0 = 100,[4]
+which were the best values found by the grid search (see Appendix A for details). The warm-up
+epoch γ was set to be 15. To reduce the performance variance caused by randomly initialized model
+parameters, all parameters were shared by all algorithms during the warm-up period. Regarding
+the training schedule, we trained the network for 40, 000 iterations and used an initial learning rate
+of 0.1, which was divided by 10 at 50% and 75% of the total number of training iterations.
+**Results: Figure 4 shows the convergence curves of training loss and test error for four batch selection**
+strategies using DenseNet and a momentum optimizer. In order to highlight the improvement of
+_Recency Bias over the baseline (Random Batch), their lines are dark colored. The best test errors in_
+Figures 4(b), 4(d), and 4(f) are summarized on the left side of Table 1.
+In general, Recency Bias achieved the most accurate network while accelerating the training process
+on all datasets. The training loss of Recency Bias converged faster (Figures 4(a), 4(c), and 4(e))
+without the increase in the generalization error, thereby achieving the lower test error (Figures 4(b),
+4(d), and 4(f)). In contrast, the test error of Online Batch was not the best even if its training loss
+converged the fastest among all strategies. As the training difficulty increased from CIFAR-10 to
+CIFAR-100, the test error of Online Batch became even worse than that of Random Batch. That
+is, emphasizing hard samples accelerated the training step but made the network overfit to hard
+samples. Meanwhile, Active Bias was prone to make the network better generalized on test data.
+In CIFAR-10, despite its highest training loss, the test error of Active Bias was better than that of
+_Random Batch. However, Active Bias slowed down the training process because of the limitation_
+of growing windows, as discussed in Section 5.1. We note that, although both Recency Bias and
+_Active Bias exploited uncertain samples, only Recency Bias based on sliding windows succeeded_
+to not only speed up the training process but also reduce the generalization error.
+The results of the best test error for ResNet or an SGD optimizer are summarized in Tables 1 and
+2 (see Appendix B for more details). Regardless of a neural network and an optimizer, Recency
+_Bias achieved the lowest test error except in MNIST with an SGD optimizer. The improvement of_
+_Recency Bias over the others was higher with an SGD optimizer than with a momentum optimizer._
+Table 1: The best test errors (%) of four batch selection strategies using DenseNet.
+|Optimizer|Momentum in Figure 4|Col3|Col4|SGD in Figure 7 (Appendix B.1)|Col6|Col7|
+|---|---|---|---|---|---|---|
+|Method|MNIST|CIFAR-10|CIFAR-100|MNIST|CIFAR-10|CIFAR-100|
+|Random Batch|0.527 ± 0.03|7.33 ± 0.09|28.0 ± 0.16|1.23 ± 0.03|14.9 ± 0.09|40.2 ± 0.06|
+|Online Batch|0.514 ± 0.01|7.00 ± 0.10|28.4 ± 0.25|0.765 ± 0.02|13.5 ± 0.02|40.7 ± 0.12|
+|Active Bias|0.616 ± 0.03|7.07 ± 0.04|27.9 ± 0.11|0.679 ± 0.02|14.2 ± 0.25|42.9 ± 0.05|
+|Recency Bias|0.490 ± 0.02|6.60 ± 0.02|27.1 ± 0.19|0.986 ± 0.06|13.2 ± 0.11|38.7 ± 0.11|
+Table 2: The best test errors (%) of four batch selection strategies using ResNet.
+|Optimizer|Momentum in Figure 8 (Appendix B.2)|Col3|Col4|SGD in Figure 9 (Appendix B.3)|Col6|Col7|
+|---|---|---|---|---|---|---|
+|Method|MNIST|CIFAR-10|CIFAR-100|MNIST|CIFAR-10|CIFAR-100|
+|Random Batch|0.636 ± 0.04|10.2 ± 0.12|33.2 ± 0.07|1.16 ± 0.03|12.7 ± 0.09|40.1 ± 0.16|
+|Online Batch|0.666 ± 0.05|10.1 ± 0.05|33.4 ± 0.01|0.890 ± 0.03|12.2 ± 0.08|40.7 ± 0.09|
+|Active Bias|0.613 ± 0.04|10.6 ± 0.08|34.2 ± 0.07|0.804 ± 0.01|13.5 ± 0.07|45.6 ± 0.07|
+|Recency Bias|0.607 ± 0.01|9.79 ± 0.04|32.4 ± 0.04|0.972 ± 0.03|11.6 ± 0.09|38.9 ± 0.14|
+[2http://yann.lecun.com/exdb/mnist](http://yann.lecun.com/exdb/mnist)
+[3https://www.cs.toronto.edu/~kriz/cifar.html](https://www.cs.toronto.edu/~kriz/cifar.html)
+4Online Batch also used the same decaying selection pressure value.
+-----
+|Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|Col10|
+|---|---|---|---|---|---|---|---|---|---|
+||Random Batch Online||||Batch Active Bias Recency Bias|||||
+|E-01 E-02 E-03|||||3.6% Error 1.2% Test|||||
+|||||||||||
+|||||||||||
+|||||||||||
+|||||||||||
+2125 4250 6375 8500
+Time (s)
+2125 4250 6375 8500
+Time (s)
+0.90.80.70.60.50.40.30.20.110
+|Col1|Col2|Col3|Col4|
+|---|---|---|---|
+|||||
+|||||
+|0|Col2|Col3|Col4|
+|---|---|---|---|
+|||||
+|||||
+|(a) MNIST Training Loss. (b) MNIST Test Error.|Col2|Col3|Col4|Col5|
+|---|---|---|---|---|
+|(a) MNIST Training Loss. (b) MNIST Test Error. 16E-01 20 40 60 80 26.0%10 0 4E-01 Error 13.0% Test 4E-02 0E-03 6.5% 0 2500 5000 7500 10000 0 2500 5000 7500 100 Time (s) Time (s) (c) CIFAR-10 Training Loss. (d) CIFAR-10 Test Error. 4E+00 54.0% Error 0E-01 Test|||||
+||||||
+2500 5000 7500 10000
+2.4E+00
+6.0E-01
+1.5E-01
+27.0%
+2500 5000 7500 10000
+Time (s)
+Time (s)
+(e) CIFAR-100 Training Loss. (f) CIFAR-100 Test Error.
+Figure 4: Convergence curves of four batch selection strategies using DenseNet with momentum.
+5.3 TASK II: FINE-TUNING
+**Experiment Setting: We prepared DenseNet (L=121, k=32) previously trained on ImageNet (Deng**
+et al., 2009) and then fine-tuned the network on two benchmark datasets: MIT-67 (67 classes)[5],
+classification of indoor scenes (Quattoni & Torralba, 2009), and Food-100 (100 classes)[6], classification of popular foods in Japan (Kawano & Yanai, 2014). After replacing the last classification
+layer, the network was trained end-to-end for 50 epochs with a batch size 32 and a constant learning
+rate 2 × 10[−][4]. Data augmentation was not applied here. The other configurations were the same
+as those in Section 5.2.
+**Results on Test Error: Figure 5 shows the convergence curves of training loss and test error for**
+the fine-tuning task on MIT-67 and Food-100. Overall, all convergence curves showed similar trends
+to those of the classification task in Figure 4. Only Recency Bias converged faster than Random
+_Batch in both training loss and test error. Online Batch converged the fastest in training loss, but_
+its test error was rather higher than Random Batch owing to the overfitting. Active Bias converged the
+[5http://web.mit.edu/torralba/www/indoor.html](http://web.mit.edu/torralba/www/indoor.html)
+[6http://foodcam.mobi/dataset100.html](http://foodcam.mobi/dataset100.html)
+-----
+|Col1|Col2|Col3|Col4|Col5|
+|---|---|---|---|---|
+||||||
+|Time Redu|ction: 24.6|%|||
+Random Batch Online Batch Active Bias Recency Bias
+1.9E+00 39.0%
+6.3E-012.1E-01 Test Error 35.0%31.0%
+Training Loss
+Time Reduction: 24.6%
+7.0E-02 27.0%
+0 1500 3000 4500 6000 0 1500 3000 4500 6000
+0.90.80.70.60.50.40.30.20.110 (a) MIT-67 Training Loss.Time (s) (b) MIT-67 Test Error.Time (s)
+1.6E+001 20 40 60 80 44.0%10
+0
+40.0%
+0.90.80.70.60.50.40.30.20.110
+8.0E-01
+4.0E-01
+36.0%
+32.0%
+2.0E-01
+|20|4|0|60|Col5|
+|---|---|---|---|---|
+||||||
+||||||
+|0|Col2|Col3|Col4|Col5|
+|---|---|---|---|---|
+|0|||||
+||||||
+|Time Redu|ction: 26.1|%|||
+2000 4000 6000 8000
+Time (s)
+2000 4000 6000 8000
+Time (s)
+(c) Food-100 Training Loss. (d) Food-100 Test Error.
+Figure 5: Convergence curves for fine-tuning on two benchmark datasets.
+Table 3: Recency Bias’s reduction in training time over other batch selection strategies.
+|Method|MIT-67|FOOD-100|
+|---|---|---|
+|Random Batch|(5, 218 −3, 936)/5, 218 × 100 = 24.6%|(7, 263 −5, 365)/7, 263 × 100 = 26.1%|
+|Online Batch|(6, 079 −3, 823)/6, 079 × 100 = 37.1%|(8, 333 −3, 685)/8, 333 × 100 = 55.8%|
+|Active Bias|(5, 738 −3, 032)/5, 738 × 100 = 47.2%|(7, 933 −3, 227)/7, 933 × 100 = 59.3%|
+slowest in both training loss and test error. Quantitatively, compared with Random Batch, Recency
+_Bias reduced the test error by 2.88% and 1.81% in MIT-67 and Food-100, respectively._
+**Results on Training Time: Moreover, to assess the performance gain in training time, we computed**
+the reduction in the training time taken to reach the same error. For example, in Figure 5(b), the
+best test error of 28.8% achieved in 5, 218 seconds by Random Batch could be achieved only in
+3, 936 seconds by Recency Bias; thus, Recency Bias improved the training time by 24.6%. Table
+3 summarizes the reduction in the training time of Recency Bias over three other batch selection
+strategies. Notably, Recency Bias improved the training time by 24.6%–47.2% and 26.1%–59.3% in
+fine-tuning MIT-67 and FOOD-100 datasets, respectively.
+6 CONCLUSION
+In this paper, we presented a novel adaptive batch selection algorithm called Recency Bias that
+emphasizes predictively uncertain samples for accelerating the training of neural networks. Toward
+this goal, the predictive uncertainty of each sample is evaluated using its recent label predictions
+managed by a sliding window of a fixed size. Then, uncertain samples at the moment are selected with
+high probability for the next mini-batch. We conducted extensive experiments on both classification
+and fine-tuning tasks. The results showed that Recency Bias is effective in reducing the training
+time as well as the best test error. It was worthwhile to note that using all historical observations to
+estimate the uncertainty has the side effect of slowing down the training process. Overall, a merger of
+uncertain samples and sliding windows greatly improves the power of adaptive batch selection.
+-----
+REFERENCES
+Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In
+_ICML, pp. 41–48, 2009._
+David Chandler. Introduction to modern statistical mechanics. Oxford University Press, 1987.
+Haw-Shiuan Chang, Erik Learned-Miller, and Andrew McCallum. Active Bias: Training more
+accurate neural networks by emphasizing high variance samples. In NeurIPS, pp. 1002–1012,
+2017.
+Brian Chen and Gregory W Wornell. Quantization index modulation: A class of provably good
+methods for digital watermarking and information embedding. IEEE Trans. on Information Theory,
+47(4):1423–1443, 2001.
+Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale
+hierarchical image database. In CVPR, pp. 248–255, 2009.
+Yang Fan, Fei Tian, Tao Qin, and Tie-Yan Liu. Neural data filter for bootstrapping stochastic gradient
+descent. In ICLR, 2017.
+Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi
+Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In
+_NeurIPS, pp. 8527–8537, 2018._
+KJ Joseph, Krishnakant Singh, Vineeth N Balasubramanian, et al. Submodular batch selection for
+training deep neural networks. In IJCAI, pp. 2677–3683, 2019.
+Angelos Katharopoulos and François Fleuret. Not all samples are created equal: Deep learning with
+importance sampling. In ICML, pp. 2525–2534, 2018.
+Y. Kawano and K. Yanai. Food image recognition with deep convolutional features. In UbiComp,
+2014.
+Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. CIFAR-10 and CIFAR-100 datasets, 2014.
+[https://www.cs.toronto.edu/~kriz/cifar.html.](https://www.cs.toronto.edu/~kriz/cifar.html)
+M Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable
+models. In NeurIPS, pp. 1189–1197, 2010.
+[Yann LeCun. The MNIST database of handwritten digits, 1998. http://yann.lecun.com/](http://yann.lecun.com/exdb/mnist)
+[exdb/mnist.](http://yann.lecun.com/exdb/mnist)
+Ilya Loshchilov and Frank Hutter. Online batch selection for faster training of neural networks. In
+_ICLR, 2016._
+Ariadna Quattoni and Antonio Torralba. Recognizing indoor scenes. In CVPR, pp. 413–420, 2009.
+Mrinmaya Sachan and Eric Xing. Easy questions first? A case study on curriculum learning for
+question answering. In ACL, pp. 453–463, 2016.
+Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors
+with online hard example mining. In CVPR, pp. 761–769, 2016.
+Hwanjun Song, Minseok Kim, and Jae-Gil Lee. SELFIE: Refurbishing unclean samples for robust
+deep learning. In ICML, pp. 5907–5915, 2019.
+Luis Torgo. Data mining with R: learning with case studies. Chapman and Hall/CRC, 2011.
+Yulia Tsvetkov, Manaal Faruqui, Wang Ling, Brian MacWhinney, and Chris Dyer. Learning the
+curriculum with bayesian optimization for task-specific word representation learning. In ACL, pp.
+130–139, 2016.
+Shengjie Wang, Wenruo Bai, Chandrashekhar Lavania, and Jeff Bilmes. Fixing mini-batch sequences
+with hierarchical robust partitioning. In AISTATS, pp. 3352–3361, 2019.
+-----
+Bernard Widrow, Istvan Kollar, and Ming-Chang Liu. Statistical theory of quantization. IEEE
+_Transactions on instrumentation and measurement, 45(2):353–361, 1996._
+Tianyi Zhou and Jeff Bilmes. Minimax curriculum learning: Machine teaching with desirable
+difficulties and scheduled diversity. In ICLR, 2018.
+-----
+A HYPERPARAMETER SELECTION
+_Recency Bias receives the two hyperparameters: (i) the initial selection pressure se0 that determines_
+the sampling probability gap between the most and the least uncertain samples and (ii) the window
+size q that determines how many recent label predictions are involved in predicting the uncertainty.
+To decide the best hyperparameters, we trained ResNet (L=50) on CIFAR-10 and CIFAR-100 with a
+momentum optimizer. For hyperparameters selection, the two hyperparameters were chosen in a grid
+_se0_ 1, 10, 100, 1000 and q 5, 10, 15 .
+_∈{_ _}_ _∈{_ _}_
+|Window|Size|Col3|
+|---|---|---|
+|q=5 q=10 q=15|||
+10.4% 33.5%
+Window Size
+10.1% 33.0% q=5
+q=10
+9.8% 32.5%
+Best Test Error q=15
+9.5% 32.0%
+1 10 100 1000 1 10 100 1000
+Initial Selection Pressure (𝑆𝑒0) Initial Selection Pressure (𝑆𝑒0)
+(a) CIFAR-10. (b) CIFAR-100.
+Figure 6: Grid search on CIFAR-10 and CIFAR-100 datasets using ResNet.
+Figure 6 shows the test errors of Recency Bias obtained by the grid search on the two datasets.
+Regarding the initial selection pressure se0, the lowest test error was typically achieved when the
+_se0 value was 100. As for the window size q, the test error was almost always the lowest when the q_
+value was 10. Similar trends were observed for the other combinations of a neural network and an
+optimizer. Therefore, in all experiments, we set se0 to be 100 and q to be 10.
+-----
+|GENERALIZATION OF Recency Bias|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|Col10|Col11|Col12|
+|---|---|---|---|---|---|---|---|---|---|---|---|
+|CONVERGENCE CURVES USING DENSENET WITH SGD 7 shows the convergence curves of training loss and test error for four batch selection strate DenseNet and an SGD optimizer, which corresponds to the right side of Table 1.||||||||||||
+||eNet and an SGD optimizer, whic|||||||||||
+||Random Batch Online|||||Batch Active Bias Recency Bias||||||
+|E-01 E-02 E-02||||||4.8% 2.4% Error Test 1.2%||||||
+|||||||||||||
+|||||||||||||
+|||||||||||||
+2000 4000 6000 8000
+Time (s)
+2000 4000 6000 8000
+Time (s)
+0.90.80.70.60.50.40.30.20.110
+|(a) MNIST Training Loss. (b) MNIST Test Error.|Col2|Col3|Col4|Col5|
+|---|---|---|---|---|
+|(a) MNIST Training Loss. (b) MNIST Test Error. 10E+00 20 40 60 80 48.0%10 0 2E-01 Error 24.0% Test 6E-01|||||
+||||||
+||||||
+3.2E+00
+12.0%
+|Col1|Col2|Col3|Col4|
+|---|---|---|---|
+|||||
+|||||
+2500 5000 7500 10000
+Time (s)
+2500 5000 7500 10000
+(c) CIFAR-10 Training Loss. (d) CIFAR-10 Test Error.
+70.0%
+1.6E+00
+8.0E-01
+35.0%
+|(c) CIFAR-10 Training Loss.|Col2|Col3|Col4|
+|---|---|---|---|
+|||||
+|||||
+|||||
+|Time (s)|Col2|Col3|Col4|
+|---|---|---|---|
+|(d) CIFAR-10 Test Error.||||
+|||||
+2500 5000 7500 10000
+Time (s)
+2500 5000 7500 10000
+Time (s)
+(e) CIFAR-100 Training Loss. (f) CIFAR-100 Test Error.
+Figure 7: Convergence curves of four batch selection strategies using DenseNet with SGD.
+-----
+|Col1|et and a momentum optimizer, w|Col3|Col4|Col5|
+|---|---|---|---|---|
+||Random Batch Online|||Batch Active Bias Recency Bias|
+||||||
+||||||
+||||||
+||||||
+0.90.80.70.60.50.40.30.20.110
+|20|40|60|
+|---|---|---|
+||||
+||||
+||||
+|0|Col2|Col3|
+|---|---|---|
+||||
+||||
+|CONVERGENCE CURVES USING RESNET WITH MOMENTUM e 8 shows the convergence curves of training loss and test error for four batch selection strate ResNet and a momentum optimizer, which corresponds to the left side of Table 2. Random Batch Online Batch Active Bias Recency Bias 5E-01 4.5% 9E-02 Error 1.5% Test 4E-03 0E-04 0.5% 0 2300 4600 6900 0 2300 4600 69 Time (s) Time (s) (a) MNIST Training Loss. (b) MNIST Test Error. 17E+00 20 40 60 80 36.0%10 0 5E-01 Error 18.0% Test 5E-02 0E-03 9.0% 0 2900 5800 8700 0 2900 5800 870 Time (s) Time (s) (c) CIFAR-10 Training Loss. (d) CIFAR-10 Test Error. 3E+00 64.0% 2E-01 Error Test 2E-01|Col2|Col3|Col4|
+|---|---|---|---|
+|||||
+2900 5800 8700
+Time (s)
+2900 5800 8700
+Time (s)
+(e) CIFAR-100 Training Loss. (f) CIFAR-100 Test Error.
+Figure 8: Convergence curves of four batch selection strategies using ResNet with momentum.
+-----
+|CONVERGENCE CURVES USING RESNET WITH SGD 9 shows the convergence curves of training loss and test error for four batch selection strate ResNet and an SGD optimizer, which corresponds to the right side of Table 2.|Col2|Col3|Col4|Col5|Col6|Col7|Col8|
+|---|---|---|---|---|---|---|---|
+||ows the convergence curves of trai et and an SGD optimizer, which|||||||
+||Random Batch Online|||Batch Active Bias Recency Bias||||
+|E-01 E-02 E-02||||6.3% Error 2.1% Test||||
+|||||||||
+|||||||||
+|||||||||
+|||||||||
+2300 4600 6900
+2300 4600 6900
+|Time (s) Time (s) (a) MNIST Training Loss. (b) MNIST Test Error. 1 20 40 60 80 44.0%10 4E+00 0 5E-01 Error 22.0% Test 5E-01|Col2|Col3|Col4|
+|---|---|---|---|
+|||||
+|||||
+2900 5800 8700
+0.90.80.70.60.50.40.30.20.110
+0
+22.0%
+Test Error
+11.0%
+1.4E+001
+4.5E-01
+1.5E-01
+5.0E-02
+|20|40|60|
+|---|---|---|
+||||
+||||
+||||
+2900 5800 8700
+(c) CIFAR-10 Training Loss. (d) CIFAR-10 Test Error.
+76.0%
+3.6E+00
+1.2E+00
+4.0E-01
+38.0%
+|Time (s)|Col2|Col3|
+|---|---|---|
+|(c) CIFAR-10 Training Loss.|||
+||||
+||||
+|Time (s)|Col2|Col3|
+|---|---|---|
+|(d) CIFAR-10 Test Error.|||
+||||
+2900 5800 8700
+Time (s)
+2900 5800 8700
+Time (s)
+(e) CIFAR-100 Training Loss. (f) CIFAR-100 Test Error.
+Figure 9: Convergence curves of four batch selection strategies using ResNet with SGD.
+-----

ai_scientist/fewshot_examples/attention.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+    "review": "{\n    \"Summary\": \"The paper proposes the Transformer, a novel neural network architecture that relies entirely on self-attention mechanisms, eschewing traditional recurrent and convolutional layers. This innovation allows the model to achieve state-of-the-art results in machine translation tasks with significant improvements in both training efficiency and translation quality. The paper includes detailed descriptions of the model architecture, including multi-head attention and positional encodings, as well as extensive experimental results to validate the model's performance.\",\n    \"Questions\": [\n        \"Could the authors provide more detailed comparisons with other recent models not included in Table 2?\",\n        \"What is the impact of varying the number of layers (N) in both the encoder and decoder stacks?\",\n        \"Can the authors provide more insights into the choice of hyperparameters, especially the learning rate schedule and warmup steps?\"\n    ],\n    \"Limitations\": [\n        \"The paper does not explore the application of the Transformer to tasks beyond machine translation, such as image or audio processing.\",\n        \"The discussion on the potential negative societal impacts of the model is minimal and could be expanded.\"\n    ],\n    \"Ethical Concerns\": false,\n    \"Soundness\": 4,\n    \"Presentation\": 3,\n    \"Contribution\": 4,\n    \"Overall\": 8,\n    \"Confidence\": 5,\n    \"Strengths\": [\n        \"The Transformer model introduces a highly innovative use of self-attention mechanisms, replacing traditional recurrent and convolutional layers.\",\n        \"Comprehensive experimental validation showing state-of-the-art performance in machine translation tasks.\",\n        \"Clear and detailed description of the model architecture and its components, facilitating reproducibility and further research.\"\n    ],\n    \"Weaknesses\": [\n        \"Limited discussion on the application of the model to other domains beyond machine translation.\",\n        \"The paper could benefit from a deeper analysis of the potential negative societal impacts of the model.\"\n    ],\n    \"Originality\": 4,\n    \"Quality\": 4,\n    \"Clarity\": 4,\n    \"Significance\": 4,\n    \"Decision\": \"Accept\"\n}"
+}

ai_scientist/fewshot_examples/attention.pdf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d87d482d5ae7960e2e43d7dd6d21377e60e73e8fce1bf2a01aff7aca8a08c537
+size 569417

ai_scientist/fewshot_examples/attention.txt ADDED Viewed

	@@ -0,0 +1,662 @@

+# Attention Is All You Need
+**Ashish Vaswani[∗]**
+Google Brain
+```
+avaswani@google.com
+```
+**Llion Jones[∗]**
+Google Research
+```
+ llion@google.com
+```
+**Noam Shazeer[∗]**
+Google Brain
+```
+noam@google.com
+```
+**Niki Parmar[∗]**
+Google Research
+```
+nikip@google.com
+```
+**Jakob Uszkoreit[∗]**
+Google Research
+```
+usz@google.com
+```
+**Aidan N. Gomez[∗†]**
+University of Toronto
+```
+aidan@cs.toronto.edu
+```
+**Łukasz Kaiser[∗]**
+Google Brain
+```
+lukaszkaiser@google.com
+```
+**Illia Polosukhin[∗‡]**
+```
+             illia.polosukhin@gmail.com
+```
+**Abstract**
+The dominant sequence transduction models are based on complex recurrent or
+convolutional neural networks that include an encoder and a decoder. The best
+performing models also connect the encoder and decoder through an attention
+mechanism. We propose a new simple network architecture, the Transformer,
+based solely on attention mechanisms, dispensing with recurrence and convolutions
+entirely. Experiments on two machine translation tasks show these models to
+be superior in quality while being more parallelizable and requiring significantly
+less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including
+ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task,
+our model establishes a new single-model state-of-the-art BLEU score of 41.0 after
+training for 3.5 days on eight GPUs, a small fraction of the training costs of the
+best models from the literature.
+**1** **Introduction**
+Recurrent neural networks, long short-term memory [12] and gated recurrent [7] neural networks
+in particular, have been firmly established as state of the art approaches in sequence modeling and
+transduction problems such as language modeling and machine translation [29, 2, 5]. Numerous
+efforts have since continued to push the boundaries of recurrent language models and encoder-decoder
+architectures [31, 21, 13].
+_∗Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started_
+the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and
+has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head
+attention and the parameter-free position representation and became the other person involved in nearly every
+detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and
+tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and
+efficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and
+implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating
+our research.
+_†Work performed while at Google Brain._
+_‡Work performed while at Google Research._
+31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
+-----
+Recurrent models typically factor computation along the symbol positions of the input and output
+sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden
+states ht, as a function of the previous hidden state ht 1 and the input for position t. This inherently
+_−_
+sequential nature precludes parallelization within training examples, which becomes critical at longer
+sequence lengths, as memory constraints limit batching across examples. Recent work has achieved
+significant improvements in computational efficiency through factorization tricks [18] and conditional
+computation [26], while also improving model performance in case of the latter. The fundamental
+constraint of sequential computation, however, remains.
+Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in
+the input or output sequences [2, 16]. In all but a few cases [22], however, such attention mechanisms
+are used in conjunction with a recurrent network.
+In this work we propose the Transformer, a model architecture eschewing recurrence and instead
+relying entirely on an attention mechanism to draw global dependencies between input and output.
+The Transformer allows for significantly more parallelization and can reach a new state of the art in
+translation quality after being trained for as little as twelve hours on eight P100 GPUs.
+**2** **Background**
+The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU
+[20], ByteNet [15] and ConvS2S [8], all of which use convolutional neural networks as basic building
+block, computing hidden representations in parallel for all input and output positions. In these models,
+the number of operations required to relate signals from two arbitrary input or output positions grows
+in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes
+it more difficult to learn dependencies between distant positions [11]. In the Transformer this is
+reduced to a constant number of operations, albeit at the cost of reduced effective resolution due
+to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as
+described in section 3.2.
+Self-attention, sometimes called intra-attention is an attention mechanism relating different positions
+of a single sequence in order to compute a representation of the sequence. Self-attention has been
+used successfully in a variety of tasks including reading comprehension, abstractive summarization,
+textual entailment and learning task-independent sentence representations [4, 22, 23, 19].
+End-to-end memory networks are based on a recurrent attention mechanism instead of sequencealigned recurrence and have been shown to perform well on simple-language question answering and
+language modeling tasks [28].
+To the best of our knowledge, however, the Transformer is the first transduction model relying
+entirely on self-attention to compute representations of its input and output without using sequencealigned RNNs or convolution. In the following sections, we will describe the Transformer, motivate
+self-attention and discuss its advantages over models such as [14, 15] and [8].
+**3** **Model Architecture**
+Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 29].
+Here, the encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence
+of continuous representations z = (z1, ..., zn). Given z, the decoder then generates an output
+sequence (y1, ..., ym) of symbols one element at a time. At each step the model is auto-regressive
+[9], consuming the previously generated symbols as additional input when generating the next.
+The Transformer follows this overall architecture using stacked self-attention and point-wise, fully
+connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,
+respectively.
+**3.1** **Encoder and Decoder Stacks**
+**Encoder:** The encoder is composed of a stack of N = 6 identical layers. Each layer has two
+sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position
+-----
+Figure 1: The Transformer - model architecture.
+wise fully connected feed-forward network. We employ a residual connection [10] around each of
+the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is
+LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer
+itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding
+layers, produce outputs of dimension dmodel = 512.
+**Decoder:** The decoder is also composed of a stack of N = 6 identical layers. In addition to the two
+sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head
+attention over the output of the encoder stack. Similar to the encoder, we employ residual connections
+around each of the sub-layers, followed by layer normalization. We also modify the self-attention
+sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This
+masking, combined with fact that the output embeddings are offset by one position, ensures that the
+predictions for position i can depend only on the known outputs at positions less than i.
+**3.2** **Attention**
+An attention function can be described as mapping a query and a set of key-value pairs to an output,
+where the query, keys, values, and output are all vectors. The output is computed as a weighted sum
+of the values, where the weight assigned to each value is computed by a compatibility function of the
+query with the corresponding key.
+**3.2.1** **Scaled Dot-Product Attention**
+We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of
+queries and keys of dimension dk, and values of dimension dv. We compute the dot products of the
+-----
+Scaled Dot-Product Attention Multi-Head Attention
+Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several
+attention layers running in parallel.
+query with all keys, divide each by _dk, and apply a softmax function to obtain the weights on the_
+_[√]_
+values.
+In practice, we compute the attention function on a set of queries simultaneously, packed together
+into a matrix Q. The keys and values are also packed together into matrices K and V . We compute
+the matrix of outputs as:
+Attention(Q, K, V ) = softmax( _[QK]√dk[T]_ )V (1)
+The two most commonly used attention functions are additive attention [2], and dot-product (multiplicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor
+of _√1dk . Additive attention computes the compatibility function using a feed-forward network with_
+a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is
+much faster and more space-efficient in practice, since it can be implemented using highly optimized
+matrix multiplication code.
+While for small values of dk the two mechanisms perform similarly, additive attention outperforms
+dot product attention without scaling for larger values of dk [3]. We suspect that for large values of
+_dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has_
+extremely small gradients [4]. To counteract this effect, we scale the dot products by _√1dk ._
+**3.2.2** **Multi-Head Attention**
+Instead of performing a single attention function with dmodel-dimensional keys, values and queries,
+we found it beneficial to linearly project the queries, keys and values h times with different, learned
+linear projections to dk, dk and dv dimensions, respectively. On each of these projected versions of
+queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional
+output values. These are concatenated and once again projected, resulting in the final values, as
+depicted in Figure 2.
+Multi-head attention allows the model to jointly attend to information from different representation
+subspaces at different positions. With a single attention head, averaging inhibits this.
+4To illustrate why the dot products get large, assume that the components of q and k are independent random
+variables with mean 0 and variance 1. Then their dot product, q · k = _i=1_ _[q][i][k][i][, has mean][ 0][ and variance][ d][k][.]_
+[P][d][k]
+-----
+MultiHead(Q, K, V ) = Concat(head1, ..., headh)W _[O]_
+where headi = Attention(QWi[Q][, KW][ K]i _[, V W][ V]i_ [)]
+Where the projections are parameter matrices Wi[Q] R[d][model][×][d][k], Wi[K] R[d][model][×][d][k], Wi[V] R[d][model][×][d][v]
+_∈_ _∈_ _∈_
+and W _[O]_ _∈_ R[hd][v][×][d][model].
+In this work we employ h = 8 parallel attention layers, or heads. For each of these we use
+_dk = dv = dmodel/h = 64. Due to the reduced dimension of each head, the total computational cost_
+is similar to that of single-head attention with full dimensionality.
+**3.2.3** **Applications of Attention in our Model**
+The Transformer uses multi-head attention in three different ways:
+_• In "encoder-decoder attention" layers, the queries come from the previous decoder layer,_
+and the memory keys and values come from the output of the encoder. This allows every
+position in the decoder to attend over all positions in the input sequence. This mimics the
+typical encoder-decoder attention mechanisms in sequence-to-sequence models such as
+[31, 2, 8].
+_• The encoder contains self-attention layers. In a self-attention layer all of the keys, values_
+and queries come from the same place, in this case, the output of the previous layer in the
+encoder. Each position in the encoder can attend to all positions in the previous layer of the
+encoder.
+_• Similarly, self-attention layers in the decoder allow each position in the decoder to attend to_
+all positions in the decoder up to and including that position. We need to prevent leftward
+information flow in the decoder to preserve the auto-regressive property. We implement this
+inside of scaled dot-product attention by masking out (setting to −∞) all values in the input
+of the softmax which correspond to illegal connections. See Figure 2.
+**3.3** **Position-wise Feed-Forward Networks**
+In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully
+connected feed-forward network, which is applied to each position separately and identically. This
+consists of two linear transformations with a ReLU activation in between.
+FFN(x) = max(0, xW1 + b1)W2 + b2 (2)
+While the linear transformations are the same across different positions, they use different parameters
+from layer to layer. Another way of describing this is as two convolutions with kernel size 1.
+The dimensionality of input and output is dmodel = 512, and the inner-layer has dimensionality
+_dff = 2048._
+**3.4** **Embeddings and Softmax**
+Similarly to other sequence transduction models, we use learned embeddings to convert the input
+tokens and output tokens to vectors of dimension dmodel. We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. In
+our model, we share the same weight matrix between the two embedding layers and the pre-softmax
+linear transformation, similar to [24]. In the embedding layers, we multiply those weights by _dmodel._
+_[√]_
+**3.5** **Positional Encoding**
+Since our model contains no recurrence and no convolution, in order for the model to make use of the
+order of the sequence, we must inject some information about the relative or absolute position of the
+tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the
+-----
+Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations
+for different layer types. n is the sequence length, d is the representation dimension, k is the kernel
+size of convolutions and r the size of the neighborhood in restricted self-attention.
+Layer Type Complexity per Layer Sequential Maximum Path Length
+Operations
+Self-Attention _O(n[2]_ _· d)_ _O(1)_ _O(1)_
+Recurrent _O(n · d[2])_ _O(n)_ _O(n)_
+Convolutional _O(k_ _n_ _d[2])_ _O(1)_ _O(logk(n))_
+_·_ _·_
+Self-Attention (restricted) _O(r · n · d)_ _O(1)_ _O(n/r)_
+bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodel
+as the embeddings, so that the two can be summed. There are many choices of positional encodings,
+learned and fixed [8].
+In this work, we use sine and cosine functions of different frequencies:
+_PE(pos,2i) = sin(pos/10000[2][i/d][model])_
+_PE(pos,2i+1) = cos(pos/10000[2][i/d][model])_
+where pos is the position and i is the dimension. That is, each dimension of the positional encoding
+corresponds to a sinusoid. The wavelengths form a geometric progression from 2π to 10000 · 2π. We
+chose this function because we hypothesized it would allow the model to easily learn to attend by
+relative positions, since for any fixed offset k, PEpos+k can be represented as a linear function of
+_PEpos._
+We also experimented with using learned positional embeddings [8] instead, and found that the two
+versions produced nearly identical results (see Table 3 row (E)). We chose the sinusoidal version
+because it may allow the model to extrapolate to sequence lengths longer than the ones encountered
+during training.
+**4** **Why Self-Attention**
+In this section we compare various aspects of self-attention layers to the recurrent and convolutional layers commonly used for mapping one variable-length sequence of symbol representations
+(layer in a typical sequence transduction encoder or decoder. Motivating our use of self-attention wex1, ..., xn) to another sequence of equal length (z1, ..., zn), with xi, zi ∈ R[d], such as a hidden
+consider three desiderata.
+One is the total computational complexity per layer. Another is the amount of computation that can
+be parallelized, as measured by the minimum number of sequential operations required.
+The third is the path length between long-range dependencies in the network. Learning long-range
+dependencies is a key challenge in many sequence transduction tasks. One key factor affecting the
+ability to learn such dependencies is the length of the paths forward and backward signals have to
+traverse in the network. The shorter these paths between any combination of positions in the input
+and output sequences, the easier it is to learn long-range dependencies [11]. Hence we also compare
+the maximum path length between any two input and output positions in networks composed of the
+different layer types.
+As noted in Table 1, a self-attention layer connects all positions with a constant number of sequentially
+executed operations, whereas a recurrent layer requires O(n) sequential operations. In terms of
+computational complexity, self-attention layers are faster than recurrent layers when the sequence
+length n is smaller than the representation dimensionality d, which is most often the case with
+sentence representations used by state-of-the-art models in machine translations, such as word-piece
+[31] and byte-pair [25] representations. To improve computational performance for tasks involving
+very long sequences, self-attention could be restricted to considering only a neighborhood of size r in
+-----
+the input sequence centered around the respective output position. This would increase the maximum
+path length to O(n/r). We plan to investigate this approach further in future work.
+A single convolutional layer with kernel width k < n does not connect all pairs of input and output
+positions. Doing so requires a stack of O(n/k) convolutional layers in the case of contiguous kernels,
+or O(logk(n)) in the case of dilated convolutions [15], increasing the length of the longest paths
+between any two positions in the network. Convolutional layers are generally more expensive than
+recurrent layers, by a factor of k. Separable convolutions [6], however, decrease the complexity
+considerably, to O(k · n · d + n · d[2]). Even with k = n, however, the complexity of a separable
+convolution is equal to the combination of a self-attention layer and a point-wise feed-forward layer,
+the approach we take in our model.
+As side benefit, self-attention could yield more interpretable models. We inspect attention distributions
+from our models and present and discuss examples in the appendix. Not only do individual attention
+heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic
+and semantic structure of the sentences.
+**5** **Training**
+This section describes the training regime for our models.
+**5.1** **Training Data and Batching**
+We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million
+sentence pairs. Sentences were encoded using byte-pair encoding [3], which has a shared sourcetarget vocabulary of about 37000 tokens. For English-French, we used the significantly larger WMT
+2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece
+vocabulary [31]. Sentence pairs were batched together by approximate sequence length. Each training
+batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000
+target tokens.
+**5.2** **Hardware and Schedule**
+We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using
+the hyperparameters described throughout the paper, each training step took about 0.4 seconds. We
+trained the base models for a total of 100,000 steps or 12 hours. For our big models,(described on the
+bottom line of table 3), step time was 1.0 seconds. The big models were trained for 300,000 steps
+(3.5 days).
+**5.3** **Optimizer**
+We used the Adam optimizer [17] with β1 = 0.9, β2 = 0.98 and ϵ = 10[−][9]. We varied the learning
+rate over the course of training, according to the formula:
+_lrate = d[−]model[0][.][5]_ (3)
+_[·][ min(][step][_][num][−][0][.][5][, step][_][num][ ·][ warmup][_][steps][−][1][.][5][)]_
+This corresponds to increasing the learning rate linearly for the first warmup_steps training steps,
+and decreasing it thereafter proportionally to the inverse square root of the step number. We used
+_warmup_steps = 4000._
+**5.4** **Regularization**
+We employ three types of regularization during training:
+**Residual Dropout** We apply dropout [27] to the output of each sub-layer, before it is added to the
+sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the
+positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of
+_Pdrop = 0.1._
+-----
+Table 2: The Transformer achieves better BLEU scores than previous state-of-the-art models on the
+English-to-German and English-to-French newstest2014 tests at a fraction of the training cost.
+BLEU Training Cost (FLOPs)
+Model
+EN-DE EN-FR EN-DE EN-FR
+ByteNet [15] 23.75
+Deep-Att + PosUnk [32] 39.2 1.0 · 10[20]
+GNMT + RL [31] 24.6 39.92 2.3 · 10[19] 1.4 · 10[20]
+ConvS2S [8] 25.16 40.46 9.6 · 10[18] 1.5 · 10[20]
+MoE [26] 26.03 40.56 2.0 · 10[19] 1.2 · 10[20]
+Deep-Att + PosUnk Ensemble [32] 40.4 8.0 · 10[20]
+GNMT + RL Ensemble [31] 26.30 41.16 1.8 · 10[20] 1.1 · 10[21]
+ConvS2S Ensemble [8] 26.36 **41.29** 7.7 · 10[19] 1.2 · 10[21]
+Transformer (base model) 27.3 38.1 **3.3 · 10[18]**
+Transformer (big) **28.4** **41.0** 2.3 · 10[19]
+**Label Smoothing** During training, we employed label smoothing of value ϵls = 0.1 [30]. This
+hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.
+**6** **Results**
+**6.1** **Machine Translation**
+On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big)
+in Table 2) outperforms the best previously reported models (including ensembles) by more than 2.0
+BLEU, establishing a new state-of-the-art BLEU score of 28.4. The configuration of this model is
+listed in the bottom line of Table 3. Training took 3.5 days on 8 P100 GPUs. Even our base model
+surpasses all previously published models and ensembles, at a fraction of the training cost of any of
+the competitive models.
+On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of 41.0,
+outperforming all of the previously published single models, at less than 1/4 the training cost of the
+previous state-of-the-art model. The Transformer (big) model trained for English-to-French used
+dropout rate Pdrop = 0.1, instead of 0.3.
+For the base models, we used a single model obtained by averaging the last 5 checkpoints, which
+were written at 10-minute intervals. For the big models, we averaged the last 20 checkpoints. We
+used beam search with a beam size of 4 and length penalty α = 0.6 [31]. These hyperparameters
+were chosen after experimentation on the development set. We set the maximum output length during
+inference to input length + 50, but terminate early when possible [31].
+Table 2 summarizes our results and compares our translation quality and training costs to other model
+architectures from the literature. We estimate the number of floating point operations used to train a
+model by multiplying the training time, the number of GPUs used, and an estimate of the sustained
+single-precision floating-point capacity of each GPU [5].
+**6.2** **Model Variations**
+To evaluate the importance of different components of the Transformer, we varied our base model
+in different ways, measuring the change in performance on English-to-German translation on the
+development set, newstest2013. We used beam search as described in the previous section, but no
+checkpoint averaging. We present these results in Table 3.
+In Table 3 rows (A), we vary the number of attention heads and the attention key and value dimensions,
+keeping the amount of computation constant, as described in Section 3.2.2. While single-head
+attention is 0.9 BLEU worse than the best setting, quality also drops off with too many heads.
+5We used values of 2.8, 3.7, 6.0 and 9.5 TFLOPS for K80, K40, M40 and P100, respectively.
+-----
+Table 3: Variations on the Transformer architecture. Unlisted values are identical to those of the base
+model. All metrics are on the English-to-German translation development set, newstest2013. Listed
+perplexities are per-wordpiece, according to our byte-pair encoding, and should not be compared to
+per-word perplexities.
+|Col1|train N d d h d d P ϵ model ff k v drop ls steps|PPL BLEU params (dev) (dev) 106 ×|
+|---|---|---|
+|base|6 512 2048 8 64 64 0.1 0.1 100K|4.92 25.8 65|
+|(A)|1 512 512 4 128 128 16 32 32 32 16 16|5.29 24.9 5.00 25.5 4.91 25.8 5.01 25.4|
+|(B)|16 32|5.16 25.1 58 5.01 25.4 60|
+|(C)|2 4 8 256 32 32 1024 128 128 1024 4096|6.11 23.7 36 5.19 25.3 50 4.88 25.5 80 5.75 24.5 28 4.66 26.0 168 5.12 25.4 53 4.75 26.2 90|
+|(D)|0.0 0.2 0.0 0.2|5.77 24.6 4.95 25.5 4.67 25.3 5.47 25.7|
+|(E)|positional embedding instead of sinusoids|4.92 25.7|
+|big|6 1024 4096 16 0.3 300K|4.33 26.4 213|
+In Table 3 rows (B), we observe that reducing the attention key size dk hurts model quality. This
+suggests that determining compatibility is not easy and that a more sophisticated compatibility
+function than dot product may be beneficial. We further observe in rows (C) and (D) that, as expected,
+bigger models are better, and dropout is very helpful in avoiding over-fitting. In row (E) we replace our
+sinusoidal positional encoding with learned positional embeddings [8], and observe nearly identical
+results to the base model.
+**7** **Conclusion**
+In this work, we presented the Transformer, the first sequence transduction model based entirely on
+attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with
+multi-headed self-attention.
+For translation tasks, the Transformer can be trained significantly faster than architectures based
+on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014
+English-to-French translation tasks, we achieve a new state of the art. In the former task our best
+model outperforms even all previously reported ensembles.
+We are excited about the future of attention-based models and plan to apply them to other tasks. We
+plan to extend the Transformer to problems involving input and output modalities other than text and
+to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs
+such as images, audio and video. Making generation less sequential is another research goals of ours.
+[The code we used to train and evaluate our models is available at https://github.com/](https://github.com/tensorflow/tensor2tensor)
+```
+tensorflow/tensor2tensor.
+```
+**Acknowledgements** We are grateful to Nal Kalchbrenner and Stephan Gouws for their fruitful
+comments, corrections and inspiration.
+-----
+**References**
+[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint
+_arXiv:1607.06450, 2016._
+[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly
+learning to align and translate. CoRR, abs/1409.0473, 2014.
+[3] Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc V. Le. Massive exploration of neural
+machine translation architectures. CoRR, abs/1703.03906, 2017.
+[4] Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine
+reading. arXiv preprint arXiv:1601.06733, 2016.
+[5] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk,
+and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical
+machine translation. CoRR, abs/1406.1078, 2014.
+[6] Francois Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv
+_preprint arXiv:1610.02357, 2016._
+[7] Junyoung Chung, Çaglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation
+of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014.
+[8] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122v2, 2017.
+[9] Alex Graves. Generating sequences with recurrent neural networks. _arXiv preprint_
+_arXiv:1308.0850, 2013._
+[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern
+_Recognition, pages 770–778, 2016._
+[11] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber. Gradient flow in
+recurrent nets: the difficulty of learning long-term dependencies, 2001.
+[12] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation,
+9(8):1735–1780, 1997.
+[13] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring
+the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
+[14] Łukasz Kaiser and Ilya Sutskever. Neural GPUs learn algorithms. In International Conference
+_on Learning Representations (ICLR), 2016._
+[15] Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099v2,
+2017.
+[16] Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. Structured attention networks.
+In International Conference on Learning Representations, 2017.
+[17] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
+[18] Oleksii Kuchaiev and Boris Ginsburg. Factorization tricks for LSTM networks. arXiv preprint
+_arXiv:1703.10722, 2017._
+[19] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen
+Zhou, and Yoshua Bengio. A structured self-attentive sentence embedding. arXiv preprint
+_arXiv:1703.03130, 2017._
+[20] Samy Bengio Łukasz Kaiser. Can active memory replace attention? In Advances in Neural
+_Information Processing Systems, (NIPS), 2016._
+-----
+[21] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attentionbased neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
+[22] Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention
+model. In Empirical Methods in Natural Language Processing, 2016.
+[23] Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive
+summarization. arXiv preprint arXiv:1705.04304, 2017.
+[24] Ofir Press and Lior Wolf. Using the output embedding to improve language models. arXiv
+_preprint arXiv:1608.05859, 2016._
+[25] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words
+with subword units. arXiv preprint arXiv:1508.07909, 2015.
+[26] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton,
+and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts
+layer. arXiv preprint arXiv:1701.06538, 2017.
+[27] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine
+_Learning Research, 15(1):1929–1958, 2014._
+[28] Sainbayar Sukhbaatar, arthur szlam, Jason Weston, and Rob Fergus. End-to-end memory
+networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,
+_Advances in Neural Information Processing Systems 28, pages 2440–2448. Curran Associates,_
+Inc., 2015.
+[29] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural
+networks. In Advances in Neural Information Processing Systems, pages 3104–3112, 2014.
+[30] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna.
+Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.
+[31] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang
+Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine
+translation system: Bridging the gap between human and machine translation. arXiv preprint
+_arXiv:1609.08144, 2016._
+[32] Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and Wei Xu. Deep recurrent models with
+fast-forward connections for neural machine translation. CoRR, abs/1606.04199, 2016.
+-----

ai_scientist/generate_ideas.py ADDED Viewed

	@@ -0,0 +1,546 @@

+import json
+import os
+import os.path as osp
+import time
+from typing import List, Dict, Union
+import backoff
+import requests
+from ai_scientist.llm import get_response_from_llm, extract_json_between_markers, create_client, AVAILABLE_LLMS
+S2_API_KEY = os.getenv("S2_API_KEY")
+idea_first_prompt = """{task_description}
+<experiment.py>
+{code}
+</experiment.py>
+Here are the ideas that you have already generated:
+'''
+{prev_ideas_string}
+'''
+Come up with the next impactful and creative idea for research experiments and directions you can feasibly investigate with the code provided.
+Note that you will not have access to any additional resources or datasets.
+Make sure any idea is not overfit the specific training dataset or model, and has wider significance.
+Respond in the following format:
+THOUGHT:
+<THOUGHT>
+NEW IDEA JSON:
+```json
+<JSON>
+```
+In <THOUGHT>, first briefly discuss your intuitions and motivations for the idea. Detail your high-level plan, necessary design choices and ideal outcomes of the experiments. Justify how the idea is different from the existing ones.
+In <JSON>, provide the new idea in JSON format with the following fields:
+- "Name": A shortened descriptor of the idea. Lowercase, no spaces, underscores allowed.
+- "Title": A title for the idea, will be used for the report writing.
+- "Experiment": An outline of the implementation. E.g. which functions need to be added or modified, how results will be obtained, ...
+- "Interestingness": A rating from 1 to 10 (lowest to highest).
+- "Feasibility": A rating from 1 to 10 (lowest to highest).
+- "Novelty": A rating from 1 to 10 (lowest to highest).
+Be cautious and realistic on your ratings.
+This JSON will be automatically parsed, so ensure the format is precise.
+You will have {num_reflections} rounds to iterate on the idea, but do not need to use them all.
+"""
+idea_reflection_prompt = """Round {current_round}/{num_reflections}.
+In your thoughts, first carefully consider the quality, novelty, and feasibility of the idea you just created.
+Include any other factors that you think are important in evaluating the idea.
+Ensure the idea is clear and concise, and the JSON is the correct format.
+Do not make things overly complicated.
+In the next attempt, try and refine and improve your idea.
+Stick to the spirit of the original idea unless there are glaring issues.
+Respond in the same format as before:
+THOUGHT:
+<THOUGHT>
+NEW IDEA JSON:
+```json
+<JSON>
+```
+If there is nothing to improve, simply repeat the previous JSON EXACTLY after the thought and include "I am done" at the end of the thoughts but before the JSON.
+ONLY INCLUDE "I am done" IF YOU ARE MAKING NO MORE CHANGES."""
+# GENERATE IDEAS
+def generate_ideas(
+        base_dir,
+        client,
+        model,
+        skip_generation=False,
+        max_num_generations=20,
+        num_reflections=5,
+):
+    if skip_generation:
+        # Load existing ideas from file
+        try:
+            with open(osp.join(base_dir, "ideas.json"), "r") as f:
+                ideas = json.load(f)
+            print("Loaded existing ideas:")
+            for idea in ideas:
+                print(idea)
+            return ideas
+        except FileNotFoundError:
+            print("No existing ideas found. Generating new ideas.")
+        except json.JSONDecodeError:
+            print("Error decoding existing ideas. Generating new ideas.")
+    idea_str_archive = []
+    with open(osp.join(base_dir, "seed_ideas.json"), "r") as f:
+        seed_ideas = json.load(f)
+    for seed_idea in seed_ideas:
+        idea_str_archive.append(json.dumps(seed_idea))
+    with open(osp.join(base_dir, "experiment.py"), "r") as f:
+        code = f.read()
+    with open(osp.join(base_dir, "prompt.json"), "r") as f:
+        prompt = json.load(f)
+    idea_system_prompt = prompt["system"]
+    for _ in range(max_num_generations):
+        print()
+        print(f"Generating idea {_ + 1}/{max_num_generations}")
+        try:
+            prev_ideas_string = "\n\n".join(idea_str_archive)
+            msg_history = []
+            print(f"Iteration 1/{num_reflections}")
+            text, msg_history = get_response_from_llm(
+                idea_first_prompt.format(
+                    task_description=prompt["task_description"],
+                    code=code,
+                    prev_ideas_string=prev_ideas_string,
+                    num_reflections=num_reflections,
+                ),
+                client=client,
+                model=model,
+                system_message=idea_system_prompt,
+                msg_history=msg_history,
+            )
+            ## PARSE OUTPUT
+            json_output = extract_json_between_markers(text)
+            assert json_output is not None, "Failed to extract JSON from LLM output"
+            print(json_output)
+            # Iteratively improve task.
+            if num_reflections > 1:
+                for j in range(num_reflections - 1):
+                    print(f"Iteration {j + 2}/{num_reflections}")
+                    text, msg_history = get_response_from_llm(
+                        idea_reflection_prompt.format(
+                            current_round=j + 2, num_reflections=num_reflections
+                        ),
+                        client=client,
+                        model=model,
+                        system_message=idea_system_prompt,
+                        msg_history=msg_history,
+                    )
+                    ## PARSE OUTPUT
+                    json_output = extract_json_between_markers(text)
+                    assert (
+                            json_output is not None
+                    ), "Failed to extract JSON from LLM output"
+                    print(json_output)
+                    if "I am done" in text:
+                        print(f"Idea generation converged after {j + 2} iterations.")
+                        break
+            idea_str_archive.append(json.dumps(json_output))
+        except Exception as e:
+            print(f"Failed to generate idea: {e}")
+            continue
+    ## SAVE IDEAS
+    ideas = []
+    for idea_str in idea_str_archive:
+        ideas.append(json.loads(idea_str))
+    with open(osp.join(base_dir, "ideas.json"), "w") as f:
+        json.dump(ideas, f, indent=4)
+    return ideas
+# GENERATE IDEAS OPEN-ENDED
+def generate_next_idea(
+        base_dir,
+        client,
+        model,
+        prev_idea_archive=[],
+        num_reflections=5,
+        max_attempts=10,
+):
+    idea_archive = prev_idea_archive
+    original_archive_size = len(idea_archive)
+    print(f"Generating idea {original_archive_size + 1}")
+    if len(prev_idea_archive) == 0:
+        print(f"First iteration, taking seed ideas")
+        # seed the archive on the first run with pre-existing ideas
+        with open(osp.join(base_dir, "seed_ideas.json"), "r") as f:
+            seed_ideas = json.load(f)
+        for seed_idea in seed_ideas[:1]:
+            idea_archive.append(seed_idea)
+    else:
+        with open(osp.join(base_dir, "experiment.py"), "r") as f:
+            code = f.read()
+        with open(osp.join(base_dir, "prompt.json"), "r") as f:
+            prompt = json.load(f)
+        idea_system_prompt = prompt["system"]
+        for _ in range(max_attempts):
+            try:
+                idea_strings = []
+                for idea in idea_archive:
+                    idea_strings.append(json.dumps(idea))
+                prev_ideas_string = "\n\n".join(idea_strings)
+                msg_history = []
+                print(f"Iteration 1/{num_reflections}")
+                text, msg_history = get_response_from_llm(
+                    idea_first_prompt.format(
+                        task_description=prompt["task_description"],
+                        code=code,
+                        prev_ideas_string=prev_ideas_string,
+                        num_reflections=num_reflections,
+                    )
+                    + """
+Completed ideas have an additional "Score" field which indicates the assessment by an expert ML reviewer.
+This is on a standard 1-10 ML conference scale.
+Scores of 0 indicate the idea failed either during experimentation, writeup or reviewing.
+""",
+                    client=client,
+                    model=model,
+                    system_message=idea_system_prompt,
+                    msg_history=msg_history,
+                )
+                ## PARSE OUTPUT
+                json_output = extract_json_between_markers(text)
+                assert json_output is not None, "Failed to extract JSON from LLM output"
+                print(json_output)
+                # Iteratively improve task.
+                if num_reflections > 1:
+                    for j in range(num_reflections - 1):
+                        print(f"Iteration {j + 2}/{num_reflections}")
+                        text, msg_history = get_response_from_llm(
+                            idea_reflection_prompt.format(
+                                current_round=j + 2, num_reflections=num_reflections
+                            ),
+                            client=client,
+                            model=model,
+                            system_message=idea_system_prompt,
+                            msg_history=msg_history,
+                        )
+                        ## PARSE OUTPUT
+                        json_output = extract_json_between_markers(text)
+                        assert (
+                                json_output is not None
+                        ), "Failed to extract JSON from LLM output"
+                        print(json_output)
+                        if "I am done" in text:
+                            print(
+                                f"Idea generation converged after {j + 2} iterations."
+                            )
+                            break
+                idea_archive.append(json_output)
+                break
+            except Exception as e:
+                print(f"Failed to generate idea: {e}")
+                continue
+    ## SAVE IDEAS
+    with open(osp.join(base_dir, "ideas.json"), "w") as f:
+        json.dump(idea_archive, f, indent=4)
+    return idea_archive
+def on_backoff(details):
+    print(
+        f"Backing off {details['wait']:0.1f} seconds after {details['tries']} tries "
+        f"calling function {details['target'].__name__} at {time.strftime('%X')}"
+    )
+@backoff.on_exception(
+    backoff.expo, requests.exceptions.HTTPError, on_backoff=on_backoff
+)
+def search_for_papers(query, result_limit=10, engine="semanticscholar") -> Union[None, List[Dict]]:
+    if not query:
+        return None
+    if engine == "semanticscholar":
+        rsp = requests.get(
+            "https://api.semanticscholar.org/graph/v1/paper/search",
+            headers={"X-API-KEY": S2_API_KEY} if S2_API_KEY else {},
+            params={
+                "query": query,
+                "limit": result_limit,
+                "fields": "title,authors,venue,year,abstract,citationStyles,citationCount",
+            },
+        )
+        print(f"Response Status Code: {rsp.status_code}")
+        print(
+            f"Response Content: {rsp.text[:500]}"
+        )  # Print the first 500 characters of the response content
+        rsp.raise_for_status()
+        results = rsp.json()
+        total = results["total"]
+        time.sleep(1.0)
+        if not total:
+            return None
+        papers = results["data"]
+        return papers
+    elif engine == "openalex":
+        import pyalex
+        from pyalex import Work, Works
+        mail = os.environ.get("OPENALEX_MAIL_ADDRESS", None)
+        if mail is None:
+            print("[WARNING] Please set OPENALEX_MAIL_ADDRESS for better access to OpenAlex API!")
+        else:
+            pyalex.config.email = mail
+        def extract_info_from_work(work: Work, max_abstract_length: int = 1000) -> dict[str, str]:
+            # "Unknown" is returned when venue is unknown...
+            venue = "Unknown"
+            for i, location in enumerate(work["locations"]):
+                if location["source"] is not None:
+                    venue = location["source"]["display_name"]
+                    if venue != "":
+                        break
+            title = work["title"]
+            abstract = work["abstract"]
+            if abstract is None:
+                abstract = ""
+            if len(abstract) > max_abstract_length:
+                # To avoid context length exceed error.
+                print(f"[WARNING] {title=}: {len(abstract)=} is too long! Use first {max_abstract_length} chars.")
+                abstract = abstract[:max_abstract_length]
+            authors_list = [author["author"]["display_name"] for author in work["authorships"]]
+            authors = " and ".join(authors_list) if len(authors_list) < 20 else f"{authors_list[0]} et al."
+            paper = dict(
+                title=title,
+                authors=authors,
+                venue=venue,
+                year=work["publication_year"],
+                abstract=abstract,
+                citationCount=work["cited_by_count"],
+            )
+            return paper
+        works: List[Dict] = Works().search(query).get(per_page=result_limit)
+        papers: List[Dict[str, str]] = [extract_info_from_work(work) for work in works]
+        return papers
+    else:
+        raise NotImplementedError(f"{engine=} not supported!")
+novelty_system_msg = """You are an ambitious AI PhD student who is looking to publish a paper that will contribute significantly to the field.
+You have an idea and you want to check if it is novel or not. I.e., not overlapping significantly with existing literature or already well explored.
+Be a harsh critic for novelty, ensure there is a sufficient contribution in the idea for a new conference or workshop paper.
+You will be given access to the Semantic Scholar API, which you may use to survey the literature and find relevant papers to help you make your decision.
+The top 10 results for any search query will be presented to you with the abstracts.
+You will be given {num_rounds} to decide on the paper, but you do not need to use them all.
+At any round, you may exit early and decide on the novelty of the idea.
+Decide a paper idea is novel if after sufficient searching, you have not found a paper that significantly overlaps with your idea.
+Decide a paper idea is not novel, if you have found a paper that significantly overlaps with your idea.
+{task_description}
+<experiment.py>
+{code}
+</experiment.py>
+"""
+novelty_prompt = '''Round {current_round}/{num_rounds}.
+You have this idea:
+"""
+{idea}
+"""
+The results of the last query are (empty on first round):
+"""
+{last_query_results}
+"""
+Respond in the following format:
+THOUGHT:
+<THOUGHT>
+RESPONSE:
+```json
+<JSON>
+```
+In <THOUGHT>, first briefly reason over the idea and identify any query that could help you make your decision.
+If you have made your decision, add "Decision made: novel." or "Decision made: not novel." to your thoughts.
+In <JSON>, respond in JSON format with ONLY the following field:
+- "Query": An optional search query to search the literature (e.g. attention is all you need). You must make a query if you have not decided this round.
+A query will work best if you are able to recall the exact name of the paper you are looking for, or the authors.
+This JSON will be automatically parsed, so ensure the format is precise.'''
+def check_idea_novelty(
+        ideas,
+        base_dir,
+        client,
+        model,
+        max_num_iterations=10,
+        engine="semanticscholar",
+):
+    with open(osp.join(base_dir, "experiment.py"), "r") as f:
+        code = f.read()
+    with open(osp.join(base_dir, "prompt.json"), "r") as f:
+        prompt = json.load(f)
+        task_description = prompt["task_description"]
+    for idx, idea in enumerate(ideas):
+        if "novel" in idea:
+            print(f"Skipping idea {idx}, already checked.")
+            continue
+        print(f"\nChecking novelty of idea {idx}: {idea['Name']}")
+        novel = False
+        msg_history = []
+        papers_str = ""
+        for j in range(max_num_iterations):
+            try:
+                text, msg_history = get_response_from_llm(
+                    novelty_prompt.format(
+                        current_round=j + 1,
+                        num_rounds=max_num_iterations,
+                        idea=idea,
+                        last_query_results=papers_str,
+                    ),
+                    client=client,
+                    model=model,
+                    system_message=novelty_system_msg.format(
+                        num_rounds=max_num_iterations,
+                        task_description=task_description,
+                        code=code,
+                    ),
+                    msg_history=msg_history,
+                )
+                if "decision made: novel" in text.lower():
+                    print("Decision made: novel after round", j)
+                    novel = True
+                    break
+                if "decision made: not novel" in text.lower():
+                    print("Decision made: not novel after round", j)
+                    break
+                ## PARSE OUTPUT
+                json_output = extract_json_between_markers(text)
+                assert json_output is not None, "Failed to extract JSON from LLM output"
+                ## SEARCH FOR PAPERS
+                query = json_output["Query"]
+                papers = search_for_papers(query, result_limit=10, engine=engine)
+                if papers is None:
+                    papers_str = "No papers found."
+                paper_strings = []
+                for i, paper in enumerate(papers):
+                    paper_strings.append(
+                        """{i}: {title}. {authors}. {venue}, {year}.\nNumber of citations: {cites}\nAbstract: {abstract}""".format(
+                            i=i,
+                            title=paper["title"],
+                            authors=paper["authors"],
+                            venue=paper["venue"],
+                            year=paper["year"],
+                            cites=paper["citationCount"],
+                            abstract=paper["abstract"],
+                        )
+                    )
+                papers_str = "\n\n".join(paper_strings)
+            except Exception as e:
+                print(f"Error: {e}")
+                continue
+        idea["novel"] = novel
+    # Save results to JSON file
+    results_file = osp.join(base_dir, "ideas.json")
+    with open(results_file, "w") as f:
+        json.dump(ideas, f, indent=4)
+    return ideas
+if __name__ == "__main__":
+    MAX_NUM_GENERATIONS = 32
+    NUM_REFLECTIONS = 5
+    import argparse
+    parser = argparse.ArgumentParser(description="Generate AI scientist ideas")
+    # add type of experiment (nanoGPT, Boston, etc.)
+    parser.add_argument(
+        "--experiment",
+        type=str,
+        default="nanoGPT",
+        help="Experiment to run AI Scientist on.",
+    )
+    parser.add_argument(
+        "--model",
+        type=str,
+        default="gpt-4o-2024-05-13",
+        choices=AVAILABLE_LLMS,
+        help="Model to use for AI Scientist.",
+    )
+    parser.add_argument(
+        "--skip-idea-generation",
+        action="store_true",
+        help="Skip idea generation and use existing ideas.",
+    )
+    parser.add_argument(
+        "--check-novelty",
+        action="store_true",
+        help="Check novelty of ideas.",
+    )
+    args = parser.parse_args()
+    # Create client
+    client, client_model = create_client(args.model)
+    base_dir = osp.join("templates", args.experiment)
+    results_dir = osp.join("results", args.experiment)
+    ideas = generate_ideas(
+        base_dir,
+        client=client,
+        model=client_model,
+        skip_generation=args.skip_idea_generation,
+        max_num_generations=MAX_NUM_GENERATIONS,
+        num_reflections=NUM_REFLECTIONS,
+    )
+    if args.check_novelty:
+        ideas = check_idea_novelty(
+            ideas,
+            base_dir=base_dir,
+            client=client,
+            model=client_model,
+        )

ai_scientist/llm.py ADDED Viewed

	@@ -0,0 +1,351 @@

+import json
+import os
+import re
+import anthropic
+import backoff
+import openai
+import google.generativeai as genai
+from google.generativeai.types import GenerationConfig
+MAX_NUM_TOKENS = 4096
+AVAILABLE_LLMS = [
+    # Anthropic models
+    "claude-3-5-sonnet-20240620",
+    "claude-3-5-sonnet-20241022",
+    # OpenAI models
+    "gpt-4o-mini",
+    "gpt-4o-mini-2024-07-18",
+    "gpt-4o",
+    "gpt-4o-2024-05-13",
+    "gpt-4o-2024-08-06",
+    "gpt-4.1",
+    "gpt-4.1-2025-04-14",
+    "gpt-4.1-mini",
+    "gpt-4.1-mini-2025-04-14",
+    "gpt-4.1-nano",
+    "gpt-4.1-nano-2025-04-14",
+    "o1",
+    "o1-2024-12-17",
+    "o1-preview-2024-09-12",
+    "o1-mini",
+    "o1-mini-2024-09-12",
+    "o3-mini",
+    "o3-mini-2025-01-31",
+    # OpenRouter models
+    "llama3.1-405b",
+    # Anthropic Claude models via Amazon Bedrock
+    "bedrock/anthropic.claude-3-sonnet-20240229-v1:0",
+    "bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0",
+    "bedrock/anthropic.claude-3-5-sonnet-20241022-v2:0",
+    "bedrock/anthropic.claude-3-haiku-20240307-v1:0",
+    "bedrock/anthropic.claude-3-opus-20240229-v1:0",
+    # Anthropic Claude models Vertex AI
+    "vertex_ai/claude-3-opus@20240229",
+    "vertex_ai/claude-3-5-sonnet@20240620",
+    "vertex_ai/claude-3-5-sonnet-v2@20241022",
+    "vertex_ai/claude-3-sonnet@20240229",
+    "vertex_ai/claude-3-haiku@20240307",
+    # DeepSeek models
+    "deepseek-chat",
+    "deepseek-coder",
+    "deepseek-reasoner",
+    # Google Gemini models
+    "gemini-1.5-flash",
+    "gemini-1.5-pro",
+    "gemini-2.0-flash",
+    "gemini-2.0-flash-lite",
+    "gemini-2.0-flash-thinking-exp-01-21",
+    "gemini-2.5-pro-preview-03-25",
+    "gemini-2.5-pro-exp-03-25",
+]
+# Get N responses from a single message, used for ensembling.
+@backoff.on_exception(backoff.expo, (openai.RateLimitError, openai.APITimeoutError))
+def get_batch_responses_from_llm(
+        msg,
+        client,
+        model,
+        system_message,
+        print_debug=False,
+        msg_history=None,
+        temperature=0.75,
+        n_responses=1,
+):
+    if msg_history is None:
+        msg_history = []
+    if 'gpt' in model:
+        new_msg_history = msg_history + [{"role": "user", "content": msg}]
+        response = client.chat.completions.create(
+            model=model,
+            messages=[
+                {"role": "system", "content": system_message},
+                *new_msg_history,
+            ],
+            temperature=temperature,
+            max_tokens=MAX_NUM_TOKENS,
+            n=n_responses,
+            stop=None,
+            seed=0,
+        )
+        content = [r.message.content for r in response.choices]
+        new_msg_history = [
+            new_msg_history + [{"role": "assistant", "content": c}] for c in content
+        ]
+    elif model == "llama-3-1-405b-instruct":
+        new_msg_history = msg_history + [{"role": "user", "content": msg}]
+        response = client.chat.completions.create(
+            model="meta-llama/llama-3.1-405b-instruct",
+            messages=[
+                {"role": "system", "content": system_message},
+                *new_msg_history,
+            ],
+            temperature=temperature,
+            max_tokens=MAX_NUM_TOKENS,
+            n=n_responses,
+            stop=None,
+        )
+        content = [r.message.content for r in response.choices]
+        new_msg_history = [
+            new_msg_history + [{"role": "assistant", "content": c}] for c in content
+        ]
+    else:
+        content, new_msg_history = [], []
+        for _ in range(n_responses):
+            c, hist = get_response_from_llm(
+                msg,
+                client,
+                model,
+                system_message,
+                print_debug=False,
+                msg_history=None,
+                temperature=temperature,
+            )
+            content.append(c)
+            new_msg_history.append(hist)
+    if print_debug:
+        print()
+        print("*" * 20 + " LLM START " + "*" * 20)
+        for j, msg in enumerate(new_msg_history[0]):
+            print(f'{j}, {msg["role"]}: {msg["content"]}')
+        print(content)
+        print("*" * 21 + " LLM END " + "*" * 21)
+        print()
+    return content, new_msg_history
+@backoff.on_exception(backoff.expo, (openai.RateLimitError, openai.APITimeoutError))
+def get_response_from_llm(
+        msg,
+        client,
+        model,
+        system_message,
+        print_debug=False,
+        msg_history=None,
+        temperature=0.75,
+):
+    if msg_history is None:
+        msg_history = []
+    if "claude" in model:
+        new_msg_history = msg_history + [
+            {
+                "role": "user",
+                "content": [
+                    {
+                        "type": "text",
+                        "text": msg,
+                    }
+                ],
+            }
+        ]
+        response = client.messages.create(
+            model=model,
+            max_tokens=MAX_NUM_TOKENS,
+            temperature=temperature,
+            system=system_message,
+            messages=new_msg_history,
+        )
+        content = response.content[0].text
+        new_msg_history = new_msg_history + [
+            {
+                "role": "assistant",
+                "content": [
+                    {
+                        "type": "text",
+                        "text": content,
+                    }
+                ],
+            }
+        ]
+    elif 'gpt' in model:
+        new_msg_history = msg_history + [{"role": "user", "content": msg}]
+        response = client.chat.completions.create(
+            model=model,
+            messages=[
+                {"role": "system", "content": system_message},
+                *new_msg_history,
+            ],
+            temperature=temperature,
+            max_tokens=MAX_NUM_TOKENS,
+            n=1,
+            stop=None,
+            seed=0,
+        )
+        content = response.choices[0].message.content
+        new_msg_history = new_msg_history + [{"role": "assistant", "content": content}]
+    elif "o1" in model or "o3" in model:
+        new_msg_history = msg_history + [{"role": "user", "content": msg}]
+        response = client.chat.completions.create(
+            model=model,
+            messages=[
+                {"role": "user", "content": system_message},
+                *new_msg_history,
+            ],
+            temperature=1,
+            max_completion_tokens=MAX_NUM_TOKENS,
+            n=1,
+            seed=0,
+        )
+        content = response.choices[0].message.content
+        new_msg_history = new_msg_history + [{"role": "assistant", "content": content}]
+    elif model in ["meta-llama/llama-3.1-405b-instruct", "llama-3-1-405b-instruct"]:
+        new_msg_history = msg_history + [{"role": "user", "content": msg}]
+        response = client.chat.completions.create(
+            model="meta-llama/llama-3.1-405b-instruct",
+            messages=[
+                {"role": "system", "content": system_message},
+                *new_msg_history,
+            ],
+            temperature=temperature,
+            max_tokens=MAX_NUM_TOKENS,
+            n=1,
+            stop=None,
+        )
+        content = response.choices[0].message.content
+        new_msg_history = new_msg_history + [{"role": "assistant", "content": content}]
+    elif model in ["deepseek-chat", "deepseek-coder"]:
+        new_msg_history = msg_history + [{"role": "user", "content": msg}]
+        response = client.chat.completions.create(
+            model=model,
+            messages=[
+                {"role": "system", "content": system_message},
+                *new_msg_history,
+            ],
+            temperature=temperature,
+            max_tokens=MAX_NUM_TOKENS,
+            n=1,
+            stop=None,
+        )
+        content = response.choices[0].message.content
+        new_msg_history = new_msg_history + [{"role": "assistant", "content": content}]
+    elif model in ["deepseek-reasoner"]:
+        new_msg_history = msg_history + [{"role": "user", "content": msg}]
+        response = client.chat.completions.create(
+            model=model,
+            messages=[
+                {"role": "system", "content": system_message},
+                *new_msg_history,
+            ],
+            n=1,
+            stop=None,
+        )
+        content = response.choices[0].message.content
+        new_msg_history = new_msg_history + [{"role": "assistant", "content": content}]
+    elif "gemini" in model:
+        new_msg_history = msg_history + [{"role": "user", "content": msg}]
+        response = client.chat.completions.create(
+            model=model,
+            messages=[
+                {"role": "system", "content": system_message},
+                *new_msg_history,
+            ],
+            temperature=temperature,
+            max_tokens=MAX_NUM_TOKENS,
+            n=1,
+        )
+        content = response.choices[0].message.content
+        new_msg_history = new_msg_history + [{"role": "assistant", "content": content}]
+    else:
+        raise ValueError(f"Model {model} not supported.")
+    if print_debug:
+        print()
+        print("*" * 20 + " LLM START " + "*" * 20)
+        for j, msg in enumerate(new_msg_history):
+            print(f'{j}, {msg["role"]}: {msg["content"]}')
+        print(content)
+        print("*" * 21 + " LLM END " + "*" * 21)
+        print()
+    return content, new_msg_history
+def extract_json_between_markers(llm_output):
+    # Regular expression pattern to find JSON content between ```json and ```
+    json_pattern = r"```json(.*?)```"
+    matches = re.findall(json_pattern, llm_output, re.DOTALL)
+    if not matches:
+        # Fallback: Try to find any JSON-like content in the output
+        json_pattern = r"\{.*?\}"
+        matches = re.findall(json_pattern, llm_output, re.DOTALL)
+    for json_string in matches:
+        json_string = json_string.strip()
+        try:
+            parsed_json = json.loads(json_string)
+            return parsed_json
+        except json.JSONDecodeError:
+            # Attempt to fix common JSON issues
+            try:
+                # Remove invalid control characters
+                json_string_clean = re.sub(r"[\x00-\x1F\x7F]", "", json_string)
+                parsed_json = json.loads(json_string_clean)
+                return parsed_json
+            except json.JSONDecodeError:
+                continue  # Try next match
+    return None  # No valid JSON found
+def create_client(model):
+    if model.startswith("claude-"):
+        print(f"Using Anthropic API with model {model}.")
+        return anthropic.Anthropic(), model
+    elif model.startswith("bedrock") and "claude" in model:
+        client_model = model.split("/")[-1]
+        print(f"Using Amazon Bedrock with model {client_model}.")
+        return anthropic.AnthropicBedrock(), client_model
+    elif model.startswith("vertex_ai") and "claude" in model:
+        client_model = model.split("/")[-1]
+        print(f"Using Vertex AI with model {client_model}.")
+        return anthropic.AnthropicVertex(), client_model
+    elif 'gpt' in model or "o1" in model or "o3" in model:
+        print(f"Using OpenAI API with model {model}.")
+        return openai.OpenAI(), model
+    elif model in ["deepseek-chat", "deepseek-reasoner", "deepseek-coder"]:
+        print(f"Using OpenAI API with {model}.")
+        return openai.OpenAI(
+            api_key=os.environ["DEEPSEEK_API_KEY"],
+            base_url="https://api.deepseek.com"
+        ), model
+    elif model == "llama3.1-405b":
+        print(f"Using OpenAI API with {model}.")
+        return openai.OpenAI(
+            api_key=os.environ["OPENROUTER_API_KEY"],
+            base_url="https://openrouter.ai/api/v1"
+        ), "meta-llama/llama-3.1-405b-instruct"
+    elif "gemini" in model:
+        print(f"Using OpenAI API with {model}.")
+        return openai.OpenAI(
+            api_key=os.environ["GEMINI_API_KEY"],
+            base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
+        ), model
+    else:
+        raise ValueError(f"Model {model} not supported.")

ai_scientist/perform_experiments.py ADDED Viewed

	@@ -0,0 +1,166 @@

+import json
+import os.path as osp
+import shutil
+import subprocess
+import sys
+from subprocess import TimeoutExpired
+MAX_ITERS = 4
+MAX_RUNS = 5
+MAX_STDERR_OUTPUT = 1500
+coder_prompt = """Your goal is to implement the following idea: {title}.
+The proposed experiment is as follows: {idea}.
+You are given a total of up to {max_runs} runs to complete the necessary experiments. You do not need to use all {max_runs}.
+First, plan the list of experiments you would like to run. For example, if you are sweeping over a specific hyperparameter, plan each value you would like to test for each run.
+Note that we already provide the vanilla baseline results, so you do not need to re-run it.
+For reference, the baseline results are as follows:
+{baseline_results}
+After you complete each change, we will run the command `python experiment.py --out_dir=run_i' where i is the run number and evaluate the results.
+YOUR PROPOSED CHANGE MUST USE THIS COMMAND FORMAT, DO NOT ADD ADDITIONAL COMMAND LINE ARGS.
+You can then implement the next thing on your list."""
+# RUN EXPERIMENT
+def run_experiment(folder_name, run_num, timeout=7200):
+    cwd = osp.abspath(folder_name)
+    # COPY CODE SO WE CAN SEE IT.
+    shutil.copy(
+        osp.join(folder_name, "experiment.py"),
+        osp.join(folder_name, f"run_{run_num}.py"),
+    )
+    # LAUNCH COMMAND
+    command = [
+        "python",
+        "experiment.py",
+        f"--out_dir=run_{run_num}",
+    ]
+    try:
+        result = subprocess.run(
+            command, cwd=cwd, stderr=subprocess.PIPE, text=True, timeout=timeout
+        )
+        if result.stderr:
+            print(result.stderr, file=sys.stderr)
+        if result.returncode != 0:
+            print(f"Run {run_num} failed with return code {result.returncode}")
+            if osp.exists(osp.join(cwd, f"run_{run_num}")):
+                shutil.rmtree(osp.join(cwd, f"run_{run_num}"))
+            print(f"Run failed with the following error {result.stderr}")
+            stderr_output = result.stderr
+            if len(stderr_output) > MAX_STDERR_OUTPUT:
+                stderr_output = "..." + stderr_output[-MAX_STDERR_OUTPUT:]
+            next_prompt = f"Run failed with the following error {stderr_output}"
+        else:
+            with open(osp.join(cwd, f"run_{run_num}", "final_info.json"), "r") as f:
+                results = json.load(f)
+            results = {k: v["means"] for k, v in results.items()}
+            next_prompt = f"""Run {run_num} completed. Here are the results:
+{results}
+Decide if you need to re-plan your experiments given the result (you often will not need to).
+Someone else will be using `notes.txt` to perform a writeup on this in the future.
+Please include *all* relevant information for the writeup on Run {run_num}, including an experiment description and the run number. Be as verbose as necessary.
+Then, implement the next thing on your list.
+We will then run the command `python experiment.py --out_dir=run_{run_num + 1}'.
+YOUR PROPOSED CHANGE MUST USE THIS COMMAND FORMAT, DO NOT ADD ADDITIONAL COMMAND LINE ARGS.
+If you are finished with experiments, respond with 'ALL_COMPLETED'."""
+        return result.returncode, next_prompt
+    except TimeoutExpired:
+        print(f"Run {run_num} timed out after {timeout} seconds")
+        if osp.exists(osp.join(cwd, f"run_{run_num}")):
+            shutil.rmtree(osp.join(cwd, f"run_{run_num}"))
+        next_prompt = f"Run timed out after {timeout} seconds"
+        return 1, next_prompt
+# RUN PLOTTING
+def run_plotting(folder_name, timeout=600):
+    cwd = osp.abspath(folder_name)
+    # LAUNCH COMMAND
+    command = [
+        "python",
+        "plot.py",
+    ]
+    try:
+        result = subprocess.run(
+            command, cwd=cwd, stderr=subprocess.PIPE, text=True, timeout=timeout
+        )
+        if result.stderr:
+            print(result.stderr, file=sys.stderr)
+        if result.returncode != 0:
+            print(f"Plotting failed with return code {result.returncode}")
+            next_prompt = f"Plotting failed with the following error {result.stderr}"
+        else:
+            next_prompt = ""
+        return result.returncode, next_prompt
+    except TimeoutExpired:
+        print(f"Plotting timed out after {timeout} seconds")
+        next_prompt = f"Plotting timed out after {timeout} seconds"
+        return 1, next_prompt
+# PERFORM EXPERIMENTS
+def perform_experiments(idea, folder_name, coder, baseline_results) -> bool:
+    ## RUN EXPERIMENT
+    current_iter = 0
+    run = 1
+    next_prompt = coder_prompt.format(
+        title=idea["Title"],
+        idea=idea["Experiment"],
+        max_runs=MAX_RUNS,
+        baseline_results=baseline_results,
+    )
+    while run < MAX_RUNS + 1:
+        if current_iter >= MAX_ITERS:
+            print("Max iterations reached")
+            break
+        coder_out = coder.run(next_prompt)
+        print(coder_out)
+        if "ALL_COMPLETED" in coder_out:
+            break
+        return_code, next_prompt = run_experiment(folder_name, run)
+        if return_code == 0:
+            run += 1
+            current_iter = 0
+        current_iter += 1
+    if current_iter >= MAX_ITERS:
+        print("Not all experiments completed.")
+        return False
+    current_iter = 0
+    next_prompt = """
+Great job! Please modify `plot.py` to generate the most relevant plots for the final writeup.
+In particular, be sure to fill in the "labels" dictionary with the correct names for each run that you want to plot.
+Only the runs in the `labels` dictionary will be plotted, so make sure to include all relevant runs.
+We will be running the command `python plot.py` to generate the plots.
+"""
+    while True:
+        _ = coder.run(next_prompt)
+        return_code, next_prompt = run_plotting(folder_name)
+        current_iter += 1
+        if return_code == 0 or current_iter >= MAX_ITERS:
+            break
+    next_prompt = """
+Please modify `notes.txt` with a description of what each plot shows along with the filename of the figure. Please do so in-depth.
+Somebody else will be using `notes.txt` to write a report on this in the future.
+"""
+    coder.run(next_prompt)
+    return True

ai_scientist/perform_review.py ADDED Viewed

	@@ -0,0 +1,395 @@

+import os
+import numpy as np
+import json
+from pypdf import PdfReader
+import pymupdf
+import pymupdf4llm
+from ai_scientist.llm import (
+    get_response_from_llm,
+    get_batch_responses_from_llm,
+    extract_json_between_markers,
+)
+reviewer_system_prompt_base = (
+    "You are an AI researcher who is reviewing a paper that was submitted to a prestigious ML venue."
+    "Be critical and cautious in your decision."
+)
+reviewer_system_prompt_neg = (
+    reviewer_system_prompt_base
+    + "If a paper is bad or you are unsure, give it bad scores and reject it."
+)
+reviewer_system_prompt_pos = (
+    reviewer_system_prompt_base
+    + "If a paper is good or you are unsure, give it good scores and accept it."
+)
+template_instructions = """
+Respond in the following format:
+THOUGHT:
+<THOUGHT>
+REVIEW JSON:
+```json
+<JSON>
+```
+In <THOUGHT>, first briefly discuss your intuitions and reasoning for the evaluation.
+Detail your high-level arguments, necessary choices and desired outcomes of the review.
+Do not make generic comments here, but be specific to your current paper.
+Treat this as the note-taking phase of your review.
+In <JSON>, provide the review in JSON format with the following fields in the order:
+- "Summary": A summary of the paper content and its contributions.
+- "Strengths": A list of strengths of the paper.
+- "Weaknesses": A list of weaknesses of the paper.
+- "Originality": A rating from 1 to 4 (low, medium, high, very high).
+- "Quality": A rating from 1 to 4 (low, medium, high, very high).
+- "Clarity": A rating from 1 to 4 (low, medium, high, very high).
+- "Significance": A rating from 1 to 4 (low, medium, high, very high).
+- "Questions": A set of clarifying questions to be answered by the paper authors.
+- "Limitations": A set of limitations and potential negative societal impacts of the work.
+- "Ethical Concerns": A boolean value indicating whether there are ethical concerns.
+- "Soundness": A rating from 1 to 4 (poor, fair, good, excellent).
+- "Presentation": A rating from 1 to 4 (poor, fair, good, excellent).
+- "Contribution": A rating from 1 to 4 (poor, fair, good, excellent).
+- "Overall": A rating from 1 to 10 (very strong reject to award quality).
+- "Confidence": A rating from 1 to 5 (low, medium, high, very high, absolute).
+- "Decision": A decision that has to be one of the following: Accept, Reject.
+For the "Decision" field, don't use Weak Accept, Borderline Accept, Borderline Reject, or Strong Reject. Instead, only use Accept or Reject.
+This JSON will be automatically parsed, so ensure the format is precise.
+"""
+neurips_form = (
+    """
+## Review Form
+Below is a description of the questions you will be asked on the review form for each paper and some guidelines on what to consider when answering these questions.
+When writing your review, please keep in mind that after decisions have been made, reviews and meta-reviews of accepted papers and opted-in rejected papers will be made public.
+1. Summary: Briefly summarize the paper and its contributions. This is not the place to critique the paper; the authors should generally agree with a well-written summary.
+  - Strengths and Weaknesses: Please provide a thorough assessment of the strengths and weaknesses of the paper, touching on each of the following dimensions:
+  - Originality: Are the tasks or methods new? Is the work a novel combination of well-known techniques? (This can be valuable!) Is it clear how this work differs from previous contributions? Is related work adequately cited
+  - Quality: Is the submission technically sound? Are claims well supported (e.g., by theoretical analysis or experimental results)? Are the methods used appropriate? Is this a complete piece of work or work in progress? Are the authors careful and honest about evaluating both the strengths and weaknesses of their work
+  - Clarity: Is the submission clearly written? Is it well organized? (If not, please make constructive suggestions for improving its clarity.) Does it adequately inform the reader? (Note that a superbly written paper provides enough information for an expert reader to reproduce its results.)
+  - Significance: Are the results important? Are others (researchers or practitioners) likely to use the ideas or build on them? Does the submission address a difficult task in a better way than previous work? Does it advance the state of the art in a demonstrable way? Does it provide unique data, unique conclusions about existing data, or a unique theoretical or experimental approach?
+2. Questions: Please list up and carefully describe any questions and suggestions for the authors. Think of the things where a response from the author can change your opinion, clarify a confusion or address a limitation. This can be very important for a productive rebuttal and discussion phase with the authors.
+3. Limitations: Have the authors adequately addressed the limitations and potential negative societal impact of their work? If not, please include constructive suggestions for improvement.
+In general, authors should be rewarded rather than punished for being up front about the limitations of their work and any potential negative societal impact. You are encouraged to think through whether any critical points are missing and provide these as feedback for the authors.
+4. Ethical concerns: If there are ethical issues with this paper, please flag the paper for an ethics review. For guidance on when this is appropriate, please review the NeurIPS ethics guidelines.
+5. Soundness: Please assign the paper a numerical rating on the following scale to indicate the soundness of the technical claims, experimental and research methodology and on whether the central claims of the paper are adequately supported with evidence.
+  4: excellent
+  3: good
+  2: fair
+  1: poor
+6. Presentation: Please assign the paper a numerical rating on the following scale to indicate the quality of the presentation. This should take into account the writing style and clarity, as well as contextualization relative to prior work.
+  4: excellent
+  3: good
+  2: fair
+  1: poor
+7. Contribution: Please assign the paper a numerical rating on the following scale to indicate the quality of the overall contribution this paper makes to the research area being studied. Are the questions being asked important? Does the paper bring a significant originality of ideas and/or execution? Are the results valuable to share with the broader NeurIPS community.
+  4: excellent
+  3: good
+  2: fair
+  1: poor
+8. Overall: Please provide an "overall score" for this submission. Choices:
+  10: Award quality: Technically flawless paper with groundbreaking impact on one or more areas of AI, with exceptionally strong evaluation, reproducibility, and resources, and no unaddressed ethical considerations.
+  9: Very Strong Accept: Technically flawless paper with groundbreaking impact on at least one area of AI and excellent impact on multiple areas of AI, with flawless evaluation, resources, and reproducibility, and no unaddressed ethical considerations.
+  8: Strong Accept: Technically strong paper with, with novel ideas, excellent impact on at least one area of AI or high-to-excellent impact on multiple areas of AI, with excellent evaluation, resources, and reproducibility, and no unaddressed ethical considerations.
+  7: Accept: Technically solid paper, with high impact on at least one sub-area of AI or moderate-to-high impact on more than one area of AI, with good-to-excellent evaluation, resources, reproducibility, and no unaddressed ethical considerations.
+  6: Weak Accept: Technically solid, moderate-to-high impact paper, with no major concerns with respect to evaluation, resources, reproducibility, ethical considerations.
+  5: Borderline accept: Technically solid paper where reasons to accept outweigh reasons to reject, e.g., limited evaluation. Please use sparingly.
+  4: Borderline reject: Technically solid paper where reasons to reject, e.g., limited evaluation, outweigh reasons to accept, e.g., good evaluation. Please use sparingly.
+  3: Reject: For instance, a paper with technical flaws, weak evaluation, inadequate reproducibility and incompletely addressed ethical considerations.
+  2: Strong Reject: For instance, a paper with major technical flaws, and/or poor evaluation, limited impact, poor reproducibility and mostly unaddressed ethical considerations.
+  1: Very Strong Reject: For instance, a paper with trivial results or unaddressed ethical considerations
+9. Confidence:  Please provide a "confidence score" for your assessment of this submission to indicate how confident you are in your evaluation. Choices:
+  5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.
+  4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.
+  3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.
+  2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.
+  1: Your assessment is an educated guess. The submission is not in your area or the submission was difficult to understand. Math/other details were not carefully checked.
+"""
+    + template_instructions
+)
+def perform_review(
+    text,
+    model,
+    client,
+    num_reflections=1,
+    num_fs_examples=1,
+    num_reviews_ensemble=1,
+    temperature=0.75,
+    msg_history=None,
+    return_msg_history=False,
+    reviewer_system_prompt=reviewer_system_prompt_neg,
+    review_instruction_form=neurips_form,
+):
+    if num_fs_examples > 0:
+        fs_prompt = get_review_fewshot_examples(num_fs_examples)
+        base_prompt = review_instruction_form + fs_prompt
+    else:
+        base_prompt = review_instruction_form
+    base_prompt += f"""
+Here is the paper you are asked to review:
+```
+{text}
+```"""
+    if num_reviews_ensemble > 1:
+        llm_review, msg_histories = get_batch_responses_from_llm(
+            base_prompt,
+            model=model,
+            client=client,
+            system_message=reviewer_system_prompt,
+            print_debug=False,
+            msg_history=msg_history,
+            # Higher temperature to encourage diversity.
+            temperature=0.75,
+            n_responses=num_reviews_ensemble,
+        )
+        parsed_reviews = []
+        for idx, rev in enumerate(llm_review):
+            try:
+                parsed_reviews.append(extract_json_between_markers(rev))
+            except Exception as e:
+                print(f"Ensemble review {idx} failed: {e}")
+        parsed_reviews = [r for r in parsed_reviews if r is not None]
+        review = get_meta_review(model, client, temperature, parsed_reviews)
+        # take first valid in case meta-reviewer fails
+        if review is None:
+            review = parsed_reviews[0]
+        # Replace numerical scores with the average of the ensemble.
+        for score, limits in [
+            ("Originality", (1, 4)),
+            ("Quality", (1, 4)),
+            ("Clarity", (1, 4)),
+            ("Significance", (1, 4)),
+            ("Soundness", (1, 4)),
+            ("Presentation", (1, 4)),
+            ("Contribution", (1, 4)),
+            ("Overall", (1, 10)),
+            ("Confidence", (1, 5)),
+        ]:
+            scores = []
+            for r in parsed_reviews:
+                if score in r and limits[1] >= r[score] >= limits[0]:
+                    scores.append(r[score])
+            review[score] = int(round(np.mean(scores)))
+        # Rewrite the message history with the valid one and new aggregated review.
+        msg_history = msg_histories[0][:-1]
+        msg_history += [
+            {
+                "role": "assistant",
+                "content": f"""
+THOUGHT:
+I will start by aggregating the opinions of {num_reviews_ensemble} reviewers that I previously obtained.
+REVIEW JSON:
+```json
+{json.dumps(review)}
+```
+""",
+            }
+        ]
+    else:
+        llm_review, msg_history = get_response_from_llm(
+            base_prompt,
+            model=model,
+            client=client,
+            system_message=reviewer_system_prompt,
+            print_debug=False,
+            msg_history=msg_history,
+            temperature=temperature,
+        )
+        review = extract_json_between_markers(llm_review)
+    if num_reflections > 1:
+        for j in range(num_reflections - 1):
+            # print(f"Relection: {j + 2}/{num_reflections}")
+            text, msg_history = get_response_from_llm(
+                reviewer_reflection_prompt,
+                client=client,
+                model=model,
+                system_message=reviewer_system_prompt,
+                msg_history=msg_history,
+                temperature=temperature,
+            )
+            review = extract_json_between_markers(text)
+            assert review is not None, "Failed to extract JSON from LLM output"
+            if "I am done" in text:
+                # print(f"Review generation converged after {j + 2} iterations.")
+                break
+    if return_msg_history:
+        return review, msg_history
+    else:
+        return review
+reviewer_reflection_prompt = """Round {current_round}/{num_reflections}.
+In your thoughts, first carefully consider the accuracy and soundness of the review you just created.
+Include any other factors that you think are important in evaluating the paper.
+Ensure the review is clear and concise, and the JSON is in the correct format.
+Do not make things overly complicated.
+In the next attempt, try and refine and improve your review.
+Stick to the spirit of the original review unless there are glaring issues.
+Respond in the same format as before:
+THOUGHT:
+<THOUGHT>
+REVIEW JSON:
+```json
+<JSON>
+```
+If there is nothing to improve, simply repeat the previous JSON EXACTLY after the thought and include "I am done" at the end of the thoughts but before the JSON.
+ONLY INCLUDE "I am done" IF YOU ARE MAKING NO MORE CHANGES."""
+def load_paper(pdf_path, num_pages=None, min_size=100):
+    try:
+        if num_pages is None:
+            text = pymupdf4llm.to_markdown(pdf_path)
+        else:
+            reader = PdfReader(pdf_path)
+            min_pages = min(len(reader.pages), num_pages)
+            text = pymupdf4llm.to_markdown(pdf_path, pages=list(range(min_pages)))
+        if len(text) < min_size:
+            raise Exception("Text too short")
+    except Exception as e:
+        print(f"Error with pymupdf4llm, falling back to pymupdf: {e}")
+        try:
+            doc = pymupdf.open(pdf_path)  # open a document
+            if num_pages:
+                doc = doc[:num_pages]
+            text = ""
+            for page in doc:  # iterate the document pages
+                text = text + page.get_text()  # get plain text encoded as UTF-8
+            if len(text) < min_size:
+                raise Exception("Text too short")
+        except Exception as e:
+            print(f"Error with pymupdf, falling back to pypdf: {e}")
+            reader = PdfReader(pdf_path)
+            if num_pages is None:
+                text = "".join(page.extract_text() for page in reader.pages)
+            else:
+                text = "".join(page.extract_text() for page in reader.pages[:num_pages])
+            if len(text) < min_size:
+                raise Exception("Text too short")
+    return text
+def load_review(path):
+    with open(path, "r") as json_file:
+        loaded = json.load(json_file)
+    return loaded["review"]
+# get directory of this file
+dir_path = os.path.dirname(os.path.realpath(__file__))
+fewshot_papers = [
+    os.path.join(dir_path, "fewshot_examples/132_automated_relational.pdf"),
+    os.path.join(dir_path, "fewshot_examples/attention.pdf"),
+    os.path.join(dir_path, "fewshot_examples/2_carpe_diem.pdf"),
+]
+fewshot_reviews = [
+    os.path.join(dir_path, "fewshot_examples/132_automated_relational.json"),
+    os.path.join(dir_path, "fewshot_examples/attention.json"),
+    os.path.join(dir_path, "fewshot_examples/2_carpe_diem.json"),
+]
+def get_review_fewshot_examples(num_fs_examples=1):
+    fewshot_prompt = """
+Below are some sample reviews, copied from previous machine learning conferences.
+Note that while each review is formatted differently according to each reviewer's style, the reviews are well-structured and therefore easy to navigate.
+"""
+    for paper, review in zip(
+        fewshot_papers[:num_fs_examples], fewshot_reviews[:num_fs_examples]
+    ):
+        txt_path = paper.replace(".pdf", ".txt")
+        if os.path.exists(txt_path):
+            with open(txt_path, "r") as f:
+                paper_text = f.read()
+        else:
+            paper_text = load_paper(paper)
+        review_text = load_review(review)
+        fewshot_prompt += f"""
+Paper:
+```
+{paper_text}
+```
+Review:
+```
+{review_text}
+```
+"""
+    return fewshot_prompt
+meta_reviewer_system_prompt = """You are an Area Chair at a machine learning conference.
+You are in charge of meta-reviewing a paper that was reviewed by {reviewer_count} reviewers.
+Your job is to aggregate the reviews into a single meta-review in the same format.
+Be critical and cautious in your decision, find consensus, and respect the opinion of all the reviewers."""
+def get_meta_review(model, client, temperature, reviews):
+    # Write a meta-review from a set of individual reviews
+    review_text = ""
+    for i, r in enumerate(reviews):
+        review_text += f"""
+Review {i + 1}/{len(reviews)}:
+```
+{json.dumps(r)}
+```
+"""
+    base_prompt = neurips_form + review_text
+    llm_review, msg_history = get_response_from_llm(
+        base_prompt,
+        model=model,
+        client=client,
+        system_message=meta_reviewer_system_prompt.format(reviewer_count=len(reviews)),
+        print_debug=False,
+        msg_history=None,
+        temperature=temperature,
+    )
+    meta_review = extract_json_between_markers(llm_review)
+    return meta_review
+def perform_improvement(review, coder):
+    improvement_prompt = '''The following review has been created for your research paper:
+"""
+{review}
+"""
+Improve the text using the review.'''.format(
+        review=json.dumps(review)
+    )
+    coder_out = coder.run(improvement_prompt)

ai_scientist/perform_writeup.py ADDED Viewed

	@@ -0,0 +1,579 @@

+import argparse
+import json
+import os
+import os.path as osp
+import re
+import shutil
+import subprocess
+from typing import Optional, Tuple
+from ai_scientist.generate_ideas import search_for_papers
+from ai_scientist.llm import get_response_from_llm, extract_json_between_markers, create_client, AVAILABLE_LLMS
+# GENERATE LATEX
+def generate_latex(coder, folder_name, pdf_file, timeout=30, num_error_corrections=5):
+    folder = osp.abspath(folder_name)
+    cwd = osp.join(folder, "latex")  # Fixed potential issue with path
+    writeup_file = osp.join(cwd, "template.tex")
+    # Check all references are valid and in the references.bib file
+    with open(writeup_file, "r") as f:
+        tex_text = f.read()
+    cites = re.findall(r"\\cite[a-z]*{([^}]*)}", tex_text)
+    references_bib = re.search(
+        r"\\begin{filecontents}{references.bib}(.*?)\\end{filecontents}",
+        tex_text,
+        re.DOTALL,
+    )
+    if references_bib is None:
+        print("No references.bib found in template.tex")
+        return
+    bib_text = references_bib.group(1)
+    cites = [cite.strip() for item in cites for cite in item.split(",")]
+    for cite in cites:
+        if cite not in bib_text:
+            print(f"Reference {cite} not found in references.")
+            prompt = f"""Reference {cite} not found in references.bib. Is this included under a different name?
+If so, please modify the citation in template.tex to match the name in references.bib at the top. Otherwise, remove the cite."""
+            coder.run(prompt)
+    # Check all included figures are actually in the directory.
+    with open(writeup_file, "r") as f:
+        tex_text = f.read()
+    referenced_figs = re.findall(r"\\includegraphics.*?{(.*?)}", tex_text)
+    all_figs = [f for f in os.listdir(folder) if f.endswith(".png")]
+    for figure in referenced_figs:
+        if figure not in all_figs:
+            print(f"Figure {figure} not found in directory.")
+            prompt = f"""The image {figure} not found in the directory. The images in the directory are: {all_figs}.
+Please ensure that the figure is in the directory and that the filename is correct. Check the notes to see what each figure contains."""
+            coder.run(prompt)
+    # Remove duplicate figures.
+    with open(writeup_file, "r") as f:
+        tex_text = f.read()
+    referenced_figs = re.findall(r"\\includegraphics.*?{(.*?)}", tex_text)
+    duplicates = {x for x in referenced_figs if referenced_figs.count(x) > 1}
+    if duplicates:
+        for dup in duplicates:
+            print(f"Duplicate figure found: {dup}.")
+            prompt = f"""Duplicate figures found: {dup}. Ensure any figure is only included once.
+If duplicated, identify the best location for the figure and remove any other."""
+            coder.run(prompt)
+    # Remove duplicate section headers.
+    with open(writeup_file, "r") as f:
+        tex_text = f.read()
+    sections = re.findall(r"\\section{([^}]*)}", tex_text)
+    duplicates = {x for x in sections if sections.count(x) > 1}
+    if duplicates:
+        for dup in duplicates:
+            print(f"Duplicate section header found: {dup}")
+            prompt = f"""Duplicate section header found: {dup}. Ensure any section header is declared once.
+If duplicated, identify the best location for the section header and remove any other."""
+            coder.run(prompt)
+    # Iteratively fix any LaTeX bugs
+    for i in range(num_error_corrections):
+        # Filter trivial bugs in chktex
+        check_output = os.popen(f"chktex {writeup_file} -q -n2 -n24 -n13 -n1").read()
+        if check_output:
+            prompt = f"""Please fix the following LaTeX errors in `template.tex` guided by the output of `chktek`:
+{check_output}.
+Make the minimal fix required and do not remove or change any packages.
+Pay attention to any accidental uses of HTML syntax, e.g. </end instead of \\end.
+"""
+            coder.run(prompt)
+        else:
+            break
+    compile_latex(cwd, pdf_file, timeout=timeout)
+def compile_latex(cwd, pdf_file, timeout=30):
+    print("GENERATING LATEX")
+    commands = [
+        ["pdflatex", "-interaction=nonstopmode", "template.tex"],
+        ["bibtex", "template"],
+        ["pdflatex", "-interaction=nonstopmode", "template.tex"],
+        ["pdflatex", "-interaction=nonstopmode", "template.tex"],
+    ]
+    for command in commands:
+        try:
+            result = subprocess.run(
+                command,
+                cwd=cwd,
+                stdout=subprocess.PIPE,
+                stderr=subprocess.PIPE,
+                text=True,
+                timeout=timeout,
+            )
+            print("Standard Output:\n", result.stdout)
+            print("Standard Error:\n", result.stderr)
+        except subprocess.TimeoutExpired:
+            print(f"Latex timed out after {timeout} seconds")
+        except subprocess.CalledProcessError as e:
+            print(f"Error running command {' '.join(command)}: {e}")
+    print("FINISHED GENERATING LATEX")
+    # Attempt to move the PDF to the desired location
+    try:
+        shutil.move(osp.join(cwd, "template.pdf"), pdf_file)
+    except FileNotFoundError:
+        print("Failed to rename PDF.")
+per_section_tips = {
+    "Abstract": """
+- TL;DR of the paper
+- What are we trying to do and why is it relevant?
+- Why is this hard?
+- How do we solve it (i.e. our contribution!)
+- How do we verify that we solved it (e.g. Experiments and results)
+Please make sure the abstract reads smoothly and is well-motivated. This should be one continuous paragraph with no breaks between the lines.
+""",
+    "Introduction": """
+- Longer version of the Abstract, i.e. of the entire paper
+- What are we trying to do and why is it relevant?
+- Why is this hard?
+- How do we solve it (i.e. our contribution!)
+- How do we verify that we solved it (e.g. Experiments and results)
+- New trend: specifically list your contributions as bullet points
+- Extra space? Future work!
+""",
+    "Related Work": """
+- Academic siblings of our work, i.e. alternative attempts in literature at trying to solve the same problem.
+- Goal is to “Compare and contrast” - how does their approach differ in either assumptions or method? If their method is applicable to our Problem Setting I expect a comparison in the experimental section. If not, there needs to be a clear statement why a given method is not applicable.
+- Note: Just describing what another paper is doing is not enough. We need to compare and contrast.
+""",
+    "Background": """
+- Academic Ancestors of our work, i.e. all concepts and prior work that are required for understanding our method.
+- Usually includes a subsection, Problem Setting, which formally introduces the problem setting and notation (Formalism) for our method. Highlights any specific assumptions that are made that are unusual.
+- Note: If our paper introduces a novel problem setting as part of its contributions, it's best to have a separate Section.
+""",
+    "Method": """
+- What we do. Why we do it. All described using the general Formalism introduced in the Problem Setting and building on top of the concepts / foundations introduced in Background.
+""",
+    "Experimental Setup": """
+- How do we test that our stuff works? Introduces a specific instantiation of the Problem Setting and specific implementation details of our Method for this Problem Setting.
+- Do not imagine unknown hardware details.
+- Includes a description of the dataset, evaluation metrics, important hyperparameters, and implementation details.
+""",
+    "Results": """
+- Shows the results of running Method on our problem described in Experimental Setup.
+- Includes statements on hyperparameters and other potential issues of fairness.
+- Only includes results that have actually been run and saved in the logs. Do not hallucinate results that don't exist.
+- If results exist: compares to baselines and includes statistics and confidence intervals.
+- If results exist: includes ablation studies to show that specific parts of the method are relevant.
+- Discusses limitations of the method.
+- Make sure to include all the results from the experiments, and include all relevant figures.
+""",
+    "Conclusion": """
+- Brief recap of the entire paper.
+- To keep going with the analogy, you can think of future work as (potential) academic offspring.
+""",
+}
+error_list = """- Unenclosed math symbols
+- Only reference figures that exist in our directory
+- LaTeX syntax errors
+- Numerical results that do not come from explicit experiments and logs
+- Repeatedly defined figure labels
+- References to papers that are not in the .bib file, DO NOT ADD ANY NEW CITATIONS!
+- Unnecessary verbosity or repetition, unclear text
+- Results or insights in the `notes.txt` that have not yet need included
+- Any relevant figures that have not yet been included in the text
+- Closing any \\begin{{figure}} with a \\end{{figure}} and \\begin{{table}} with a \\end{{table}}, etc.
+- Duplicate headers, e.g. duplicated \\section{{Introduction}} or \\end{{document}}
+- Unescaped symbols, e.g. shakespeare_char should be shakespeare\\_char in text
+- Incorrect closing of environments, e.g. </end{{figure}}> instead of \\end{{figure}}
+"""
+refinement_prompt = (
+    """Great job! Now criticize and refine only the {section} that you just wrote.
+Make this complete in this pass, do not leave any placeholders.
+Pay particular attention to fixing any errors such as:
+"""
+    + error_list
+)
+second_refinement_prompt = (
+    """Criticize and refine the {section} only. Recall the advice:
+{tips}
+Make this complete in this pass, do not leave any placeholders.
+Pay attention to how it fits in with the rest of the paper.
+Identify any redundancies (e.g. repeated figures or repeated text), if there are any, decide where in the paper things should be cut.
+Identify where we can save space, and be more concise without weakening the message of the text.
+Fix any remaining errors as before:
+"""
+    + error_list
+)
+# CITATION HELPERS
+citation_system_msg = """You are an ambitious AI PhD student who is looking to publish a paper that will contribute significantly to the field.
+You have already written an initial draft of the paper and now you are looking to add missing citations to related papers throughout the paper.
+The related work section already has some initial comments on which papers to add and discuss.
+Focus on completing the existing write-up and do not add entirely new elements unless necessary.
+Ensure every point in the paper is substantiated with sufficient evidence.
+Feel free to add more cites to a particular point if there is only one or two references.
+Ensure no paper is cited without a corresponding reference in the `references.bib` file.
+Ensure each paragraph of the related work has sufficient background, e.g. a few papers cited.
+You will be given access to the Semantic Scholar API, only add citations that you have found using the API.
+Aim to discuss a broad range of relevant papers, not just the most popular ones.
+Make sure not to copy verbatim from prior literature to avoid plagiarism.
+You will be prompted to give a precise description of where and how to add the cite, and a search query for the paper to be cited.
+Finally, you will select the most relevant cite from the search results (top 10 results will be shown).
+You will have {total_rounds} rounds to add to the references, but do not need to use them all.
+DO NOT ADD A CITATION THAT ALREADY EXISTS!"""
+citation_first_prompt = '''Round {current_round}/{total_rounds}:
+You have written this LaTeX draft so far:
+"""
+{draft}
+"""
+Identify the most important citation that you still need to add, and the query to find the paper.
+Respond in the following format:
+THOUGHT:
+<THOUGHT>
+RESPONSE:
+```json
+<JSON>
+```
+In <THOUGHT>, first briefly reason over the paper and identify where citations should be added.
+If no more citations are needed, add "No more citations needed" to your thoughts.
+Do not add "No more citations needed" if you are adding citations this round.
+In <JSON>, respond in JSON format with the following fields:
+- "Description": A precise description of the required edit, along with the proposed text and location where it should be made.
+- "Query": The search query to find the paper (e.g. attention is all you need).
+Ensure the description is sufficient to make the change without further context. Someone else will make the change.
+The query will work best if you are able to recall the exact name of the paper you are looking for, or the authors.
+This JSON will be automatically parsed, so ensure the format is precise.'''
+citation_second_prompt = """Search has recovered the following articles:
+{papers}
+Respond in the following format:
+THOUGHT:
+<THOUGHT>
+RESPONSE:
+```json
+<JSON>
+```
+In <THOUGHT>, first briefly reason over the search results and identify which citation best fits your paper and the location is to be added at.
+If none are appropriate, add "Do not add any" to your thoughts.
+In <JSON>, respond in JSON format with the following fields:
+- "Selected": A list of the indices of the selected papers to be cited, e.g. "[0, 1]". Can be "[]" if no papers are selected. This must be a string.
+- "Description": Update the previous description of the required edit if needed. Ensure that any cites precisely match the name in the bibtex!!!
+Do not select papers that are already in the `references.bib` file at the top of the draft, or if the same citation exists under a different name.
+This JSON will be automatically parsed, so ensure the format is precise."""
+def get_citation_aider_prompt(
+        client, model, draft, current_round, total_rounds, engine="semanticscholar"
+) -> Tuple[Optional[str], bool]:
+    msg_history = []
+    try:
+        text, msg_history = get_response_from_llm(
+            citation_first_prompt.format(
+                draft=draft, current_round=current_round, total_rounds=total_rounds
+            ),
+            client=client,
+            model=model,
+            system_message=citation_system_msg.format(total_rounds=total_rounds),
+            msg_history=msg_history,
+        )
+        if "No more citations needed" in text:
+            print("No more citations needed.")
+            return None, True
+        ## PARSE OUTPUT
+        json_output = extract_json_between_markers(text)
+        assert json_output is not None, "Failed to extract JSON from LLM output"
+        query = json_output["Query"]
+        papers = search_for_papers(query, engine=engine)
+    except Exception as e:
+        print(f"Error: {e}")
+        return None, False
+    if papers is None:
+        print("No papers found.")
+        return None, False
+    paper_strings = []
+    for i, paper in enumerate(papers):
+        paper_strings.append(
+            """{i}: {title}. {authors}. {venue}, {year}.\nAbstract: {abstract}""".format(
+                i=i,
+                title=paper["title"],
+                authors=paper["authors"],
+                venue=paper["venue"],
+                year=paper["year"],
+                abstract=paper["abstract"],
+            )
+        )
+    papers_str = "\n\n".join(paper_strings)
+    try:
+        text, msg_history = get_response_from_llm(
+            citation_second_prompt.format(
+                papers=papers_str,
+                current_round=current_round,
+                total_rounds=total_rounds,
+            ),
+            client=client,
+            model=model,
+            system_message=citation_system_msg.format(total_rounds=total_rounds),
+            msg_history=msg_history,
+        )
+        if "Do not add any" in text:
+            print("Do not add any.")
+            return None, False
+        ## PARSE OUTPUT
+        json_output = extract_json_between_markers(text)
+        assert json_output is not None, "Failed to extract JSON from LLM output"
+        desc = json_output["Description"]
+        selected_papers = json_output["Selected"]
+        selected_papers = str(selected_papers)
+        # convert to list
+        if selected_papers != "[]":
+            selected_papers = list(map(int, selected_papers.strip("[]").split(",")))
+            assert all(
+                [0 <= i < len(papers) for i in selected_papers]
+            ), "Invalid paper index"
+            bibtexs = [papers[i]["citationStyles"]["bibtex"] for i in selected_papers]
+            bibtex_string = "\n".join(bibtexs)
+        else:
+            return None, False
+    except Exception as e:
+        print(f"Error: {e}")
+        return None, False
+    # Add citation to draft
+    aider_format = '''The following citations have just been added to the end of the `references.bib` file definition at the top of the file:
+"""
+{bibtex}
+"""
+You do not need to add them yourself.
+ABSOLUTELY DO NOT ADD IT AGAIN!!!
+Make the proposed change to the draft incorporating these new cites:
+{description}
+Use your judgment for whether these should be cited anywhere else.
+Make sure that any citation precisely matches the name in `references.bib`. Change its name to the correct name in the bibtex if needed.
+Ensure the citation is well-integrated into the text.'''
+    aider_prompt = (
+            aider_format.format(bibtex=bibtex_string, description=desc)
+            + """\n You must use \cite or \citet to reference papers, do not manually type out author names."""
+    )
+    return aider_prompt, False
+# PERFORM WRITEUP
+def perform_writeup(
+        idea, folder_name, coder, cite_client, cite_model, num_cite_rounds=20, engine="semanticscholar"
+):
+    # CURRENTLY ASSUMES LATEX
+    abstract_prompt = f"""We've provided the `latex/template.tex` file to the project. We will be filling it in section by section.
+First, please fill in the "Title" and "Abstract" sections of the writeup.
+Some tips are provided below:
+{per_section_tips["Abstract"]}
+Before every paragraph, please include a brief description of what you plan to write in that paragraph in a comment.
+Be sure to first name the file and use *SEARCH/REPLACE* blocks to perform these edits.
+"""
+    coder_out = coder.run(abstract_prompt)
+    coder_out = coder.run(
+        refinement_prompt.format(section="Abstract")
+        .replace(r"{{", "{")
+        .replace(r"}}", "}")
+    )
+    for section in [
+        "Introduction",
+        "Background",
+        "Method",
+        "Experimental Setup",
+        "Results",
+        "Conclusion",
+    ]:
+        section_prompt = f"""Please fill in the {section} of the writeup. Some tips are provided below:
+{per_section_tips[section]}
+Be sure to use \cite or \citet where relevant, referring to the works provided in the file.
+Do not cite anything that is not already in `references.bib`. Do not add any new entries to this.
+Keep the experimental results (figures and tables) only in the Results section, and make sure that any captions are filled in.
+In this pass, do not reference anything in later sections of the paper.
+Before every paragraph, please include a brief description of what you plan to write in that paragraph in a comment.
+Be sure to first name the file and use *SEARCH/REPLACE* blocks to perform these edits.
+"""
+        coder_out = coder.run(section_prompt)
+        coder_out = coder.run(
+            refinement_prompt.format(section=section)
+            .replace(r"{{", "{")
+            .replace(r"}}", "}")
+        )
+    # SKETCH THE RELATED WORK
+    section_prompt = f"""Please fill in the Related Work of the writeup. Some tips are provided below:
+{per_section_tips["Related Work"]}
+For this section, very briefly sketch out the structure of the section, and clearly indicate what papers you intend to include.
+Do this all in LaTeX comments using %.
+The related work should be concise, only plan to discuss the most relevant work.
+Do not modify `references.bib` to add any new citations, this will be filled in at a later stage.
+Be sure to first name the file and use *SEARCH/REPLACE* blocks to perform these edits.
+"""
+    coder_out = coder.run(section_prompt)
+    # Fill paper with cites.
+    for _ in range(num_cite_rounds):
+        with open(osp.join(folder_name, "latex", "template.tex"), "r") as f:
+            draft = f.read()
+        prompt, done = get_citation_aider_prompt(
+            cite_client, cite_model, draft, _, num_cite_rounds, engine=engine
+        )
+        if done:
+            break
+        if prompt is not None:
+            # extract bibtex string
+            bibtex_string = prompt.split('"""')[1]
+            # insert this into draft before the "\end{filecontents}" line
+            search_str = r"\end{filecontents}"
+            draft = draft.replace(search_str, f"{bibtex_string}{search_str}")
+            with open(osp.join(folder_name, "latex", "template.tex"), "w") as f:
+                f.write(draft)
+            coder_out = coder.run(prompt)
+    coder_out = coder.run(
+        refinement_prompt.format(section="Related Work")
+        .replace(r"{{", "{")
+        .replace(r"}}", "}")
+    )
+    ## SECOND REFINEMENT LOOP
+    coder.run(
+        """Great job! Now that there is a complete draft of the entire paper, let's refine each section again.
+First, re-think the Title if necessary. Keep this concise and descriptive of the paper's concept, but try by creative with it."""
+    )
+    for section in [
+        "Abstract",
+        "Related Work",
+        "Introduction",
+        "Background",
+        "Method",
+        "Experimental Setup",
+        "Results",
+        "Conclusion",
+    ]:
+        coder_out = coder.run(
+            second_refinement_prompt.format(
+                section=section, tips=per_section_tips[section]
+            )
+            .replace(r"{{", "{")
+            .replace(r"}}", "}")
+        )
+    generate_latex(coder, folder_name, f"{folder_name}/{idea['Name']}.pdf")
+if __name__ == "__main__":
+    from aider.coders import Coder
+    from aider.models import Model
+    from aider.io import InputOutput
+    import json
+    parser = argparse.ArgumentParser(description="Perform writeup for a project")
+    parser.add_argument("--folder", type=str)
+    parser.add_argument("--no-writing", action="store_true", help="Only generate")
+    parser.add_argument(
+        "--model",
+        type=str,
+        default="gpt-4o-2024-05-13",
+        choices=AVAILABLE_LLMS,
+        help="Model to use for AI Scientist.",
+    )
+    parser.add_argument(
+        "--engine",
+        type=str,
+        default="semanticscholar",
+        choices=["semanticscholar", "openalex"],
+        help="Scholar engine to use.",
+    )
+    args = parser.parse_args()
+    client, client_model = create_client(args.model)
+    print("Make sure you cleaned the Aider logs if re-generating the writeup!")
+    folder_name = args.folder
+    idea_name = osp.basename(folder_name)
+    exp_file = osp.join(folder_name, "experiment.py")
+    vis_file = osp.join(folder_name, "plot.py")
+    notes = osp.join(folder_name, "notes.txt")
+    model = args.model
+    writeup_file = osp.join(folder_name, "latex", "template.tex")
+    ideas_file = osp.join(folder_name, "ideas.json")
+    with open(ideas_file, "r") as f:
+        ideas = json.load(f)
+    for idea in ideas:
+        if idea["Name"] in idea_name:
+            print(f"Found idea: {idea['Name']}")
+            break
+    if idea["Name"] not in idea_name:
+        raise ValueError(f"Idea {idea_name} not found")
+    fnames = [exp_file, writeup_file, notes]
+    io = InputOutput(yes=True, chat_history_file=f"{folder_name}/{idea_name}_aider.txt")
+    if args.model == "deepseek-coder-v2-0724":
+        main_model = Model("deepseek/deepseek-coder")
+    elif args.model == "llama3.1-405b":
+        main_model = Model("openrouter/meta-llama/llama-3.1-405b-instruct")
+    else:
+        main_model = Model(model)
+    coder = Coder.create(
+        main_model=main_model,
+        fnames=fnames,
+        io=io,
+        stream=False,
+        use_git=False,
+        edit_format="diff",
+    )
+    if args.no_writing:
+        generate_latex(coder, args.folder, f"{args.folder}/test.pdf")
+    else:
+        try:
+            perform_writeup(idea, folder_name, coder, client, client_model, engine=args.engine)
+        except Exception as e:
+            print(f"Failed to perform writeup: {e}")