gpt2-dpo-mcqa

This model is a fine-tuned version of mNLP-project/gpt2-finetuned-mcqa on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6407	0.9993	668	0.6460	0.7721	0.6216	0.6295	0.1505	-664.7236	-794.3383	-15.1273	-11.7899
0.6498	2.0	1337	0.6374	1.2927	1.0475	0.6325	0.2453	-660.4651	-789.1318	-14.9517	-11.6401
0.6468	2.9993	2005	0.6342	1.3734	1.1102	0.6388	0.2632	-659.8373	-788.3249	-14.9535	-11.6481
0.6113	4.0	2674	0.6332	1.3317	1.0769	0.6444	0.2548	-660.1705	-788.7426	-14.9930	-11.6897
0.5826	4.9993	3342	0.6310	1.4580	1.1845	0.6414	0.2735	-659.0944	-787.4795	-14.9328	-11.6364
0.5613	6.0	4011	0.6317	1.4979	1.2181	0.6407	0.2798	-658.7584	-787.0804	-14.9234	-11.6271
0.581	6.9993	4679	0.6316	1.5084	1.2260	0.6437	0.2825	-658.6798	-786.9750	-14.9319	-11.6377
0.571	8.0	5348	0.6320	1.4992	1.2184	0.6425	0.2808	-658.7557	-787.0676	-14.9334	-11.6373
0.5943	8.9993	6016	0.6317	1.5126	1.2294	0.6437	0.2832	-658.6454	-786.9331	-14.9226	-11.6269
0.5635	9.9925	6680	0.6317	1.5142	1.2308	0.6433	0.2835	-658.6317	-786.9168	-14.9211	-11.6256

Safetensors

Model size

0.1B params

Tensor type

F32

Base model

Finetuned

Finetuned

(1)

this model