Update README.md
Browse files
README.md
CHANGED
|
@@ -77,11 +77,13 @@ The combined dataset[GIZ/policy_qa_v0_1](https://huggingface.co/datasets/GIZ/pol
|
|
| 77 |
The pre-processing operations used to produce the final training dataset were as follows:
|
| 78 |
|
| 79 |
1. Dataset is filtered based on 'medium' value in 'strategy' column (sequence length = 85), selecting only IKITracs samples.
|
| 80 |
-
2. For
|
| 81 |
-
3.
|
| 82 |
-
4.
|
| 83 |
-
5. The
|
| 84 |
-
6.
|
|
|
|
|
|
|
| 85 |
|
| 86 |
###**Parameter to category mapping taxonomy**
|
| 87 |
|index|Category|Parameter|
|
|
|
|
| 77 |
The pre-processing operations used to produce the final training dataset were as follows:
|
| 78 |
|
| 79 |
1. Dataset is filtered based on 'medium' value in 'strategy' column (sequence length = 85), selecting only IKITracs samples.
|
| 80 |
+
2. For ClimateWatch, all rows are removed as there was assessed to be no taxonomical alignment with the IKITracs labels inherent to the dataset.
|
| 81 |
+
3. For IKITracs, labels are assigned based on the presence of of 'parameter' values matching the category mapping taxonomy defined by TraCS (ref. below)
|
| 82 |
+
4. If 'context_translated' is available and the 'language' is not English, 'context' is replaced with 'context_translated'. This results in the model being trained on English translations of original text samples.
|
| 83 |
+
5. The dataset is "exploded" - i.e., the text samples in the 'context' column, which are lists, are converted into separate rows - and labels are merged to align with the associated samples.
|
| 84 |
+
6. The 'match_onanswer' and 'answerWordcount' are used conditionally to select high quality samples (prefers high % of word matches in 'match_onanswer', but will take lower if there is a high 'answerWordcount')
|
| 85 |
+
7. Data is then augmented using sentence shuffle from the ```albumentations``` library and insertions from ```nlpaug```. This is done to increase the number of training samples available for under-represented classes. Given the large number of classes for this classifier, it is unsurprising that some categories have very low representation. In this case, classes with instances less than 1/3 of the most represented classes are categorized as under-represented and each instance is augmented to effectively double the number of instances for these classes.
|
| 86 |
+
8. To address the remaining class imbalances, the ratio of negative instances to positive instances for each class is computed to produce a weights array. This array is passed to a custom multi label trainer function which is used during hyperparameter tuning and final model training.
|
| 87 |
|
| 88 |
###**Parameter to category mapping taxonomy**
|
| 89 |
|index|Category|Parameter|
|