File size: 70,821 Bytes
5783f3e 527ed01 5783f3e 527ed01 5783f3e 527ed01 5783f3e |
|
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Sentence Processing for NLP\n",
"\n",
"In this notebook, we will see the importance of sentence processing and the techniques that we used to train the models.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will use a corpus of `10_000` sentences to demonstrate the difference between the different techniques.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[nltk_data] Downloading package punkt_tab to /Users/az-r-\n",
"[nltk_data] ow/nltk_data...\n",
"[nltk_data] Package punkt_tab is already up-to-date!\n"
]
}
],
"source": [
"from app.travel_resolver.libs.nlp.data_processing import from_bio_file_to_examples\n",
"\n",
"\n",
"sentences, labels, vocab, unique_labels = from_bio_file_to_examples(\n",
" \"data/bio/fr.bio/10k_samples.bio\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On the same corpus, we will be comparing :\n",
"\n",
"- No changes\n",
"- Stopwords removal + lowercasing\n",
"- Stopwords removal + lowercasing + stemming\n",
"\n",
"To avoid confusion, variables in relation with the sentences that hasn't been altered with will be prefixed with `o` for original, `sl` for stopwords + lowercasing and `sls` for stopwords + lowercasing + stemming.\n",
"\n",
"> If you're not familiar with stemming, it's basically an attempt of taking the word to it's root by removing what supposedly is a prefix or a suffix (eg: _chocolates_ $\\to$ _chocolate_, _retrieval_ $\\to$ _retrieve_)\n"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"import nltk\n",
"from app.travel_resolver.libs.nlp.data_processing import process_sentence\n",
"\n",
"o_sentences = [nltk.tokenize.word_tokenize(sentence) for sentence in sentences]\n",
"sl_sentences = []\n",
"sl_labels = []\n",
"sls_sentences = []\n",
"sls_labels = []\n",
"\n",
"for sentence, label in zip(sentences, labels):\n",
" sl_sentence, sl_label = process_sentence(\n",
" sentence, return_tokens=True, labels_to_adapt=label\n",
" )\n",
" sls_sentence, sls_label = process_sentence(\n",
" sentence, stemming=True, return_tokens=True, labels_to_adapt=label\n",
" )\n",
"\n",
" # print(len(sl_sentence))\n",
" # print(len(sl_label))\n",
" # break\n",
" sl_sentences.append(sl_sentence)\n",
" sl_labels.append(sl_label)\n",
" sls_sentences.append(sls_sentence)\n",
" sls_labels.append(sls_label)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"from collections import Counter\n",
"\n",
"o_word_count = Counter([word for sentence in o_sentences for word in sentence])\n",
"sl_word_count = Counter([word for sentence in sl_sentences for word in sentence])\n",
"sls_word_count = Counter([word for sentence in sls_sentences for word in sentence])"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"\n",
"o_word_count_values = list(o_word_count.values())\n",
"sl_word_count_values = list(sl_word_count.values())\n",
"sls_word_count_values = list(sls_word_count.values())\n",
"\n",
"# Plot the distribution\n",
"plt.figure(figsize=(10, 6))\n",
"plt.hist(\n",
" o_word_count_values,\n",
" bins=50,\n",
" color=\"blue\",\n",
" edgecolor=\"black\",\n",
" alpha=0.7,\n",
" label=\"Original\",\n",
")\n",
"plt.hist(\n",
" sl_word_count_values, bins=50, color=\"red\", edgecolor=\"black\", alpha=0.6, label=\"SL\"\n",
")\n",
"plt.hist(\n",
" sls_word_count_values,\n",
" bins=50,\n",
" color=\"yellow\",\n",
" edgecolor=\"black\",\n",
" alpha=0.4,\n",
" label=\"SLS\",\n",
")\n",
"plt.yscale(\"log\") # Optional: use log scale for y-axis if the distribution is skewed\n",
"plt.xlabel(\"Word Frequency\")\n",
"plt.ylabel(\"Number of Words\")\n",
"plt.title(\"Distribution of Word Frequency in Corpus\")\n",
"plt.legend()\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ββββββββββββ€ββββββββ€ββββββββββ€ββββββββββ€ββββββββββββ€ββββββββββββββββββββββββ\n",
"β β max β mean β std β n_words β avg_tokens/sentence β\n",
"ββββββββββββͺββββββββͺββββββββββͺββββββββββͺββββββββββββͺββββββββββββββββββββββββ‘\n",
"β Original β 12127 β 41.3419 β 367.982 β 3270 β 13.61 β\n",
"ββββββββββββΌββββββββΌββββββββββΌββββββββββΌββββββββββββΌββββββββββββββββββββββββ€\n",
"β SL β 7825 β 27.8919 β 224.392 β 3248 β 9.12041 β\n",
"ββββββββββββΌββββββββΌββββββββββΌββββββββββΌββββββββββββΌββββββββββββββββββββββββ€\n",
"β SLS β 7825 β 28.2397 β 229.139 β 3208 β 9.12041 β\n",
"ββββββββββββ§ββββββββ§ββββββββββ§ββββββββββ§ββββββββββββ§ββββββββββββββββββββββββ\n"
]
}
],
"source": [
"import numpy as np\n",
"from tabulate import tabulate\n",
"\n",
"word_distributions = np.array(\n",
" [\n",
" [\"\", \"max\", \"mean\", \"std\", \"n_words\", \"avg_tokens/sentence\"],\n",
" [\n",
" \"Original\",\n",
" np.max(o_word_count_values),\n",
" np.mean(o_word_count_values),\n",
" np.std(o_word_count_values),\n",
" len(o_word_count),\n",
" np.mean([len(sentence) for sentence in o_sentences]),\n",
" ],\n",
" [\n",
" \"SL\",\n",
" np.max(sl_word_count_values),\n",
" np.mean(sl_word_count_values),\n",
" np.std(sl_word_count_values),\n",
" len(sl_word_count),\n",
" np.mean([len(sentence) for sentence in sl_sentences]),\n",
" ],\n",
" [\n",
" \"SLS\",\n",
" np.max(sls_word_count_values),\n",
" np.mean(sls_word_count_values),\n",
" np.std(sls_word_count_values),\n",
" len(sls_word_count),\n",
" np.mean([len(sentence) for sentence in sls_sentences]),\n",
" ],\n",
" ]\n",
")\n",
"\n",
"print(tabulate(word_distributions, headers=\"firstrow\", tablefmt=\"fancy_grid\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With the processing techniques, we were able to reduce :\n",
"\n",
"- Mean\n",
"- Standard Deviation (`std`)\n",
"- Number of words (`n_words`)\n",
"- Average Tokens per sentence (`avg_tokens/sentence`)\n"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"o_label_count = Counter(sum(labels, []))\n",
"sl_label_count = Counter(sum(sl_labels, []))\n",
"sls_label_count = Counter(sum(sls_labels, []))"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(Counter({0: 113556, 1: 10836, 2: 10796}),\n",
" Counter({0: 68961, 1: 10836, 2: 10796}),\n",
" Counter({0: 68961, 1: 10836, 2: 10796}))"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"o_label_count, sl_label_count, sls_label_count"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Both **SL** and **SLS** have the same ratio for each of the labels therefore we will be comparing either one of them with the **O** corpus.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"o_label_ratio = [c / sum(o_label_count.values()) for c in o_label_count.values()]\n",
"sl_label_ratio = [c / sum(sl_label_count.values()) for c in sl_label_count.values()]\n",
"\n",
"x = np.arange(len(unique_labels))\n",
"bar_width = 0.4\n",
"\n",
"o_label_x = [i - (bar_width / 2) for i in o_label_count.keys()]\n",
"sl_label_x = [i + (bar_width / 2) for i in sl_label_count.keys()]\n",
"\n",
"fig = plt.figure(figsize=(10, 6))\n",
"\n",
"# Creating a bar plot\n",
"plt.bar(sl_label_x, sl_label_ratio, color=\"red\", width=0.4, label=\"SL\")\n",
"plt.bar(o_label_x, o_label_ratio, color=\"blue\", width=0.4, label=\"Original\")\n",
"plt.xlabel(\"Labels\")\n",
"plt.ylabel(\"Frequency\")\n",
"plt.title(\"Label Distribution\")\n",
"plt.legend()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Processing the sentences helped used increase the amount of classes representation as well as decrease the ratio of \"Outside\" words whilst keeping the \"Normal\" representation.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
|