{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Assignment 3: Hello Vectors\n",
"\n",
"Welcome to this week's programming assignment of the specialization. In this assignment we will explore word vectors.\n",
"In natural language processing, we represent each word as a vector consisting of numbers.\n",
"The vector encodes the meaning of the word. These numbers (or weights) for each word are learned using various machine\n",
"learning models, which we will explore in more detail later in this specialization. Rather than make you code the\n",
"machine learning models from scratch, we will show you how to use them. In the real world, you can always load the\n",
"trained word vectors, and you will almost never have to train them from scratch. In this assignment you will\n",
"\n",
"- Predict analogies between words.\n",
"- Use PCA to reduce the dimensionality of the word embeddings and plot them in two dimensions.\n",
"- Compare word embeddings by using a similarity measure (the cosine similarity).\n",
"- Understand how these vector space models work.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Important Note on Submission to the AutoGrader\n",
"\n",
"Before submitting your assignment to the AutoGrader, please make sure you are not doing the following:\n",
"\n",
"1. You have not added any _extra_ `print` statement(s) in the assignment.\n",
"2. You have not added any _extra_ code cell(s) in the assignment.\n",
"3. You have not changed any of the function parameters.\n",
"4. You are not using any global variables inside your graded exercises. Unless specifically instructed to do so, please refrain from it and use the local variables instead.\n",
"5. You are not changing the assignment code where it is not required, like creating _extra_ variables.\n",
"\n",
"If you do any of the following, you will get something like, `Grader Error: Grader feedback not found` (or similarly unexpected) error upon submitting your assignment. Before asking for help/debugging the errors in your assignment, check for these first. If this is the case, and you don't remember the changes you have made, you can get a fresh copy of the assignment by following these [instructions](https://www.coursera.org/learn/classification-vector-spaces-in-nlp/supplement/YLuAg/h-ow-to-refresh-your-workspace)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of Contents\n",
"\n",
"- [1 - Predict the Countries from Capitals](#1)\n",
" - [1.1 Importing the Data](#1-1)\n",
" - [1.2 Cosine Similarity](#1-2)\n",
" - [Exercise 1 - cosine_similarity (UNQ_C1)](#ex-1)\n",
" - [1.3 Euclidean Distance](#1-3)\n",
" - [Exercise 2 - euclidean (UNQ_C2)](#ex-2)\n",
" - [1.4 Finding the Country of each Capital](#1-4)\n",
" - [Exercise 3 - get_country (UNQ_C3)](#ex-3)\n",
" - [1.5 Model Accuracy](#1-5)\n",
" - [Exercise 4 - get_accuracy (UNQ_C4)](#ex-4)\n",
"- [2 - Plotting the vectors using PCA](#2)\n",
" - [Exercise 5 - compute_pca (UNQ_C5)](#ex-5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## 1 - Predict the Countries from Capitals\n",
"\n",
"During the presentation of the module, we have illustrated the word analogies\n",
"by finding the capital of a country from the country. In this part of the assignment\n",
"we have changed the problem a bit. You are asked to predict the **countries** \n",
"that correspond to some **capitals**.\n",
"You are playing trivia against some second grader who just took their geography test and knows all the capitals by heart.\n",
"Thanks to NLP, you will be able to answer the questions properly. In other words, you will write a program that can give\n",
"you the country by its capital. That way you are pretty sure you will win the trivia game. We will start by exploring the data set.\n",
"\n",
"
\n",
"\n",
"\n",
"### 1.1 Importing the Data\n",
"\n",
"As usual, you start by importing some essential Python libraries and the load dataset.\n",
"The dataset will be loaded as a [Pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html),\n",
"which is very a common method in data science. Because of the large size of the data,\n",
"this may take a few minutes."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Run this cell to import packages.\n",
"import pickle\n",
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import w3_unittest\n",
"\n",
"from utils import get_vectors"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
| \n", " | city1 | \n", "country1 | \n", "city2 | \n", "country2 | \n", "
|---|---|---|---|---|
| 0 | \n", "Athens | \n", "Greece | \n", "Bangkok | \n", "Thailand | \n", "
| 1 | \n", "Athens | \n", "Greece | \n", "Beijing | \n", "China | \n", "
| 2 | \n", "Athens | \n", "Greece | \n", "Berlin | \n", "Germany | \n", "
| 3 | \n", "Athens | \n", "Greece | \n", "Bern | \n", "Switzerland | \n", "
| 4 | \n", "Athens | \n", "Greece | \n", "Cairo | \n", "Egypt | \n", "
\n",
"\n",
"You will implement a function that can tell you the capital of a country.\n",
"You should use the same methodology shown in the figure above. To do this,\n",
"you'll first compute the cosine similarity metric or the Euclidean distance."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### 1.2 Cosine Similarity\n",
"\n",
"The cosine similarity function is:\n",
"\n",
"$$\\cos (\\theta)=\\frac{\\mathbf{A} \\cdot \\mathbf{B}}{\\|\\mathbf{A}\\|\\|\\mathbf{B}\\|}=\\frac{\\sum_{i=1}^{n} A_{i} B_{i}}{\\sqrt{\\sum_{i=1}^{n} A_{i}^{2}} \\sqrt{\\sum_{i=1}^{n} B_{i}^{2}}}\\tag{1}$$\n",
"\n",
"$A$ and $B$ represent the word vectors and $A_i$ or $B_i$ represent index i of that vector. Note that if A and B are identical, you will get $cos(\\theta) = 1$.\n",
"* Otherwise, if they are the total opposite, meaning, $A= -B$, then you would get $cos(\\theta) = -1$.\n",
"* If you get $cos(\\theta) =0$, that means that they are orthogonal (or perpendicular).\n",
"* Numbers between 0 and 1 indicate a similarity score.\n",
"* Numbers between -1 and 0 indicate a dissimilarity score.\n",
"\n",
"\n",
"### Exercise 1 - cosine_similarity\n",
"Implement a function that takes in two word vectors and computes the cosine distance."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n", "
\n", "
\n", "
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### Exercise 5 - compute_pca\n",
"\n",
"**Instructions**: \n",
"\n",
"Implement a program that takes in a data set where each row corresponds to a word vector. \n",
"* The word vectors are of dimension 300. \n",
"* Use PCA to change the 300 dimensions to `n_components` dimensions. \n",
"* The new matrix should be of dimension `m, n_components`. \n",
"\n",
"* First de-mean the data\n",
"* Get the eigenvalues using `linalg.eigh`. Use 'eigh' rather than 'eig' since R is symmetric. The performance gain when using eigh instead of eig is substantial.\n",
"* Sort the eigenvectors and eigenvalues by decreasing order of the eigenvalues.\n",
"* Get a subset of the eigenvectors (choose how many principle components you want to use using n_components).\n",
"* Return the new transformation of the data by multiplying the eigenvectors with the original data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n", "
axis = 0, you take the mean for each column. If you set axis = 1, you take the mean for each row. Remember that each row is a word vector, and the number of columns are the number of dimensions in a word vector. rowvar is True. From the documentation: \"If rowvar is True (default), then each row represents a variable, with observations in the columns.\" In our case, each row is a word vector observation, and each column is a feature (variable). x[::-1].x[indices_sorted].x[:,indices_sorted](n_observations, n_features). (n_features, n_components).(n_components, n_features) and the data (n_features, n_observations).(n_components,n_observations). Take its transpose to get the shape (n_observations, n_components).| \n", " 0.43437323\n", " | \n", "\n", " 0.49820384\n", " | \n", "
| \n", " 0.42077249\n", " | \n", "\n", " -0.50351448\n", " | \n", "
| \n", " -0.85514571\n", " | \n", "\n", " 0.00531064\n", " | \n", "