diff --git a/stanza/demo/Stanza_CoreNLP_Interface.ipynb b/stanza/demo/Stanza_CoreNLP_Interface.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..a19cffbe55afe5ef9220d69706e71dcd379342bb --- /dev/null +++ b/stanza/demo/Stanza_CoreNLP_Interface.ipynb @@ -0,0 +1,485 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "Stanza-CoreNLP-Interface.ipynb", + "provenance": [], + "collapsed_sections": [], + "toc_visible": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "2-4lzQTC9yxG", + "colab_type": "text" + }, + "source": [ + "# Stanza: A Tutorial on the Python CoreNLP Interface\n", + "\n", + "![Latest Version](https://img.shields.io/pypi/v/stanza.svg?colorB=bc4545)\n", + "![Python Versions](https://img.shields.io/pypi/pyversions/stanza.svg?colorB=bc4545)\n", + "\n", + "While the Stanza library implements accurate neural network modules for basic functionalities such as part-of-speech tagging and dependency parsing, the [Stanford CoreNLP Java library](https://stanfordnlp.github.io/CoreNLP/) has been developed for years and offers more complementary features such as coreference resolution and relation extraction. To unlock these features, the Stanza library also offers an officially maintained Python interface to the CoreNLP Java library. This interface allows you to get NLP anntotations from CoreNLP by writing native Python code.\n", + "\n", + "\n", + "This tutorial walks you through the installation, setup and basic usage of this Python CoreNLP interface. If you want to learn how to use the neural network components in Stanza, please refer to other tutorials." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YpKwWeVkASGt", + "colab_type": "text" + }, + "source": [ + "## 1. Installation\n", + "\n", + "Before the installation starts, please make sure that you have Python 3 and Java installed on your computer. Since Colab already has them installed, we'll skip this procedure in this notebook." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "k1Az2ECuAfG8", + "colab_type": "text" + }, + "source": [ + "### Installing Stanza\n", + "\n", + "Installing and importing Stanza are as simple as running the following commands:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xiFwYAgW4Mss", + "colab_type": "code", + "colab": {} + }, + "source": [ + "# Install stanza; note that the prefix \"!\" is not needed if you are running in a terminal\n", + "!pip install stanza\n", + "\n", + "# Import stanza\n", + "import stanza" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2zFvaA8_A32_", + "colab_type": "text" + }, + "source": [ + "### Setting up Stanford CoreNLP\n", + "\n", + "In order for the interface to work, the Stanford CoreNLP library has to be installed and a `CORENLP_HOME` environment variable has to be pointed to the installation location.\n", + "\n", + "Here we are going to show you how to download and install the CoreNLP library on your machine, with Stanza's installation command:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "MgK6-LPV-OdA", + "colab_type": "code", + "colab": {} + }, + "source": [ + "# Download the Stanford CoreNLP package with Stanza's installation command\n", + "# This'll take several minutes, depending on the network speed\n", + "corenlp_dir = './corenlp'\n", + "stanza.install_corenlp(dir=corenlp_dir)\n", + "\n", + "# Set the CORENLP_HOME environment variable to point to the installation location\n", + "import os\n", + "os.environ[\"CORENLP_HOME\"] = corenlp_dir" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Jdq8MT-NAhKj", + "colab_type": "text" + }, + "source": [ + "That's all for the installation! 🎉 We can now double check if the installation is successful by listing files in the CoreNLP directory. You should be able to see a number of `.jar` files by running the following command:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "K5eIOaJp_tuo", + "colab_type": "code", + "colab": {} + }, + "source": [ + "# Examine the CoreNLP installation folder to make sure the installation is successful\n", + "!ls $CORENLP_HOME" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "S0xb9BHt__gx", + "colab_type": "text" + }, + "source": [ + "**Note 1**:\n", + "If you are want to use the interface in a terminal (instead of a Colab notebook), you can properly set the `CORENLP_HOME` environment variable with:\n", + "\n", + "```bash\n", + "export CORENLP_HOME=path_to_corenlp_dir\n", + "```\n", + "\n", + "Here we instead set this variable with the Python `os` library, simply because `export` command is not well-supported in Colab notebook.\n", + "\n", + "\n", + "**Note 2**:\n", + "The `stanza.install_corenlp()` function is only available since Stanza v1.1.1. If you are using an earlier version of Stanza, please check out our [manual installation page](https://stanfordnlp.github.io/stanza/client_setup.html#manual-installation) for how to install CoreNLP on your computer.\n", + "\n", + "**Note 3**:\n", + "Besides the installation function, we also provide a `stanza.download_corenlp_models()` function to help you download additional CoreNLP models for different languages that are not shipped with the default installation. Check out our [automatic installation website page](https://stanfordnlp.github.io/stanza/client_setup.html#automated-installation) for more information on how to use it." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xJsuO6D8D05q", + "colab_type": "text" + }, + "source": [ + "## 2. Annotating Text with CoreNLP Interface" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dZNHxXHkH1K2", + "colab_type": "text" + }, + "source": [ + "### Constructing CoreNLPClient\n", + "\n", + "At a high level, the CoreNLP Python interface works by first starting a background Java CoreNLP server process, and then initializing a client instance in Python which can pass the text to the background server process, and accept the returned annotation results.\n", + "\n", + "We wrap these functionalities in a `CoreNLPClient` class. Therefore, we need to start by importing this class from Stanza." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "LS4OKnqJ8wui", + "colab_type": "code", + "colab": {} + }, + "source": [ + "# Import client module\n", + "from stanza.server import CoreNLPClient" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WP4Dz6PIJHeL", + "colab_type": "text" + }, + "source": [ + "After the import is done, we can construct a `CoreNLPClient` instance. The constructor method takes a Python list of annotator names as argument. Here let's explore some basic annotators including tokenization, sentence split, part-of-speech tagging, lemmatization and named entity recognition (NER). \n", + "\n", + "Additionally, the client constructor accepts a `memory` argument, which specifies how much memory will be allocated to the background Java process. An `endpoint` option can be used to specify a port number used by the communication between the server and the client. The default port is 9000. However, since this port is pre-occupied by a system process in Colab, we'll manually set it to 9001 in the following example.\n", + "\n", + "Also, here we manually set `be_quiet=True` to avoid an IO issue in colab notebook. You should be able to use `be_quiet=False` on your own computer, which will print detailed logging information from CoreNLP during usage.\n", + "\n", + "For more options in constructing the clients, please refer to the [CoreNLP Client Options List](https://stanfordnlp.github.io/stanza/corenlp_client.html#corenlp-client-options)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "mbOBugvd9JaM", + "colab_type": "code", + "colab": {} + }, + "source": [ + "# Construct a CoreNLPClient with some basic annotators, a memory allocation of 4GB, and port number 9001\n", + "client = CoreNLPClient(\n", + " annotators=['tokenize','ssplit', 'pos', 'lemma', 'ner'], \n", + " memory='4G', \n", + " endpoint='http://localhost:9001',\n", + " be_quiet=True)\n", + "print(client)\n", + "\n", + "# Start the background server and wait for some time\n", + "# Note that in practice this is totally optional, as by default the server will be started when the first annotation is performed\n", + "client.start()\n", + "import time; time.sleep(10)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kgTiVjNydmIW", + "colab_type": "text" + }, + "source": [ + "After the above code block finishes executing, if you print the background processes, you should be able to find the Java CoreNLP server running." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "spZrJ-oFdkdF", + "colab_type": "code", + "colab": {} + }, + "source": [ + "# Print background processes and look for java\n", + "# You should be able to see a StanfordCoreNLPServer java process running in the background\n", + "!ps -o pid,cmd | grep java" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KxJeJ0D2LoOs", + "colab_type": "text" + }, + "source": [ + "### Annotating Text\n", + "\n", + "Annotating a piece of text is as simple as passing the text into an `annotate` function of the client object. After the annotation is complete, a `Document` object will be returned with all annotations.\n", + "\n", + "Note that although in general annotations are very fast, the first annotation might take a while to complete in the notebook. Please stay patient." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "s194RnNg5z95", + "colab_type": "code", + "colab": {} + }, + "source": [ + "# Annotate some text\n", + "text = \"Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity.\"\n", + "document = client.annotate(text)\n", + "print(type(document))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "semmA3e0TcM1", + "colab_type": "text" + }, + "source": [ + "## 3. Accessing Annotations\n", + "\n", + "Annotations can be accessed from the returned `Document` object.\n", + "\n", + "A `Document` contains a list of `Sentence`s, which contain a list of `Token`s. Here let's first explore the annotations stored in all tokens." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lIO4B5d6Rk4I", + "colab_type": "code", + "colab": {} + }, + "source": [ + "# Iterate over all tokens in all sentences, and print out the word, lemma, pos and ner tags\n", + "print(\"{:12s}\\t{:12s}\\t{:6s}\\t{}\".format(\"Word\", \"Lemma\", \"POS\", \"NER\"))\n", + "\n", + "for i, sent in enumerate(document.sentence):\n", + " print(\"[Sentence {}]\".format(i+1))\n", + " for t in sent.token:\n", + " print(\"{:12s}\\t{:12s}\\t{:6s}\\t{}\".format(t.word, t.lemma, t.pos, t.ner))\n", + " print(\"\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "msrJfvu8VV9m", + "colab_type": "text" + }, + "source": [ + "Alternatively, you can also browse the NER results by iterating over entity mentions over the sentences. For example:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ezEjc9LeV2Xs", + "colab_type": "code", + "colab": {} + }, + "source": [ + "# Iterate over all detected entity mentions\n", + "print(\"{:30s}\\t{}\".format(\"Mention\", \"Type\"))\n", + "\n", + "for sent in document.sentence:\n", + " for m in sent.mentions:\n", + " print(\"{:30s}\\t{}\".format(m.entityMentionText, m.entityType))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ueGzBZ3hWzkN", + "colab_type": "text" + }, + "source": [ + "To print all annotations a sentence, token or mention has, you can simply print the corresponding obejct." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "4_S8o2BHXIed", + "colab_type": "code", + "colab": {} + }, + "source": [ + "# Print annotations of a token\n", + "print(document.sentence[0].token[0])\n", + "\n", + "# Print annotations of a mention\n", + "print(document.sentence[0].mentions[0])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Qp66wjZ10xia", + "colab_type": "text" + }, + "source": [ + "**Note**: Since the Stanza CoreNLP client interface simply ports the CoreNLP annotation results to native Python objects, for a comprehensive lists of available annotators and how their annotation results can be accessed, you will need to visit the [Stanford CoreNLP website](https://stanfordnlp.github.io/CoreNLP/)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IPqzMK90X0w3", + "colab_type": "text" + }, + "source": [ + "## 4. Shutting Down the CoreNLP Server\n", + "\n", + "To shut down the background CoreNLP server process, simply call the `stop` function of the client. Note that once a server is shutdown, you'll have to restart the server with the `start()` function before any annotation is requested." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "xrJq8lZ3Nw7b", + "colab_type": "code", + "colab": {} + }, + "source": [ + "# Shut down the background CoreNLP server\n", + "client.stop()\n", + "\n", + "time.sleep(10)\n", + "!ps -o pid,cmd | grep java" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "23Vwa_ifYfF7", + "colab_type": "text" + }, + "source": [ + "### More Information\n", + "\n", + "For more information on how to use the `CoreNLPClient`, please go to the [CoreNLPClient documentation page](https://stanfordnlp.github.io/stanza/corenlp_client.html)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YUrVT6kA_Bzx", + "colab_type": "text" + }, + "source": [ + "## 5. Simplifying Client Usage with the Python `with` statement\n", + "\n", + "In the above demo, we explicitly called the `client.start()` and `client.stop()` functions to start and stop a client-server connection. However, doing this in practice is usually suboptimal, since you may forget to call the `stop()` function at the end, resulting in an unused server process occupying your machine memory.\n", + "\n", + "To solve is, a simple solution is to use the client interface with the [Python `with` statement](https://docs.python.org/3/reference/compound_stmts.html#the-with-statement). The `with` statement provides an elegant way to automatically start and stop the server process in your Python program, without you needing to worry about this. The following code snippet demonstrates how to establish a client, annotate an example text and then stop the server with a simple `with` statement. Note that we **always recommend** you to use the `with` statement when working with the Stanza CoreNLP client interface." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "H0ct2-R4AvJh", + "colab_type": "code", + "colab": {} + }, + "source": [ + "print(\"Starting a server with the Python \\\"with\\\" statement...\")\n", + "with CoreNLPClient(annotators=['tokenize','ssplit', 'pos', 'lemma', 'ner'], \n", + " memory='4G', endpoint='http://localhost:9001', be_quiet=True) as client:\n", + " text = \"Albert Einstein was a German-born theoretical physicist.\"\n", + " document = client.annotate(text)\n", + "\n", + " print(\"{:30s}\\t{}\".format(\"Mention\", \"Type\"))\n", + " for sent in document.sentence:\n", + " for m in sent.mentions:\n", + " print(\"{:30s}\\t{}\".format(m.entityMentionText, m.entityType))\n", + "\n", + "print(\"\\nThe server should be stopped upon exit from the \\\"with\\\" statement.\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "W435Lwc4YqKb", + "colab_type": "text" + }, + "source": [ + "## 6. Other Resources\n", + "\n", + "- [Stanza Homepage](https://stanfordnlp.github.io/stanza/)\n", + "- [FAQs](https://stanfordnlp.github.io/stanza/faq.html)\n", + "- [GitHub Repo](https://github.com/stanfordnlp/stanza)\n", + "- [Reporting Issues](https://github.com/stanfordnlp/stanza/issues)\n" + ] + } + ] +} \ No newline at end of file diff --git a/stanza/demo/en_test.conllu.txt b/stanza/demo/en_test.conllu.txt new file mode 100644 index 0000000000000000000000000000000000000000..0dfb9d9179703c5a8fa6c2ea2aa1f1dde9ee7cb2 --- /dev/null +++ b/stanza/demo/en_test.conllu.txt @@ -0,0 +1,79 @@ +# newdoc id = weblog-blogspot.com_zentelligence_20040423000200_ENG_20040423_000200 +# sent_id = weblog-blogspot.com_zentelligence_20040423000200_ENG_20040423_000200-0001 +# newpar id = weblog-blogspot.com_zentelligence_20040423000200_ENG_20040423_000200-p0001 +# text = What if Google Morphed Into GoogleOS? +1 What what PRON WP PronType=Int 0 root 0:root _ +2 if if SCONJ IN _ 4 mark 4:mark _ +3 Google Google PROPN NNP Number=Sing 4 nsubj 4:nsubj _ +4 Morphed morph VERB VBD Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 1 advcl 1:advcl:if _ +5 Into into ADP IN _ 6 case 6:case _ +6 GoogleOS GoogleOS PROPN NNP Number=Sing 4 obl 4:obl:into SpaceAfter=No +7 ? ? PUNCT . _ 4 punct 4:punct _ + +# sent_id = weblog-blogspot.com_zentelligence_20040423000200_ENG_20040423_000200-0002 +# text = What if Google expanded on its search-engine (and now e-mail) wares into a full-fledged operating system? +1 What what PRON WP PronType=Int 0 root 0:root _ +2 if if SCONJ IN _ 4 mark 4:mark _ +3 Google Google PROPN NNP Number=Sing 4 nsubj 4:nsubj _ +4 expanded expand VERB VBD Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 1 advcl 1:advcl:if _ +5 on on ADP IN _ 15 case 15:case _ +6 its its PRON PRP$ Gender=Neut|Number=Sing|Person=3|Poss=Yes|PronType=Prs 15 nmod:poss 15:nmod:poss _ +7 search search NOUN NN Number=Sing 9 compound 9:compound SpaceAfter=No +8 - - PUNCT HYPH _ 9 punct 9:punct SpaceAfter=No +9 engine engine NOUN NN Number=Sing 15 compound 15:compound _ +10 ( ( PUNCT -LRB- _ 9 punct 9:punct SpaceAfter=No +11 and and CCONJ CC _ 13 cc 13:cc _ +12 now now ADV RB _ 13 advmod 13:advmod _ +13 e-mail e-mail NOUN NN Number=Sing 9 conj 9:conj:and|15:compound SpaceAfter=No +14 ) ) PUNCT -RRB- _ 15 punct 15:punct _ +15 wares wares NOUN NNS Number=Plur 4 obl 4:obl:on _ +16 into into ADP IN _ 22 case 22:case _ +17 a a DET DT Definite=Ind|PronType=Art 22 det 22:det _ +18 full full ADV RB _ 20 advmod 20:advmod SpaceAfter=No +19 - - PUNCT HYPH _ 20 punct 20:punct SpaceAfter=No +20 fledged fledged ADJ JJ Degree=Pos 22 amod 22:amod _ +21 operating operating NOUN NN Number=Sing 22 compound 22:compound _ +22 system system NOUN NN Number=Sing 4 obl 4:obl:into SpaceAfter=No +23 ? ? PUNCT . _ 4 punct 4:punct _ + +# sent_id = weblog-blogspot.com_zentelligence_20040423000200_ENG_20040423_000200-0003 +# text = [via Microsoft Watch from Mary Jo Foley ] +1 [ [ PUNCT -LRB- _ 4 punct 4:punct SpaceAfter=No +2 via via ADP IN _ 4 case 4:case _ +3 Microsoft Microsoft PROPN NNP Number=Sing 4 compound 4:compound _ +4 Watch Watch PROPN NNP Number=Sing 0 root 0:root _ +5 from from ADP IN _ 6 case 6:case _ +6 Mary Mary PROPN NNP Number=Sing 4 nmod 4:nmod:from _ +7 Jo Jo PROPN NNP Number=Sing 6 flat 6:flat _ +8 Foley Foley PROPN NNP Number=Sing 6 flat 6:flat _ +9 ] ] PUNCT -RRB- _ 4 punct 4:punct _ + +# newdoc id = weblog-blogspot.com_marketview_20050511222700_ENG_20050511_222700 +# sent_id = weblog-blogspot.com_marketview_20050511222700_ENG_20050511_222700-0001 +# newpar id = weblog-blogspot.com_marketview_20050511222700_ENG_20050511_222700-p0001 +# text = (And, by the way, is anybody else just a little nostalgic for the days when that was a good thing?) +1 ( ( PUNCT -LRB- _ 14 punct 14:punct SpaceAfter=No +2 And and CCONJ CC _ 14 cc 14:cc SpaceAfter=No +3 , , PUNCT , _ 14 punct 14:punct _ +4 by by ADP IN _ 6 case 6:case _ +5 the the DET DT Definite=Def|PronType=Art 6 det 6:det _ +6 way way NOUN NN Number=Sing 14 obl 14:obl:by SpaceAfter=No +7 , , PUNCT , _ 14 punct 14:punct _ +8 is be AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 14 cop 14:cop _ +9 anybody anybody PRON NN Number=Sing 14 nsubj 14:nsubj _ +10 else else ADJ JJ Degree=Pos 9 amod 9:amod _ +11 just just ADV RB _ 13 advmod 13:advmod _ +12 a a DET DT Definite=Ind|PronType=Art 13 det 13:det _ +13 little little ADJ JJ Degree=Pos 14 obl:npmod 14:obl:npmod _ +14 nostalgic nostalgic NOUN NN Number=Sing 0 root 0:root _ +15 for for ADP IN _ 17 case 17:case _ +16 the the DET DT Definite=Def|PronType=Art 17 det 17:det _ +17 days day NOUN NNS Number=Plur 14 nmod 14:nmod:for|23:obl:npmod _ +18 when when ADV WRB PronType=Rel 23 advmod 17:ref _ +19 that that PRON DT Number=Sing|PronType=Dem 23 nsubj 23:nsubj _ +20 was be AUX VBD Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 23 cop 23:cop _ +21 a a DET DT Definite=Ind|PronType=Art 23 det 23:det _ +22 good good ADJ JJ Degree=Pos 23 amod 23:amod _ +23 thing thing NOUN NN Number=Sing 17 acl:relcl 17:acl:relcl SpaceAfter=No +24 ? ? PUNCT . _ 14 punct 14:punct SpaceAfter=No +25 ) ) PUNCT -RRB- _ 14 punct 14:punct _ \ No newline at end of file diff --git a/stanza/demo/semgrex visualization.ipynb b/stanza/demo/semgrex visualization.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..b2b0b44c4c034df1a5c1cd0ffaa9da234ecf2ee7 --- /dev/null +++ b/stanza/demo/semgrex visualization.ipynb @@ -0,0 +1,367 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "id": "2787d5f5", + "metadata": {}, + "outputs": [], + "source": [ + "import stanza\n", + "from stanza.server.semgrex import Semgrex\n", + "from stanza.models.common.constant import is_right_to_left\n", + "import spacy\n", + "from spacy import displacy\n", + "from spacy.tokens import Doc\n", + "from IPython.display import display, HTML\n", + "\n", + "\n", + "\"\"\"\n", + "IMPORTANT: For the code in this module to run, you must have corenlp and Java installed on your machine. Additionally,\n", + "set an environment variable CLASSPATH equal to the path of your corenlp directory.\n", + "\n", + "Example: CLASSPATH=C:\\\\Users\\\\Alex\\\\PycharmProjects\\\\pythonProject\\\\stanford-corenlp-4.5.0\\\\stanford-corenlp-4.5.0\\\\*\n", + "\"\"\"\n", + "\n", + "%env CLASSPATH=C:\\\\stanford-corenlp-4.5.2\\\\stanford-corenlp-4.5.2\\\\*\n", + "def get_sentences_html(doc, language):\n", + " \"\"\"\n", + " Returns a list of the HTML strings of the dependency visualizations of a given stanza doc object.\n", + "\n", + " The 'language' arg is the two-letter language code for the document to be processed.\n", + "\n", + " First converts the stanza doc object to a spacy doc object and uses displacy to generate an HTML\n", + " string for each sentence of the doc object.\n", + " \"\"\"\n", + " html_strings = []\n", + "\n", + " # blank model - we don't use any of the model features, just the visualization\n", + " nlp = spacy.blank(\"en\")\n", + " sentences_to_visualize = []\n", + " for sentence in doc.sentences:\n", + " words, lemmas, heads, deps, tags = [], [], [], [], []\n", + " if is_right_to_left(language): # order of words displayed is reversed, dependency arcs remain intact\n", + " sent_len = len(sentence.words)\n", + " for word in reversed(sentence.words):\n", + " words.append(word.text)\n", + " lemmas.append(word.lemma)\n", + " deps.append(word.deprel)\n", + " tags.append(word.upos)\n", + " if word.head == 0: # spaCy head indexes are formatted differently than that of Stanza\n", + " heads.append(sent_len - word.id)\n", + " else:\n", + " heads.append(sent_len - word.head)\n", + " else: # left to right rendering\n", + " for word in sentence.words:\n", + " words.append(word.text)\n", + " lemmas.append(word.lemma)\n", + " deps.append(word.deprel)\n", + " tags.append(word.upos)\n", + " if word.head == 0:\n", + " heads.append(word.id - 1)\n", + " else:\n", + " heads.append(word.head - 1)\n", + " document_result = Doc(nlp.vocab, words=words, lemmas=lemmas, heads=heads, deps=deps, pos=tags)\n", + " sentences_to_visualize.append(document_result)\n", + "\n", + " for line in sentences_to_visualize: # render all sentences through displaCy\n", + " html_strings.append(displacy.render(line, style=\"dep\",\n", + " options={\"compact\": True, \"word_spacing\": 30, \"distance\": 100,\n", + " \"arrow_spacing\": 20}, jupyter=False))\n", + " return html_strings\n", + "\n", + "\n", + "def find_nth(haystack, needle, n):\n", + " \"\"\"\n", + " Returns the starting index of the nth occurrence of the substring 'needle' in the string 'haystack'.\n", + " \"\"\"\n", + " start = haystack.find(needle)\n", + " while start >= 0 and n > 1:\n", + " start = haystack.find(needle, start + len(needle))\n", + " n -= 1\n", + " return start\n", + "\n", + "\n", + "def round_base(num, base=10):\n", + " \"\"\"\n", + " Rounding a number to its nearest multiple of the base. round_base(49.2, base=50) = 50.\n", + " \"\"\"\n", + " return base * round(num/base)\n", + "\n", + "\n", + "def process_sentence_html(orig_html, semgrex_sentence):\n", + " \"\"\"\n", + " Takes a semgrex sentence object and modifies the HTML of the original sentence's deprel visualization,\n", + " highlighting words involved in the search queries and adding the label of the word inside of the semgrex match.\n", + "\n", + " Returns the modified html string of the sentence's deprel visualization.\n", + " \"\"\"\n", + " tracker = {} # keep track of which words have multiple labels\n", + " DEFAULT_TSPAN_COUNT = 2 # the original displacy html assigns two objects per object\n", + " CLOSING_TSPAN_LEN = 8 # is 8 chars long\n", + " colors = ['#4477AA', '#66CCEE', '#228833', '#CCBB44', '#EE6677', '#AA3377', '#BBBBBB']\n", + " css_bolded_class = \"\\n\"\n", + " found_index = orig_html.find(\"\\n\") # returns index where the opening ends\n", + " # insert the new style class into html string\n", + " orig_html = orig_html[: found_index + 1] + css_bolded_class + orig_html[found_index + 1:]\n", + "\n", + " # Add color to words in the match, bold words in the match\n", + " for query in semgrex_sentence.result:\n", + " for i, match in enumerate(query.match):\n", + " color = colors[i]\n", + " paired_dy = 2\n", + " for node in match.node:\n", + " name, match_index = node.name, node.matchIndex\n", + " # edit existing to change color and bold the text\n", + " start = find_nth(orig_html, \" of interest\n", + " if match_index not in tracker: # if we've already bolded and colored, keep the first color\n", + " tspan_start = orig_html.find(\" inside of the \n", + " tspan_end = orig_html.find(\"\", start) # finds start of the end of the above \n", + " tspan_substr = orig_html[tspan_start: tspan_end + CLOSING_TSPAN_LEN + 1] + \"\\n\"\n", + " # color words in the hit and bold words in the hit\n", + " edited_tspan = tspan_substr.replace('class=\"displacy-word\"', 'class=\"bolded\"').replace(\n", + " 'fill=\"currentColor\"', f'fill=\"{color}\"')\n", + " # insert edited object into html string\n", + " orig_html = orig_html[: tspan_start] + edited_tspan + orig_html[tspan_end + CLOSING_TSPAN_LEN + 2:]\n", + " tracker[match_index] = DEFAULT_TSPAN_COUNT\n", + "\n", + " # next, we have to insert the new object for the label\n", + " # Copy old to copy formatting when creating new later\n", + " prev_tspan_start = find_nth(orig_html[start:], \" start index\n", + " prev_tspan_end = find_nth(orig_html[start:], \"\",\n", + " tracker[match_index] - 1) + start # find the prev start index\n", + " prev_tspan = orig_html[prev_tspan_start: prev_tspan_end + CLOSING_TSPAN_LEN + 1]\n", + "\n", + " # Find spot to insert new tspan\n", + " closing_tspan_start = find_nth(orig_html[start:], \"\", tracker[match_index]) + start\n", + " up_to_new_tspan = orig_html[: closing_tspan_start + CLOSING_TSPAN_LEN + 1]\n", + " rest_need_add_newline = orig_html[closing_tspan_start + CLOSING_TSPAN_LEN + 1:]\n", + "\n", + " # Calculate proper x value in svg\n", + " x_value_start = prev_tspan.find('x=\"')\n", + " x_value_end = prev_tspan[x_value_start + 3:].find('\"') + 3 # 3 is the length of the 'x=\"' substring\n", + " x_value = prev_tspan[x_value_start + 3: x_value_end + x_value_start]\n", + "\n", + " # Calculate proper y value in svg\n", + " DEFAULT_DY_VAL, dy = 2, 2\n", + " if paired_dy != DEFAULT_DY_VAL and node == match.node[\n", + " 1]: # we're on the second node and need to adjust height to match the paired node\n", + " dy = paired_dy\n", + " if node == match.node[0]:\n", + " paired_node_level = 2\n", + " if match.node[1].matchIndex in tracker: # check if we need to adjust heights of labels\n", + " paired_node_level = tracker[match.node[1].matchIndex]\n", + " dif = tracker[match_index] - paired_node_level\n", + " if dif > 0: # current node has more labels\n", + " paired_dy = DEFAULT_DY_VAL * dif + 1\n", + " dy = DEFAULT_DY_VAL\n", + " else: # paired node has more labels, adjust this label down\n", + " dy = DEFAULT_DY_VAL * (abs(dif) + 1)\n", + " paired_dy = DEFAULT_DY_VAL\n", + "\n", + " # Insert new object\n", + " new_tspan = f' {name[: 3].title()}.\\n' # abbreviate label names to 3 chars\n", + " orig_html = up_to_new_tspan + new_tspan + rest_need_add_newline\n", + " tracker[match_index] += 1\n", + " return orig_html\n", + "\n", + "\n", + "def render_html_strings(edited_html_strings):\n", + " \"\"\"\n", + " Renders the HTML to make the edits visible\n", + " \"\"\"\n", + " for html_string in edited_html_strings:\n", + " display(HTML(html_string))\n", + "\n", + "\n", + "def visualize_search_doc(doc, semgrex_queries, lang_code, start_match=0, end_match=10):\n", + " \"\"\"\n", + " Visualizes the semgrex results of running semgrex search on a stanza doc object with the given list of\n", + " semgrex queries. Returns a list of the edited HTML strings from the doc. Each element in the list represents\n", + " the HTML to render one of the sentences in the document.\n", + "\n", + " 'lang_code' is the two-letter language abbreviation for the language that the stanza doc object is written in.\n", + "\n", + "\n", + " 'start_match' and 'end_match' determine which matches to visualize. Works similar to splices, so that\n", + " start_match=0 and end_match=10 will display the first 10 semgrex matches.\n", + " \"\"\"\n", + " matches_count = 0 # Limits number of visualizations\n", + " with Semgrex(classpath=\"$CLASSPATH\") as sem:\n", + " edited_html_strings = []\n", + " semgrex_results = sem.process(doc, *semgrex_queries)\n", + " # one html string for each sentence\n", + " unedited_html_strings = get_sentences_html(doc, lang_code)\n", + " for i in range(len(unedited_html_strings)):\n", + "\n", + " if matches_count >= end_match: # we've collected enough matches, stop early\n", + " break\n", + "\n", + " # check if sentence has matches, if not then do not visualize\n", + " has_none = True\n", + " for query in semgrex_results.result[i].result:\n", + " for match in query.match:\n", + " if match:\n", + " has_none = False\n", + "\n", + " # Process HTML if queries have matches\n", + " if not has_none:\n", + " if start_match <= matches_count < end_match:\n", + " edited_string = process_sentence_html(unedited_html_strings[i], semgrex_results.result[i])\n", + " edited_string = adjust_dep_arrows(edited_string)\n", + " edited_html_strings.append(edited_string)\n", + " matches_count += 1\n", + "\n", + " render_html_strings(edited_html_strings)\n", + " return edited_html_strings\n", + "\n", + "\n", + "def visualize_search_str(text, semgrex_queries, lang_code):\n", + " \"\"\"\n", + " Visualizes the deprel of the semgrex results from running semgrex search on a string with the given list of\n", + " semgrex queries. Returns a list of the edited HTML strings. Each element in the list represents\n", + " the HTML to render one of the sentences in the document.\n", + "\n", + " Internally, this function converts the string into a stanza doc object before processing the doc object.\n", + "\n", + " 'lang_code' is the two-letter language abbreviation for the language that the stanza doc object is written in.\n", + " \"\"\"\n", + " nlp = stanza.Pipeline(lang_code, processors=\"tokenize, pos, lemma, depparse\")\n", + " doc = nlp(text)\n", + " return visualize_search_doc(doc, semgrex_queries, lang_code)\n", + "\n", + "\n", + "def adjust_dep_arrows(raw_html):\n", + " \"\"\"\n", + " The default spaCy dependency visualization has misaligned arrows.\n", + " We fix arrows by aligning arrow ends and bodies to the word that they are directed to. If a word has an\n", + " arrowhead that is pointing not directly on the word's center, align the arrowhead to match the center of the word.\n", + "\n", + " returns the edited html with fixed arrow placement\n", + " \"\"\"\n", + " HTML_ARROW_BEGINNING = ''\n", + " HTML_ARROW_ENDING = \"\"\n", + " HTML_ARROW_ENDING_LEN = 6 # there are 2 newline chars after the arrow ending\n", + " arrows_start_idx = find_nth(haystack=raw_html, needle='', n=1)\n", + " words_html, arrows_html = raw_html[: arrows_start_idx], raw_html[arrows_start_idx:] # separate html for words and arrows\n", + " final_html = words_html # continually concatenate to this after processing each arrow\n", + " arrow_number = 1 # which arrow we're editing (1-indexed)\n", + " start_idx, end_of_class_idx = find_nth(haystack=arrows_html, needle=HTML_ARROW_BEGINNING, n=arrow_number), find_nth(arrows_html, HTML_ARROW_ENDING, arrow_number)\n", + " while start_idx != -1: # edit every arrow\n", + " arrow_section = arrows_html[start_idx: end_of_class_idx + HTML_ARROW_ENDING_LEN] # slice a single svg arrow object\n", + " if arrow_section[-1] == \"<\": # this is the last arrow in the HTML, don't cut the splice early\n", + " arrow_section = arrows_html[start_idx:]\n", + " edited_arrow_section = edit_dep_arrow(arrow_section)\n", + "\n", + " final_html = final_html + edited_arrow_section # continually update html with new arrow html until done\n", + "\n", + " # Prepare for next iteration\n", + " arrow_number += 1\n", + " start_idx = find_nth(arrows_html, '', n=arrow_number)\n", + " end_of_class_idx = find_nth(arrows_html, \"\", arrow_number)\n", + " return final_html\n", + "\n", + "\n", + "def edit_dep_arrow(arrow_html):\n", + " \"\"\"\n", + " The formatting of a displacy arrow in svg is the following:\n", + " \n", + " \n", + " \n", + " csubj\n", + " \n", + " \n", + " \n", + "\n", + " We edit the 'd = ...' parts of the section to fix the arrow direction and length\n", + "\n", + " returns the arrow_html with distances fixed\n", + " \"\"\"\n", + " WORD_SPACING = 50 # words start at x=50 and are separated by 100s so their x values are multiples of 50\n", + " M_OFFSET = 4 # length of 'd=\"M' that we search for to extract the number from d=\"M70, for instance\n", + " ARROW_PIXEL_SIZE = 4\n", + " first_d_idx, second_d_idx = find_nth(arrow_html, 'd=\"M', 1), find_nth(arrow_html, 'd=\"M', 2) # find where d=\"M starts\n", + " first_d_cutoff, second_d_cutoff = arrow_html.find(\",\", first_d_idx), arrow_html.find(\",\", second_d_idx) # isolate the number after 'M' e.g. 'M70'\n", + " # gives svg x values of arrow body starting position and arrowhead position\n", + " arrow_position, arrowhead_position = float(arrow_html[first_d_idx + M_OFFSET: first_d_cutoff]), float(arrow_html[second_d_idx + M_OFFSET: second_d_cutoff])\n", + " # gives starting index of where 'fill=\"none\"' or 'fill=\"currentColor\"' begin, reference points to end the d= section\n", + " first_fill_start_idx, second_fill_start_idx = find_nth(arrow_html, \"fill\", n=1), find_nth(arrow_html, \"fill\", n=3)\n", + "\n", + " # isolate the d= ... section to edit\n", + " first_d, second_d = arrow_html[first_d_idx: first_fill_start_idx], arrow_html[second_d_idx: second_fill_start_idx]\n", + " first_d_split, second_d_split = first_d.split(\",\"), second_d.split(\",\")\n", + "\n", + " if arrow_position == arrowhead_position: # This arrow is incoming onto the word, center the arrow/head to word center\n", + " corrected_arrow_pos = corrected_arrowhead_pos = round_base(arrow_position, base=WORD_SPACING)\n", + "\n", + " # edit first_d -- arrow body\n", + " second_term = first_d_split[1].split(\" \")[0] + \" \" + str(corrected_arrow_pos)\n", + " first_d = 'd=\"M' + str(corrected_arrow_pos) + \",\" + second_term + \",\" + \",\".join(first_d_split[2:])\n", + "\n", + " # edit second_d -- arrowhead\n", + " second_term = second_d_split[1].split(\" \")[0] + \" L\" + str(corrected_arrowhead_pos - ARROW_PIXEL_SIZE)\n", + " third_term = second_d_split[2].split(\" \")[0] + \" \" + str(corrected_arrowhead_pos + ARROW_PIXEL_SIZE)\n", + " second_d = 'd=\"M' + str(corrected_arrowhead_pos) + \",\" + second_term + \",\" + third_term + \",\" + \",\".join(second_d_split[3:])\n", + " else: # This arrow is outgoing to another word, center the arrow/head to that word's center\n", + " corrected_arrowhead_pos = round_base(arrowhead_position, base=WORD_SPACING)\n", + "\n", + " # edit first_d -- arrow body\n", + " third_term = first_d_split[2].split(\" \")[0] + \" \" + str(corrected_arrowhead_pos)\n", + " fourth_term = first_d_split[3].split(\" \")[0] + \" \" + str(corrected_arrowhead_pos)\n", + " terms = [first_d_split[0], first_d_split[1], third_term, fourth_term] + first_d_split[4:]\n", + " first_d = \",\".join(terms)\n", + "\n", + " # edit second_d -- arrow head\n", + " first_term = f'd=\"M{corrected_arrowhead_pos}'\n", + " second_term = second_d_split[1].split(\" \")[0] + \" L\" + str(corrected_arrowhead_pos - ARROW_PIXEL_SIZE)\n", + " third_term = second_d_split[2].split(\" \")[0] + \" \" + str(corrected_arrowhead_pos + ARROW_PIXEL_SIZE)\n", + " terms = [first_term, second_term, third_term] + second_d_split[3:]\n", + " second_d = \",\".join(terms)\n", + " # rebuild and return html\n", + " return arrow_html[:first_d_idx] + first_d + \" \" + arrow_html[first_fill_start_idx:second_d_idx] + second_d + \" \" + arrow_html[second_fill_start_idx:]\n", + "\n", + "\n", + "def main():\n", + " nlp = stanza.Pipeline(\"en\", processors=\"tokenize,pos,lemma,depparse\")\n", + "\n", + " # doc = nlp(\"This a dummy sentence. Banning opal removed all artifact decks from the meta. I miss playing lantern. This is a dummy sentence.\")\n", + " doc = nlp(\"Banning opal removed artifact decks from the meta. Banning tennis resulted in players banning people.\")\n", + " # A single result .result[i].result[j] is a list of matches for sentence i on semgrex query j.\n", + " queries = [\"{pos:NN}=object + + + + + + + Sił Zbrojnych + + + Siły Zbrojne + + + + + + + + + +""".strip() + + +EMPTY_SENTENCE = """""" + +def test_extract_entities_from_sentence(): + rt = ET.fromstring(SENTENCE_SAMPLE) + entities = extract_entities_from_sentence(rt) + assert entities == EXPECTED_ENTITIES['1-p']['1.39-s'] + + rt = ET.fromstring(EMPTY_SENTENCE) + entities = extract_entities_from_sentence(rt) + assert entities == [] + + + +# picked completely at random, one sample file for testing: +# 610-1-000248/ann_named.xml +# only the first sentence is used in the morpho file +SAMPLE_ANN = """ + + + + + + + +

+ + + + + + + + Sił Zbrojnych + + + Siły Zbrojne + + + + + + + + + + + +

+

+ + +

+

+ +

+ +
+
+
+""".lstrip() + + + +SAMPLE_MORPHO = """ + + + + + + + + +

+ + + + + 2 + + + + + + + + + + + + + + + + + 2 + + + + + + + + + + + + + + 2:adj:sg:nom:n:pos + + + + + + + + + . + + + + + + + + + . + + + + + + + + + + + + + + .:interp + + + + + + + + + Wezwanie + + + + + + wezwanie + + + + + + + + + + + + + + + wezwać + + + + + + + + + + + + + + + + + wezwanie:subst:sg:acc:n + + + + + + + + + , + + + + + + + + + , + + + + + + + + + + + + + + ,:interp + + + + + + + + + o + + + + + + o + + + + + + + + + + + o + + + + + + + + + + + + + + ojciec + + + + + + + + + + + + + + o:prep:loc + + + + + + + + + którym + + + + + + który + + + + + + + + + + + + + + + + + + + + + + + + + + + + który:adj:sg:loc:n:pos + + + + + + + + + mowa + + + + + + mowa + + + + + + + + + + + + + + mowa:subst:sg:nom:f + + + + + + + + + w + + + + + + w + + + + + + + + + + + + + + wiek + + + + + + + + + + + wielki + + + + + + + + + + + wiersz + + + + + + + + + + + wieś + + + + + + + + + + + wyspa + + + + + + + + + + + + + + w:prep:loc:nwok + + + + + + + + + ust + + + + + + usta + + + + + + + + + + + ustęp + + + + + + + + + + + + + + ustęp:brev:pun + + + + + + + + + . + + + + + + + + + . + + + + + + + + + + + + + + .:interp + + + + + + + + + 1 + + + + + + + + + + + + + + + + + 1 + + + + + + + + + + + + + + 1:adj:sg:loc:m3:pos + + + + + + + + + , + + + + + + + + + , + + + + + + + + + + + + + + ,:interp + + + + + + + + + doręcza + + + + + + doręczać + + + + + + + + + + + doręcze + + + + + + + + + + + + + + + + + + + doręczać:fin:sg:ter:imperf + + + + + + + + + się + + + + + + się + + + + + + + + + + + + + + się:qub + + + + + + + + + na + + + + + + na + + + + + + + + + + + na + + + + + + + + + + + + + + + + + na:prep:acc + + + + + + + + + czternaście + + + + + + czternaście + + + + + + + + + + + + + + + + + + + + + + + + + + + czternaście:num:pl:acc:m3:rec + + + + + + + + + dni + + + + + + dni + + + + + + + + + + + + + + + + dzień + + + + + + + + + + + + + + + + + + + dzień:subst:pl:gen:m3 + + + + + + + + + przed + + + + + + przed + + + + + + + + + + + + + + + + + przed:prep:inst:nwok + + + + + + + + + terminem + + + + + + termin + + + + + + + + + + + + + + termin:subst:sg:inst:m3 + + + + + + + + + wykonania + + + + + + wykonanie + + + + + + + + + + + + + + + + wykonać + + + + + + + + + + + + + + wykonać:ger:sg:gen:n:perf:aff + + + + + + + + + świadczenia + + + + + + świadczenie + + + + + + + + + + + + + + + + świadczyć + + + + + + + + + + + + + + świadczenie:subst:sg:gen:n + + + + + + + + + , + + + + + + + + + , + + + + + + + + + + + + + + ,:interp + + + + + + + + + z + + + + + + z + + + + + + + + + + + + + + + z + + + + + + + + + + + zeszyt + + + + + + + + + + + + + + z:prep:inst:nwok + + + + + + + + + wyjątkiem + + + + + + wyjątek + + + + + + + + + + + + + + wyjątek:subst:sg:inst:m3 + + + + + + + + + przypadków + + + + + + przypadek + + + + + + + + + + + + + + przypadek:subst:pl:gen:m3 + + + + + + + + + , + + + + + + + + + , + + + + + + + + + + + + + + ,:interp + + + + + + + + + w + + + + + + w + + + + + + + + + + + + + + wiek + + + + + + + + + + + wielki + + + + + + + + + + + wiersz + + + + + + + + + + + wieś + + + + + + + + + + + wyspa + + + + + + + + + + + + + + w:prep:loc:nwok + + + + + + + + + których + + + + + + który + + + + + + + + + + + + + + + + + + + + + + + + + + który:adj:pl:loc:m3:pos + + + + + + + + + wykonanie + + + + + + wykonanie + + + + + + + + + + + + + + + wykonać + + + + + + + + + + + + + + + + + wykonać:ger:sg:nom:n:perf:aff + + + + + + + + + świadczenia + + + + + + świadczenie + + + + + + + + + + + + + + + + świadczyć + + + + + + + + + + + + + + świadczenie:subst:sg:gen:n + + + + + + + + + następuje + + + + + + następować + + + + + + + + + + + + + + następować:fin:sg:ter:imperf + + + + + + + + + w + + + + + + w + + + + + + + + + + + + + + wiek + + + + + + + + + + + wielki + + + + + + + + + + + wiersz + + + + + + + + + + + wieś + + + + + + + + + + + wyspa + + + + + + + + + + + + + + w:prep:loc:nwok + + + + + + + + + celu + + + + + + Cela + + + + + + + + + + + cel + + + + + + + + + + + + + + + + + + cel:subst:sg:loc:m3 + + + + + + + + + sprawdzenia + + + + + + sprawdzić + + + + + + + + + + + + + + sprawdzić:ger:sg:gen:n:perf:aff + + + + + + + + + gotowości + + + + + + gotowość + + + + + + + + + + + + + + + + + + + + + + + gotowość:subst:sg:gen:f + + + + + + + + + mobilizacyjnej + + + + + + mobilizacyjny + + + + + + + + + + + + + + + + + + mobilizacyjny:adj:sg:gen:f:pos + + + + + + + + + Sił + + + + + + siła + + + + + + + + + + + siły + + + + + + + + + + + + + + siła:subst:pl:gen:f + + + + + + + + + Zbrojnych + + + + + + zbrojny + + + + + + + + + + + + + + + + + + + + + + + + + + zbrojny:adj:pl:gen:f:pos + + + + + + + + + . + + + + + + + + + . + + + + + + + + + + + + + + .:interp + + + + + + +

+ +
+
+
+""".lstrip() diff --git a/stanza/stanza/tests/resources/test_charlm_depparse.py b/stanza/stanza/tests/resources/test_charlm_depparse.py new file mode 100644 index 0000000000000000000000000000000000000000..990a3d3092d61dd0d81dd8780b51b0b56f5a1513 --- /dev/null +++ b/stanza/stanza/tests/resources/test_charlm_depparse.py @@ -0,0 +1,32 @@ +import pytest + +from stanza.resources.default_packages import default_charlms, depparse_charlms +from stanza.resources.print_charlm_depparse import list_depparse + +def test_list_depparse(): + models = list_depparse() + + # check that it's picking up the models which don't have specific charlms + # first, make sure the default assumption of the test is still true... + # if this test fails, find a different language which isn't in depparse_charlms + assert "af" not in depparse_charlms + assert "af" in default_charlms + assert "af_afribooms_charlm" in models + assert "af_afribooms_nocharlm" in models + + # assert that it's picking up the models which do have specific charlms that aren't None + # again, first make sure the default assumptions are true + # if one of these next few tests fail, just update the test + assert "en" in depparse_charlms + assert "en" in default_charlms + assert "ewt" not in depparse_charlms["en"] + assert "craft" in depparse_charlms["en"] + assert "mimic" in depparse_charlms["en"] + # now, check the results + assert "en_ewt_charlm" in models + assert "en_ewt_nocharlm" in models + assert "en_mimic_charlm" in models + # haven't yet trained w/ and w/o for the bio models + assert "en_mimic_nocharlm" not in models + assert "en_craft_charlm" not in models + assert "en_craft_nocharlm" in models diff --git a/stanza/stanza/tests/resources/test_common.py b/stanza/stanza/tests/resources/test_common.py new file mode 100644 index 0000000000000000000000000000000000000000..ae91bdc18ddfbf04b201625790b2d1c75e8d8f3d --- /dev/null +++ b/stanza/stanza/tests/resources/test_common.py @@ -0,0 +1,132 @@ +""" +Test various resource downloading functions from resources/common.py +""" + +import os +import pytest +import tempfile + +import stanza +from stanza.resources import common +from stanza.tests import TEST_MODELS_DIR, TEST_WORKING_DIR + +pytestmark = [pytest.mark.travis, pytest.mark.client] + +def test_assert_file_exists(): + with tempfile.TemporaryDirectory(dir=TEST_WORKING_DIR) as test_dir: + filename = os.path.join(test_dir, "test.txt") + with pytest.raises(FileNotFoundError): + common.assert_file_exists(filename) + + with open(filename, "w", encoding="utf-8") as fout: + fout.write("Unban mox opal!") + # MD5 of the fake model file, not any real model files in the system + EXPECTED_MD5 = "44dbf21b4e89cea5184615a72a825a36" + common.assert_file_exists(filename) + common.assert_file_exists(filename, md5=EXPECTED_MD5) + + with pytest.raises(ValueError): + common.assert_file_exists(filename, md5="12345") + + with pytest.raises(ValueError): + common.assert_file_exists(filename, md5="12345", alternate_md5="12345") + + common.assert_file_exists(filename, md5="12345", alternate_md5=EXPECTED_MD5) + + +def test_download_tokenize_mwt(): + with tempfile.TemporaryDirectory(dir=TEST_WORKING_DIR) as test_dir: + stanza.download("en", model_dir=test_dir, processors="tokenize", package="ewt", verbose=False) + pipeline = stanza.Pipeline("en", model_dir=test_dir, processors="tokenize", package="ewt") + assert isinstance(pipeline, stanza.Pipeline) + # mwt should be added to the list + assert len(pipeline.loaded_processors) == 2 + +def test_download_non_default(): + """ + Test the download path for a single file rather than the default zip + + The expectation is that an NER model will also download two charlm models. + If that layout changes on purpose, this test will fail and will need to be updated + """ + with tempfile.TemporaryDirectory(dir=TEST_WORKING_DIR) as test_dir: + stanza.download("en", model_dir=test_dir, processors="ner", package="ontonotes_charlm", verbose=False) + assert sorted(os.listdir(test_dir)) == ['en', 'resources.json'] + en_dir = os.path.join(test_dir, 'en') + en_dir_listing = sorted(os.listdir(en_dir)) + assert en_dir_listing == ['backward_charlm', 'forward_charlm', 'ner', 'pretrain'] + assert os.listdir(os.path.join(en_dir, 'ner')) == ['ontonotes_charlm.pt'] + for i in en_dir_listing: + assert len(os.listdir(os.path.join(en_dir, i))) == 1 + + +def test_download_two_models(): + """ + Test the download path for two NER models + + The package system should now allow for multiple NER models to be + specified, and a consequence of that is it should be possible to + download two models at once + + The expectation is that the two different NER models both download + a different forward & backward charlm. If that changes, the test + will fail. Best way to update it will be two different models + which download two different charlms + """ + with tempfile.TemporaryDirectory(dir=TEST_WORKING_DIR) as test_dir: + stanza.download("en", model_dir=test_dir, processors="ner", package={"ner": ["ontonotes_charlm", "anatem"]}, verbose=False) + assert sorted(os.listdir(test_dir)) == ['en', 'resources.json'] + en_dir = os.path.join(test_dir, 'en') + en_dir_listing = sorted(os.listdir(en_dir)) + assert en_dir_listing == ['backward_charlm', 'forward_charlm', 'ner', 'pretrain'] + assert sorted(os.listdir(os.path.join(en_dir, 'ner'))) == ['anatem.pt', 'ontonotes_charlm.pt'] + for i in en_dir_listing: + assert len(os.listdir(os.path.join(en_dir, i))) == 2 + + +def test_process_pipeline_parameters(): + """ + Test a few options for specifying which processors to load + """ + with tempfile.TemporaryDirectory(dir=TEST_WORKING_DIR) as test_dir: + lang, model_dir, package, processors = common.process_pipeline_parameters("en", test_dir, None, "tokenize,pos") + assert processors == {"tokenize": "default", "pos": "default"} + assert package == None + + lang, model_dir, package, processors = common.process_pipeline_parameters("en", test_dir, {"tokenize": "spacy"}, "tokenize,pos") + assert processors == {"tokenize": "spacy", "pos": "default"} + assert package == None + + lang, model_dir, package, processors = common.process_pipeline_parameters("en", test_dir, {"pos": "ewt"}, "tokenize,pos") + assert processors == {"tokenize": "default", "pos": "ewt"} + assert package == None + + lang, model_dir, package, processors = common.process_pipeline_parameters("en", test_dir, "ewt", "tokenize,pos") + assert processors == {"tokenize": "ewt", "pos": "ewt"} + assert package == None + +def test_language_resources(): + resources = common.load_resources_json(TEST_MODELS_DIR) + + # check that an unknown language comes back as None + bad_lang = 'z' + while bad_lang in resources and len(bad_lang) < 100: + bad_lang = bad_lang + 'z' + assert bad_lang not in resources + assert common.get_language_resources(resources, bad_lang) == None + + # check the parameters of the test make sense + # there should be 'zh' which is an alias of 'zh-hans' + assert "zh" in resources + assert "alias" in resources["zh"] + assert resources["zh"]["alias"] == "zh-hans" + + # check that getting the resources for either 'zh' or 'zh-hans' + # return the simplified Chinese resources + zh_resources = common.get_language_resources(resources, "zh") + assert "tokenize" in zh_resources + assert "alias" not in zh_resources + assert "Chinese" in zh_resources["lang_name"] + + zh_hans_resources = common.get_language_resources(resources, "zh-hans") + assert zh_resources == zh_hans_resources diff --git a/stanza/stanza/tests/resources/test_installation.py b/stanza/stanza/tests/resources/test_installation.py new file mode 100644 index 0000000000000000000000000000000000000000..0b99721afdbc005153d2a7a9cef8272b9e14c549 --- /dev/null +++ b/stanza/stanza/tests/resources/test_installation.py @@ -0,0 +1,48 @@ +""" +Test installation functions. +""" + +import os +import pytest +import shutil +import tempfile + +import stanza +from stanza.tests import TEST_WORKING_DIR + +pytestmark = [pytest.mark.travis, pytest.mark.client] + +def test_install_corenlp(): + # we do not reset the CORENLP_HOME variable since this may impact the + # client tests + with tempfile.TemporaryDirectory(dir=TEST_WORKING_DIR) as test_dir: + + # the download method doesn't install over existing directories + shutil.rmtree(test_dir) + stanza.install_corenlp(dir=test_dir) + + assert os.path.isdir(test_dir), "Installation destination directory not found." + jar_files = [f for f in os.listdir(test_dir) \ + if f.endswith('.jar') and f.startswith('stanford-corenlp')] + assert len(jar_files) > 0, \ + "Cannot find stanford-corenlp jar files in the installation directory." + assert not os.path.exists(os.path.join(test_dir, 'corenlp.zip')), \ + "Downloaded zip file was not removed." + +def test_download_corenlp_models(): + model_name = "arabic" + version = "4.2.2" + + with tempfile.TemporaryDirectory(dir=TEST_WORKING_DIR) as test_dir: + stanza.download_corenlp_models(model=model_name, version=version, dir=test_dir) + + dest_file = os.path.join(test_dir, f"stanford-corenlp-{version}-models-{model_name}.jar") + assert os.path.isfile(dest_file), "Downloaded model file not found." + +def test_download_tokenize_mwt(): + with tempfile.TemporaryDirectory(dir=TEST_WORKING_DIR) as test_dir: + stanza.download("en", model_dir=test_dir, processors="tokenize", package="ewt", verbose=False) + pipeline = stanza.Pipeline("en", model_dir=test_dir, processors="tokenize", package="ewt") + assert isinstance(pipeline, stanza.Pipeline) + # mwt should be added to the list + assert len(pipeline.loaded_processors) == 2 diff --git a/stanza/stanza/tests/server/__init__.py b/stanza/stanza/tests/server/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/stanza/stanza/tests/server/test_client.py b/stanza/stanza/tests/server/test_client.py new file mode 100644 index 0000000000000000000000000000000000000000..3cb8fbf74bedbcbca6d9557f9fa6ea2498884f6c --- /dev/null +++ b/stanza/stanza/tests/server/test_client.py @@ -0,0 +1,239 @@ +""" +Tests that call a running CoreNLPClient. +""" + +from http.server import BaseHTTPRequestHandler, HTTPServer +import multiprocessing +import pytest +import requests +import stanza.server as corenlp +import stanza.server.client as client +import shlex +import subprocess +import time + +from stanza.models.constituency import tree_reader +from stanza.tests import * + +# set the marker for this module +pytestmark = [pytest.mark.travis, pytest.mark.client] + +TEXT = "Chris wrote a simple sentence that he parsed with Stanford CoreNLP.\n" + +MAX_REQUEST_ATTEMPTS = 5 + +EN_GOLD = """ +Sentence #1 (12 tokens): +Chris wrote a simple sentence that he parsed with Stanford CoreNLP. + +Tokens: +[Text=Chris CharacterOffsetBegin=0 CharacterOffsetEnd=5 PartOfSpeech=NNP] +[Text=wrote CharacterOffsetBegin=6 CharacterOffsetEnd=11 PartOfSpeech=VBD] +[Text=a CharacterOffsetBegin=12 CharacterOffsetEnd=13 PartOfSpeech=DT] +[Text=simple CharacterOffsetBegin=14 CharacterOffsetEnd=20 PartOfSpeech=JJ] +[Text=sentence CharacterOffsetBegin=21 CharacterOffsetEnd=29 PartOfSpeech=NN] +[Text=that CharacterOffsetBegin=30 CharacterOffsetEnd=34 PartOfSpeech=WDT] +[Text=he CharacterOffsetBegin=35 CharacterOffsetEnd=37 PartOfSpeech=PRP] +[Text=parsed CharacterOffsetBegin=38 CharacterOffsetEnd=44 PartOfSpeech=VBD] +[Text=with CharacterOffsetBegin=45 CharacterOffsetEnd=49 PartOfSpeech=IN] +[Text=Stanford CharacterOffsetBegin=50 CharacterOffsetEnd=58 PartOfSpeech=NNP] +[Text=CoreNLP CharacterOffsetBegin=59 CharacterOffsetEnd=66 PartOfSpeech=NNP] +[Text=. CharacterOffsetBegin=66 CharacterOffsetEnd=67 PartOfSpeech=.] +""".strip() + +def run_webserver(port, timeout_secs): + class HTTPTimeoutHandler(BaseHTTPRequestHandler): + def do_POST(self): + time.sleep(timeout_secs) + self.send_response(200) + self.send_header('Content-type', 'text/plain; charset=utf-8') + self.end_headers() + self.wfile.write("HTTPMockServerTimeout") + + HTTPServer(('127.0.0.1', port), HTTPTimeoutHandler).serve_forever() + +class HTTPMockServerTimeoutContext: + """ For launching an HTTP server on certain port with an specified delay at responses """ + def __init__(self, port, timeout_secs): + self.port = port + self.timeout_secs = timeout_secs + + def __enter__(self): + self.p = multiprocessing.Process(target=run_webserver, args=(self.port, self.timeout_secs)) + self.p.daemon = True + self.p.start() + + def __exit__(self, exc_type, exc_value, exc_traceback): + self.p.terminate() + +class TestCoreNLPClient: + @pytest.fixture(scope="class") + def corenlp_client(self): + """ Client to run tests on """ + client = corenlp.CoreNLPClient(annotators='tokenize,ssplit,pos,lemma,ner,depparse', + server_id='stanza_main_test_server') + yield client + client.stop() + + + def test_connect(self, corenlp_client): + corenlp_client.ensure_alive() + assert corenlp_client.is_active + assert corenlp_client.is_alive() + + + def test_context_manager(self): + with corenlp.CoreNLPClient(annotators="tokenize,ssplit", + endpoint="http://localhost:9001") as context_client: + ann = context_client.annotate(TEXT) + assert corenlp.to_text(ann.sentence[0]) == TEXT[:-1] + + def test_no_duplicate_servers(self): + """We expect a second server on the same port to fail""" + with pytest.raises(corenlp.PermanentlyFailedException): + with corenlp.CoreNLPClient(annotators="tokenize,ssplit") as duplicate_server: + raise RuntimeError("This should have failed") + + def test_annotate(self, corenlp_client): + ann = corenlp_client.annotate(TEXT) + assert corenlp.to_text(ann.sentence[0]) == TEXT[:-1] + + + def test_update(self, corenlp_client): + ann = corenlp_client.annotate(TEXT) + ann = corenlp_client.update(ann) + assert corenlp.to_text(ann.sentence[0]) == TEXT[:-1] + + + def test_tokensregex(self, corenlp_client): + pattern = '([ner: PERSON]+) /wrote/ /an?/ []{0,3} /sentence|article/' + matches = corenlp_client.tokensregex(TEXT, pattern) + assert len(matches["sentences"]) == 1 + assert matches["sentences"][0]["length"] == 1 + assert matches == { + "sentences": [{ + "0": { + "text": "Chris wrote a simple sentence", + "begin": 0, + "end": 5, + "1": { + "text": "Chris", + "begin": 0, + "end": 1 + }}, + "length": 1 + },]} + + + def test_semgrex(self, corenlp_client): + pattern = '{word:wrote} >nsubj {}=subject >obj {}=object' + matches = corenlp_client.semgrex(TEXT, pattern, to_words=True) + assert matches == [ + { + "text": "wrote", + "begin": 1, + "end": 2, + "$subject": { + "text": "Chris", + "begin": 0, + "end": 1 + }, + "$object": { + "text": "sentence", + "begin": 4, + "end": 5 + }, + "sentence": 0,}] + + def test_tregex(self, corenlp_client): + # the PP should be easy to parse + pattern = 'PP < NP' + matches = corenlp_client.tregex(TEXT, pattern) + print(matches) + assert matches == { + 'sentences': [ + {'0': {'sentIndex': 0, 'characterOffsetBegin': 45, 'codepointOffsetBegin': 45, 'characterOffsetEnd': 66, 'codepointOffsetEnd': 66, + 'match': '(PP (IN with)\n (NP (NNP Stanford) (NNP CoreNLP)))\n', + 'spanString': 'with Stanford CoreNLP', 'namedNodes': []}} + ] + } + + def test_tregex_trees(self, corenlp_client): + """ + Test the results of tregex run on trees w/o parsing + """ + trees = tree_reader.read_trees("(ROOT (S (NP (NNP Jennifer)) (VP (VBZ has) (NP (JJ blue) (NN skin))))) (ROOT (S (NP (PRP I)) (VP (VBP like) (NP (PRP$ her) (NNS antennae)))))") + pattern = "VP < NP" + matches = corenlp_client.tregex(pattern=pattern, trees=trees) + assert matches == { + 'sentences': [ + {'0': {'sentIndex': 0, 'match': '(VP (VBZ has)\n (NP (JJ blue) (NN skin)))\n', 'spanString': 'has blue skin', 'namedNodes': []}}, + {'0': {'sentIndex': 1, 'match': '(VP (VBP like)\n (NP (PRP$ her) (NNS antennae)))\n', 'spanString': 'like her antennae', 'namedNodes': []}} + ] + } + + @pytest.fixture + def external_server_9001(self): + corenlp_home = client.resolve_classpath(None) + start_cmd = f'java -Xmx5g -cp "{corenlp_home}" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9001 ' \ + f'-timeout 60000 -server_id stanza_external_server -serverProperties {SERVER_TEST_PROPS}' + start_cmd = start_cmd and shlex.split(start_cmd) + external_server_process = subprocess.Popen(start_cmd) + + yield external_server_process + + assert external_server_process + external_server_process.terminate() + external_server_process.wait(5) + + def test_external_server_legacy_start_server(self, external_server_9001): + """ Test starting up an external server and accessing with a client with start_server=False """ + with corenlp.CoreNLPClient(start_server=False, endpoint="http://localhost:9001") as external_server_client: + ann = external_server_client.annotate(TEXT, annotators='tokenize,ssplit,pos', output_format='text') + assert ann.strip() == EN_GOLD + + def test_external_server_available(self, external_server_9001): + """ Test starting up an external available server and accessing with a client with start_server=StartServer.DONT_START """ + time.sleep(5) # wait and make sure the external CoreNLP server is up and running + with corenlp.CoreNLPClient(start_server=corenlp.StartServer.DONT_START, endpoint="http://localhost:9001") as external_server_client: + ann = external_server_client.annotate(TEXT, annotators='tokenize,ssplit,pos', output_format='text') + assert ann.strip() == EN_GOLD + + def test_external_server_unavailable(self): + """ Test accessing with a client with start_server=StartServer.DONT_START to an external unavailable server """ + with pytest.raises(corenlp.AnnotationException): + with corenlp.CoreNLPClient(start_server=corenlp.StartServer.DONT_START, endpoint="http://localhost:9001") as external_server_client: + ann = external_server_client.annotate(TEXT, annotators='tokenize,ssplit,pos', output_format='text') + + def test_external_server_timeout(self): + """ Test starting up an external server with long response time (20 seconds) and accessing with a client with start_server=StartServer.DONT_START and timeout=5000""" + with HTTPMockServerTimeoutContext(9001, 20): + time.sleep(5) # wait and make sure the external HTTPMockServer server is up and running + with pytest.raises(corenlp.TimeoutException): + with corenlp.CoreNLPClient(start_server=corenlp.StartServer.DONT_START, endpoint="http://localhost:9001", timeout=5000) as external_server_client: + ann = external_server_client.annotate(TEXT, annotators='tokenize,ssplit,pos', output_format='text') + + def test_external_server_try_start_with_external(self, external_server_9001): + """ Test starting up an external server and accessing with a client with start_server=StartServer.TRY_START """ + time.sleep(5) # wait and make sure the external CoreNLP server is up and running + with corenlp.CoreNLPClient(start_server=corenlp.StartServer.TRY_START, + annotators='tokenize,ssplit,pos', + endpoint="http://localhost:9001") as external_server_client: + ann = external_server_client.annotate(TEXT, annotators='tokenize,ssplit,pos', output_format='text') + assert external_server_client.server is None, "If this is not None, that indicates the client started a server instead of reusing an existing one" + assert ann.strip() == EN_GOLD + + def test_external_server_try_start(self): + """ Test starting up a server with a client with start_server=StartServer.TRY_START """ + with corenlp.CoreNLPClient(start_server=corenlp.StartServer.TRY_START, + annotators='tokenize,ssplit,pos', + endpoint="http://localhost:9001") as external_server_client: + ann = external_server_client.annotate(TEXT, annotators='tokenize,ssplit,pos', output_format='text') + assert ann.strip() == EN_GOLD + + def test_external_server_force_start(self, external_server_9001): + """ Test starting up an external server and accessing with a client with start_server=StartServer.FORCE_START """ + time.sleep(5) # wait and make sure the external CoreNLP server is up and running + with pytest.raises(corenlp.PermanentlyFailedException): + with corenlp.CoreNLPClient(start_server=corenlp.StartServer.FORCE_START, endpoint="http://localhost:9001") as external_server_client: + ann = external_server_client.annotate(TEXT, annotators='tokenize,ssplit,pos', output_format='text') diff --git a/stanza/stanza/tests/server/test_java_protobuf_requests.py b/stanza/stanza/tests/server/test_java_protobuf_requests.py new file mode 100644 index 0000000000000000000000000000000000000000..e7beceda482044e026738b4bb8206e1efd1e2a2a --- /dev/null +++ b/stanza/stanza/tests/server/test_java_protobuf_requests.py @@ -0,0 +1,93 @@ +import tempfile + +import pytest + +from stanza.models.common.utils import misc_to_space_after, space_after_to_misc +from stanza.models.constituency import tree_reader +from stanza.server import java_protobuf_requests +from stanza.tests import * +from stanza.utils.conll import CoNLL +from stanza.protobuf import DependencyGraph + +pytestmark = [pytest.mark.travis, pytest.mark.pipeline] + +def check_tree(proto_tree, py_tree, py_score): + tree, tree_score = java_protobuf_requests.from_tree(proto_tree) + assert tree_score == py_score + assert tree == py_tree + +def test_build_tree(): + text="((S (VP (VB Unban)) (NP (NNP Mox) (NNP Opal))))\n( (SBARQ (WHNP (WP Who)) (SQ (VP (VBZ sits) (PP (IN in) (NP (DT this) (NN seat))))) (. ?)))" + trees = tree_reader.read_trees(text) + assert len(trees) == 2 + + for tree in trees: + proto_tree = java_protobuf_requests.build_tree(trees[0], 1.0) + check_tree(proto_tree, trees[0], 1.0) + + +ESTONIAN_EMPTY_DEPS = """ +# sent_id = ewtb2_000035_15 +# text = Ja paari aasta pärast rôômalt maasikatele ... +1 Ja ja CCONJ J _ 3 cc 5.1:cc _ +2 paari paar NUM N Case=Gen|Number=Sing|NumForm=Word|NumType=Card 3 nummod 3:nummod _ +3 aasta aasta NOUN S Case=Gen|Number=Sing 0 root 5.1:obl _ +4 pärast pärast ADP K AdpType=Post 3 case 3:case _ +5 rôômalt rõõmsalt ADV D Typo=Yes 3 advmod 5.1:advmod Orphan=Yes|CorrectForm=rõõmsalt +5.1 panna panema VERB V VerbForm=Inf _ _ 0:root Empty=5.1 +6 maasikatele maasikas NOUN S Case=All|Number=Plur 3 obl 5.1:obl Orphan=Yes +7 ... ... PUNCT Z _ 3 punct 5.1:punct _ +""".strip() + + +def test_convert_networkx_graph(): + doc = CoNLL.conll2doc(input_str=ESTONIAN_EMPTY_DEPS, ignore_gapping=False) + deps = doc.sentences[0]._enhanced_dependencies + + graph = DependencyGraph() + java_protobuf_requests.convert_networkx_graph(graph, doc.sentences[0], 0) + assert len(graph.rootNode) == 1 + assert graph.rootNode[0] == 0 + nodes = sorted([(x.index, x.emptyIndex) for x in graph.node]) + expected_nodes = [(1,0), (2,0), (3,0), (4,0), (5,0), (5,1), (6,0), (7,0)] + assert nodes == expected_nodes + + edges = [(x.target, x.dep) for x in graph.edge if x.source == 5 and x.sourceEmpty == 1] + edges = sorted(edges) + expected_edges = [(1, 'cc'), (3, 'obl'), (5, 'advmod'), (6, 'obl'), (7, 'punct')] + assert edges == expected_edges + +ENGLISH_NBSP_SAMPLE=""" +# sent_id = newsgroup-groups.google.com_n3td3v_e874a1e5eb995654_ENG_20060120_052200-0011 +# text = Please note that neither the e-mail address nor name of the sender have been verified. +1 Please please INTJ UH _ 2 discourse _ _ +2 note note VERB VB Mood=Imp|VerbForm=Fin 0 root _ _ +3 that that SCONJ IN _ 15 mark _ _ +4 neither neither CCONJ CC _ 7 cc:preconj _ _ +5 the the DET DT Definite=Def|PronType=Art 7 det _ _ +6 e-mail e-mail NOUN NN Number=Sing 7 compound _ _ +7 address address NOUN NN Number=Sing 15 nsubj:pass _ _ +8 nor nor CCONJ CC _ 9 cc _ _ +9 name name NOUN NN Number=Sing 7 conj _ _ +10 of of ADP IN _ 12 case _ _ +11 the the DET DT Definite=Def|PronType=Art 12 det _ _ +12 sender sender NOUN NN Number=Sing 7 nmod _ _ +13 have have AUX VBP Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin 15 aux _ SpacesAfter=\\u00A0 +14 been be AUX VBN Tense=Past|VerbForm=Part 15 aux:pass _ _ +15 verified verify VERB VBN Tense=Past|VerbForm=Part|Voice=Pass 2 ccomp _ SpaceAfter=No +16 . . PUNCT . _ 2 punct _ _ +""".strip() + +def test_nbsp_doc(): + """ + Test that the space conversion methods will convert to and from NBSP + """ + doc = CoNLL.conll2doc(input_str=ENGLISH_NBSP_SAMPLE) + + assert doc.sentences[0].text == "Please note that neither the e-mail address nor name of the sender have been verified." + assert doc.sentences[0].tokens[12].spaces_after == " " + assert misc_to_space_after("SpacesAfter=\\u00A0") == ' ' + assert space_after_to_misc(' ') == "SpacesAfter=\\u00A0" + + conllu = "{:C}".format(doc) + assert conllu == ENGLISH_NBSP_SAMPLE diff --git a/stanza/stanza/tests/server/test_morphology.py b/stanza/stanza/tests/server/test_morphology.py new file mode 100644 index 0000000000000000000000000000000000000000..74e77ca12e3d023d2e8d353521ee1fe310167aed --- /dev/null +++ b/stanza/stanza/tests/server/test_morphology.py @@ -0,0 +1,23 @@ +""" +Test the most basic functionality of the morphology script +""" + +import pytest + +from stanza.server.morphology import Morphology, process_text + +words = ["Jennifer", "has", "the", "prettiest", "antennae"] +tags = ["NNP", "VBZ", "DT", "JJS", "NNS"] +expected = ["Jennifer", "have", "the", "pretty", "antenna"] + +def test_process_text(): + result = process_text(words, tags) + lemma = [x.lemma for x in result.words] + print(lemma) + assert lemma == expected + +def test_basic_morphology(): + with Morphology() as morph: + result = morph.process(words, tags) + lemma = [x.lemma for x in result.words] + assert lemma == expected diff --git a/stanza/stanza/tests/server/test_parser_eval.py b/stanza/stanza/tests/server/test_parser_eval.py new file mode 100644 index 0000000000000000000000000000000000000000..0ed68269539cf73cbde81e3a651cb4b7098fd7eb --- /dev/null +++ b/stanza/stanza/tests/server/test_parser_eval.py @@ -0,0 +1,60 @@ +""" +Test the parser eval interface +""" + +import pytest +import stanza +from stanza.models.constituency import tree_reader +from stanza.protobuf import EvaluateParserRequest, EvaluateParserResponse +from stanza.server.parser_eval import build_request, collate, EvaluateParser, ParseResult +from stanza.tests.server.test_java_protobuf_requests import check_tree + +from stanza.tests import * + +pytestmark = [pytest.mark.travis, pytest.mark.client] + +def build_one_tree_treebank(fake_scores=True): + text = "((S (VP (VB Unban)) (NP (NNP Mox) (NNP Opal))))" + trees = tree_reader.read_trees(text) + assert len(trees) == 1 + gold = trees[0] + if fake_scores: + prediction = (gold, 1.0) + treebank = [ParseResult(gold, [prediction], None, None)] + return treebank + else: + prediction = gold + return collate([gold], [prediction]) + +def check_build(fake_scores=True): + treebank = build_one_tree_treebank(fake_scores) + request = build_request(treebank) + + assert len(request.treebank) == 1 + check_tree(request.treebank[0].gold, treebank[0][0], None) + assert len(request.treebank[0].predicted) == 1 + if fake_scores: + check_tree(request.treebank[0].predicted[0], treebank[0][1][0][0], treebank[0][1][0][1]) + else: + check_tree(request.treebank[0].predicted[0], treebank[0][1][0], None) + + +def test_build_tuple_request(): + check_build(True) + +def test_build_notuple_request(): + check_build(False) + +def test_score_one_tree_tuples(): + treebank = build_one_tree_treebank(True) + + with EvaluateParser() as ep: + response = ep.process(treebank) + assert response.f1 == pytest.approx(1.0) + +def test_score_one_tree_notuples(): + treebank = build_one_tree_treebank(False) + + with EvaluateParser() as ep: + response = ep.process(treebank) + assert response.f1 == pytest.approx(1.0) diff --git a/stanza/stanza/tests/server/test_semgrex.py b/stanza/stanza/tests/server/test_semgrex.py new file mode 100644 index 0000000000000000000000000000000000000000..4e6c68cc890ba83bf2d0189d9c3b489e848e096c --- /dev/null +++ b/stanza/stanza/tests/server/test_semgrex.py @@ -0,0 +1,282 @@ +""" +Test the semgrex interface +""" + +import pytest +import stanza +import stanza.server.semgrex as semgrex +from stanza.models.common.doc import Document +from stanza.protobuf import SemgrexRequest +from stanza.utils.conll import CoNLL + +from stanza.tests import * + +pytestmark = [pytest.mark.travis, pytest.mark.client] + +TEST_ONE_SENTENCE = [[ + { + "id": 1, + "text": "Unban", + "lemma": "unban", + "upos": "VERB", + "xpos": "VB", + "feats": "Mood=Imp|VerbForm=Fin", + "head": 0, + "deprel": "root", + "misc": "start_char=0|end_char=5" + }, + { + "id": 2, + "text": "Mox", + "lemma": "Mox", + "upos": "PROPN", + "xpos": "NNP", + "feats": "Number=Sing", + "head": 3, + "deprel": "compound", + "misc": "start_char=6|end_char=9" + }, + { + "id": 3, + "text": "Opal", + "lemma": "Opal", + "upos": "PROPN", + "xpos": "NNP", + "feats": "Number=Sing", + "head": 1, + "deprel": "obj", + "misc": "start_char=10|end_char=14", + "ner": "GEM" + }, + { + "id": 4, + "text": "!", + "lemma": "!", + "upos": "PUNCT", + "xpos": ".", + "head": 1, + "deprel": "punct", + "misc": "start_char=14|end_char=15" + }]] + +TEST_TWO_SENTENCES = [[ + { + "id": 1, + "text": "Unban", + "lemma": "unban", + "upos": "VERB", + "xpos": "VB", + "feats": "Mood=Imp|VerbForm=Fin", + "head": 0, + "deprel": "root", + "misc": "start_char=0|end_char=5" + }, + { + "id": 2, + "text": "Mox", + "lemma": "Mox", + "upos": "PROPN", + "xpos": "NNP", + "feats": "Number=Sing", + "head": 3, + "deprel": "compound", + "misc": "start_char=6|end_char=9" + }, + { + "id": 3, + "text": "Opal", + "lemma": "Opal", + "upos": "PROPN", + "xpos": "NNP", + "feats": "Number=Sing", + "head": 1, + "deprel": "obj", + "misc": "start_char=10|end_char=14" + }, + { + "id": 4, + "text": "!", + "lemma": "!", + "upos": "PUNCT", + "xpos": ".", + "head": 1, + "deprel": "punct", + "misc": "start_char=14|end_char=15" + }], + [{ + "id": 1, + "text": "Unban", + "lemma": "unban", + "upos": "VERB", + "xpos": "VB", + "feats": "Mood=Imp|VerbForm=Fin", + "head": 0, + "deprel": "root", + "misc": "start_char=16|end_char=21" + }, + { + "id": 2, + "text": "Mox", + "lemma": "Mox", + "upos": "PROPN", + "xpos": "NNP", + "feats": "Number=Sing", + "head": 3, + "deprel": "compound", + "misc": "start_char=22|end_char=25" + }, + { + "id": 3, + "text": "Opal", + "lemma": "Opal", + "upos": "PROPN", + "xpos": "NNP", + "feats": "Number=Sing", + "head": 1, + "deprel": "obj", + "misc": "start_char=26|end_char=30" + }, + { + "id": 4, + "text": "!", + "lemma": "!", + "upos": "PUNCT", + "xpos": ".", + "head": 1, + "deprel": "punct", + "misc": "start_char=30|end_char=31" + }]] + +ONE_SENTENCE_DOC = Document(TEST_ONE_SENTENCE, "Unban Mox Opal!") +TWO_SENTENCE_DOC = Document(TEST_TWO_SENTENCES, "Unban Mox Opal! Unban Mox Opal!") + + +def check_response(response, response_len=1, semgrex_len=1, source_index=1, target_index=3, reln='obj'): + assert len(response.result) == response_len + assert len(response.result[0].result) == semgrex_len + for semgrex_result in response.result[0].result: + assert len(semgrex_result.match) == 1 + assert semgrex_result.match[0].matchIndex == source_index + for match in semgrex_result.match: + assert len(match.node) == 2 + assert match.node[0].name == 'source' + assert match.node[0].matchIndex == source_index + assert match.node[1].name == 'target' + assert match.node[1].matchIndex == target_index + assert len(match.reln) == 1 + assert match.reln[0].name == 'zzz' + assert match.reln[0].reln == reln + +def test_multi(): + with semgrex.Semgrex() as sem: + response = sem.process(ONE_SENTENCE_DOC, "{}=source >obj=zzz {}=target") + check_response(response) + response = sem.process(ONE_SENTENCE_DOC, "{}=source >obj=zzz {}=target") + check_response(response) + response = sem.process(TWO_SENTENCE_DOC, "{}=source >obj=zzz {}=target") + check_response(response, response_len=2) + +def test_single_sentence(): + response = semgrex.process_doc(ONE_SENTENCE_DOC, "{}=source >obj=zzz {}=target") + check_response(response) + +def test_two_semgrex(): + response = semgrex.process_doc(ONE_SENTENCE_DOC, "{}=source >obj=zzz {}=target", "{}=source >obj=zzz {}=target") + check_response(response, semgrex_len=2) + +def test_two_sentences(): + response = semgrex.process_doc(TWO_SENTENCE_DOC, "{}=source >obj=zzz {}=target") + check_response(response, response_len=2) + +def test_word_attribute(): + response = semgrex.process_doc(ONE_SENTENCE_DOC, "{word:Mox}=source <=zzz {word:Opal}=target") + check_response(response, response_len=1, source_index=2, reln='compound') + +def test_lemma_attribute(): + response = semgrex.process_doc(ONE_SENTENCE_DOC, "{lemma:Mox}=source <=zzz {lemma:Opal}=target") + check_response(response, response_len=1, source_index=2, reln='compound') + +def test_xpos_attribute(): + response = semgrex.process_doc(ONE_SENTENCE_DOC, "{tag:NNP}=source <=zzz {word:Opal}=target") + check_response(response, response_len=1, source_index=2, reln='compound') + response = semgrex.process_doc(ONE_SENTENCE_DOC, "{pos:NNP}=source <=zzz {word:Opal}=target") + check_response(response, response_len=1, source_index=2, reln='compound') + +def test_upos_attribute(): + response = semgrex.process_doc(ONE_SENTENCE_DOC, "{cpos:PROPN}=source <=zzz {word:Opal}=target") + check_response(response, response_len=1, source_index=2, reln='compound') + +def test_ner_attribute(): + response = semgrex.process_doc(ONE_SENTENCE_DOC, "{cpos:PROPN}=source <=zzz {ner:GEM}=target") + check_response(response, response_len=1, source_index=2, reln='compound') + +def test_hand_built_request(): + """ + Essentially a test program: the result should be a response with + one match, two named nodes, one named relation + """ + request = SemgrexRequest() + request.semgrex.append("{}=source >obj=zzz {}=target") + query = request.query.add() + + for idx, word in enumerate(['Unban', 'Mox', 'Opal']): + token = query.token.add() + token.word = word + token.value = word + + node = query.graph.node.add() + node.sentenceIndex = 1 + node.index = idx+1 + + edge = query.graph.edge.add() + edge.source = 1 + edge.target = 3 + edge.dep = 'obj' + + edge = query.graph.edge.add() + edge.source = 3 + edge.target = 2 + edge.dep = 'compound' + + response = semgrex.send_semgrex_request(request) + check_response(response) + +BLANK_DEPENDENCY_SENTENCE = """ +# sent_id = weblog-juancole.com_juancole_20051126063000_ENG_20051126_063000-0007 +# text = You wonder if he was manipulating the market with his bombing targets. +1 You you PRON PRP Case=Nom|Person=2|PronType=Prs 2 nsubj _ _ +2 wonder wonder VERB VBP Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin 1 _ _ _ +3 if if SCONJ IN _ 6 mark _ _ +4 he he PRON PRP Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs 6 nsubj _ _ +5 was be AUX VBD Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 6 aux _ _ +6 manipulating manipulate VERB VBG Tense=Pres|VerbForm=Part 2 ccomp _ _ +7 the the DET DT Definite=Def|PronType=Art 8 det _ _ +8 market market NOUN NN Number=Sing 6 obj _ _ +9 with with ADP IN _ 12 case _ _ +10 his his PRON PRP$ Case=Gen|Gender=Masc|Number=Sing|Person=3|Poss=Yes|PronType=Prs 12 nmod:poss _ _ +11 bombing bombing NOUN NN Number=Sing 12 compound _ _ +12 targets target NOUN NNS Number=Plur 6 obl _ SpaceAfter=No +13 . . PUNCT . _ 2 punct _ _ +""".lstrip() + + +def test_blank_dependency(): + """ + A user / contributor sent a dependency file with blank dependency labels and twisted up roots + """ + blank_dep_doc = CoNLL.conll2doc(input_str=BLANK_DEPENDENCY_SENTENCE) + blank_dep_request = semgrex.build_request(blank_dep_doc, "{}=root <_=edge {}") + response = semgrex.send_semgrex_request(blank_dep_request) + assert len(response.result) == 1 + assert len(response.result[0].result) == 1 + assert len(response.result[0].result[0].match) == 1 + # there should be a named node... + assert len(response.result[0].result[0].match[0].node) == 1 + assert response.result[0].result[0].match[0].node[0].name == 'root' + assert response.result[0].result[0].match[0].node[0].matchIndex == 2 + + # ... and a named edge + assert len(response.result[0].result[0].match[0].edge) == 1 + assert response.result[0].result[0].match[0].edge[0].source == 1 + assert response.result[0].result[0].match[0].edge[0].target == 2 + assert response.result[0].result[0].match[0].edge[0].reln == "_" diff --git a/stanza/stanza/tests/server/test_server_pretokenized.py b/stanza/stanza/tests/server/test_server_pretokenized.py new file mode 100644 index 0000000000000000000000000000000000000000..9ee971dd4f24815ee5f2df6192d0d0f1ee251fb2 --- /dev/null +++ b/stanza/stanza/tests/server/test_server_pretokenized.py @@ -0,0 +1,76 @@ +""" +Misc tests for the server +""" + +import pytest +import re + +from stanza.server import CoreNLPClient + +pytestmark = pytest.mark.client + +tokens = {} +tags = {} + +# Italian examples +tokens["italian"] = [ + "È vero , tutti possiamo essere sostituiti .\n Alcune chiamate partirono da il Quirinale ." +] +tags["italian"] = [ + [ + ["AUX", "ADJ", "PUNCT", "PRON", "AUX", "AUX", "VERB", "PUNCT"], + ["DET", "NOUN", "VERB", "ADP", "DET", "PROPN", "PUNCT"], + ], +] + + +# French examples +tokens["french"] = [ + ( + "Les études durent six ans mais leur contenu diffère donc selon les Facultés .\n" + "Il est fêté le 22 mai ." + ) +] +tags["french"] = [ + [ + ["DET", "NOUN", "VERB", "NUM", "NOUN", "CCONJ", "DET", "NOUN", "VERB", "ADV", "ADP", "DET", "PROPN", "PUNCT"], + ["PRON", "AUX", "VERB", "DET", "NUM", "NOUN", "PUNCT"] + ], +] + + +# English examples +tokens["english"] = ["This shouldn't be split .\n I hope it's not ."] +tags["english"] = [ + [ + ["DT", "NN", "VB", "VBN", "."], + ["PRP", "VBP", "PRP$", "RB", "."], + ], +] + + +def pretokenized_test(lang): + """Test submitting pretokenized French text.""" + with CoreNLPClient( + properties=lang, + annotators="pos", + pretokenized=True, + be_quiet=True, + ) as client: + for input_text, gold_tags in zip(tokens[lang], tags[lang]): + ann = client.annotate(input_text) + for sentence_tags, sentence in zip(gold_tags, ann.sentence): + result_tags = [tok.pos for tok in sentence.token] + assert sentence_tags == result_tags + + +def test_english_pretokenized(): + pretokenized_test("english") + + +def test_italian_pretokenized(): + pretokenized_test("italian") + + +def test_french_pretokenized(): + pretokenized_test("french") diff --git a/stanza/stanza/tests/server/test_server_request.py b/stanza/stanza/tests/server/test_server_request.py new file mode 100644 index 0000000000000000000000000000000000000000..1c3244f03d186405b521b9d3ad994f9bbff3b51f --- /dev/null +++ b/stanza/stanza/tests/server/test_server_request.py @@ -0,0 +1,223 @@ +""" +Tests for setting request properties of servers +""" + +import json +import pytest +import stanza.server as corenlp + +from stanza.protobuf import Document +from stanza.tests import TEST_WORKING_DIR, compare_ignoring_whitespace + +pytestmark = pytest.mark.client + +EN_DOC = "Joe Smith lives in California." + +# results with an example properties file +EN_DOC_GOLD = """ +Sentence #1 (6 tokens): +Joe Smith lives in California. + +Tokens: +[Text=Joe CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=NNP] +[Text=Smith CharacterOffsetBegin=4 CharacterOffsetEnd=9 PartOfSpeech=NNP] +[Text=lives CharacterOffsetBegin=10 CharacterOffsetEnd=15 PartOfSpeech=VBZ] +[Text=in CharacterOffsetBegin=16 CharacterOffsetEnd=18 PartOfSpeech=IN] +[Text=California CharacterOffsetBegin=19 CharacterOffsetEnd=29 PartOfSpeech=NNP] +[Text=. CharacterOffsetBegin=29 CharacterOffsetEnd=30 PartOfSpeech=.] +""" + +GERMAN_DOC = "Angela Merkel ist seit 2005 Bundeskanzlerin der Bundesrepublik Deutschland." + +GERMAN_DOC_GOLD = """ +Sentence #1 (10 tokens): +Angela Merkel ist seit 2005 Bundeskanzlerin der Bundesrepublik Deutschland. + +Tokens: +[Text=Angela CharacterOffsetBegin=0 CharacterOffsetEnd=6 PartOfSpeech=PROPN] +[Text=Merkel CharacterOffsetBegin=7 CharacterOffsetEnd=13 PartOfSpeech=PROPN] +[Text=ist CharacterOffsetBegin=14 CharacterOffsetEnd=17 PartOfSpeech=AUX] +[Text=seit CharacterOffsetBegin=18 CharacterOffsetEnd=22 PartOfSpeech=ADP] +[Text=2005 CharacterOffsetBegin=23 CharacterOffsetEnd=27 PartOfSpeech=NUM] +[Text=Bundeskanzlerin CharacterOffsetBegin=28 CharacterOffsetEnd=43 PartOfSpeech=NOUN] +[Text=der CharacterOffsetBegin=44 CharacterOffsetEnd=47 PartOfSpeech=DET] +[Text=Bundesrepublik CharacterOffsetBegin=48 CharacterOffsetEnd=62 PartOfSpeech=PROPN] +[Text=Deutschland CharacterOffsetBegin=63 CharacterOffsetEnd=74 PartOfSpeech=PROPN] +[Text=. CharacterOffsetBegin=74 CharacterOffsetEnd=75 PartOfSpeech=PUNCT] +""" + +FRENCH_CUSTOM_PROPS = {'annotators': 'tokenize,ssplit,mwt,pos,parse', + 'tokenize.language': 'fr', + 'pos.model': 'edu/stanford/nlp/models/pos-tagger/french-ud.tagger', + 'parse.model': 'edu/stanford/nlp/models/srparser/frenchSR.ser.gz', + 'mwt.mappingFile': 'edu/stanford/nlp/models/mwt/french/french-mwt.tsv', + 'mwt.pos.model': 'edu/stanford/nlp/models/mwt/french/french-mwt.tagger', + 'mwt.statisticalMappingFile': 'edu/stanford/nlp/models/mwt/french/french-mwt-statistical.tsv', + 'mwt.preserveCasing': 'false', + 'outputFormat': 'text'} + +FRENCH_EXTRA_PROPS = {'annotators': 'tokenize,ssplit,mwt,pos,depparse', + 'tokenize.language': 'fr', + 'pos.model': 'edu/stanford/nlp/models/pos-tagger/french-ud.tagger', + 'mwt.mappingFile': 'edu/stanford/nlp/models/mwt/french/french-mwt.tsv', + 'mwt.pos.model': 'edu/stanford/nlp/models/mwt/french/french-mwt.tagger', + 'mwt.statisticalMappingFile': 'edu/stanford/nlp/models/mwt/french/french-mwt-statistical.tsv', + 'mwt.preserveCasing': 'false', + 'depparse.model': 'edu/stanford/nlp/models/parser/nndep/UD_French.gz'} + +FRENCH_DOC = "Cette enquête préliminaire fait suite aux révélations de l’hebdomadaire quelques jours plus tôt." + +FRENCH_CUSTOM_GOLD = """ +Sentence #1 (16 tokens): +Cette enquête préliminaire fait suite aux révélations de l’hebdomadaire quelques jours plus tôt. + +Tokens: +[Text=Cette CharacterOffsetBegin=0 CharacterOffsetEnd=5 PartOfSpeech=DET] +[Text=enquête CharacterOffsetBegin=6 CharacterOffsetEnd=13 PartOfSpeech=NOUN] +[Text=préliminaire CharacterOffsetBegin=14 CharacterOffsetEnd=26 PartOfSpeech=ADJ] +[Text=fait CharacterOffsetBegin=27 CharacterOffsetEnd=31 PartOfSpeech=VERB] +[Text=suite CharacterOffsetBegin=32 CharacterOffsetEnd=37 PartOfSpeech=NOUN] +[Text=à CharacterOffsetBegin=38 CharacterOffsetEnd=41 PartOfSpeech=ADP] +[Text=les CharacterOffsetBegin=38 CharacterOffsetEnd=41 PartOfSpeech=DET] +[Text=révélations CharacterOffsetBegin=42 CharacterOffsetEnd=53 PartOfSpeech=NOUN] +[Text=de CharacterOffsetBegin=54 CharacterOffsetEnd=56 PartOfSpeech=ADP] +[Text=l’ CharacterOffsetBegin=57 CharacterOffsetEnd=59 PartOfSpeech=NOUN] +[Text=hebdomadaire CharacterOffsetBegin=59 CharacterOffsetEnd=71 PartOfSpeech=ADJ] +[Text=quelques CharacterOffsetBegin=72 CharacterOffsetEnd=80 PartOfSpeech=DET] +[Text=jours CharacterOffsetBegin=81 CharacterOffsetEnd=86 PartOfSpeech=NOUN] +[Text=plus CharacterOffsetBegin=87 CharacterOffsetEnd=91 PartOfSpeech=ADV] +[Text=tôt CharacterOffsetBegin=92 CharacterOffsetEnd=95 PartOfSpeech=ADV] +[Text=. CharacterOffsetBegin=95 CharacterOffsetEnd=96 PartOfSpeech=PUNCT] + +Constituency parse: +(ROOT + (SENT + (NP (DET Cette) + (MWN (NOUN enquête) (ADJ préliminaire))) + (VN + (MWV (VERB fait) (NOUN suite))) + (PP (ADP à) + (NP (DET les) (NOUN révélations) + (PP (ADP de) + (NP (NOUN l’) + (AP (ADJ hebdomadaire)))))) + (NP (DET quelques) (NOUN jours)) + (AdP (ADV plus) (ADV tôt)) + (PUNCT .))) +""" + +FRENCH_EXTRA_GOLD = """ +Sentence #1 (16 tokens): +Cette enquête préliminaire fait suite aux révélations de l’hebdomadaire quelques jours plus tôt. + +Tokens: +[Text=Cette CharacterOffsetBegin=0 CharacterOffsetEnd=5 PartOfSpeech=DET] +[Text=enquête CharacterOffsetBegin=6 CharacterOffsetEnd=13 PartOfSpeech=NOUN] +[Text=préliminaire CharacterOffsetBegin=14 CharacterOffsetEnd=26 PartOfSpeech=ADJ] +[Text=fait CharacterOffsetBegin=27 CharacterOffsetEnd=31 PartOfSpeech=VERB] +[Text=suite CharacterOffsetBegin=32 CharacterOffsetEnd=37 PartOfSpeech=NOUN] +[Text=à CharacterOffsetBegin=38 CharacterOffsetEnd=41 PartOfSpeech=ADP] +[Text=les CharacterOffsetBegin=38 CharacterOffsetEnd=41 PartOfSpeech=DET] +[Text=révélations CharacterOffsetBegin=42 CharacterOffsetEnd=53 PartOfSpeech=NOUN] +[Text=de CharacterOffsetBegin=54 CharacterOffsetEnd=56 PartOfSpeech=ADP] +[Text=l’ CharacterOffsetBegin=57 CharacterOffsetEnd=59 PartOfSpeech=NOUN] +[Text=hebdomadaire CharacterOffsetBegin=59 CharacterOffsetEnd=71 PartOfSpeech=ADJ] +[Text=quelques CharacterOffsetBegin=72 CharacterOffsetEnd=80 PartOfSpeech=DET] +[Text=jours CharacterOffsetBegin=81 CharacterOffsetEnd=86 PartOfSpeech=NOUN] +[Text=plus CharacterOffsetBegin=87 CharacterOffsetEnd=91 PartOfSpeech=ADV] +[Text=tôt CharacterOffsetBegin=92 CharacterOffsetEnd=95 PartOfSpeech=ADV] +[Text=. CharacterOffsetBegin=95 CharacterOffsetEnd=96 PartOfSpeech=PUNCT] + +Dependency Parse (enhanced plus plus dependencies): +root(ROOT-0, fait-4) +det(enquête-2, Cette-1) +nsubj(fait-4, enquête-2) +amod(enquête-2, préliminaire-3) +obj(fait-4, suite-5) +case(révélations-8, à-6) +det(révélations-8, les-7) +obl:à(fait-4, révélations-8) +case(l’-10, de-9) +nmod:de(révélations-8, l’-10) +amod(révélations-8, hebdomadaire-11) +det(jours-13, quelques-12) +obl(fait-4, jours-13) +advmod(tôt-15, plus-14) +advmod(jours-13, tôt-15) +punct(fait-4, .-16) +""" + +FRENCH_JSON_GOLD = json.loads(open(f'{TEST_WORKING_DIR}/out/example_french.json', encoding="utf-8").read()) + +ES_DOC = 'Andrés Manuel López Obrador es el presidente de México.' + +ES_PROPS = {'annotators': 'tokenize,ssplit,mwt,pos,depparse', 'tokenize.language': 'es', + 'pos.model': 'edu/stanford/nlp/models/pos-tagger/spanish-ud.tagger', + 'mwt.mappingFile': 'edu/stanford/nlp/models/mwt/spanish/spanish-mwt.tsv', + 'depparse.model': 'edu/stanford/nlp/models/parser/nndep/UD_Spanish.gz'} + +ES_PROPS_GOLD = """ +Sentence #1 (10 tokens): +Andrés Manuel López Obrador es el presidente de México. + +Tokens: +[Text=Andrés CharacterOffsetBegin=0 CharacterOffsetEnd=6 PartOfSpeech=PROPN] +[Text=Manuel CharacterOffsetBegin=7 CharacterOffsetEnd=13 PartOfSpeech=PROPN] +[Text=López CharacterOffsetBegin=14 CharacterOffsetEnd=19 PartOfSpeech=PROPN] +[Text=Obrador CharacterOffsetBegin=20 CharacterOffsetEnd=27 PartOfSpeech=PROPN] +[Text=es CharacterOffsetBegin=28 CharacterOffsetEnd=30 PartOfSpeech=AUX] +[Text=el CharacterOffsetBegin=31 CharacterOffsetEnd=33 PartOfSpeech=DET] +[Text=presidente CharacterOffsetBegin=34 CharacterOffsetEnd=44 PartOfSpeech=NOUN] +[Text=de CharacterOffsetBegin=45 CharacterOffsetEnd=47 PartOfSpeech=ADP] +[Text=México CharacterOffsetBegin=48 CharacterOffsetEnd=54 PartOfSpeech=PROPN] +[Text=. CharacterOffsetBegin=54 CharacterOffsetEnd=55 PartOfSpeech=PUNCT] + +Dependency Parse (enhanced plus plus dependencies): +root(ROOT-0, presidente-7) +nsubj(presidente-7, Andrés-1) +flat(Andrés-1, Manuel-2) +flat(Andrés-1, López-3) +flat(Andrés-1, Obrador-4) +cop(presidente-7, es-5) +det(presidente-7, el-6) +case(México-9, de-8) +nmod:de(presidente-7, México-9) +punct(presidente-7, .-10) +""" + +class TestServerRequest: + @pytest.fixture(scope="class") + def corenlp_client(self): + """ Client to run tests on """ + client = corenlp.CoreNLPClient(annotators='tokenize,ssplit,pos', server_id='stanza_request_tests_server') + yield client + client.stop() + + + def test_basic(self, corenlp_client): + """ Basic test of making a request, test default output format is a Document """ + ann = corenlp_client.annotate(EN_DOC, output_format="text") + compare_ignoring_whitespace(ann, EN_DOC_GOLD) + ann = corenlp_client.annotate(EN_DOC) + assert isinstance(ann, Document) + + + def test_python_dict(self, corenlp_client): + """ Test using a Python dictionary to specify all request properties """ + ann = corenlp_client.annotate(ES_DOC, properties=ES_PROPS, output_format="text") + compare_ignoring_whitespace(ann, ES_PROPS_GOLD) + ann = corenlp_client.annotate(FRENCH_DOC, properties=FRENCH_CUSTOM_PROPS) + compare_ignoring_whitespace(ann, FRENCH_CUSTOM_GOLD) + + + def test_lang_setting(self, corenlp_client): + """ Test using a Stanford CoreNLP supported languages as a properties key """ + ann = corenlp_client.annotate(GERMAN_DOC, properties="german", output_format="text") + compare_ignoring_whitespace(ann, GERMAN_DOC_GOLD) + + + def test_annotators_and_output_format(self, corenlp_client): + """ Test setting the annotators and output_format """ + ann = corenlp_client.annotate(FRENCH_DOC, properties=FRENCH_EXTRA_PROPS, + annotators="tokenize,ssplit,mwt,pos", output_format="json") + assert ann == FRENCH_JSON_GOLD diff --git a/stanza/stanza/tests/server/test_server_start.py b/stanza/stanza/tests/server/test_server_start.py new file mode 100644 index 0000000000000000000000000000000000000000..de50d488ae016ebc6d0b1609566f40913c494aaf --- /dev/null +++ b/stanza/stanza/tests/server/test_server_start.py @@ -0,0 +1,214 @@ +""" +Tests for starting a server in Python code +""" + +import pytest +import stanza.server as corenlp +from stanza.server.client import AnnotationException +import time + +from stanza.tests import * + +pytestmark = pytest.mark.client + +EN_DOC = "Joe Smith lives in California." + +# results on EN_DOC with standard StanfordCoreNLP defaults +EN_PRELOAD_GOLD = """ +Sentence #1 (6 tokens): +Joe Smith lives in California. + +Tokens: +[Text=Joe CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=NNP Lemma=Joe NamedEntityTag=PERSON] +[Text=Smith CharacterOffsetBegin=4 CharacterOffsetEnd=9 PartOfSpeech=NNP Lemma=Smith NamedEntityTag=PERSON] +[Text=lives CharacterOffsetBegin=10 CharacterOffsetEnd=15 PartOfSpeech=VBZ Lemma=live NamedEntityTag=O] +[Text=in CharacterOffsetBegin=16 CharacterOffsetEnd=18 PartOfSpeech=IN Lemma=in NamedEntityTag=O] +[Text=California CharacterOffsetBegin=19 CharacterOffsetEnd=29 PartOfSpeech=NNP Lemma=California NamedEntityTag=STATE_OR_PROVINCE] +[Text=. CharacterOffsetBegin=29 CharacterOffsetEnd=30 PartOfSpeech=. Lemma=. NamedEntityTag=O] + +Dependency Parse (enhanced plus plus dependencies): +root(ROOT-0, lives-3) +compound(Smith-2, Joe-1) +nsubj(lives-3, Smith-2) +case(California-5, in-4) +obl:in(lives-3, California-5) +punct(lives-3, .-6) + +Extracted the following NER entity mentions: +Joe Smith PERSON PERSON:0.9972202681743931 +California STATE_OR_PROVINCE LOCATION:0.9990868267559281 + +Extracted the following KBP triples: +1.0 Joe Smith per:statesorprovinces_of_residence California +""" + +# results with an example properties file +EN_PROPS_FILE_GOLD = """ +Sentence #1 (6 tokens): +Joe Smith lives in California. + +Tokens: +[Text=Joe CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=NNP] +[Text=Smith CharacterOffsetBegin=4 CharacterOffsetEnd=9 PartOfSpeech=NNP] +[Text=lives CharacterOffsetBegin=10 CharacterOffsetEnd=15 PartOfSpeech=VBZ] +[Text=in CharacterOffsetBegin=16 CharacterOffsetEnd=18 PartOfSpeech=IN] +[Text=California CharacterOffsetBegin=19 CharacterOffsetEnd=29 PartOfSpeech=NNP] +[Text=. CharacterOffsetBegin=29 CharacterOffsetEnd=30 PartOfSpeech=.] +""" + +GERMAN_DOC = "Angela Merkel ist seit 2005 Bundeskanzlerin der Bundesrepublik Deutschland." + +# results with standard German properties +GERMAN_FULL_PROPS_GOLD = """ +Sentence #1 (10 tokens): +Angela Merkel ist seit 2005 Bundeskanzlerin der Bundesrepublik Deutschland. + +Tokens: +[Text=Angela CharacterOffsetBegin=0 CharacterOffsetEnd=6 PartOfSpeech=PROPN Lemma=angela NamedEntityTag=PERSON] +[Text=Merkel CharacterOffsetBegin=7 CharacterOffsetEnd=13 PartOfSpeech=PROPN Lemma=merkel NamedEntityTag=PERSON] +[Text=ist CharacterOffsetBegin=14 CharacterOffsetEnd=17 PartOfSpeech=AUX Lemma=ist NamedEntityTag=O] +[Text=seit CharacterOffsetBegin=18 CharacterOffsetEnd=22 PartOfSpeech=ADP Lemma=seit NamedEntityTag=O] +[Text=2005 CharacterOffsetBegin=23 CharacterOffsetEnd=27 PartOfSpeech=NUM Lemma=2005 NamedEntityTag=O] +[Text=Bundeskanzlerin CharacterOffsetBegin=28 CharacterOffsetEnd=43 PartOfSpeech=NOUN Lemma=bundeskanzlerin NamedEntityTag=O] +[Text=der CharacterOffsetBegin=44 CharacterOffsetEnd=47 PartOfSpeech=DET Lemma=der NamedEntityTag=O] +[Text=Bundesrepublik CharacterOffsetBegin=48 CharacterOffsetEnd=62 PartOfSpeech=PROPN Lemma=bundesrepublik NamedEntityTag=LOCATION] +[Text=Deutschland CharacterOffsetBegin=63 CharacterOffsetEnd=74 PartOfSpeech=PROPN Lemma=deutschland NamedEntityTag=LOCATION] +[Text=. CharacterOffsetBegin=74 CharacterOffsetEnd=75 PartOfSpeech=PUNCT Lemma=. NamedEntityTag=O] + +Dependency Parse (enhanced plus plus dependencies): +root(ROOT-0, Bundeskanzlerin-6) +nsubj(Bundeskanzlerin-6, Angela-1) +flat(Angela-1, Merkel-2) +cop(Bundeskanzlerin-6, ist-3) +case(2005-5, seit-4) +nmod:seit(Bundeskanzlerin-6, 2005-5) +det(Bundesrepublik-8, der-7) +nmod(Bundeskanzlerin-6, Bundesrepublik-8) +appos(Bundesrepublik-8, Deutschland-9) +punct(Bundeskanzlerin-6, .-10) + +Extracted the following NER entity mentions: +Angela Merkel PERSON PERSON:0.9999981583351504 +Bundesrepublik Deutschland LOCATION LOCATION:0.9682902289749544 +""" + + +GERMAN_SMALL_PROPS = {'annotators': 'tokenize,ssplit,pos', 'tokenize.language': 'de', + 'pos.model': 'edu/stanford/nlp/models/pos-tagger/german-ud.tagger'} + +# results with custom Python dictionary set properties +GERMAN_SMALL_PROPS_GOLD = """ +Sentence #1 (10 tokens): +Angela Merkel ist seit 2005 Bundeskanzlerin der Bundesrepublik Deutschland. + +Tokens: +[Text=Angela CharacterOffsetBegin=0 CharacterOffsetEnd=6 PartOfSpeech=PROPN] +[Text=Merkel CharacterOffsetBegin=7 CharacterOffsetEnd=13 PartOfSpeech=PROPN] +[Text=ist CharacterOffsetBegin=14 CharacterOffsetEnd=17 PartOfSpeech=AUX] +[Text=seit CharacterOffsetBegin=18 CharacterOffsetEnd=22 PartOfSpeech=ADP] +[Text=2005 CharacterOffsetBegin=23 CharacterOffsetEnd=27 PartOfSpeech=NUM] +[Text=Bundeskanzlerin CharacterOffsetBegin=28 CharacterOffsetEnd=43 PartOfSpeech=NOUN] +[Text=der CharacterOffsetBegin=44 CharacterOffsetEnd=47 PartOfSpeech=DET] +[Text=Bundesrepublik CharacterOffsetBegin=48 CharacterOffsetEnd=62 PartOfSpeech=PROPN] +[Text=Deutschland CharacterOffsetBegin=63 CharacterOffsetEnd=74 PartOfSpeech=PROPN] +[Text=. CharacterOffsetBegin=74 CharacterOffsetEnd=75 PartOfSpeech=PUNCT] +""" + +# results with custom Python dictionary set properties and annotators=tokenize,ssplit +GERMAN_SMALL_PROPS_W_ANNOTATORS_GOLD = """ +Sentence #1 (10 tokens): +Angela Merkel ist seit 2005 Bundeskanzlerin der Bundesrepublik Deutschland. + +Tokens: +[Text=Angela CharacterOffsetBegin=0 CharacterOffsetEnd=6] +[Text=Merkel CharacterOffsetBegin=7 CharacterOffsetEnd=13] +[Text=ist CharacterOffsetBegin=14 CharacterOffsetEnd=17] +[Text=seit CharacterOffsetBegin=18 CharacterOffsetEnd=22] +[Text=2005 CharacterOffsetBegin=23 CharacterOffsetEnd=27] +[Text=Bundeskanzlerin CharacterOffsetBegin=28 CharacterOffsetEnd=43] +[Text=der CharacterOffsetBegin=44 CharacterOffsetEnd=47] +[Text=Bundesrepublik CharacterOffsetBegin=48 CharacterOffsetEnd=62] +[Text=Deutschland CharacterOffsetBegin=63 CharacterOffsetEnd=74] +[Text=. CharacterOffsetBegin=74 CharacterOffsetEnd=75] +""" + +# properties for username/password example +USERNAME_PASS_PROPS = {'annotators': 'tokenize,ssplit,pos'} + +USERNAME_PASS_GOLD = """ +Sentence #1 (6 tokens): +Joe Smith lives in California. + +Tokens: +[Text=Joe CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=NNP] +[Text=Smith CharacterOffsetBegin=4 CharacterOffsetEnd=9 PartOfSpeech=NNP] +[Text=lives CharacterOffsetBegin=10 CharacterOffsetEnd=15 PartOfSpeech=VBZ] +[Text=in CharacterOffsetBegin=16 CharacterOffsetEnd=18 PartOfSpeech=IN] +[Text=California CharacterOffsetBegin=19 CharacterOffsetEnd=29 PartOfSpeech=NNP] +[Text=. CharacterOffsetBegin=29 CharacterOffsetEnd=30 PartOfSpeech=.] +""" + + +def annotate_and_time(client, text, properties={}): + """ Submit an annotation request and return how long it took """ + start = time.time() + ann = client.annotate(text, properties=properties, output_format="text") + end = time.time() + return {'annotation': ann, 'start_time': start, 'end_time': end} + +def test_preload(): + """ Test that the default annotators load fully immediately upon server start """ + with corenlp.CoreNLPClient(server_id='test_server_start_preload') as client: + # wait for annotators to load + time.sleep(140) + results = annotate_and_time(client, EN_DOC) + compare_ignoring_whitespace(results['annotation'], EN_PRELOAD_GOLD) + assert results['end_time'] - results['start_time'] < 3 + + +def test_props_file(): + """ Test starting the server with a props file """ + with corenlp.CoreNLPClient(properties=SERVER_TEST_PROPS, server_id='test_server_start_props_file') as client: + ann = client.annotate(EN_DOC, output_format="text") + assert ann.strip() == EN_PROPS_FILE_GOLD.strip() + + +def test_lang_start(): + """ Test starting the server with a Stanford CoreNLP language name """ + with corenlp.CoreNLPClient(properties='german', server_id='test_server_start_lang_name') as client: + ann = client.annotate(GERMAN_DOC, output_format='text') + compare_ignoring_whitespace(ann, GERMAN_FULL_PROPS_GOLD) + + +def test_python_dict(): + """ Test starting the server with a Python dictionary as default properties """ + with corenlp.CoreNLPClient(properties=GERMAN_SMALL_PROPS, server_id='test_server_start_python_dict') as client: + ann = client.annotate(GERMAN_DOC, output_format='text') + assert ann.strip() == GERMAN_SMALL_PROPS_GOLD.strip() + + +def test_python_dict_w_annotators(): + """ Test starting the server with a Python dictionary as default properties, override annotators """ + with corenlp.CoreNLPClient(properties=GERMAN_SMALL_PROPS, annotators="tokenize,ssplit", + server_id='test_server_start_python_dict_w_annotators') as client: + ann = client.annotate(GERMAN_DOC, output_format='text') + assert ann.strip() == GERMAN_SMALL_PROPS_W_ANNOTATORS_GOLD.strip() + + +def test_username_password(): + """ Test starting a server with a username and password """ + with corenlp.CoreNLPClient(properties=USERNAME_PASS_PROPS, username='user-1234', password='1234', + server_id="test_server_username_pass") as client: + # check with correct password + ann = client.annotate(EN_DOC, output_format='text', username='user-1234', password='1234') + assert ann.strip() == USERNAME_PASS_GOLD.strip() + # check with incorrect password, should throw AnnotationException + try: + ann = client.annotate(EN_DOC, output_format='text', username='user-1234', password='12345') + assert False + except AnnotationException as ae: + pass + except Exception as e: + assert False + + diff --git a/stanza/stanza/tests/server/test_ssurgeon.py b/stanza/stanza/tests/server/test_ssurgeon.py new file mode 100644 index 0000000000000000000000000000000000000000..3e8f29f77be76d81d8e4721bc7ec12a49f980785 --- /dev/null +++ b/stanza/stanza/tests/server/test_ssurgeon.py @@ -0,0 +1,425 @@ +import pytest + +from stanza.tests import compare_ignoring_whitespace + +pytestmark = [pytest.mark.travis, pytest.mark.client] + +from stanza.utils.conll import CoNLL +import stanza.server.ssurgeon as ssurgeon + +SAMPLE_DOC_INPUT = """ +# sent_id = 271 +# text = Hers is easy to clean. +# previous = What did the dealer like about Alex's car? +# comment = extraction/raising via "tough extraction" and clausal subject +1 Hers hers PRON PRP Gender=Fem|Number=Sing|Person=3|Poss=Yes|PronType=Prs 3 nsubj _ _ +2 is be AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 3 cop _ _ +3 easy easy ADJ JJ Degree=Pos 0 root _ _ +4 to to PART TO _ 5 mark _ _ +5 clean clean VERB VB VerbForm=Inf 3 csubj _ SpaceAfter=No +6 . . PUNCT . _ 5 punct _ _ +""" + +SAMPLE_DOC_EXPECTED = """ +# sent_id = 271 +# text = Hers is easy to clean. +# previous = What did the dealer like about Alex's car? +# comment = extraction/raising via "tough extraction" and clausal subject +1 Hers hers PRON PRP Gender=Fem|Number=Sing|Person=3|Poss=Yes|PronType=Prs 3 nsubj _ _ +2 is be AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 3 cop _ _ +3 easy easy ADJ JJ Degree=Pos 0 root _ _ +4 to to PART TO _ 5 mark _ _ +5 clean clean VERB VB VerbForm=Inf 3 advcl _ SpaceAfter=No +6 . . PUNCT . _ 5 punct _ _ +""" + + +def test_ssurgeon_same_length(): + semgrex_pattern = "{}=source >nsubj {} >csubj=bad {}" + ssurgeon_edits = ["relabelNamedEdge -edge bad -reln advcl"] + + doc = CoNLL.conll2doc(input_str=SAMPLE_DOC_INPUT) + + ssurgeon_response = ssurgeon.process_doc_one_operation(doc, semgrex_pattern, ssurgeon_edits) + updated_doc = ssurgeon.convert_response_to_doc(doc, ssurgeon_response) + + result = "{:C}".format(updated_doc) + #print(result) + #print(SAMPLE_DOC_EXPECTED) + compare_ignoring_whitespace(result, SAMPLE_DOC_EXPECTED) + + +ADD_WORD_DOC_INPUT = """ +# text = Jennifer has lovely antennae. +# sent_id = 12 +# comment = if you're in to that kind of thing +1 Jennifer Jennifer PROPN NNP Number=Sing 2 nsubj _ start_char=0|end_char=8|ner=S-PERSON +2 has have VERB VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root _ start_char=9|end_char=12|ner=O +3 lovely lovely ADJ JJ Degree=Pos 4 amod _ start_char=13|end_char=19|ner=O +4 antennae antenna NOUN NNS Number=Plur 2 obj _ start_char=20|end_char=28|ner=O|SpaceAfter=No +5 . . PUNCT . _ 2 punct _ start_char=28|end_char=29|ner=O +""" + +ADD_WORD_DOC_EXPECTED = """ +# text = Jennifer has lovely blue antennae. +# sent_id = 12 +# comment = if you're in to that kind of thing +1 Jennifer Jennifer PROPN NNP Number=Sing 2 nsubj _ ner=S-PERSON +2 has have VERB VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root _ ner=O +3 lovely lovely ADJ JJ Degree=Pos 5 amod _ ner=O +4 blue blue ADJ JJ _ 5 amod _ ner=O +5 antennae antenna NOUN NNS Number=Plur 2 obj _ SpaceAfter=No|ner=O +6 . . PUNCT . _ 2 punct _ ner=O +""" + + +def test_ssurgeon_different_length(): + semgrex_pattern = "{word:antennae}=antennae !> {word:blue}" + ssurgeon_edits = ["addDep -gov antennae -reln amod -word blue -lemma blue -cpos ADJ -pos JJ -ner O -position -antennae -after \" \""] + + doc = CoNLL.conll2doc(input_str=ADD_WORD_DOC_INPUT) + #print() + #print("{:C}".format(doc)) + + ssurgeon_response = ssurgeon.process_doc_one_operation(doc, semgrex_pattern, ssurgeon_edits) + updated_doc = ssurgeon.convert_response_to_doc(doc, ssurgeon_response) + + result = "{:C}".format(updated_doc) + #print(result) + #print(ADD_WORD_DOC_EXPECTED) + + compare_ignoring_whitespace(result, ADD_WORD_DOC_EXPECTED) + +BECOME_MWT_DOC_INPUT = """ +# sent_id = 25 +# text = It's not yours! +# comment = negation +1 It it PRON PRP Number=Sing|Person=2|PronType=Prs 4 nsubj _ SpaceAfter=No +2 's be AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 4 cop _ _ +3 not not PART RB Polarity=Neg 4 advmod _ _ +4 yours yours PRON PRP Gender=Neut|Number=Sing|Person=2|Poss=Yes|PronType=Prs 0 root _ SpaceAfter=No +5 ! ! PUNCT . _ 4 punct _ _ +""" + +BECOME_MWT_DOC_EXPECTED = """ +# sent_id = 25 +# text = It's not yours! +# comment = negation +1-2 It's _ _ _ _ _ _ _ _ +1 It it PRON PRP Number=Sing|Person=2|PronType=Prs 4 nsubj _ _ +2 's be AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 4 cop _ _ +3 not not PART RB Polarity=Neg 4 advmod _ _ +4 yours yours PRON PRP Gender=Neut|Number=Sing|Person=2|Poss=Yes|PronType=Prs 0 root _ SpaceAfter=No +5 ! ! PUNCT . _ 4 punct _ _ +""" + +def test_ssurgeon_become_mwt(): + """ + Test that converting a document, adding a new MWT, works as expected + """ + semgrex_pattern = "{word:It}=it . {word:/'s/}=s" + ssurgeon_edits = ["EditNode -node it -is_mwt true -is_first_mwt true -mwt_text It's", + "EditNode -node s -is_mwt true -is_first_mwt false -mwt_text It's"] + + doc = CoNLL.conll2doc(input_str=BECOME_MWT_DOC_INPUT) + + ssurgeon_response = ssurgeon.process_doc_one_operation(doc, semgrex_pattern, ssurgeon_edits) + updated_doc = ssurgeon.convert_response_to_doc(doc, ssurgeon_response) + + result = "{:C}".format(updated_doc) + compare_ignoring_whitespace(result, BECOME_MWT_DOC_EXPECTED) + +EXISTING_MWT_DOC_INPUT = """ +# sent_id = newsgroup-groups.google.com_GayMarriage_0ccbb50b41a5830b_ENG_20050321_181500-0005 +# text = One of “NCRC4ME’s” +1 One one NUM CD NumType=Card 0 root 0:root _ +2 of of ADP IN _ 4 case 4:case _ +3 “ " PUNCT `` _ 4 punct 4:punct SpaceAfter=No +4-5 NCRC4ME’s _ _ _ _ _ _ _ SpaceAfter=No +4 NCRC4ME NCRC4ME PROPN NNP Number=Sing 1 compound 1:compound _ +5 ’s 's PART POS _ 4 case 4:case _ +6 ” " PUNCT '' _ 4 punct 4:punct _ +""" + +# TODO: also, we shouldn't lose the enhanced dependencies... +EXISTING_MWT_DOC_EXPECTED = """ +# sent_id = newsgroup-groups.google.com_GayMarriage_0ccbb50b41a5830b_ENG_20050321_181500-0005 +# text = One of “NCRC4ME’s” +1 One one NUM CD NumType=Card 0 root _ _ +2 of of ADP IN _ 4 case _ _ +3 “ " PUNCT `` _ 4 punct _ SpaceAfter=No +4-5 NCRC4ME’s _ _ _ _ _ _ _ SpaceAfter=No +4 NCRC4ME NCRC4ME PROPN NNP Number=Sing 1 compound _ _ +5 ’s 's PART POS _ 4 case _ _ +6 ” " PUNCT '' _ 4 punct _ _ +""" + +def test_ssurgeon_existing_mwt_no_change(): + """ + Test that converting a document with an MWT works as expected + + Note regarding this test: + Currently it works because ssurgeon.py doesn't look at the + "changed" flag because of a bug in EditNode in CoreNLP 4.5.3 + If that is fixed, but the enhanced dependencies aren't fixed, + this test will fail because the enhanced dependencies *aren't* + removed. Fixing the enhanced dependencies as well will fix + that, though. + """ + semgrex_pattern = "{word:It}=it . {word:/'s/}=s" + ssurgeon_edits = ["EditNode -node it -is_mwt true -is_first_mwt true -mwt_text It's", + "EditNode -node s -is_mwt true -is_first_mwt false -mwt_text It's"] + + doc = CoNLL.conll2doc(input_str=EXISTING_MWT_DOC_INPUT) + + ssurgeon_response = ssurgeon.process_doc_one_operation(doc, semgrex_pattern, ssurgeon_edits) + updated_doc = ssurgeon.convert_response_to_doc(doc, ssurgeon_response) + + result = "{:C}".format(updated_doc) + compare_ignoring_whitespace(result, EXISTING_MWT_DOC_EXPECTED) + +def check_empty_test(input_text, expected=None, echo=False): + if expected is None: + expected = input_text + + doc = CoNLL.conll2doc(input_str=input_text) + + # we don't want to edit this, just test the to/from conversion + ssurgeon_response = ssurgeon.process_doc(doc, []) + updated_doc = ssurgeon.convert_response_to_doc(doc, ssurgeon_response) + + result = "{:C}".format(updated_doc) + if echo: + print("INPUT") + print(input_text) + print("EXPECTED") + print(expected) + print("RESULT") + print(result) + compare_ignoring_whitespace(result, expected) + +ITALIAN_MWT_INPUT = """ +# sent_id = train_78 +# text = @user dovrebbe fare pace col cervello +# twittiro = IMPLICIT ANALOGY +1 @user @user SYM SYM _ 3 nsubj _ _ +2 dovrebbe dovere AUX VM Mood=Cnd|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 3 aux _ _ +3 fare fare VERB V VerbForm=Inf 0 root _ _ +4 pace pace NOUN S Gender=Fem|Number=Sing 3 obj _ _ +5-6 col _ _ _ _ _ _ _ _ +5 con con ADP E _ 7 case _ _ +6 il il DET RD Definite=Def|Gender=Masc|Number=Sing|PronType=Art 7 det _ _ +7 cervello cervello NOUN S Gender=Masc|Number=Sing 3 obl _ _ +""" + +def test_ssurgeon_mwt_text(): + """ + Test that an MWT which is split into pieces which don't make up + the original token results in a correct #text annotation + + For example, in Italian, "col" splits into "con il", and we want + the #text to contain "col" + """ + check_empty_test(ITALIAN_MWT_INPUT) + +ITALIAN_SPACES_AFTER_INPUT=""" +# sent_id = train_1114 +# text = ““““ buona scuola ““““ +# twittiro = EXPLICIT OTHER +1 “ “ PUNCT FB _ 6 punct _ SpaceAfter=No +2 “ “ PUNCT FB _ 6 punct _ SpaceAfter=No +3 “ “ PUNCT FB _ 6 punct _ SpaceAfter=No +4 “ “ PUNCT FB _ 6 punct _ _ +5 buona buono ADJ A Gender=Fem|Number=Sing 6 amod _ _ +6 scuola scuola NOUN S Gender=Fem|Number=Sing 0 root _ _ +7 “ “ PUNCT FB _ 6 punct _ SpaceAfter=No +8 “ “ PUNCT FB _ 6 punct _ SpaceAfter=No +9 “ “ PUNCT FB _ 6 punct _ SpaceAfter=No +10 “ “ PUNCT FB _ 6 punct _ SpacesAfter=\\n +""" + +ITALIAN_SPACES_AFTER_YES_INPUT=""" +# sent_id = train_1114 +# text = ““““ buona scuola ““““ +# twittiro = EXPLICIT OTHER +1 “ “ PUNCT FB _ 6 punct _ SpaceAfter=No +2 “ “ PUNCT FB _ 6 punct _ SpaceAfter=No +3 “ “ PUNCT FB _ 6 punct _ SpaceAfter=No +4 “ “ PUNCT FB _ 6 punct _ SpaceAfter=Yes +5 buona buono ADJ A Gender=Fem|Number=Sing 6 amod _ _ +6 scuola scuola NOUN S Gender=Fem|Number=Sing 0 root _ _ +7 “ “ PUNCT FB _ 6 punct _ SpaceAfter=No +8 “ “ PUNCT FB _ 6 punct _ SpaceAfter=No +9 “ “ PUNCT FB _ 6 punct _ SpaceAfter=No +10 “ “ PUNCT FB _ 6 punct _ SpacesAfter=\\n +""" + + +def test_ssurgeon_spaces_after_text(): + """ + Test that SpacesAfter goes and comes back the same way + + Tested using some random example from the UD_Italian-TWITTIRO dataset + """ + check_empty_test(ITALIAN_SPACES_AFTER_INPUT) + +def test_ssurgeon_spaces_after_yes(): + """ + Test that an unnecessary SpaceAfter=Yes is eliminated + """ + check_empty_test(ITALIAN_SPACES_AFTER_YES_INPUT, ITALIAN_SPACES_AFTER_INPUT) + +EMPTY_VALUES_INPUT = """ +# text = Jennifer has lovely antennae. +# sent_id = 12 +# comment = if you're in to that kind of thing +1 Jennifer _ _ _ Number=Sing 2 nsubj _ ner=S-PERSON +2 has _ _ _ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root _ ner=O +3 lovely _ _ _ Degree=Pos 4 amod _ ner=O +4 antennae _ _ _ Number=Plur 2 obj _ SpaceAfter=No|ner=O +5 . _ _ _ _ 2 punct _ ner=O +""" + +def test_ssurgeon_blank_values(): + """ + Check that various None fields such as lemma & xpos are not turned into blanks + + Tests, like regulations, are often written in blood + """ + check_empty_test(EMPTY_VALUES_INPUT) + +# first couple sentences of UD_Cantonese-HK +# we change the order of the misc column in word 3 to make sure the +# pieces don't get unnecessarily reordered by ssurgeon +CANTONESE_MISC_WORDS_INPUT = """ +# sent_id = 1 +# text = 你喺度搵乜嘢呀? +1 你 你 PRON _ _ 3 nsubj _ Translit=nei5|Gloss=2SG|SpaceAfter=No +2 喺度 喺度 ADV _ _ 3 advmod _ Translit=hai2dou6|Gloss=PROG|SpaceAfter=No +3 搵 搵 VERB _ _ 0 root _ Translit=wan2|Gloss=find|SpaceAfter=No +4 乜嘢 乜嘢 PRON _ _ 3 obj _ Translit=mat1je5|Gloss=what|SpaceAfter=No +5 呀 呀 PART _ _ 3 discourse:sp _ Translit=aa3|Gloss=SFP|SpaceAfter=No +6 ? ? PUNCT _ _ 3 punct _ SpaceAfter=No + +# sent_id = 2 +# text = 咪執返啲嘢去阿哥個新屋度囖。 +1 咪 咪 ADV _ _ 2 advmod _ SpaceAfter=No +2 執 執 VERB _ _ 0 root _ SpaceAfter=No +3 返 返 VERB _ _ 2 compound:dir _ SpaceAfter=No +4 啲 啲 NOUN _ NounType=Clf 5 clf:det _ SpaceAfter=No +5 嘢 嘢 NOUN _ _ 3 obj _ SpaceAfter=No +6 去 去 VERB _ _ 2 conj _ SpaceAfter=No +7 阿哥 阿哥 NOUN _ _ 10 nmod _ SpaceAfter=No +8 個 個 NOUN _ NounType=Clf 10 clf:det _ SpaceAfter=No +9 新 新 ADJ _ _ 10 amod _ SpaceAfter=No +10 屋 屋 NOUN _ _ 6 obj _ SpaceAfter=No +11 度 度 ADP _ _ 10 case:loc _ SpaceAfter=No +12 囖 囖 PART _ _ 2 discourse:sp _ SpaceAfter=No +13 。 。 PUNCT _ _ 2 punct _ SpaceAfter=No +""" + +def test_ssurgeon_misc_words(): + """ + Check that various None fields such as lemma & xpos are not turned into blanks + + Tests, like regulations, are often written in blood + """ + check_empty_test(CANTONESE_MISC_WORDS_INPUT) + +ITALIAN_MWT_SPACE_AFTER_INPUT = """ +# sent_id = train_78 +# text = @user dovrebbe fare pace colcervello +# twittiro = IMPLICIT ANALOGY +1 @user @user SYM SYM _ 3 nsubj _ _ +2 dovrebbe dovere AUX VM Mood=Cnd|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 3 aux _ _ +3 fare fare VERB V VerbForm=Inf 0 root _ _ +4 pace pace NOUN S Gender=Fem|Number=Sing 3 obj _ _ +5-6 col _ _ _ _ _ _ _ SpaceAfter=No +5 con con ADP E _ 7 case _ _ +6 il il DET RD Definite=Def|Gender=Masc|Number=Sing|PronType=Art 7 det _ _ +7 cervello cervello NOUN S Gender=Masc|Number=Sing 3 obl _ RandomFeature=foo +""" + +def test_ssurgeon_mwt_space_after(): + """ + Check the SpaceAfter=No on an MWT (rather than a word) + + the RandomFeature=foo is on account of a silly bug in the initial + version of passing in MWT misc features + """ + check_empty_test(ITALIAN_MWT_SPACE_AFTER_INPUT) + +ITALIAN_MWT_MISC_INPUT = """ +# sent_id = train_78 +# text = @user dovrebbe farepacecolcervello +# twittiro = IMPLICIT ANALOGY +1 @user @user SYM SYM _ 3 nsubj _ _ +2 dovrebbe dovere AUX VM Mood=Cnd|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 3 aux _ _ +3-4 farepace _ _ _ _ _ _ _ Players=GonnaPlay|SpaceAfter=No +3 fare fare VERB V VerbForm=Inf 0 root _ _ +4 pace pace NOUN S Gender=Fem|Number=Sing 3 obj _ _ +5-6 col _ _ _ _ _ _ _ Haters=GonnaHate|SpaceAfter=No +5 con con ADP E _ 7 case _ _ +6 il il DET RD Definite=Def|Gender=Masc|Number=Sing|PronType=Art 7 det _ _ +7 cervello cervello NOUN S Gender=Masc|Number=Sing 3 obl _ RandomFeature=foo +""" + +def test_ssurgeon_mwt_misc(): + """ + Check the SpaceAfter=No on an MWT (rather than a word) + + the RandomFeature=foo is on account of a silly bug in the initial + version of passing in MWT misc features + """ + check_empty_test(ITALIAN_MWT_MISC_INPUT) + +SINDHI_ROOT_EXAMPLE = """ +# sent_id = 1 +# text = غلام رهڻ سان ماڻهو منافق ٿئي ٿو . +1 غلام غلام NOUN NN__اسم Case=Acc|Gender=Masc|Number=Sing|Person=3 2 compound _ _ +2 رهڻ ره VERB VB__فعل Number=Sing 6 advcl _ _ +3 سان سان ADP IN__حرفِ_جر Number=Sing 2 mark _ _ +4 ماڻهو ماڻهو NOUN NN__اسم Case=Nom|Gender=Masc|Number=Sing|Person=3 6 nsubj _ _ +5 منافق منافق ADJ JJ__صفت Case=Acc|Number=Sing|Person=3 6 xcomp _ _ +6 ٿئي ٿي VERB VB__فعل Number=Sing _ _ _ _ +7 ٿو ٿو AUX VB__فعل Number=Sing 6 aux _ _ +8 . . PUNCT -__پورو_دم _ 6 punct _ _ +""".lstrip() + +SINDHI_ROOT_EXPECTED = """ +# sent_id = 1 +# text = غلام رهڻ سان ماڻهو منافق ٿئي ٿو . +1 غلام غلام NOUN NN__اسم Case=Acc|Gender=Masc|Number=Sing|Person=3 2 compound _ _ +2 رهڻ ره VERB VB__فعل Number=Sing 6 advcl _ _ +3 سان سان ADP IN__حرفِ_جر Number=Sing 2 mark _ _ +4 ماڻهو ماڻهو NOUN NN__اسم Case=Nom|Gender=Masc|Number=Sing|Person=3 6 nsubj _ _ +5 منافق منافق ADJ JJ__صفت Case=Acc|Number=Sing|Person=3 6 xcomp _ _ +6 ٿئي ٿي VERB VB__فعل Number=Sing 0 root _ _ +7 ٿو ٿو AUX VB__فعل Number=Sing 6 aux _ _ +8 . . PUNCT -__پورو_دم _ 6 punct _ _ +""".strip() + +SINDHI_EDIT = """ +{}=root !< {} +setRoots root +""" + +def test_ssurgeon_rewrite_sindhi_roots(): + """ + A user / contributor sent a dependency file with blank roots + """ + edits = ssurgeon.parse_ssurgeon_edits(SINDHI_EDIT) + expected_edits = [ssurgeon.SsurgeonEdit(semgrex_pattern='{}=root !< {}', + ssurgeon_edits=['setRoots root'], + ssurgeon_id='1', notes='', language='UniversalEnglish')] + assert edits == expected_edits + + blank_dep_doc = CoNLL.conll2doc(input_str=SINDHI_ROOT_EXAMPLE) + # test that the conversion will work w/o crashing, such as because of a missing root edge + request = ssurgeon.build_request(blank_dep_doc, edits) + + response = ssurgeon.process_doc(blank_dep_doc, edits) + updated_doc = ssurgeon.convert_response_to_doc(blank_dep_doc, response) + + result = "{:C}".format(updated_doc) + assert result == SINDHI_ROOT_EXPECTED diff --git a/stanza/stanza/tests/server/test_tokensregex.py b/stanza/stanza/tests/server/test_tokensregex.py new file mode 100644 index 0000000000000000000000000000000000000000..e5780107a4808a01de2107907f41b39e84e71347 --- /dev/null +++ b/stanza/stanza/tests/server/test_tokensregex.py @@ -0,0 +1,48 @@ +import pytest +from stanza.tests import * + +from stanza.models.common.doc import Document +import stanza.server.tokensregex as tokensregex + +pytestmark = [pytest.mark.travis, pytest.mark.client] + +from stanza.tests.server.test_semgrex import ONE_SENTENCE_DOC, TWO_SENTENCE_DOC + +def test_single_sentence(): + #expected: + #match { + # sentence: 0 + # match { + # text: "Opal" + # begin: 2 + # end: 3 + # } + #} + + response = tokensregex.process_doc(ONE_SENTENCE_DOC, "Opal") + assert len(response.match) == 1 + assert len(response.match[0].match) == 1 + assert response.match[0].match[0].sentence == 0 + assert response.match[0].match[0].match.text == "Opal" + assert response.match[0].match[0].match.begin == 2 + assert response.match[0].match[0].match.end == 3 + + +def test_ner_sentence(): + #expected: + #match { + # sentence: 0 + # match { + # text: "Opal" + # begin: 2 + # end: 3 + # } + #} + + response = tokensregex.process_doc(ONE_SENTENCE_DOC, "[ner: GEM]") + assert len(response.match) == 1 + assert len(response.match[0].match) == 1 + assert response.match[0].match[0].sentence == 0 + assert response.match[0].match[0].match.text == "Opal" + assert response.match[0].match[0].match.begin == 2 + assert response.match[0].match[0].match.end == 3 diff --git a/stanza/stanza/tests/server/test_ud_enhancer.py b/stanza/stanza/tests/server/test_ud_enhancer.py new file mode 100644 index 0000000000000000000000000000000000000000..f67cf5d40d4ac7690115d6e7c6f5ede98d1301a4 --- /dev/null +++ b/stanza/stanza/tests/server/test_ud_enhancer.py @@ -0,0 +1,35 @@ +import pytest +import stanza +from stanza.tests import * + +from stanza.models.common.doc import Document +import stanza.server.ud_enhancer as ud_enhancer + +pytestmark = [pytest.mark.pipeline] + +def check_edges(graph, source, target, num, isExtra=None): + edges = [edge for edge in graph.edge if edge.source == source and edge.target == target] + assert len(edges) == num + if num == 1: + assert edges[0].isExtra == isExtra + +def test_one_sentence(): + nlp = stanza.Pipeline(dir=TEST_MODELS_DIR, processors="tokenize,pos,lemma,depparse") + doc = nlp("This is the car that I bought") + result = ud_enhancer.process_doc(doc, language="en", pronouns_pattern=None) + + assert len(result.sentence) == 1 + sentence = result.sentence[0] + + basic = sentence.basicDependencies + assert len(basic.node) == 7 + assert len(basic.edge) == 6 + check_edges(basic, 4, 7, 1, False) + check_edges(basic, 7, 4, 0) + + enhanced = sentence.enhancedDependencies + assert len(enhanced.node) == 7 + assert len(enhanced.edge) == 7 + check_edges(enhanced, 4, 7, 1, False) + # this is the new edge + check_edges(enhanced, 7, 4, 1, True) diff --git a/stanza/stanza/tests/tokenization/test_tokenize_files.py b/stanza/stanza/tests/tokenization/test_tokenize_files.py new file mode 100644 index 0000000000000000000000000000000000000000..b9604351fed847490af02d02a309f58d50c2f15e --- /dev/null +++ b/stanza/stanza/tests/tokenization/test_tokenize_files.py @@ -0,0 +1,24 @@ +import pytest + +from stanza.models.tokenization import tokenize_files +from stanza.tests import TEST_MODELS_DIR + +pytestmark = [pytest.mark.pipeline, pytest.mark.travis] + +EXPECTED = """ +This is a test . This is a second sentence . +I took my daughter ice skating +""".lstrip() + +def test_tokenize_files(tmp_path): + input_file = tmp_path / "input.txt" + with open(input_file, "w") as fout: + fout.write("This is a test. This is a second sentence.\n\nI took my daughter ice skating") + + output_file = tmp_path / "output.txt" + tokenize_files.main([str(input_file), "--lang", "en", "--output_file", str(output_file), "--model_dir", TEST_MODELS_DIR]) + + with open(output_file) as fin: + text = fin.read() + + assert EXPECTED == text diff --git a/stanza/stanza/utils/avg_sent_len.py b/stanza/stanza/utils/avg_sent_len.py new file mode 100644 index 0000000000000000000000000000000000000000..268fc9e35783938d5cf14eecf23d5be80310d75c --- /dev/null +++ b/stanza/stanza/utils/avg_sent_len.py @@ -0,0 +1,20 @@ +import sys +import json + +def avg_sent_len(toklabels): + if toklabels.endswith('.json'): + with open(toklabels, 'r') as f: + l = json.load(f) + + l = [''.join([str(x[1]) for x in para]) for para in l] + else: + with open(toklabels, 'r') as f: + l = ''.join(f.readlines()) + + l = l.split('\n\n') + + sentlen = [len(x) + 1 for para in l for x in para.split('2')] + return sum(sentlen) / len(sentlen) + +if __name__ == '__main__': + print(avg_sent_len(sys.args[1])) diff --git a/stanza/stanza/utils/default_paths.py b/stanza/stanza/utils/default_paths.py new file mode 100644 index 0000000000000000000000000000000000000000..ef87cc14f6aef21b0f5857706b72112adeb8980f --- /dev/null +++ b/stanza/stanza/utils/default_paths.py @@ -0,0 +1,55 @@ +import os + +def get_default_paths(): + """ + Gets base paths for the data directories + + If DATA_ROOT is set in the environment, use that as the root + otherwise use "./data" + individual paths can also be set in the environment + """ + DATA_ROOT = os.environ.get("DATA_ROOT", "data") + defaults = { + "TOKENIZE_DATA_DIR": DATA_ROOT + "/tokenize", + "MWT_DATA_DIR": DATA_ROOT + "/mwt", + "LEMMA_DATA_DIR": DATA_ROOT + "/lemma", + "POS_DATA_DIR": DATA_ROOT + "/pos", + "DEPPARSE_DATA_DIR": DATA_ROOT + "/depparse", + "ETE_DATA_DIR": DATA_ROOT + "/ete", + "NER_DATA_DIR": DATA_ROOT + "/ner", + "CHARLM_DATA_DIR": DATA_ROOT + "/charlm", + "SENTIMENT_DATA_DIR": DATA_ROOT + "/sentiment", + "CONSTITUENCY_DATA_DIR": DATA_ROOT + "/constituency", + "COREF_DATA_DIR": DATA_ROOT + "/coref", + "LEMMA_CLASSIFIER_DATA_DIR": DATA_ROOT + "/lemma_classifier", + + # Set directories to store external word vector data + "WORDVEC_DIR": "extern_data/wordvec", + + # TODO: not sure what other people actually have + # TODO: also, could make this automatically update to the latest + "UDBASE": "extern_data/ud2/ud-treebanks-v2.11", + "UDBASE_GIT": "extern_data/ud2/git", + + "NERBASE": "extern_data/ner", + "CONSTITUENCY_BASE": "extern_data/constituency", + "SENTIMENT_BASE": "extern_data/sentiment", + "COREF_BASE": "extern_data/coref", + + # there's a stanford github, stanfordnlp/handparsed-treebank, + # with some data for different languages + "HANDPARSED_DIR": "extern_data/handparsed-treebank", + + # directory with the contents of https://nlp.stanford.edu/projects/stanza/bio/ + # on the cluster, for example, /u/nlp/software/stanza/bio_ud + "BIO_UD_DIR": "extern_data/bio", + + # data root for other general input files, such as VI_VLSP + "STANZA_EXTERN_DIR": "extern_data", + } + + paths = { "DATA_ROOT" : DATA_ROOT } + for k, v in defaults.items(): + paths[k] = os.environ.get(k, v) + + return paths diff --git a/stanza/stanza/utils/get_tqdm.py b/stanza/stanza/utils/get_tqdm.py new file mode 100644 index 0000000000000000000000000000000000000000..94c911c200d87024594b820195fc0b0da4a703e1 --- /dev/null +++ b/stanza/stanza/utils/get_tqdm.py @@ -0,0 +1,46 @@ +import sys + +def get_tqdm(): + """ + Return a tqdm appropriate for the situation + + imports tqdm depending on if we're at a console, redir to a file, notebook, etc + + from @tcrimi at https://github.com/tqdm/tqdm/issues/506 + + This replaces `import tqdm`, so for example, you do this: + from stanza.utils.get_tqdm import get_tqdm + tqdm = get_tqdm() + then do this when you want a scroll bar or regular iterator depending on context: + tqdm(list) + + If there is no tty, the returned tqdm will always be disabled + unless disable=False is specifically set. + """ + ipy_str = "" + try: + from IPython import get_ipython + ipy_str = str(type(get_ipython())) + except ImportError: + pass + + if 'zmqshell' in ipy_str: + from tqdm import tqdm_notebook as tqdm + return tqdm + if 'terminal' in ipy_str: + from tqdm import tqdm + return tqdm + + if sys.stderr is not None and hasattr(sys.stderr, "isatty") and sys.stderr.isatty(): + from tqdm import tqdm + return tqdm + + from tqdm import tqdm + def hidden_tqdm(*args, **kwargs): + if "disable" in kwargs: + return tqdm(*args, **kwargs) + kwargs["disable"] = True + return tqdm(*args, **kwargs) + + return hidden_tqdm + diff --git a/stanza/stanza/utils/max_mwt_length.py b/stanza/stanza/utils/max_mwt_length.py new file mode 100644 index 0000000000000000000000000000000000000000..5e24d043f3e352d9ff76463a419df31b3c021bdb --- /dev/null +++ b/stanza/stanza/utils/max_mwt_length.py @@ -0,0 +1,14 @@ +import sys + +import json + +def max_mwt_length(filenames): + max_len = 0 + for filename in filenames: + with open(filename) as f: + d = json.load(f) + max_len = max([max_len] + [len(" ".join(x[0][1])) for x in d]) + return max_len + +if __name__ == '__main__': + print(max_max_jlength(sys.argv[1:])) diff --git a/stanza/stanza/utils/select_backoff.py b/stanza/stanza/utils/select_backoff.py new file mode 100644 index 0000000000000000000000000000000000000000..60d70cec2b665030a43396f02afba76673f24945 --- /dev/null +++ b/stanza/stanza/utils/select_backoff.py @@ -0,0 +1,13 @@ +import sys + +backoff_models = { "UD_Breton-KEB": "ga_idt", + "UD_Czech-PUD": "cs_pdt", + "UD_English-PUD": "en_ewt", + "UD_Faroese-OFT": "nn_nynorsk", + "UD_Finnish-PUD": "fi_tdt", + "UD_Japanese-Modern": "ja_gsd", + "UD_Naija-NSC": "en_ewt", + "UD_Swedish-PUD": "sv_talbanken" + } + +print(backoff_models[sys.argv[1]])