Chapter 3 - "Church" in the New Testament
#########################################

Comparing Word Vectors
======================

.. autoclass:: Data_Production.compare_vectors.comparison

The central task of chapter 3 is to discover which of several corpora might best reflect, and thus point to the source for, the semantics of the word ἐκκλησία (ekklesia, English "assembly"). In order to do this, I calculated the cosine similarity of the cosine similarity vectors for ἐκκλησία in 6 different corpora: the New Testament, the Septuagint (Greek Old Testament), Philo, Josephus, Plutarch, and a mixed corpus of classical Greek literature extracted from the Perseus Digital Library. The automation of these calculations is the purpose for the ``compare_vectors.comparison`` class. There are 6 parameters for this class.

The first, ``base`` is a string that represents the top-level folder where the data for all of the corpora are contained. The folder structure assumed by this class for the data for each of the corpora depends on the corpora being used, the word under investigation, and the parameters used for each of the corpora. I will go into this folder structure more below after I have discussed the other parameters.

The second parameter, ``english`` is a string representing the transcription of the word into Latin characters. So for my purposes, I used the sting 'ekklesia' here. The third parameter, ``greek``, is a string representing the word as it is written in its native alphabet. This need not be in Greek. I simply named the parameter ``greek`` because that is the language I am working with. Again, for my purposes, I would have set the value for ``greek`` to 'ἐκκλησία'. The fourth parameter is ``measure``. There are only three accepted values for this: 'CS', 'LL', 'PPMI' or 'cooc'. The value for ``measure`` will determine the type of vectors to be used to calculate the similarity. I used 'CS' because I wanted to find the similarity of the vectors in terms of how similar ἐκκλησία was to other words in each of the corpora. If I had wanted to conduct the investigation based on the co-occurrence patterns of ἐκκλησία, I would have used 'LL' or 'PPMI' if I had wanted to base it on the statistical significance of the co-occurrence counts, or 'cooc' if I had wanted to base it on the raw co-occurrence counts. The fifth parameter, ``norm``, is a boolean value and determines whether normalization will need to be run on the input matrices or not. In order to compare the data from different corpora, the matrices will need to be normalized using ``sklearn.preprocessing.scale``. If this has already been done, then you should give ``norm`` the value of `False`. If not, then give it the value `True` and the data will be normalized and saved before being analyzed. This normalization process can take anywhere from a few seconds to a few hours, depending on the size of the corpus and the hardware being used. However, once this normalization has been done once for a corpus, it need not be done again. 

And the sixth and final parameter, ``zscore``, is the minimum Z-score to be used when producing the lists of the most similar words between the two corpora in each corpus pair. This parameter and its results require a bit more explanation. First, remember that the point of this class is to find out how similar a single word's vectors in several different corpora are. In order to discover what makes the vectors as similar as they are in each corpus, ``comparison`` produces CSV files that show the word scores in the two vectors in two different corpora that are the most similar. Since the similarity scores produced by the cosine similarity metric are based on the similarity of individual scores in each vector, the scores that are most similar to each other in the two vectors will be most responisible for the similarity between two vectors. ``comparison``, then, automatically checks each of the vectors in all corpus pairs to find the single word scores that have the smallest difference and saves these words, along with the difference in their Z-scores, to a CSV file in the ``base`` directory called *CORPUS1_CORPUS2_zscore=ZSCORE_top_100_words.txt* where CORPUS1 is the name of the first corpus, CORPUS2 that of the second, and ZSCORE the value for the ``zscore`` parameter. The reason that I have implemented the ``zscore`` parameter and not just automatically included all of the words in this pairwise comparison is because words that have a very low Z-score, in whatever measure (whether cooc, LL, PPMI, or CS) are less important to the semantics of the target word than words with a high score. Therefore, in order to determine which of the most semantically important words are similar between the two corpora, one can set the ``zscore`` parameter to the desired magnitude. The trick here is to figure out which ``zscore`` level produces the most reasonable results. I have found that a Z-score of 1.0 tends to produce interpretable results, but this will always depend on the two corpora under investigation. If the two corpora use the target word in a very similar manner, then a Z-score of 1.0 might actually produce too many results with many actually reflecting a syntactic or grammatical, rather than semantic, relationship. If, however, the two corpora differ significantly in how they use the target word, a Z-score of 1.0 might produce almost no results. You as the user must decide what produces the best amount of interpretable data and set your ``zscore`` to the appropriate level.

Before moving on to the expected folder structure, I should mention one instance variable that needs to be set within the code itself: ``corpora``. ``corpora`` is a list that contains a length 4 tuple that represents every corpus that you want to compare. Each corpus' tuple is made up of its **name** *(str)*, which should be name of the folder in which all textual data for this corpus is contained, **optimal window size** *(str)*, as determined by ``Data_Production.sem_extract_pipeline.SemPipeline``, **minimum number of occurrences** *(int)* upon which the calculations were based, and **weighted** *(bool)*, representing whether a weighted window type was used or not.

So, now that I have introduced these parameters and variables, why are they important? Primarily to determine where the data is that should be used in the similarity calculations. This class assumes that the data being analyzed was produced by the ``Data_Production.sem_extract_pipeline.SemPipeline`` class and, thus, that it follows the folder and naming conventions automatically produced by that class. The first of these assumptions is that the data produced will be in a folder *BASE/NAME/ENGLISH/WIN_SIZE*. So that means that if I want to analyze the data I produced from the New Testament (which I call 'NT') for 'ekklesia' with a context window size of 16, minimun occurrences of 1, and a *weighted* context window type, and assuming that *BASE* is the value given to the ``base`` parameter, the ``comparison`` class would expect to find the distributional data in the folder *BASE/NT/ekklesia/16/*. And, further, if I wanted to analyze the cosine similarity data that was produced using the Log-likelihood measure, the file name it would be looking for would be *LL_cosine_16_lems=False_ekklesia_min_occ=1_no_stops=False_weighted=True_NORMED.dat*. As I said, this folder structure and file naming convention should be (almost) automatically created by ``SemPipeline`` assuming that the data was calculated from `.txt` files in the folder *BASE/NT/ekklesia*.

The 'almost' in that previous sentence needs a bit of explanation here. The only part of that file name that is not produced by ``SemPipeline`` is the _NORMED part of it. This is not done automatically there because normalization can be a resource intensive process that is unnecessary if you don't want to compare different data sets to each other. If you do, however, it is necessary. As mentioned above, if you have not already normalized your data matrices then simply set the value of ``norm`` to `True` and each matrix will be normalized during the process. Your normalized data is automatically saved during the process so that the next time you run the class, normalization will not be necessary. If you add one or more data matrices to your analysis later, these should be normalized before running ``comparison`` so that previously normalized data need not be re-normalized. This can be done using ``sklearn.preprocessing.scale`` (`documentation <http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html#sklearn.preprocessing.scale>`_).

Finally, as with ``sem_extract_pipeline``, you can also run ``comparison`` from the command line. Simply type::

    python Data_Production/compare_vectors.py --base BASE --english ENGLISH --greek GREEK --zscore ZSCORE [--norm]
    
The words in all caps represent the values for the parameters listed above. Include the ``norm`` flag only if you wish to normalize your data, otherwise simply leave it out.  The data produced by ``comparison`` is sufficient to begin investigating the semantic similarities and differences between several corpora for your target word.

Go to:

* :doc:`Index <index>`
* :doc:`Chapter 1 <chapter1>`
* :doc:`Chapter 2 <chapter2>`
* :doc:`Chapter 4 <chapter4>`