Chapter 3 - “Church” in the New Testament

Comparing Word Vectors

class Data_Production.compare_vectors.comparison(base, english, greek, measure, norm=False, zscore=1.0, **kwargs)[source]

Compares the vectors of a single word across the data from several different corpora. Note that the data from the different corpora must be normalized, preferably using sklearn.preprocessing.scale.

Parameters:
  • base (str) – the directory containing the sub-directories that contain the data for the different corpora
  • english (str) – the english transcription of the word being analyzed (used only in file naming)
  • greek (str) – the word in the alphabet of the target language. It must be written exactly as it is present in the corpora!
  • measure (str) – the type of data to use in the comparison, cosine similarity (CS), log-likelihood (LL), positive pointwise mutual information (PPMI), or raw co-occurrence counts (cooc)
  • norm (bool) – whether the data needs to be normalized
Variables:
  • corpora – the parameter information for each corpus to be used. Each corpus is represented by a tuple that contains the following information: Corpus Name (str): should match the name of the parent folder in which the text files for the corpus are kept, Best window size (str): the size of the context window as determined by ParamTester, Minimum occurrences (int): the minimum number of times a word had to occur in your corpus before being used to produce data in SemPipeline, Weighted or Unweighted window type (bool): whether the data for that corpus was produced using a weighted (True) or unweighted (False) context window type
  • base – passed on from the base parameter
  • ekk_rows – an empty dictionary that will contain the vectors for each corpus
  • english – passed on from the english parameter
  • greek – passed on from the greek parameter
  • prefix – part of the naming convention for the files from which the vectors will be extracted. Determined by the measure parameter
  • svd – part of the naming convention for the files from which the vectors will be extracted. Determined by the measure parameter
  • norm – passed on from the norm parameter

The central task of chapter 3 is to discover which of several corpora might best reflect, and thus point to the source for, the semantics of the word ἐκκλησία (ekklesia, English “assembly”). In order to do this, I calculated the cosine similarity of the cosine similarity vectors for ἐκκλησία in 6 different corpora: the New Testament, the Septuagint (Greek Old Testament), Philo, Josephus, Plutarch, and a mixed corpus of classical Greek literature extracted from the Perseus Digital Library. The automation of these calculations is the purpose for the compare_vectors.comparison class. There are 6 parameters for this class.

The first, base is a string that represents the top-level folder where the data for all of the corpora are contained. The folder structure assumed by this class for the data for each of the corpora depends on the corpora being used, the word under investigation, and the parameters used for each of the corpora. I will go into this folder structure more below after I have discussed the other parameters.

The second parameter, english is a string representing the transcription of the word into Latin characters. So for my purposes, I used the sting ‘ekklesia’ here. The third parameter, greek, is a string representing the word as it is written in its native alphabet. This need not be in Greek. I simply named the parameter greek because that is the language I am working with. Again, for my purposes, I would have set the value for greek to ‘ἐκκλησία’. The fourth parameter is measure. There are only three accepted values for this: ‘CS’, ‘LL’, ‘PPMI’ or ‘cooc’. The value for measure will determine the type of vectors to be used to calculate the similarity. I used ‘CS’ because I wanted to find the similarity of the vectors in terms of how similar ἐκκλησία was to other words in each of the corpora. If I had wanted to conduct the investigation based on the co-occurrence patterns of ἐκκλησία, I would have used ‘LL’ or ‘PPMI’ if I had wanted to base it on the statistical significance of the co-occurrence counts, or ‘cooc’ if I had wanted to base it on the raw co-occurrence counts. The fifth parameter, norm, is a boolean value and determines whether normalization will need to be run on the input matrices or not. In order to compare the data from different corpora, the matrices will need to be normalized using sklearn.preprocessing.scale. If this has already been done, then you should give norm the value of False. If not, then give it the value True and the data will be normalized and saved before being analyzed. This normalization process can take anywhere from a few seconds to a few hours, depending on the size of the corpus and the hardware being used. However, once this normalization has been done once for a corpus, it need not be done again.

And the sixth and final parameter, zscore, is the minimum Z-score to be used when producing the lists of the most similar words between the two corpora in each corpus pair. This parameter and its results require a bit more explanation. First, remember that the point of this class is to find out how similar a single word’s vectors in several different corpora are. In order to discover what makes the vectors as similar as they are in each corpus, comparison produces CSV files that show the word scores in the two vectors in two different corpora that are the most similar. Since the similarity scores produced by the cosine similarity metric are based on the similarity of individual scores in each vector, the scores that are most similar to each other in the two vectors will be most responisible for the similarity between two vectors. comparison, then, automatically checks each of the vectors in all corpus pairs to find the single word scores that have the smallest difference and saves these words, along with the difference in their Z-scores, to a CSV file in the base directory called CORPUS1_CORPUS2_zscore=ZSCORE_top_100_words.txt where CORPUS1 is the name of the first corpus, CORPUS2 that of the second, and ZSCORE the value for the zscore parameter. The reason that I have implemented the zscore parameter and not just automatically included all of the words in this pairwise comparison is because words that have a very low Z-score, in whatever measure (whether cooc, LL, PPMI, or CS) are less important to the semantics of the target word than words with a high score. Therefore, in order to determine which of the most semantically important words are similar between the two corpora, one can set the zscore parameter to the desired magnitude. The trick here is to figure out which zscore level produces the most reasonable results. I have found that a Z-score of 1.0 tends to produce interpretable results, but this will always depend on the two corpora under investigation. If the two corpora use the target word in a very similar manner, then a Z-score of 1.0 might actually produce too many results with many actually reflecting a syntactic or grammatical, rather than semantic, relationship. If, however, the two corpora differ significantly in how they use the target word, a Z-score of 1.0 might produce almost no results. You as the user must decide what produces the best amount of interpretable data and set your zscore to the appropriate level.

Before moving on to the expected folder structure, I should mention one instance variable that needs to be set within the code itself: corpora. corpora is a list that contains a length 4 tuple that represents every corpus that you want to compare. Each corpus’ tuple is made up of its name (str), which should be name of the folder in which all textual data for this corpus is contained, optimal window size (str), as determined by Data_Production.sem_extract_pipeline.SemPipeline, minimum number of occurrences (int) upon which the calculations were based, and weighted (bool), representing whether a weighted window type was used or not.

So, now that I have introduced these parameters and variables, why are they important? Primarily to determine where the data is that should be used in the similarity calculations. This class assumes that the data being analyzed was produced by the Data_Production.sem_extract_pipeline.SemPipeline class and, thus, that it follows the folder and naming conventions automatically produced by that class. The first of these assumptions is that the data produced will be in a folder BASE/NAME/ENGLISH/WIN_SIZE. So that means that if I want to analyze the data I produced from the New Testament (which I call ‘NT’) for ‘ekklesia’ with a context window size of 16, minimun occurrences of 1, and a weighted context window type, and assuming that BASE is the value given to the base parameter, the comparison class would expect to find the distributional data in the folder BASE/NT/ekklesia/16/. And, further, if I wanted to analyze the cosine similarity data that was produced using the Log-likelihood measure, the file name it would be looking for would be LL_cosine_16_lems=False_ekklesia_min_occ=1_no_stops=False_weighted=True_NORMED.dat. As I said, this folder structure and file naming convention should be (almost) automatically created by SemPipeline assuming that the data was calculated from .txt files in the folder BASE/NT/ekklesia.

The ‘almost’ in that previous sentence needs a bit of explanation here. The only part of that file name that is not produced by SemPipeline is the _NORMED part of it. This is not done automatically there because normalization can be a resource intensive process that is unnecessary if you don’t want to compare different data sets to each other. If you do, however, it is necessary. As mentioned above, if you have not already normalized your data matrices then simply set the value of norm to True and each matrix will be normalized during the process. Your normalized data is automatically saved during the process so that the next time you run the class, normalization will not be necessary. If you add one or more data matrices to your analysis later, these should be normalized before running comparison so that previously normalized data need not be re-normalized. This can be done using sklearn.preprocessing.scale (documentation).

Finally, as with sem_extract_pipeline, you can also run comparison from the command line. Simply type:

python Data_Production/compare_vectors.py --base BASE --english ENGLISH --greek GREEK --zscore ZSCORE [--norm]

The words in all caps represent the values for the parameters listed above. Include the norm flag only if you wish to normalize your data, otherwise simply leave it out. The data produced by comparison is sufficient to begin investigating the semantic similarities and differences between several corpora for your target word.

Go to: