Chapter 1 - Finding the Best Parameters for the Corpus

The code for this chapter builds the foundation for the whole study in that it calculates the parameters that are used to produce the data for the rest of the dissertation. The central class in this chapter’s code is the ParamTester class. First, the basic documentation for this class followed by a brief explanation of the input parameters.

The ParamTester Class

class Data_Production.sem_extract_pipeline.ParamTester(min_w, max_w, step, c=8, jobs=1, min_count=1, files=None, stops=(), lem_file=None, w_tests='both', l_tests='both', steps=['all'], **kwargs)[source]

Runs parameter testing for the corpus in question the testing parameters are specified in the self.RunTests function

Parameters:
  • c (int) – the number of cores to use in the co-occurrence calculations
  • jobs (int) – the number of cores to use in the cosine similarity calculations
  • min_count (int) – the minimum occurrence count. Words below this count will not be counted. The purpose here is for memory management. My tests have shown that using all words produces better results.
  • files (str) – the directory path for the .txt files that make up the corpus
  • stops ((str)) – the stops words to be ignored in the calculations
  • min_w (int) – the minimum context window size to use
  • max_w (int) – the maximum context window size to use
  • step (int) – the size of the steps between min_w and max_w
  • lem_file (str) – the path and filename for the word occurrence dictionary pickle
  • w_tests (str) – whether to use weighted (“True”) or unweighted (“False”) window types or “both”
  • l_tests (str) – whether to use word lemmas (“True”) or inflected forms (“False”) or “both”
  • steps (list) – the steps in the calculation process to perform. Allowed: ‘all’, ‘coocs’, ‘LL’, ‘PPMI’, ‘LL_CS’ (cosine similarity based on an existing Log-likelihood matrix), or ‘PPMI_CS’.
Variables:
  • c – the number of cores to use in the co-occurrence calculations
  • stops – list of stop words to ignore during the calculations
  • min_count – the minimum number of occurrences for a word to be used in the calculations
  • files – the directory path for the .txt files that make up the corpus
  • sim_algo – the similarity algorithm to use in the calculations
  • ind – the indices for the rows and columns of the matrix (i.e., the words) - filled in self.cooc_counter
  • cols – the length of self.ind - filled in self.cooc_counter
  • cols – int
  • coll_df – transformed into numpy.memmap and filled in self.cooc_counter
  • LL_df – transformed into numpy.memmap and filled in self.LL
  • PPMI_df – transformed into numpy.memmap and filled in self.PPMI
  • CS_df – transformed into numpy.memmap and filled in self.CS
  • stat_df – filled with either self.PPMI_df or self.LL_df in self.CS
  • param_dict – filled with the scores for each set of parameters in self.RunTests

As you can see, there are a lot of input parameters for this class. I will focus here on the less technical ones that are important for testing your corpus correctly.

Formatting Your Corpus

files (str)

At the same time the parameter that is the most important, easiest to understand, and requires the most explanation is the most explanation. files is a string that represents the directory path to the .txt files that make up your corpus, e.g., /home/matt/corpus. I would recommend that your corpus be represented by one file for each work. Otherwise the co-occurrence context window (explained more below) will span from one work to another.

So far, so good. But the format of these files is also very important. The formatting is based on TEI markup, though the files need not be valid TEI to be used. Basically what this means is that ParamTester expects each individual word in each file to be on its own line and to be enclosed in a <w> tag. For example:

<w>Βίβλος</w>
<w>γενέσεως</w>
<w>Ἰησοῦ</w>
<w>χριστοῦ</w>
<w>υἱοῦ</w>
<w>Δαυὶδ</w>
<w>υἱοῦ</w>
<w>Ἀβραάμ</w>

If you plan on always using only the inflected words in your corpus, then this markup is enough. If, however, you want to be able to use lemmatized versions of the text, there is a special way to format each of these tags, once again in accordance with TEI markup principles without requiring complete TEI compliance. An example:

<w lem="βίβλος">Βίβλος</w>
<w lem="γένεσις">γενέσεως</w>
<w lem="Ἰησοῦς">Ἰησοῦ</w>
<w lem="Χριστός">χριστοῦ</w>
<w lem="υἱός">υἱοῦ</w>
<w lem="Δαυίδ">Δαυὶδ</w>
<w lem="υἱός">υἱοῦ</w>
<w lem="Ἀβραάμ">Ἀβραάμ</w>

The @lem attribute on each <w> tag contains the word lemma information for each word. Note that having the value of each @lem attribute surrounded by double quotes, i.e., ", instead of single quotes, i.e., ' is quite important if you want to be able to use this lemma information. If your texts are in this format, then you will be able to perform your calculations on either the lemmatized or the inflected text.

l_tests (str)

The mention of lemmatized or inflected texts brings us to the next parameter, l_tests. This is a string value representing whether the lemmas in the text should be used for calculation, the inflected words, or whether both tests should be run. A value of True will test only the lemmatized text, False will test only the inflected words, and both will run both tests.

The Parameters that ParamTester Tests

The following sections give a brief introduction of the parameters that ParamTester tests. For more information on these parameters, see the dissertation itself.

min_w (int), max_w (int), and step (int)

These three related parameters give the range for the context-window sizes that will be tested. min_w gives the minimum window size to be tested, max_w the maximum, and step the size of the step to be taken in between min_w and max_w. So, for instance, if you wanted to test the window sizes 5, 10, 15, 20, 25, 30, you would set min_w=5, max_w=30, and step=5.

w_tests (str)

w_tests takes the same format and the same arguments as l_tests, i.e., True, False, and both. This parameter determines whether to use a weighted context window, an unweighted context window, or to test both types. An unweighted context window counts every word in the context window only one time, no matter its distance from the target word. A weighted context window, on the other hand, counts the words that are closer to the target word more times than the ones that are farther away. For instance, if you were using a 4-word context window, a weighted context window would count the words right before and after the target word 4 times, the words 2 words before and after 3 times, the words 3 words before and after 2 times, and the words 4 words away only once.

My own experience with Greek shows that for some corpora the weighted window is best, while for others the unweighted window is best. So I would certainly suggest testing both of these for all parameters.

stops ((str))

stops is a tuple that contains the words that you want to designate as stop words for your corpus. The default value is an empty tuple since my own tests with Greek have shown that results are improved by retaining the stop words during the calculations but then ignoring them, if desired, when interpreting the results. Only one set of stop words can be tested at a time. If you wish to test the results of several different stop word lists, you will need to designate and run tests on each of them individually.

sim_algo

The only class instance variable that I will mention here is sim_algo. ParamTester is not set up to take this as a parameter, instead defaulting to Cosine Similarity. If, however, you want to change the similarity algorithm used to compute the similarity of the distributional vectors, sim_algo accepts any str that is a valid metric for the sklearn.metrics.pairwise.pairwise_distances (documentation) function.

The CatSimWin Class

class Chapter_2.LouwNidaCatSim.CatSimWin(algo, rng, lems=False, CS_dir=None, dest_dir=None, sim_algo=None, corpus=('SBL_GNT_books', 1, 1.0, True), lem_file=None)[source]

Calculates the average similarity and the z-score of this similarity for all words that share the same semantic sub-domains in the Louw-Nida lexicon

Parameters:
  • algo (str) – the significance algorithm used to produce the cosine-similarity matrices used
  • rng (list) – the individual window sizes used for the discrete calculations
  • lems (bool) – whether the input matrices were calculated with lemmatized texts or not
  • CS_dir (str) – the directory path where the cosine-similarity matrices are located
  • dest_dir (str) – the diretory path to save the results
  • sim_algo (str) – which similarity algorithm was used in the calculations
  • corpus (tuple) – tuple with the name of the corpus (str), the minimum number of occurrences used (int), Caron’s svd exponent (float - 1.0 if none was used), and whether stop words were included (bool)
  • lem_file (str) – the file path and filename of the occurrence dictionary pickle that shows the number of time each word occurs in the corpus

CatSimWin is called from inside the ParamTester class and its parameters are set automatically there. Since it is an integral part of how the data produced using the different parameters is compared, it deserves some attention here.

The module LouwNidaCatSim calculates how closely a set of data mirrors the semantic sub-domains present in the Louw-Nida Lexicon of the New Testament. A Python dictionary object with all of the semantic sub-domains and the words that belong to them is required to perform these calculations. Such a pickled file is included in the repository in the Chapter_2 folder. [1]

The CatSimWin class is actually capable of testing multiple window sizes in the same run but ParamTester only sends a single window size to it at once. CatSimWin then calculates the mean and the Z-score for all of the other different parameters for that single window size and returns the results to ParamTester, which then writes them out to a tab-delimited CSV file that can be checked later. Calculating the Z-score allows the results from different parameter runs to be compared to each other, with the results that have the highest Z-score showing the largest separation between the average similarity among all words and the similarity of those words that belong to the same semantic sub-domains and, thus, being the best at mirroring the semantics represented in the Louw-Nida lexicon.

In order to run all the tests for a certain set of parameters, you can run the RunTests method on ParamTester to automatically run all of the selected tests for all of the selected parameters. You can also run this from the command line, replacing the ALL_CAPS words below with the correct parameter value for that parameter:

$ python Data_Production/sem_extract_pipeline.py \
    --files FILES \
    --c C \
    ParamTester \
    --min_w MIN_W \
    --max_w MAX_W \
    --step STEP \
    --w_tests W_TESTS \
    --l_tests L_TESTS \
    --steps all

Graphing the Results

class Chapter_1.consolidate_test_results.win_tests(orig, corpus, file_pattern='*.csv')[source]

Collects and graphs the results of multiple runs of Data_Production.sem_extract_pipeline.ParamTester

Parameters:
  • orig (str) – the folder in which the .csv files containing the results are located
  • corpus (str) – the corpus that is being analyzed. This should be the same string used in the file names to designate the corpus (e.g., NT)
  • file_pattern (str) – the file extension of the files containing the results

Once you have run ParamTester on your corpus, you can automatically graph the results using the win_tests class. The function build_df consolidates the results into a single pandas.DataFrame object and saves this object as a CSV file in the orig folder. Then, running the graph_it function will produce a simple graphical representation of the data with the maximum value for each set of parameters labelled, which it will also save in the orig folder.

The SemPipeline Class

class Data_Production.sem_extract_pipeline.SemPipeline(win_size=10, lemmata=True, weighted=True, algo='PPMI', sim_algo='cosine', files=None, c=8, occ_dict=None, min_count=1, jobs=1, stops=True, **kwargs)[source]

This class produces matrices representing cooccurrence counts, statistical significance, and similarity data for a corpus

Parameters:
  • win_size (int) – context window size
  • lemmata (bool) – whether to use word lemmata
  • weighted (bool) – whether to use a weighted window type
  • algo (str) – the significance algorithm to use. ‘LL’ and ‘PPMI’ are implemented
  • sim_algo (str) – the similarity algorithm to use. ‘CS’ is implemented
  • files (str) – the directory in which the individual .txt files are held
  • c (int) – the number of cores to use in self.cooc_counter (will be removed in the future)
  • occ_dict (str) – the path and filename for the occurrence dictionary pickle
  • min_count (int) – the minimum occurrence count below which words will not be counted
  • jobs (int) – number of jobs to use during the cosine similarity calculations
  • stops ((str)) – whether to include stop words or not (True means to include them)
Variables:
  • w – the context window size
  • lems – whether a lemmatized or unlemmatized text will be used
  • weighted – whether a weighted or unweighted context window will be used (True == weighted)
  • algo – which significance algorithm will be used (PPMI or LL)
  • sim_algo – the similarity algorithm to be used
  • dir – the directory path in which the texts are located
  • c – the number of cores to use during co-occurrence counting
  • occ_dict – the location for the dictionary representing word counts for every word
  • min_count – the minimum threshold of occurrences for the words to be calculated
  • jobs – the value to be used for n_jobs in the cosine similarity calculations
  • stops – a list of stop-words to ignore during the calculations
  • ind – the indices for the rows and columns of the matrix (i.e., the words) - filled in self.cooc_counter
  • cols – the length of self.ind - filled in self.cooc_counter
  • cols – int
  • coll_df – transformed into numpy.memmap and filled in self.cooc_counter
  • LL_df – transformed into numpy.memmap and filled in self.LL
  • PPMI_df – transformed into numpy.memmap and filled in self.PPMI
  • CS_df – transformed into numpy.memmap and filled in self.CS
  • stat_df – filled with either self.PPMI_df or self.LL_df in self.CS
  • dest – the destination directory for all files - filled in self.makeFileNames
  • corpus – the name of the corpus under investigation - filled in self.makeFileNames

Once you have determined the best parameters for your corpus, you then use the SemPipeline class to actually produce the data that you can then use for whatever task you wish. The parameters for this class are the same as those for ParamTester described above. The only significant differences between SemPipeline and ParamTester are that SemPipeline only accepts a single set of parameters at a time and it saves all the data from every step, including the list of words in the corpus, the co-occurrence counts for every word with every other word, the significance scores for all the words with each other, and the similarity scores for every word’s significance vector with every other word’s. These results are not saved in ParamTester because these individual results files can be as large as tens or even hundreds of gigabytes. So they are only saved during the SemPipeline process in order to not take up disk space with results that will not be used.

Chapters 2 through 4 will discuss particular use cases for the data that is produced with the classes described here.

Footnotes

[1]You can also create your own such dictionary from the data at http://www.laparola.net/greco/louwnida.php using the Chapter_2.LouwNidaExtract.extractLouwNida class.

Go To: