4 ngrams - N-gram models
Copyright 2005, 2006, 2007, 2008, 2009, 2010 by
Damir Ćavar, Petar Garžina, Larisa Grčić, Tanja Gulan, Damir Kero, Robert Paleka, Franjo Pehar, Pavle Valerjev
The
nltk/ngrams module provides procedures to generate and process n-gram models from token lists. The token lists could be sequences of any type. For linguistic analysis and processing the tokens are usually strings that represent words, or character sequences. The procedures for n-gram model generation and processing in this module are in fact more generic. A sequence of any data type tokens can be used to generate n-grams of the particular tokens.
An explanation of n-grams and their use in Computational Linguistics and Natural Language Processing can be found in the following resources:
4.1 General n-gram procedures
(token-sequence->ngrams seq [n ngrams]) → hashtable? |
seq : (or/c list? vector? string?) |
n : (and/c (>/c 0) integer?) = 2 |
ngrams : hashtable? = (make-hashtable equal-hash equal?) |
Returns a n-gram data structure as a hashtable with n-gram keys and their absolute frequency as corresponding values. The seq parameter must be a sequence type, that is a list, a vector of tokens, or a string. A string is interpreted as a sequence of characters, thus the resulting n-grams generated from strings consist of n tokens of the type char. The number of tokens in the resulting n-gram model is specified by the implicit parameter n, which must be a positive integer larger than 0, and defaults to 2.
The procedure has a side effect, if the ngrams parameter is provided. In this case, the extracted n-grams will be added to the model in ngrams. If the parameter ngrams is not specified, it defaults to an empty hashtable.
The resulting n-gram model is represented in a hashtable data structure, with the n-grams as keys of the type vector, and the values of type integer.
The procedure returns an empty hashtable, if the optional ngrams parameter is not provided, and either the seq parameter is not of a sequence type, or the n parameter is < 1. If a ngrams parameter is provided, and the seq parameter is not of a sequence type, or the n parameter is < 1, the ngrams parameter is returned unchanged.
4.2 N-gram model conversion
(filter-ngrams ngrams filter-tokens remove) → hashtable? |
ngrams : hashtable? |
filter-tokens : list? |
remove : boolean? |
Returns a new data structure of the type hashtable with n-grams as keys and frequencies as values. If the optional remove flag is provided as #t, the returned hashtable will not contain any n-gram that contains an element that is member of the obligatory filter-tokens list. If the optional remove flag is provided as #f, the returned hashtable will only contain n-grams that contain an element that is member of the obligatory filter-tokens list.
(filter-ngrams! ngrams filter-tokens remove) → hashtable? |
ngrams : hashtable? |
filter-tokens : list? |
remove : boolean? |
Returns exactly the same data structure of the type hashtable that is specified in the parameter ngrams, with n-grams as keys and the corresponding n-gram frequencies as values. Similar to filter-ngrams, if the remove flag is set to #t all n-grams are removed from the ngrams hashtable, if one of their tokens is member of the list filter-tokens. If the remove flag is set to #f, only the n-grams will remain in the resulting hashtable that contain a token that is member of the list filter-ngrams. The data structure ngrams is changed in place.
(ngrams->bigrams ngrams) → hashtable? |
ngrams : hashtable? |
Returns a new data structure of the type hashtable with bigrams as keys and frequencies as values. If the length of the n-grams in the parameter ngrams is smaller than 2, or if the ngrams data structure is empty, an empty bigrams hashtable is returned.
The frequencies of the bigrams in the returned data structure are not real frequencies of these n-grams in the original text source or corpus.
(relativize-ngrams ngram) → hashtable? |
ngram : hashtable? |
Returns a new hashtable with ngrams as keys. The ngrams are of the type vector, the values are relativized by division of their individual values thought the sum of all values. The procedure does not perform any type or value checking. The values are expected to be absolute frequency counts of ngrams of the integer type, the resulting values are of type rational.
(relativize-ngrams! ngram) → hashtable? |
ngram : hashtable? |
Relativizes absolute frequencies of ngrams, as relativize-ngrams, with the side effect of changing the values within the submitted ngrams model (hashtable).
4.3 Output procedures
(ngrams->dot ngrams [graph-type]) → string? |
ngrams : hashtable? |
graph-type : symbol? = 'digraph |
Returns a string with the
DOT representation of a graph with all interconnections between tokens in n-grams.
The resulting DOT data (or file) can be processed and visualized using Graphviz, and many other related tools. More information on DOT, Graphviz and other visualization software is available at the Graphviz homepage.
(ngrams->dot-digraph ngrams) → string? |
ngrams : hashtable? |
Returns a string with the
DOT representation of a graph with all interconnections between tokens in the provided n-grams parameter.
(ngrams->dot-graph ngrams) → string? |
ngrams : hashtable? |
Returns a string with the
DOT representation of a graph with all interconnections between tokens in n-grams.
(ngrams->html-table | | ngrams | | | | | | [ | column-titles | | | | | | | sorted | | | | | | | sort-field]) | | → | | string? |
|
ngrams : hashtable? |
column-titles : list? = '() |
sorted : boolean? = #t |
sort-field : symbol? = 'frequency |
Returns a string that contains the
HTML code representation of n-grams and their frequencies as stored in the
hashtable data structure provided by the variable
ngrams. The optional parameter
column-titles is a
list of
strings with the column or header labels for the resulting table. The optional
sorted flag defaults to
#t, which results in a sorted output, using a decreasing strategy for numbers, and an increasing for strings. The optional parameter
sort-field is either the symbol
'frequency, or something else. If
sort-field has the value
'frequency, the resulting table will be sorted decreasingly on the frequency value of the contained n-grams. Else, it will be sorted increasingly on the n-grams themselves.
Each row of the HTML table consists of a cell list with all tokens from the n-gram in one cell, followed by their corresponding value.
The resulting HTML code of the table provides a CSS, JavaScript or style hook for formatting options. The following class attributes values are used in the three basic tags of the resulting HTML output.