Vocabulary

class matchup.structure.vocabulary.Vocabulary(save, **kwargs)

Bases: object

Crucial data structure that represents and storage all text processing.

Attributes Summary

idf Get the data structure that represents the IDF weighting
keys Get all keywords presents in vocabulary
sanitizer Sanitizer property getter
tf Get the data structure that represents the TF weighting

Methods Summary

documents_with_keywords(kwds)
generate_idf()
import_collection() This is a function that recover the vocabulary previously generated.
import_file(file_path) Given a file path of a document, this function append this document into some structure, case the path are
import_folder(folder_path) Generalization of import_file(). This function receive a folder path and try to append all documents of
index_files() This function try to process all content of files that have been inserted before, generating
maximum_frequencies_per_document()
save() Persist data structure on disc.

Attributes Documentation

idf
Get the data structure that represents the IDF weighting
Returns:IDF object
keys
Get all keywords presents in vocabulary
Returns:list of all keywords
sanitizer
Sanitizer property getter
Returns:
tf
Get the data structure that represents the TF weighting
Returns:TF object

Methods Documentation

documents_with_keywords(kwds: Set[str]) → Set[str]
generate_idf()
import_collection() → bool
This is a function that recover the vocabulary previously generated.
Returns:boolean flag that indicates success or failure in case the vocabulary has no generated yet.
import_file(file_path: str) → bool
Given a file path of a document, this function append this document into some structure, case the path are correct. The processing of this file can be started running function index_files()
Parameters:file_path – string that represents a relative or absolute path of an txt file
Returns:boolean flag that indicates if the file has been identified
import_folder(folder_path: str) → bool

Generalization of import_file(). This function receive a folder path and try to append all documents of this folder into some structure. he processing of all this file can be started running function

index_files()
Parameters:folder_path – string that represents a relative or absolute path of an folder
Returns:boolean flag that indicates if the folder has been identified
index_files() → None
This function try to process all content of files that have been inserted before, generating the vocabulary data structure ready for use.
Returns:None
maximum_frequencies_per_document() → DefaultDict[str, float]
save() → bool
Persist data structure on disc.
Returns:boolean flag that indicates if the data structure can be persisted.