Sanitizer¶
-
class
matchup.presentation.sanitizer.
Sanitizer
(*, stopwords_path: str = None, stemming: bool = False)¶ Bases:
object
Responsible to clean and process the text representation.
Attributes Summary
stopwords_path
Property that get the stopwords file path Methods Summary
add_stopwords
(stopwords)Add a set of stopwords manually. import_stopwords
()Retrieve stopwords from a file. index_line
(words, base_line)This function index one line and returning all words sanitized sanitize_line
(line, number_line)This function sanitize one line. The number is basically for presentation strip_accents
(text)Strip accents of one text Attributes Documentation
-
stopwords_path
¶ - Property that get the stopwords file path
Returns: Complete stopwords file path
Methods Documentation
-
add_stopwords
(stopwords: Set[str])¶ - Add a set of stopwords manually.
Parameters: stopwords – set of stopwords. Returns:
-
import_stopwords
() → Set[str]¶ - Retrieve stopwords from a file.
Returns: set of stopwords
-
index_line
(words: List[str], base_line: matchup.presentation.text.Line) → List[matchup.presentation.text.Term]¶ - This function index one line and returning all words sanitized The list must be sorted by occurrence in text!
Parameters: - words – base-line stripped and without stopwords
- base_line – line to be sanitized
Returns: list of indexed words : list(Term)
-
sanitize_line
(line: str, number_line: int) → List[matchup.presentation.text.Term]¶ - This function sanitize one line. The number is basically for presentation
Parameters: - line – Complete line to be processed
- number_line – number of line
Returns:
-
static
strip_accents
(text: str) → str¶ - Strip accents of one text
Parameters: text – old text Returns: text sanitized
-