Sanitizer

class matchup.presentation.sanitizer.Sanitizer(*, stopwords_path: str = None, stemming: bool = False)

Bases: object

Responsible to clean and process the text representation.

Attributes Summary

stopwords_path Property that get the stopwords file path

Methods Summary

add_stopwords(stopwords) Add a set of stopwords manually.
import_stopwords() Retrieve stopwords from a file.
index_line(words, base_line) This function index one line and returning all words sanitized
is_stemmig()
sanitize_line(line, number_line) This function sanitize one line. The number is basically for presentation
strip_accents(text) Strip accents of one text

Attributes Documentation

stopwords_path
Property that get the stopwords file path
Returns:Complete stopwords file path

Methods Documentation

add_stopwords(stopwords: Set[str])
Add a set of stopwords manually.
Parameters:stopwords – set of stopwords.
Returns:
import_stopwords() → Set[str]
Retrieve stopwords from a file.
Returns:set of stopwords
index_line(words: List[str], base_line: matchup.presentation.text.Line) → List[matchup.presentation.text.Term]
This function index one line and returning all words sanitized The list must be sorted by occurrence in text!
Parameters:
  • words – base-line stripped and without stopwords
  • base_line – line to be sanitized
Returns:

list of indexed words : list(Term)

is_stemmig()
sanitize_line(line: str, number_line: int) → List[matchup.presentation.text.Term]
This function sanitize one line. The number is basically for presentation
Parameters:
  • line – Complete line to be processed
  • number_line – number of line
Returns:

static strip_accents(text: str) → str
Strip accents of one text
Parameters:text – old text
Returns:text sanitized