API
- class stwfsapy.predictor.StwfsapyPredictor(graph, concept_type_uri, sub_thesaurus_type_uri='', thesaurus_relation_type_uri='', thesaurus_relation_is_specialisation=False, remove_deprecated=True, langs=frozenset({}), input='content', use_txt_vec=False, handle_title_case=True, extract_upper_case_from_braces=True, extract_any_case_from_braces=False, expand_ampersand_with_spaces=True, expand_abbreviation_with_punctuation=True, simple_english_plural_rules=False)
Finds labels of thesaurus concepts in texts and assigns them a score.
Creates the predictor.
- Parameters:
graph (
Graph) – The SKOS onthology used to extract the labels.concept_type_uri (
Union[str,URIRef]) – The uri of the concept type. It is assumed that for every concept c, there is a triple (c, RDF.type, concept_type_uri) in the graph.sub_thesaurus_type_uri (
Union[str,URIRef]) – The uri of the concept type. It is assumed that for every sub thesaurus t, there is a triple (t, RDF.type, sub_thesaurus_type_uri) in the graph.thesaurus_relation_type_uri (
Union[str,URIRef]) – Uri of the relation that links concepts to thesauri.thesaurus_relation_is_specialisation (
bool) – Indicates whether the thesaurus_relation links thesauri to concepts or the other way round. E.g., for the relation skos:broader it should be false. Conversely it should be true for skos:narrower.remove_deprecated (
bool) – When True will discard deprecated subjects. Deprecation of a subject has to be indicated by a triple (s, OWL.deprecated, Literal(True)) in the graph.langs (
FrozenSet[str]) – For each language present in the set, labels will be extracted from the graph. An empy set or None will extract labels regardless of language.input (
str) –What type of input is presented to the fit method:
’content’: Input is expected to be an arraylike of string.
’filename’: Input is expected to be a list of filenames.
’file’: input is expected to be a list of file objects.
use_txt_vec (
bool) – Whether to use vectorized representations of inputs. This can lead to high memory consumption.handle_title_case (
bool) –When True, will also match labels in title case. I.e., in a text the first letter of every word can be upper or lower case and will still be matched. When False only the case of the first word’s first letter will be adapted. Example:
Given a label “garbage can” and the title “Oscar Lives in a Garbage Can”
When handle_title_case == True the label will match the text.
When handle_title_case == False the label will not match the text. It would however still match “Garbage can is home to grouchy neighbor.”.
extract_upper_case_from_braces (
bool) – Removes the explanation in braces from labels. I.e., GDP (Gross Domestic Product) will be transformed to GDP.extract_any_case_from_braces (
bool) – Can extract content of braces in labels. I.e., R&D (research and discovery) will be transformed to research and discovery. In contrast to extract_upper_case_from_braces it will extract the part inside the parenthesis and not the part before.expand_ampersand_with_spaces (
bool) – For labels that contain an ampersand it will also match text containing spaces around that symbol. I.e., R & D will be matched for label R&D.expand_abbreviation_with_punctuation (
bool) – For labels containing only uppercase letters it will also match text with punctuation added. I.e., G.D.P. for label GDP.simple_english_plural_rules (
bool) – Can detect simple English plural forms of labels.
- fit(X, y=None, **kwargs)
- static load(path)
- match_and_extend(inputs, truth_refss=None)
Retrieves concepts by their labels from text. If ground truth values are present, it will also return a list of labels for scoring matches. If no ground truth values are present, a list with the number of matched concepts for each document is returned.
- predict(X)
- Return type:
csr_matrix
- predict_proba(X)
- Return type:
csr_matrix
- store(path)