Onto.PT: Automatic Construction of a Lexical Ontology for Portuguese

of 6
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

Reviews

Published:

Views: 13 | Pages: 6

Extension: PDF | Download: 0

Share
Description
Onto.PT: Automatic Construction of a Lexical Ontology for Portuguese
Tags
Transcript
  Onto.PT: Automatic Construction of a Lexical Ontologyfor Portuguese Hugo Gonc¸alo Oliveira 1 , Paulo Gomes 2 Abstract.  This ongoing research presents an alternative to the man-ual creation of lexical resources and proposes an approach towardsthe automatic construction of a lexical ontology for Portuguese. Tex-tual sources are exploited in order to obtain a lexical network basedontermsand,afterclusteringandmapping,awordnet-likelexicalon-tology is created. At the end of the paper, current results are shown. 1 INTRODUCTION In the last decade, besides the increasing amount of Semantic Web[2] applications, we have seen a growing number of systems that per-form tasks where understanding the information conveyed by naturallanguageplaysanimportantrole.Naturallanguageprocessing(NLP)tasks, from machine translation or automatic text generation to intel-ligentsearch,arebecomingmoreandmorecommon,whichdemandsbetter access to semantic knowledge.Knowledge about words and their meanings is structured in lexi-calontologies, suchasPrincetonWordNet[15],which areusedintheachievement of the aforementioned tasks. Since this kind of resourceis most of the times handcrafted, its creation and maintenance in-volves time-consuming human effort. So, its automatic constructionfrom text arises as an alternative, providing less intensive labour, eas-ier maintenance and allowing for higher coverage, as a trade-off forlower, but still acceptable, correction.This paper presents Onto.PT, an ongoing research project wheretextual resources, more precisely dictionaries, thesaurus and corpora,are being exploited in order to extract lexico-semantic knowledgethat will be used in the construction of a public domain lexical on-tology for Portuguese. While the first stage of this work deals mainlywith information extraction from text, subsequent stages are con-cerned with the disambiguation of the acquired information and theconstruction of a structure similar to WordNet. Considering that in-formation is extracted from different sources, one particular pointis that we aim to accomplish word sense disambiguation (WSD)[28] based not on the context where information is found but onknowledge already extracted. Therefore, clustering over extractedsynonymy instances is first used to identify groups of synonymouswords that will be used as a conceptual base. The rest of the infor-mation, consisting of term-based triples, is then mapped to the con-ceptual base as each term is assigned to a group of synonyms.After introducing some background concepts and relevant work,we state the goals of this research. Then, we introduce the stages in-volved in the approach we are following. Before concluding, currentresults of this project, as well as their evaluation, when available, areshown. 1 PhD student, CISUC, University of Coimbra, Portugal, hroliv@dei.uc.pt 2 CISUC, University of Coimbra, Portugal, pgomes@dei.uc.pt 2 BACKGROUND KNOWLEDGE Besides recognising words, their structure and their interactions, ap-plications that deal with information in natural language need to un-derstand its meaning, which is usually achieved with the help of knowledge bases assembling lexical and semantic information, suchas lexical ontologies. Despite some terminological issues, lexical on-tologies can be seen both as a lexicon and as an ontology [22], andaresignificantlydifferentfromclassicontologies—theyarenotcon-structed for a specific domain and are intended to provide knowledgestructured on lexical items (words) of a language by relating themaccording to their meanings. In this context, Princeton WordNet [15]is the most representative lexico-semantic resource for English andalso the most common model for representing a lexical ontology.WordNet is structured on synsets, which are groups of synonymouswords describing concepts, and connections, denoting semantic rela-tions (e.g. hyponymy, part-of), between those groups.The success of the WordNet model led to its adoption by manylexical resources in several different languages, such as the wordnetsinvolved in the EuroWordNet [38] project, or WordNet.PT [26], forPortuguese. However, the creation of a wordnet, as well as the cre-ation of most ontologies, is typically manual, thus involving muchhuman effort [4]. To overcome this problem, some authors [9] pro-pose the translation of a target wordnet to wordnets in other lan-guages. This seems to be a suitable alternative for several applica-tions but another problem arises because different languages repre-sent different socio-cultural realities, they do not cover exactly thesame part of the lexicon and, even where they seem to be common,several concepts are lexicalised differently [22]. Another popular al-ternative for ontology creation is to extract lexico-semantic knowl-edge and learn lexical ontologies from text, which can either be un-structured, as in textual corpora, or semi-structured, as in dictionariesor encyclopedias.Research on the acquisition of lexico-semantic knowledge fromcorpora is not new and varied methods, roughly divided intolinguistics-based (see [20, 32]), statistics or graph-based (see [36, 25,14]) or hybrid (see [6, 7, 1, 18, 17]), have been proposed to achievedifferent steps of this task, such as the extraction of relations likehyponymy [20, 6, 7], meronymy [1, 17], causation [18], or the estab-lishment of sets of similar or synonymous words [32, 25, 36].Dictionary processing, which became popular during the 1970s[5], is also a good option for the extraction of this kind of knowl-edge. MindNet [31] is both an extraction methodology and a lexicalontology different from a wordnet, since it was created automaticallyfrom a dictionary and its structure is based on such resources. Nev-ertheless, it still connects sense records with semantic relations (e.g.hyponymy, cause, manner). Most of the research on the automaticcreation of lexical resources from electronic dictionaries was made  during the 1980s and 1990s, where the advantages and drawbacks of using the later resources were studied and discussed [23]. Still, thereare reports of recent works on the automatic extraction of knowledgefrom dictionaries (see [29, 19, 27]). For instance, PAPEL [19] is alexical resource consisting of a set of triples denoting semantic rela-tions between words found in a Portuguese dictionary.Besidescorporaanddictionaryprocessing,inthelateryears,semi-structured collaborative resources such as Wikipedia or Wiktionary,have proved to be important sources of lexico-semantic informationandhavethusbeenreceivingmoreandmoreattentionbytheresearchcommunity (see for instance [33, 21, 40, 27]).On the one hand, there are clear advantages of using dictionaries— they are already structured on words and meanings, they coverthe whole language, and they generally use simple and almost pre-dictable vocabulary. On the other hand, dictionaries are static re-sources with limited knowledge. Therefore, some authors [20, 32]argue that textual corpora should be exploited to extract knowledgethat can be found neither in dictionaries nor in lexical ontologies.Also, while language dictionaries are not always available for thiskind of research, there is always much text available on the mostdifferent subjects, for instance in the Web. The biggest problem con-cerning lexico-semantic information extraction from corpora is thatthere are no boundaries on the vocabulary and linguistic construc-tions used, thus leading to more ambiguity and parsing issues.Most of the aforementioned works on the extraction of semanticrelations from text output related words, identified by their ortho-graphical form. However, since natural language is ambiguous, thisrepresentation is not practical for most computational applications,because the same orthographical form might either have completelydifferent meanings (e.g.  bank  , institution or slope) or closely relatedmeanings (e.g.  bank  , institution or building). Furthermore, there arewords with completely different orthographical forms denoting thesame concept (e.g.  car   and  automobile ). This might lead to seriousinconsistencies,forinstancewhendealingwithinference,asinanex-ample, in Portuguese, reported in [19]:  queda  SYNONYM OF  ru´ına ∧  queda  SYNONYM OF  habilidade  →  ru´ına  SYNONYM OF  ha-bilidade . Here, the two words in the inferred relation are almost op-posites and not synonyms.Therefore, another challenge on lexical ontology learning fromtext, often called ontologising, is concerned with moving fromknowledge based on words to knowledge based on concepts. For En-glish, there are works on the assignment of suitable WordNet synsetsto the arguments of relational triples extracted from text, or to otherterm entities, such as Wikipedia entries [33]. Some of the methodsfor ontologising term-based triples compute the similarity betweenthe context from where each triple was extracted with the terms insynsets, sibling synsets or direct hyponym synsets [35]. Others look for relations established with the argument terms and with the termsof each synset [30], or take advantage of generalisation through hy-pernymy links [30]. 3 RESEARCH GOALS The main goal of this research is the automatic construction of Onto.PT, a broad-coverage structure of Portuguese words accordingto their meanings, or, more precisely, a lexical ontology.Regarding information sparsity, it seems natural trying to createsuch a resource with knowledge extracted from several sources, asproposed in [39] for creating a lexical ontology for German, butfor Portuguese. Thus, we are using or planning to use the followingsources of knowledge: (i) dictionaries, such as Dicion´ario da L´ınguaPortuguesa [13], through PAPEL; Dicion´ario Aberto (DA) [34], anopen domain electronic version of a Portuguese dictionary from1913; and the Portuguese Wikitionary 3 ; (ii) encyclopedias, such asthe complete entries of the Portuguese Wikipedia 4 or just their ab-stracts; (iii) corpora, yet to decide; and (iv) thesaurus, such as TeP[12], an electronic thesaurus for Brasilian Portuguese; and OpenThe-saurus.PT 5 , a thesaurus for European Portuguese.Considering each resource specificities, such as its organisationor the vocabulary used, the extraction procedures might be signifi-cantly different, but must have one common output: a set of term-based relational triples. Still, considering the limitations of represen-tations based on the terms, we are adopting a wordnet-like structurewhich enables the establishment of unambiguous semantic relationsbetween synsets. Moving from a lexical network to a lexical ontol-ogy requires the application of several WSD techniques. However,our intention is to achieve WSD based only on knowledge alreadyextracted, because we believe this is the best way to harmoniouslyintegrate knowledge coming from different heterogeneous sources.Another point that should be considered is the attribution of confi-dence weight(s) on each relation, based on its frequency and also onone or several similarity measures, calculated according to the wordsdistribution in a corpus. 4 PROPOSED APPROACH In this section, we describe all the stages involved in the creationof Onto.PT, also represented in Figure 1. Furthermore, we give anoverviewonpossiblewaystoevaluatetheresultsofthiswork,which,in the future, will be freely available. 4.1 Extraction of relational triples The first stage on the creation of Onto.PT is the automatic extractionoflexico-semantic knowledgefrom textual sources. The extractedin-formation is represented as relational triples,   1    2 , where   1  and  2  are terms and    is the name of a semantic relation held betweenpossible meanings of    1  and   2 . These triples establish a lexical net-work,    = ( 󰀬  ) , with ∣   ∣ nodes and ∣   ∣ edges,     ⊂    2 , whereeach node    ∈     is a term and each edge between nodes    and   ,   ( 󰀬󰀬 ) , means that a relation of the type    between nodes    and    was extracted.Hence, each sentence is analysed by a parser according to seman-tic grammars created specifically for each relation to be extracted.Most of the rules in the semantic grammars are based on textual pat-terns frequently used to denote each semantic relation, such as theones presented in Table 1 for well-known relations in Portuguese. Table 1.  Examples of patterns indicating semantic relations. Relation Example pattern Hypernymy  tipo|g´enero|classe|forma de Meronymy  parte|membro de Causation  causado|provocado|srcinado por Purpose  usado|utilizado|serve para Extraction from dictionaries follows very closely the extractionprocedure described in [19]. Despite significant differences in dic-tionary and corpora text, the general extraction procedure works for 3 http://pt.wiktionary.org 4 http://pt.wikipedia.org 5 http://openthesaurus.caixamagica.pt/   ExtractionClusteringMergingMapping ElectronicdictionariesTextualdocumentsHandmadethesaurusLexicalnetworkThesaurus 1Base corpusThesaurus 2Weightedlexical network Weightassignment Organisation RelationaltriplesSynonymy instances Wordnet-likeLexical ontology 1Wordnet-likeLexical ontology 2Semanticgrammars Figure 1.  Information flow in the construction of Onto.PT both, with slightly differences in the construction of the grammars.For instance, most of the relations extracted from dictionary defini-tions are established between a word in the definition and the wordbeing defined. Moreover, dictionaries are important to obtain syn-onymy instances (  1  SYNONYM OF   2 ), since many words are de-fined by (a list of) their synonyms. On the other hand, despite sharingmost of the times the same neighbourhood, synonymy instances maynot co-occur frequently in corpora text [14].In order to find less intuitive patterns, a pattern discovery algo-rithm [20] can be applied over a corpus: (i) a relation    is chosen;(ii) several pairs of words known to establish    are looked for ina corpus; (iii) everytime both words of the same pair co-occur in asentence, the text connecting them is collected; (iv) most frequentsentences collected are used as hints for new patterns denoting   . 4.2 Clustering for synsets Since lexical resources based on the words orthographical form areinadequate to deal with ambiguity, we are adopting a wordnet-likestructure, where concepts are described by synsets and ambiguouswords are included in a synset for each of their meanings. Seman-tic relations can thereby be unambiguously established between twosynsets, and concepts, even though described by groups of words,will bring together natural language and knowledge engineering ina suitable representation, for instance, for the Semantic Web. More-over, this makes it possible to apply inference rules for the discoveryof new knowledge.From a linguistic point of view, word senses are complex and over-lappingstructures[24,22].So,despitewordsensedivisionsindictio-naries and ontologies being most of the times artificial, this trade-off is needed in order to increase the usability of broad-coverage com-putational lexical resources.As lexical synonymy networks extracted from dictionaries tendto have a clustered structure [16], clusters are identified in order toestablish synsets. A possible way to achieve clustering and deal withambiguity at the same time, is to use a hard-clustering algorithm,such as the Markov Clustering Algorithm (MCL) [37], and extend ittofindunstablenodes,whicharemostofthetimesambiguouswords.This is the approach of [16], that runs clustering with noise severaltimes, creates a matrix with the probabilities of each node belongingto each cluster, and finally, assigns each word to all the clusters itsbelonging probability is higher than a threshold.Our procedure is very similar and is described as follows: (i) splitthe srcinal network into sub-networks, such that there is no pathbetween two elements in different sub-networks, and calculate thefrequency-weighted adjacency matrix     of each sub-network; (ii)add stochastic noise to each entry of     ,      =      +      ∗    ; (iii)run MCL over     for 30 times; (iv) use the (hard) clustering obtainedby each one of the 30 runs to create a new matrix     with the proba-bilities of each pair of words in     belonging to the same cluster; (v)create the clusters based on     and on a given threshold    = 0 󰀮 2 . If      >  ,    and    belong to the same cluster; (vi) in order to cleanthe results, remove: (a) big clusters,   , if there is a group of clusters    =    1 󰀬  2 󰀬󰀮󰀮󰀮    such that    =    1  ∪   2  ∪ 󰀮󰀮󰀮 ∪    ; (b) clusterscompletely included in other clusters. 4.3 Merging with other synset-based resources In this stage, we take advantage of broad-coverage synset-based re-sources for Portuguese, such as thesaurus, in order to enrich oursynset base. Still, we are more interested in manually created re-sources of that kind, since they can amplify the coverage, and im-prove the precision of our synsets at a significantly low cost.The following procedure is applied for merging two thesaurus: (i)define one thesaurus as the basis    and the other as    ; (ii) create anew empty thesaurus     and copy all the synsets in    to    ; (iii)for each synset      ∈    , find the synsets     ∈    with higher Jac-card coefficient 6  , and add them to a set of synsets     ⊂   . (iv)considering    and    , do one of the following: (a) if     = 1 , it meansthat the synset is already in    , so nothing is done; (b) if     = 0 ,     is copied to    ; (c) if   ∣   ∣  = 1 , remove    1  from     and add a newsynset     =    1  ∪     to    . (d) if   ∣   ∣  >  1 , a new set,     =      ∪   ′ where    ′ =  ∪ ∣   ∣  =0    󰀬    ∈    , is added to     and all synsets in     areremoved from    . 4.4 Assigning weights to triples In this stage, one (or several) weights are assigned to triples basedon the number of times they were extracted (frequency) and also ondistributional metrics, calculated over a corpus. The later metrics, 6  ( 󰀬 ) =   ∩ 󰀯 ∪   typically used to retrieve similar documents, assume that similar orrelated words tend to co-occur or to occur in similar contexts. Nev-ertheless, several distributional metrics (e.g. latent semantic analysis(LSA) [11]) have also been adapted to measure the similarity of twowords, based on their neighbourhoods [7, 39]. The weights can thusbe used to indicate the confidence for each triple and thresholds canbe applied to discard lower-weighted triples and improve precision.For instance, [8] reports high correlations between manual evalua-tion of hypernymy and part-of triples and their weights according tosome distributional measures computed on a corpus. 4.5 Mapping term-based triples to synsets After the previous stages, a thesaurus     and a term-based lexicalnetwork,   , are available. In order to set up a wordnet, this stageuses the latter to map term-based triples to synset-based triples, or,in other words, assign each term,    and   , in each triple, (    )  ∈  , to suitable synsets of     . This task, often called ontologising [30],can be seen as WSD, but we explicitly aim to achieve disambiguationby taking advantage of knowledge already extracted, and not of thecontext from where it was extracted. Having this in mind, we havedeveloped two mapping methods.In the first method, to assign    to a synset   ,    is fixed and allthe synsets containing   ,      ⊂    , are obtained. If     is not in    ,it is assigned to a new synset    = (  ) . Otherwise, for each synset     ∈     ,     is the number of terms    ∈      such that (    )holds. Then, the proportion     =    ∣   ∣  is calculated. All the synsetswith the highest     establish a set    . Finally, (i) if   ∣   ∣  = 1 ,    isassigned to the only synset in    ; (ii) if   ∣   ∣  >  1 ,    ′ is the set of elements of      with the highest     and, if   ∣   ′ ∣  = 1 ,    is assignedthe synset in    ′ , unless     <   7 ; (iii) if it is not possible to as-sign a synset to   , it remains unassigned. Term    is assigned to asynset using this procedure, but fixing   . In a second phase, we takeadvantage of hypernymy links already established to help mappingsemi-mapped triples, which are triples where one of the arguments isassigned to a synset and the other is not (     or     ).The second mapping method starts by creating a term-term matrix,   , based on the adjacencies of the lexical network. Consequently,    is a square matrix with    lines, where    is the total number of nodes (terms) in the lexical network. If the term in index    and theterm in index    are connected by some kind of relation,      = 1 ,otherwise,      = 0 . In order to assign synsets to    and   , the firstthing to do is, once again, to get all the synsets including the term  ,      ⊂    , and also all synsets including   ,      ⊂    . Then, thesimilarity between each synset    ∈      and each synset    ∈      isgiven by the average lexical network based similarity for each termin    with each term in   :  ( 󰀬 ) = ∣  ∣   =1 ∣  ∣   =1 cos(   󰀬  ) ∣  ∣∣  ∣ Here, the similarity of two vectors of      gives us the similarityof two words, based on their neighbourhoods in the lexical network,and is calculated by the cosine of their adjacency vectors,     and    respectively. To conclude the mapping, the pair of synsets with ahigher similarity is chosen. 7   is a threshold defined to avoid that    is assigned to a big synset where   ,itself, is the only term related to   . 4.6 Knowledge organisation In this stage, routines for knowledge organisation are applied in orderboth to make it possible to infer new implicit knowledge and also toremove redundant triples. This is achieved by applying some rules tothe synset-based triple set, including: ∙  Transitivity: if     is transitive (e.g. SYNONYMY, HYPERNYMY, ...),(    ) ∧ (     ) → (     ) ∙  Inheritance: if     is not a HYPERNYMY or HYPONYMY relation,(   HYPERNYM OF   ) ∧ (     ) → (     ) Therefore, some behaving properties, such as transitivity, inheri-tance or inversion, of the extracted relations are predefined. For in-stance, to deal with inversion, all relations are only stored in the typedefined as the direct one, but, if needed, the system can inverse them. 4.7 Evaluation Evaluation takes place through all the previous stages. Manual eval-uation is a reliable kind of evaluation, but it is also time-consumingand difficult to reproduce, so, when possible, we are willing to ex-plore automatic evaluation procedures. Automatic evaluation is typi-cally performed by comparing the results obtained with a gold stan-dard, but the later is not always available, especially for a broad-coverage ontology, where freely available gold standards are scarce.The validation of relational triples can also be perfomed using acollection of documents to find hints on them. For instance, triplescan be translated to common natural language patterns, such as theones in Table 1, and looked for in that form, as in [10] to assignprobabilities to semantic triples, or in [19], to validate them.Moreover, the quality of the final ontology will also be assessedwhen using it to perform NLP tasks, such as question answering orautomatic generation of text. 5 CURRENT RESULTS Since the authors of this research are also part of the PAPEL devel-opment team, PAPEL can be seen as a seed project. So, in Table 2,we start by presenting the numbers and examples of some of the re-lations included in PAPEL 2.0, and also the numbers of the relationsobtained after applying exactly same extraction procedure, describedin [19], to DA. We have taken advantage of the grammatical infor-mation provided by the dictionaries to organise each type of relationaccording to the grammatical category of its arguments. Table 2.  Relations extracted from dictionaries. Relation Arguments PAPEL 2.0 DA Examples Synonymynoun,noun 37,452 20,910  aux´ ılio, contributo verb,verb 21,465 8,715  tributar, colectar  adj,adj 19,073 7,353  flex´ıvel, mold´ avel adv,adv 1,171 605  ap´ os, seguidamente Hypernymy noun,noun 62,591 59,887  planta, salva Part-of noun,noun 2,805 1,795  cauda, cometa noun,adj 3.721 4,902  tampa, coberto Member-of noun,noun 5.929 1,564  ervilha, Leguminosas adj,noun 883 59  celular, c´ elula Causationnoun,noun 1.013 264  fricc¸˜ ao, assadura adj,noun 498 166  reactivo, reacc¸˜ ao verb,noun 6,399 5,714  limpar, purgac¸˜ ao Purposenoun,noun 2,886 1,760  defesa, armadura verb,noun 5,192 3,383  fazer rir, com´ edia verb,adj 260 186  corrigir, correccional The relations between nouns in a previous version of PAPEL werevalidated (also in [19]), by searching for natural language sentences  denoting the relations in a newspaper corpus. About 20% of the part-of and hypernymy triples were supported by the corpus. On the otherhand, these numbers were respectively 10% and 4% for purpose andcausation. The results are interesting since there is not as much gen-eral knowledge in a newspaper as in a dictionary and because wehave used a small set of patterns when there is a huge amount of possibilities to denote these semantic relations in corpora text. Table 3.  Relations extracted from Wikipedia abstracts. Relation Quant. Example Sample Corr. Agr. Synonymy 11,862  estupro,violac¸˜ ao  286 86,1% 91,2%Hypernymy 29,563  estilo de m´ usica,folk   322 59,1% 93,1%Part-of 1,287  jejuno,intestino  268 52,6% 78,4%Causation 520  parasita,doenc¸a  244 49,6% 79,5%Purpose 743  construc¸˜ ao,terracota  264 57,0% 82,2% Moving on to other kinds of text, around 37,898 sentences of thePortuguese Wikipedia were processed with the grammars for cor-pora. All the processed sentences were introducing articles which, inthe DBpedia [3] taxonomy, had one of the following types:  species , anatomical structure ,  chemical compound  ,  disease ,  currency ,  drug , activity ,  language ,  music genre ,  colour  ,  ethnic group  or  protein . Apos-tagger was used in the extraction, but only to identify adjectives.Also, in an additional stage, we have used it to identify the gram-matical categories of the arguments of the triples and we noticed thatmost of the relations extracted were between nouns. The evaluationof the extracted triples was performed by human judges, who classi-fied samples with triples of each relation as correct or incorrect. Thequantities of relations extracted, the proportion of correct triples, aswell as the agreement values, are shown in Table 3. Table 4.  (Noun) thesaurus in numbers. TeP OT CLIP TOP Words Quantity  17,158 5,819 23,741 30,554 Ambiguous  5,867 442 12,196 13,294 Most ambiguous  20 4 47 21Synsets Quantity  8,254 1,872 7,468 9,960 Avg. size  3.51 3.37 12.57 6.6 Biggest  21 14 103 277 To test the synset discovery procedure, we have made severalexperiments using the thesaurus of nouns TeP, OpenThesaurus.PT(OT), and also the noun synonymy instances of PAPEL which, afterclustering, became the thesaurus CLIP. We also used TeP as the basethesaurus and merged it, first with OT, and then with CLIP, givingrise to the biggest noun thesaurus, TOP.Table 4 has information on each one of the thesaurus, more pre-cisely, the quantity of words, words belonging to more than onesynset (ambiguous), the number of synsets where the most ambigu-ous word occurs, the quantity of synsets, the average synset size(number of words), and the size of the biggest synset. 519 synsetsof CLIP and 480 of TOP were manually validated, each by two hu-man judges who had to classify each synset as: correct, incorrect, ordon’t know 8 . Besides the average validation results and the agree-ment rates, Table 5 also contains the results considering only synsetsof ten or less words, the less problematic (CLIP’ and TOP’).The last set of results presented here regard using the first mappingprocedure to map all the hypernym-of, part-of and member-of term-basedtriplesofPAPELtothesynsetsofTOP.Table6showsthemap- 8 In some context, all the words of a correct synset could have the samemeaning, while for incorrect synsets, at least one word could never meanthe same meaning as the others. Table 5.  Results of manual synset validation. Sample Correct Incorrect N/A AgreementCLIP  519 sets 65.8% 31.7% 2.5% 76.1% CLIP’  310 sets 81.1% 16.9% 2.0% 84.2% TOP  480 sets 83.2% 15.8% 1.0% 82.3% TOP’  448 sets 86.8% 12.3% 0.9% 83.0% Table 6.  Results of triples mapping. Hypernym of Part of Member of Term-based triples  62,591 2,805 5,929 1stMapped  27,750 1,460 3,962 Same synset  233 5 12 Already present  3,970 40 167 Semi-mapped triples  7,952 262 357 2ndMapped  88 1 0 Could be inferred  50 0 0 Already present  13 0 0 Synset-based triples  23,572 1,416 3,783 ping numbers. After the first phase, 33,172 triples had both of theirterms assigned to a synset, and 10,530 had only one assigned. How-ever, 4,427 were not really added, either because the same synset wasassigned to both of the terms or because the triple had already beenadded after analysing other term-based triple. In the second phase,where hypernymy links were used, only 89 new triples were mappedand, from those, 13 had previously been added while other 50 tripleswere discarded or not attached because they could be inferred. More-over, 19,638 triples were attached at least to a synset with only oneterm. Table 7.  Automatic validation of synset-based triples. Relation Sample size Validation Hypernymy of 419 synsets 44,1%Member of 379 synsets 24,3%Part of 290 synsets 24,8% The triples mapping was validated using Google web search en-gine to look for evidence on the synset-based triples. Once again,a set of natural language generic patterns, indicative of each rela-tion, was defined. Then, for each triple     , each combinationof terms    ∈    and    ∈    connected by a pattern indicative of    9 was searched for. Table 7 shows the results obtained for each vali-dated sample, according to the triple validation score, calculated bythe following expression, where   ( 󰀬󰀬 ) = 1  if evidence isfound for the triple or 0 otherwise:   = ∣  ∣   =1 ∣  ∣   =1  ( 󰀬󰀬 ) ∣   ∣ ∗ ∣   ∣ The second mapping procedure was not evaluated yet, but, besidesbeingmoregeneric,itmapseverytripleandnotonlypartoftripleset. 6 CONCLUDING REMARKS This ongoing research is an answer to the growing demand on se-mantically aware applications and addresses the lack of public do-main lexico-semantic resources for Portuguese. The tools for knowl-edge extraction and the lexical ontology itself might be useful for 9 Patterns used for part-of and member-of were the same because these rela-tions can be expressed in very similar ways.
Recommended
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x