Text and Data Quality Mining in CRIS

Please download to get full document.

View again

of 15
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Similar Documents
Information Report
Category:

Letters

Published:

Views: 0 | Pages: 15

Extension: PDF | Download: 0

Share
Description
To provide scientific institutions with comprehensive and well-maintained documentation of their research information in a current research information system (CRIS), they have the best prerequisites for the implementation of text and data mining
Tags
Transcript
    information  Article Text and Data Quality Mining in CRIS Otmane Azeroual German Center for Higher Education Research and Science Studies (DZHW), Schützenstraße 6a, 10117 Berlin,Germany; azeroual@dzhw.euReceived: 23 September 2019; Accepted: 25 November 2019; Published: 28 November 2019       Abstract:  To provide scientific institutions with comprehensive and well-maintained documentation of their research information in a current research information system (CRIS), they have the best prerequisites for the implementation of text and data mining (TDM) methods. Using TDM helps to  better identify and eliminate errors, improve the process, develop the business, and make informed decisions. In addition, TDM increases understanding of the data and its context. This not only improves the quality of the data itself, but also the institution’s handling of the data and consequently the analyses. This present paper deploys TDM in CRIS to analyze, quantify, and correct theunstructured data and its quality issues. Bad data leads to increased costs or wrong decisions. Ensuring high data quality is an essential requirement when creating a CRIS project. User acceptance in a CRIS depends, among other things, on data quality. Not only is the objective data quality the decisive criterion, but also the subjective quality that the individual user assigns to the data. Keywords:  current research information systems (CRIS); research information; text and data mining(TDM); data quality; knowledge exploration; knowledge transfer; decision making; user acceptance 1. Motivation Di ff  erent research institutions use research information for di ff  erent purposes. Data analyses andreports based on current research information systems (CRIS) provide information about the research activities and their results. As a rule, management and controlling utilize the research informationfrom the CRIS for reporting. For example, trend analysis helps with business strategy decisions or rapid ad-hoc analysis to respond e ff  ectively to short-term moves. Ultimately, the analysis results and the resulting interpretations and decisions depend directly on the quality of the data. Data qualityis easiest to define as “ suitable for use ”. The many manifestations, causes, and e ff  ects of poor dataquality (such as incorrect management decisions and cost increases) make it clear that, especially in our information age, it is essential to look at methods that increase and maintain data quality [1,2]. Decisions can only be successfully and profitably implemented if the information base, in other words, the research information in the CRIS, is of high quality and thus able to withstand stress.The CRIS is understood to be a  database  or a  federated information system  that collects, manages, and provides information about research activities and research results [ 3 , 4 ]. The information considered here represents  metadata  about research activities (such as persons, projects, third-party funds, patents, partners, awards, publications, doctorates, and habilitations, etc.). Further in-depth information on CRIS and data quality can be found in papers [1–8]. Recently, research activities and their results at universities and academic institutions have beencollected, maintained, and published via CRIS in a variety of forms and heterogeneous data sources [ 5 ]. The introduction of CRIS into research institutions means that they must provide their required information about research activities and research results in an assured quality [ 6 , 8 ]. Poor data qualitymeans that analyses and evaluations are faulty or di ffi cult to interpret. The occurring quality problems in CRIS include on the one hand spelling mistakes, missing data, incorrect data, wrong formatting, Information  2019 ,  10 , 374; doi:10.3390  /  info10120374 www.mdpi.com  /   journal  /  information  Information  2019 ,  10 , 374 2 of 15 duplicates, and contradictions data, etc., and on the other hand unstructured data formats. Thesecan arise when capturing various independent information systems (such as external publication databases, identifiers (ORCID, DOIs, CrossRef), external project data, etc.), and di ff  erent standardizedexchange formats (e.g., from the CERIF or RCD data model). Low data quality can negatively impact  business processes and lead to erroneous decision-making. Much of the information in CRIS is in the form of text documents (e.g., unstructured dataincludes, for example, personal information, publication data, or project data in Word, PDF, or XML data). Unstructured data therefore presents a major challenge for CRIS administrators, especially for universities and academic institutions that manage their research information from heterogeneous datasources in CRIS [ 5 ]. The information age makes it easy to store huge amounts of data. The proliferation of documents on the internet, in institution intranets, in news wires, and blogs is overwhelming.Although the number of available research information is constantly growing, the possibilities to record and to process it remain limited. Search engines additionally aggravate this problem because they make a large number of documents accessible only by a few entries in the search mask. The knowledge about research activities and their results is becoming an increasingly important factor for the success of an institution and should be extracted from this document base. However,reading and understanding texts for gaining knowledge is a domain of the human intellect, but it is capacitively limited. A software analysis through a largely automated process of obtaining new and potentially useful knowledge of text documents can overcome this shortcoming. Due to the abundance and rapid growth of digital, unstructured data, TDM is becomingincreasingly important. TDM is a technique for extracting new knowledge from texts that is still unknown to the user, and has an application everywhere instead of database-compressed, preselected input of data; these are captured in text form in [ 9 ]. TDM creates the opportunity to conduct an e ffi cient and structured information  /  knowledge exploration and, moreover, provides good support in the management of most of the existing data [10,11]. The methods of TDM by means of statistical and linguistic analysis methods aim at the detectionof hidden and interesting information or patterns in unstructured text documents, on the one hand to be able to process the huge amount of words and structures of the natural language, and on the otherhand to allow the treatment of uncertain and fuzzy data. According to [ 12 ], as a new field of research,TDM is a promising attempt to solve this problem of information overload by using methods of TDM: Natural language processing (NLP), information extraction (IE), and clustering. The hidden and stored unstructured data and sources in CRIS can play an important role in decision-making [ 5 ]. The application of TDM in CRIS is scarcely widespread, although the techniques of TDM and their potential have already been considered and discussed in two articles [5,7]. Only in terms of ensuring data quality are possible applications established. With scalable algorithms from TDMmethods, universitiesandacademicinstitutionscanalsoe ffi cientlyanalyzeverylargeamountsof data in CRIS and detect inadequate data (e.g., detecting duplicates, erroneous and incomplete data sets,and identifying outliers and logical errors). Thus, the data quality of existing data can be significantly and systematically improved. Forthisreason,thispaperwillhighlightthedataqualityinthecontextofTDMinCRIS.ThepurposeofthepaperistopresentthestateofdevelopmentoftheTDMtechnologyinCRIS,todiscussapplicationpossibilities, and to show already existing practice applications. The focus is on TDM methods such as NLP, IE, and clustering. To ensure this, Section 2 gives an overview of the data quality in CRIS and the meaning of the TDM. After that, the problems of unstructured data in CRIS will be addressed. In Section 3, the methods of the TDM are described by the context CRIS and then a practical example is shown in each method, how the CRIS managers can analyze and improve their unstructured data andderive important insights for their own organization. Finally, the results are summarized in Section 4.  Information  2019 ,  10 , 374 3 of 15 2. Fundamentals 2.1. Data Quality in CRIS There are many definitions of the term data quality in the literature. Here, the data quality is understood as follows: •  “Data Quality [is] data that [is] fit for use by data consumers.”  [13] •  “[Exactly] the right data and information in exactly the right place at the right time and in the right format to complete an operation, serve a customer, make a decision, or set and execute a strategy.”  [14] All the definitions mentioned have in common that the data must be “fit for purpose” from thepoint of view of the person processing the data, which supports the intended use to be described as qualitative. Data will then be of high quality in CRIS if they meet the needs of the user. However, thisalso means that data quality can only be assessed individually. Quality is a relative and not an absolute property. The quality of the data can, therefore, only be assessed relative to their respective use. However, assessing and determining the quality of data within such a CRIS is a challengingtask. In order to evaluate the data quality, metrics are used which can be derived on the basis of di ff  erent approaches (theoretical, empirical, or intuitive) and which di ff  er accordingly with regardto the dimensions found. A selection of derived metrics and dimensions can be found, for example, in [15] or [13]. For example, in the context of CRIS, the following criteria may apply [3,4,8]: •  Completeness •  Correctness •  Consistency •  Timeliness Data quality can be measured by the four dimensions and their metrics. Measurements help to express observations by numbers. This makes comparisons possible. Thus, objects can be compared with each other, or the development of an object over time. The measured values can serve as the basis for decisions. For this the measurements, must be  understandable ,  reproducible,  and  expedient . •  Measurements must be understandable to serve their purpose. The results cannot help in decision-making when no one understands what has been measured and what the results meanexactly. This underlines the importance of metadata that documents the measurements and the results. This helps the data consumer understand the context and interpret the results. •  Measurements must be reproducible. Inconsistent measurements mean that the results have little or no significance. To show if the quality in a record improves or deteriorates, the same datamust be measured using the same methods. As a result, comparisons between di ff  erent objects are possible. •  Measurements must be expedient. It should measure what helps to reduce the uncertainty of a decision. Measurements serve a purpose and help with concrete problems. A comprehensive overview of the topic and the description of various data quality metrics can befound in [ 3 ]. The relationship between the quality of the data and the quality of the analyses based on it plays an important role. A framework for measuring data quality in CRIS is presented in [3]. Ensuring a sustainable and e ff  ective data quality increase in CRIS requires continuous data quality management. Punctual data cleansing has only a short-term e ff  ect. The resulting improvements are quicklylost, especiallywithfrequentlychangingdata. Therefore, dataqualityshouldnotbeconsidered a one-time action. To e ff  ectively improve data quality, holistic methods (such as data cleansing) areneeded to look at data throughout its lifecycle to guarantee a defined level of quality [ 2 ]. The goalof data cleansing is to find and correct incorrect, duplicate, inconsistent, incorrectly formatted or  Information  2019 ,  10 , 374 4 of 15 inaccurate, and irrelevant data in CRIS. As part of the cleanup, for example, data is supplemented, deleted, reformatted, or adjusted. After cleanup, the data is of higher quality and allows organizationsto work with greater reliability. The process of data cleansing consists of five steps or phases (parsing,standardization, matching, merging, and enrichment). Depending on the information system and the required target quality, these individual steps must be repeated several times. In many cases, data cleansing is a continuous, periodic process. The application of the data cleansing process in CRIS can  be found in papers [2,4]. 2.2. Definition of the Term TDM The defined approaches to TDM are manifold. The authors of [ 16 ] compare the di ff  erences in thehandling of structured and textual data and notes: Structured data is managed by means of database systems, text data, however, due to the lack of structure, caused by search engines. Unlike database queries, keywords are queried when using search engines. In order to increase the e ff  ectiveness and e ffi ciency of search engines, great progress has been achieved within information retrieval in the areasof text clustering, text categorization, text summarization, and recommendation services. Informationretrieval is traditionally focused on easy access to information, not analysis—that is the primary goalof TDM. While the goal of accessing information is to connect the right information to the right user atthe right time, TDM tools continue to be able to help the user analyze and understand that informationin order to make appropriate decisions. Other TDM tools have the task of analyzing text data to detect interesting patterns, trends, or outliers without the need for a query [16]. This brief description of the tasks of TDM tools already demonstrates the wide range of TDM tasks that lead to di ff  erent definitions of TDM in the literature. The authors of [ 9 ] note that “ this unified  problem description [...] is opposed by competing text mining specifications ”. This is also reflected in thevariety of names in the history of TDM, such as textual data mining, text knowledge engineering,or knowledge discovery in the text. The authors of [ 9 , 17 ] introduced the term knowledge discoveryinto textual databases (derived from  Knowledge Discovery in Databases —KDD) in 1995. In 1999, theterm text data mining (TDM) was coined, from which the term used today derives. Correspondingto this multiplicity of designations conflicting task assignments and definition approaches exist [ 9 ].The authors of [ 9 ] di ff  erentiate four perspectives on TDM. The first view is the approximation of  information retrieval as described by [ 16 ] and includes an improvement through text summaries and information extraction [9]. TDM is interdisciplinary and uses findings from the fields of computer science, mathematics,and statistics for the computer-aided analysis of databases [ 5 ]. TDM is the systematic application of computer-aided methods to find patterns, trends, or relationships in existing databases [ 18 ]. “ TDM,also known as text mining or knowledge discovery from textual databases, refers generally to the process of extracting interesting and non-trivial patterns or knowledge from unstructured text documents. It can be viewed as an extension of data mining or knowledge discovery from (structured) databases. [  . . .  ] Text mining is a multidisciplinary field, involving information retrieval, text analysis, information extraction, clustering, categorization, visualization, database technology, machine learning, and data mining ” [ 17 ]. TDM also helps to improve the search for literature in databases, as well as the analysis, storage, and availability of information on various websites and search engines are made more e ffi cient and accurate by this technique [5]. 2.3. Problems of Unstructured Data in CRIS ThebasisofallreportingfordecisionsupportinuniversitiesandresearchinstitutionsaretheCRIS, which draw their data from various operational and external data and are available in a structured form. Due to the huge advances in hardware and software, the use of mobile devices, and the inclusionof the Internet, the emergence of semi-structured (such as XML or HTML files) and unstructured datasuch as text documents, memos, e-mails, RSS feeds, blog entries, short messages such as Twitter, forum posts, comments in social networks, and free text input in forms but also pictures, video, and audio  Information  2019 ,  10 , 374 5 of 15 data. The developments of communication technologies allow a fast, easy, and also mobile input of this data, which form a huge repository. Specifically, the internet is driving the potential of multiple users to easily create and store a large amount of text data [16]. Academic institutions are faced with the challenge of finding relevant information in ever-largerdatabases. About80%oftheresearchinformationofauniversityisnotavailableinmachine-processable and thus structured data, but in unstructured, not directly machine-processable data and thus in documents. Unstructured data is data that does not have a formal structure and therefore cannot easily be stored in a database such as CRIS similar to structured data. Therefore, unstructured data must first be prepared or structured before being evaluated. The exact content of unstructured data is not known  beforeadataanalysis. Thebestsolutiontotheproblemofunstructureddatacanbetheimplementationof the TDM methodology at universities and research institutes and help to e ffi ciently and e ff  ectively search their unstructured data. The e ff  ort required to collect, store, and evaluate research information must therefore be justified. For research information to become a value-adding component of an academic institution, there are several aspects to consider: •  The availability of the data must be guaranteed. •  The quality of the data must be good. •  Responsibilities in universities and academic institutions must be regulated. •  Data know-how must be available. 3. Employing TDM Methods in CRIS To investigate large amounts of text, a manual approach is not enough. However, this is necessary to manage and analyze the vast amounts of textual documents that organizations have. For thispurpose, the method of TDM was developed, which can be understood as a special form of data mining. TDM methods can be applied to unstructured data in CRIS. Basically, it is about informationsearch and information retrieval from heterogeneous data sources, by finding interesting patterns [ 19 ].The special features of the TDM are the structure of the data, as well as the srcin of the data. The data type to be examined is unstructured or semi-structured text from internal or external sources. In thearea of research management TDM helps to improve the search for literature in CRIS. Moreover, the analysis, storage, and availability of research information on various websites and search enginesare made more e ffi cient and accurate by this technique. The following steps are required to obtain information from unstructured data in the context of CRIS [12,20]: 1. Application of pre-processing routines on heterogeneous data sources.2. Application of algorithms for the discovery of patterns.3. Present and visualize the results. For the TDM, it is necessary to recognize and filter representative features from the naturallanguage heterogeneous data sources and thus to create a structured intermediate from the texts.There are a variety of di ff  erent tasks or methods of TDM in science. TDM is a seemingly everydayactivity that is a demanding task in machine processing by combining di ff  erent methods of textpreprocessing and analysis. The present paper is limited to the three most frequently mentioned methods and has already been dealt with in paper [ 5 ] in the context of CRIS. The following methods: Natural language processing (NLP), information extraction (IE), and clustering, are discussed in detail in the context of CRIS. Figure 1 below uses the workflow as a guideline for analyzing research information at CRISinstitutions using TDM methods and provides a permanent backup, as unstructured and erroneous data in a collection presents a fundamental challenge to CRIS managers.
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x