Effects of Classification Techniques on Medical Reports Classification

Please download to get full document.

View again

of 16
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Similar Documents
Information Report
Category:

Scrapbooking

Published:

Views: 0 | Pages: 16

Extension: PDF | Download: 0

Share
Description
Effects of Classification Techniques on Medical Reports Classification
Tags
Transcript
  ISSN 2277-3061 4206 | Page April 09, 2014  Effects of Classification Techniques on Medical Reports Classification  Elfadil A. Mohamed 1 , Fathi H. Saad 2 , Omer I. E. Mohamed 3 1 College of Business Administration, Al Ain University of Science and Technology, UAE elfadil.mohammed@aau.ac.ae 2 NHS Oxfordshire, Oxford, UK fathi.saad@nhs.net 2 Department of Computer Science, Alkhawarizmi College, UAE omareldai@gmail.com   ABSTRACT Text classification is the process of assigning pre-defined category labels to documents based on what a classifications has learned from training examples. This paper investigates the partially supervised classification approach in the medical field. The approaches that have been evaluated include Rocchio, Naïve Bayesian (NB), Spy, Support vector machine (SVM), and Expectation Maximization (EM). A combination of these methods has been conducted. The experimental result showed that the combination which uses EM in step 2 is always produces better results than those uses SVM using small set of training samples. We also found that reducing the features based on tf-tdf values is decreasing the classification performance dramatically. Moreover,   reducing the features based on their frequencies improve the classification performance significantly while also increasing efficiency, but it may require some experimentation I NDEXING TERMS  /K EYWORDS   Document classification, positive-class based learning, partially supervised classification, labelled and unlabeled data, medical text mining, and features reduction Council for Innovative Research  Peer Review Research Publishing System Journal:   INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY   Vol 13, No. 2 editor@cirworld.com www.cirworld.com, www.ijctonline.com   ISSN 2277-3061 4207 | Page April 09, 2014  1.   I NTRODUCTION   Classification is a form of data analysis that extracts models describing important data classes [10]. The extracted models are called classifiers which are used to predict categorical class labels. The medical field has recently received great attention regarding the analysis of medical data which is available in an electronic form. The nature of the medical data is either unstructured or semi-structured which make it difficult to be analyzed using traditional data mining techniques. The medical staffs need automatic classification methods to analyze and categorize this huge amount of data. The Gastroenterology unit of a local hospital in UK had just such a problem as they collected electronic reports on thousands of colonoscopy procedures, but could not give answer to simple questions, such as the percentage of successful colonoscopies undertaken [34]. The aim of colonoscopy is to check for medical problems such as bleeding, colon cancer, polyps, colitis, etc. [6]. Text classification is a two-step process, consisting of a learning step and classification step. When the class label of each training data is provided, this is called supervised learning. The supervised classification has recently being used in the medical domain for small number of training sample data; see for example [15]. However, supervised learning has the problem of the considerable effort required to manually label a large number of training examples for every class, particularly for multi-class problems. As supervised learning methods, most existing text classification algorithms require sufficient training examples so that the obtained classification model produces accurate results [22]. When the number of training examples in each class decreases, the classification accuracy of traditional text classification algorithms degrade dramatically. In medical domain, this is serious problem, because labelled documents are often very sparse because manually labelling data is boring, tiring, costly and continuing for a long time. To solve this problem, exploiting unlabeled documents in text classification has become an active research problem in text classification recently. This leaded to a new approach called partially-supervised classification. There have been a number of techniques reported in developing partially-supervised text classification recently. Most of these techniques require labelled training examples of all classes. It has been reported that those techniques obtain considerable improvement over traditional supervised techniques when the size of training examples is small [22]. So partially supervised approach is very attractive for the medical domain since medical practitioners are often very busy dealing with patients and cannot be expected to spend large amounts of time labelling data. The approach we used in this paper has recently been introduced [1, 2] for binary text classification problems. Further evaluation of the approach can be found recently in [34]. This study is based on the use of a large set of unlabeled documents and a small set of labelled documents for every class so as to reduce the labelling efforts. The approach we used took this idea further and uses only positive and unlabeled documents to learn the classifier based on theoretical study reported in [35], cutting down more on the labelling effort. So this technique is different from the other partially- supervised classification techniques as it doesn‟t require labelling of negative training examples. This approach is two -step strategy. Step 1 identifies the positive documents from the unlabeled documents, and step 2 builds the final classifier. There are a number of algorithms that are applicable in step 1 and step 2. Deciding on what algorithms should be applied is not a trivial task, but is required for the effective application of the technique to real-world data. The main purpose of this paper is to perform a comprehensive practical evaluation of partially supervised classification technique approaches to classify real-world medical reports. The approaches that will be investigated are Rocchio (Roc), NB and Spy for step 1, and SVM and EM for stem 2. The methods available in each step of the process will be tested in combination. The combination that produces the best performance according to some evaluation measures will be recommended. The evaluation will be performed through a real-world medical problem: the classification of a set of colonoscopy reports. For further efficiency, we will also experiment on reducing the set of features used to represent a document. The rest of the paper is organized as follows: reviewing related works in section 2. The methods and algorithms used by partially supervised classification approach explained in section 3. In section 4 we described the data set used. Document representation is explained in section 5. The performance measures used are presented in section 6. The document pre-processing is outlined in section 7. Our methodology is described in section 8. We presented and analyzed the results in section 9. Finally, section 10 concludes the paper. 2.  RELATED WORK   The labelled documents availability problem resulted in new research direction focuses on exploiting the unlabelled documents in text classification. Here is a brief description of some related methods that use unlabelled examples to improve text classification. Co-Training approach [22], splits the feature set by x =( x 1 , x 2 ) and trains two classifiers  1  and θ 2  each of which is sufficient for classification. The algorithm initially constructs two classifiers based on labelled data, and mutually selects several confident examples to expand the training set. The approach assumes that the two feature sets are conditional independent. Similar approach reported in [24], this approach instead of using two conditional independent features it co-trained two SVM classifiers using two feature spaces. One is the srcinal feature space and the other is derived from clustering the labelled and unlabeled data. Another co-training based approach [25] used two hybrid algorithms, co-EM and self-training, using two randomly split features in co-training setting. More reports use co-training approach could be found in [26, 27].  ISSN 2277-3061 4208 | Page April 09, 2014  Transductive Support Vector Machine (TSVM) approach [23], it maximizes margin over both the labelled data and the unlabelled data. It works by finding a labelling of the unlabeled data D u  and a hyperplane which separates both D l  and D u  with maximum margin. Another reported works improving classification using unlabelled examples [28, 29, 30]. All above methods use small labelled examples for every class and large unlabelled examples for learning improve the classification. In the case of binary classification small labelled sets for both positive and negative classes are required. More recent approach focuses on the binary text classification problem. This approach combined the advantages of partially supervised classification approach which uses small set of labelled examples for every class and large unlabelled examples, and the advantages of theoretical study [10] which requires labelling small set of positive class only. Some studies called this approach positive class based learning [1, 13], but still in many reports known as partially supervised classification. This is two-step approach; first step is identifying a set of reliable negative documents, the second step is build the classifier using the labelled positive and identified negative documents. There are many methods applied different approaches for both steps. One technique called PEBL [31] identifying strong reliable negative documents using method says “strong negative documents are those documents that do not contain any features of the positive data”. After a set of strong negative documents is identified, SVM is applied iteratively to build a classifier. Another techniques called S- EM [2] uses new method called „spy‟ for the first step to identify reliable negative document, and uses EM for the second step. Reference [13] reports another technique called Roc-SVM which uses Rocchio algorithm for step 1 and SVM for step 2. In [32] one-class SVM is proposed; this technique uses only positive examples to build a SVM classifier. More details about methods used for step 1 and step 2 will be discussed in partially supervised classification section. 3.   P ARTIALLY SUPERVISED CLASSIFICATION    As we mentioned earlier, the approach we used in this paper is partially supervised classification. This approach uses a reduced set of positive documents, P, and a large set of unlabeled documents, U. There is initially no labelling of negative documents. The first step of the text classification is therefore to identify a reliable set of negative documents, RN, from the unlabeled documents. In this section we are describing the techniques that we are going to evaluate for both steps 1 and 2. The algorithms that we are going to use for step one are: Naïve Bayesian classifier (NB)  –  NB is a popular classification technique and has been reported as performing extremely well in practice for text classification [12]. In NB, the document is considered an ordered list of words. The vocabulary is the set of all words considered for classification. To perform classification, NB compute the posterior probability Pr(c  j  d i ), where c  j  is a class and d i  is a document. In classifying a document d i , the class with the highest Pr(c  j  d i ) is assigned as the class of the document. Identifying a set RN of reliable negative documents from the unlabeled set U is done as follows in Figure 1. Rocchio (ROC)  –  Rocchio classifiers are well explained in [11]. In this technique, each document d is represented as a vector, and each element in this vector represents a word. Each word value is calculated using tf-idf scheme [20]. The unlabelled set U is treated as negative set in this technique, then the positive set P and U used as the training data to build a Rocchio classifier which is used to classify U. the algorithm that use Rocchio to identify a set RN of reliable negative documents from U is the same as that in Figure 1 except that Rocchio classifier used instead of NB. Spy technique  –  this technique is introduced in [2]. The name reflects the fact that some documents randomly selected from P are added to U and act as spies from the positive set, P, to the unlabeled set, U. They form the spy set, S, and behave like unknown positive documents in U. The spy set, S, allows the classifier algorithm to infer the behavior of the unknown positive documents in U. It then runs EM algorithm using the set P −S as positive and the set U   S as negative.  After EM completes, a threshold t is employed to make the decision. Those documents in U with lower probabilities Pr(c  j  d i ) than t are the most likely negative documents RN. Those documents in U (spies are not included) that have higher probabilities than t become unlabeled documents U. The reader is referred to [2] for details. Figure 2 bellow shows spy algorithm in S-EM. Step 2, iteratively applying a classification algorithm to the newly labelled data. Since some documents are still in the unlabeled set, U- RN, the chosen classifier is applied repeatedly to the data with the intention of extracting more possible negative data at each iteration and improving the overall performance of the classifier. The procedure will stop when no further negative documents are found in the unlabeled set, U-RN. There are two classifiers within this step that will be tested: Expectation-Maximization (EM) and Support Vector Machines (SVM). Expectation-Maximization (EM) - This algorithm iterative algorithm for maximum likelihood estimation in problems with missing data [16, 19]. It consists of two steps, the Expectation step, and the Maximization step. The Expectation step basically fills in the missing data. In our case, it produces and revises the probabilistic labels of the documents in U-RN. The parameters are estimated in the Maximization step after the missing data are filled. EM converges when its parameters stabilize. Referring to EM algorithm shown in Figure 3 bellow, using NB in each iteration, EM employs the same equations as those used in building a NB classifier (line 3 for the Expectation step, and lines 1 and 2 for the Maximization step). The class probability given to each document takes the value between 0 and 1. Basically, EM iteratively runs NB to revise the probabilistic label of each document in set Q = U-RN. Since each iteration of EM produces a NB classifier, S-EM also has a mechanism to select a good classifier [2, 17].  ISSN 2277-3061 4209 | Page April 09, 2014  Support Vector Machines (SVMs)  –  SVMs are linear functions of the form f(x)= w T x  + b, where w T x  is the inner product between the weight vector w  and the input vector x . SVM selects a hyperplane that separates the positive and negative examples while maximizing the smallest margin [1]. Since small set of RN is identified in step 1, SVM uses P, RN and U-RN and runs iteratively to build the classifiers. In each iteration new RN will be identified and added to RN set to build new classifier. The iteration converges when no document in U-RN classified as negative. The final classifier will be selected from the set of classifiers. The reader is referred to [13] for a detailed description of the SVM algorithm. 1. Each document in P is assigned the class label 1; 2. Each document in RN is assigned the class label -1; 3. Each document d   Q (= U-RN) is not assigned any label initially. At the end of the first iteration of EM, it will be assigned a probabilistic label, Pr(1|d). In subsequent iterations, the set Q will participate in EM with its newly assigned probabilistic classes. 4. Run the EM algorithm using the document sets, P, RN and Q until it converges.   1.   Assign each document in  P the class label 1; 2.   Assign each document in U the class label -1; 3.   Build a NB classifier using  P and U  ; 4.   Use the classifier to classify U  . Those documents in  U that are classified as negative form the reliable negative set  RN  . Figure 1. The NB method for Step 1 1.    RN =  NULL ; 2.   S = Sample (  P  ,  s %); 3.   Us = U   S; 4.    Ps =  P-S  ; 5.   Assign each document in  Ps the class label 1; 6.   Assign each document in Us the class label -1; 7.   EM ( Us ,  Ps ); // This produces a NB classifier. 8.   Classify each document in Us using the NB classifier; 9.   Determine a probability threshold t using S  ; 10.   for  each document d    Us 11.   if   its probability  Pr  (1| d  ) < t then  12.    RN =  RN   { d  }; Figure 2. The algorithm of Spy technique in S-EM . Figure 3. The EM algorithm with the NB classifier  .  ISSN 2277-3061 4210 | Page April 09, 2014  4.  DATA SET I N THIS PAPER ,  WE FOCUSED ON THE ANALYSIS OF COLONOSCOPY PROCEDURES ,  SINCE THIS IS THE MOST PRESSING TASK FROM A MEDICAL POINT OF VIEW ,  GIVEN THE INTRODUCTION OF COLORECTAL SCREENING IN THE UK  IN 2006.   C OLONOSCOPY REFERS TO THE PASSAGE OF THE COLONOSCOPE FROM THE LOWEST PART (  ANUS AND RECTUM )  RIGHT AROUND THE COLON TO THE CAECUM AND IN SOME CASES INTO THE TERMINAL ILEUM VIA THE ILEO - CAECAL VALVE .   T HE AIM OF COLONOSCOPY IS TO CHECK FOR MEDICAL PROBLEMS SUCH AS BLEEDING ,  COLON CANCER ,  POLYPS ,  COLITIS ,  ETC .    A FTER EACH COLONOSCOPY PROCEDURE ,  THE ENDOSCOPIST AT THE NNUH  WOULD GENERATE A DETAILED REPORT ABOUT THE CURRENT STATUS OF THE EXAMINED PART OF THE BODY AND THE RESULT OF THE PROCEDURE ITSELF USING A SOFTWARE PACKAGE CALLED E NDOSCRIBE .   C LASSIFYING THE PROCEDURE AS SUCCESSFUL OR FAILED DEPENDS ON WHAT IS WRITTEN IN THE REPORT .   The dataset used contained 4,876 documents for as many colonoscopy procedures. 25% of these documents were selected using 1-in-4 include sampling strategy to be used as test documents. The rest (75%) were used to create training sets. We used the dataset in two ways. The first way focuses on classifying the reports into successful or failed procedure. To achieve this, we combined extensive database querying, looking for regular expressions, with an expert doctor to create the “gold - standard” classification against which to measure our automatic classification. The “gold - standard” classification assigned 656 documents to the class of failed colon oscopies and 4,220 documents to the class of successful colonoscopies, giving an overall failure rate of 13.45%. This way will be used to evaluate all the approaches under investigation. Table 1 shows the class distribution for each set of documents, ac cording to the “gold - standard” classification. The second way is looking at the data from different medical point of view which is diagnosis view. In this way we focused on certain diagnosis, namely “Diverticulosis”, “Polyps”, “Sessile Polyps” and “U l cerative Colitis”. We used this way for further investigation of the top three approaches that obtained good results using the first way. Using the same data in different ways as same as we use different datasets, because the class in each way is completely different. Creating the “gold - standard” classification is similar to the one we used in the first way. Tables 2, 3, 4 and 5 sho w the class distribution for Diverticulosis, Polyps, Sessile Polyps, and Ulcerative Colitis set of documents, according to the “gold - standard” classification. The “Yes” column in these tables refers to the positive class and “No” column refers to the negat ive class for that diagnosis. Figure 4 shows the numbers and percentages of the positive class for the 5 classification problems. The x axe shows the 5 classification problem, the left and right y axes show the number and percentage of the positive class respectively.  TABLE I  T HE “ GOLD - STANDARD ”   S UCCESSFUL /F AILED CLASS DISTRIBUTION  Data set Successful Failed Total Train 3,298 359 3657 (75%) Test 1,042 177 1219 (25%) Total 4,220 (86.5%) 656 (13.5%) 4876 (100%)
Recommended
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x