GitHub - baiyan305/NLP-Project

Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
Algorithm-Munkres-0.08		Algorithm-Munkres-0.08
senseclusters_scorer		senseclusters_scorer
teamdata		teamdata
teamresult		teamresult
BASELINES		BASELINES
DefinitionGeneration.py		DefinitionGeneration.py
ExampleGenerator.py		ExampleGenerator.py
Main.py		Main.py
ORIGINS		ORIGINS
README		README
RESULTS		RESULTS
SenseCluster.py		SenseCluster.py
Util.py		Util.py
XMLParser.py		XMLParser.py
install.sh		install.sh
runit.sh		runit.sh
Repository files navigation

=========================================
README
=========================================
-AUTHOR:BHARATH BOMMANA

*****************************************
Project Introduction:
*****************************************

Our system attempts to create a dictionary from an inputfile. This system automatically clusters the instances based on the sense in which the
target word is being used in a particular instance. Ideally, our system clusters all the instances in which the target word is used in the same
sense into a single cluster. For each of these clusters, we generate the sense for the target word along with 
an example(picked from the instances in the cluster). So if we have n clusters, we will have n senses and n examples in the output.

Input:  inputfile contains Instances of target word in semeval2 format
output: dictionary for this target word
		(lists out all the senses for the target word along with an example)
		
		input file : orange-noun-bomma008.xml

		Suppose our target word is 'orange' and our system has identified 3 clusters among the instances, output would be as follows.

		Expected Output:
		########
		sense id: 0
		definition: refers to the property or quality similar to  vitamin AND/OR c
		example:
		As mentioned, vitamin C works as an antioxidant, and one of the most important functions of antioxidants is to prevent cancer. Antioxidants keep the DNA of healthy cells from mutating into cancerous cells, so antioxidants like vitamin C are the first line of defense for cancer and other serious diseases. Along with vitamin C, <head>orange</head> juice also contains the antioxidant hesperidin, which has been connected to reducing tumor growth and even stimulating apoptosis, or programmed cell death, in cancerous cells. Although research is ongoing, it has positively been linked to colon cancer prevention, but it is likely that hesperidin is effective in terms of many types of cancer.
		
		sense id: 1
		definition: refers to an action similar to  county AND/OR california
		example:
		<head>orange</head> will continue to expand to the east, where it has a 60-square mile sphere of influence extending to the county line. Preliminary plans call for a variety of developments in the area around Irvine Lake, with much of the area to the north of the lake remaining as open space.
		
		sense id: 2
		definition: refers to the property or quality similar to  juice AND/OR water
		example:
		<head>orange</head> juice helps improve blood circulation, avoiding the formation of clots, and therefore can prevent arteriosclerosis.
		
		sense id: 3
		definition: refers to the property or quality similar to  juice AND/OR sugar
		example:
		Worried about all that sugar you’d be drinking, though? Well, the magnesium that comes in your OJ is actually essential for blood sugar regulation, too, and keeps blood glucose levels normalized. As well as potassium, another key nutrient found plentifully in <head>orange</head> juice. Potassium is involved in regulating both blood sugar and insulin levels. Potassium works by getting sugar into the cell, which regulates cellular metabolism and makes use of that sugar by using it as fuel. Potassium has other benefits, too. It determines cellular hydration, by keeping sodium in circulation, which regulates water retention and blood volume.
		
		sense id: 4
		definition: refers to an action similar to  like AND/OR make
		example:
		But now because of the massive use of High Fructose Corn Syrup — which basically tastes like sugar but is cheaper to make — Fructose makes up a much higher percentage of our diet.  Fructose is in everything. Soft drinks. Candies. Crackers. Fructose, when you consume it in normal proportions, is fine.  But because people now are drinking 2 glasses of <head>orange</head> juice, then a soda with High Fructose Cornsyrup, and some crackers and chips (also with High Fructose Corn Syrup), and even bread that has it, fructose consumption is at an all-time high. In studies, high fructose consumption has been linked to alterations in fat levels, cholesterol changes, as well as other obvious changes that occur with a high energy intake like weight gain, metabolic disorder, and cardiovascular issues.

*********************************
Files and Authors
*********************************

We have divided the project into the following tasks

Task1 : Parsing the inputfile and storing the useful information for future processing.
Task2 : Generating clusters and common words among instances in each cluster.
Task3 : Generate definition
Task4 : identifying example for each cluster
Task5 : Writing output into the required files.
Task6 : Calling Mechanism for the above classes (Main.py)


The python classes, in the code, corresponding to the above tasks with the Authors are listed below:


1. Task 1 -  	XMLParse.py 		     -Yan Bai
2. Task 2 - 	SenseCluster.py 	     -Yan Bai
3. Task 3 - 	DefinitionGeneration.py     -Bharath Bommana
4. Task 4 -     ExampleGenerator.py         -Swetha Naidu
4. Task 5 - 	Util.py  		     -Swetha Naidu
5. Task 6 -     Main.py 	             -Bharath Bommana
6. Env tasks -  runit.sh, install.sh        -Swetha Naidu



*********************************************************************************************************************************
			Instructions to run
*********************************************************************************************************************************

Our system uses the following programs. install.sh installs these programs for you.
1. scipy
2. csh
3. build-essential
4. Algorithm Mankress


our system can be run using runit.sh script in two ways.
Note: runit.sh can be run only from the directory in which it is located.

To run our system on data with more than 10000 instaces, atleast 4 GB memory is required- 8GB is preferred.
We tested our system on a VM with 4GB on pmss data(10000 instances) and it took around 25 mins. 
We also tested it on VM with 1GB and 2GB and we encountered memory issues.

If the name of input file is input_file_name

	./runit.sh input_file_name

Note: Outputs for team data are kept in ./teamresult directory.

All the output will be stored in ./out directory.
-----------------------------------------
The contents of ./out directory will be: 
------------------------------------------
1.asnwers.txt
2.semeval2.xml
3.inputkey.key
4.outputkey.key
5.report.out(in ./senseclusters_scorer directory)

We run senseclusters_scorer on the  outputkey.key and inputkey.key files and the resultant report.out will be in ./senseclusters_scorer directory


1. answers.txt -> dictionary for the target word
	
		orange

		sense id: 0
		definition: refers to the property or quality similar to  vitamin AND/OR c
		example:
		As mentioned, vitamin C works as an antioxidant, and one of the most important functions of antioxidants is to prevent cancer. Antioxidants keep the DNA of healthy cells from mutating into cancerous cells, so antioxidants like vitamin C are the first line of defense for cancer and other serious diseases. Along with vitamin C, <head>orange</head> juice also contains the antioxidant hesperidin, which has been connected to reducing tumor growth and even stimulating apoptosis, or programmed cell death, in cancerous cells. Although research is ongoing, it has positively been linked to colon cancer prevention, but it is likely that hesperidin is effective in terms of many types of cancer.
		
		sense id: 1
		definition: refers to an action similar to  county AND/OR california
		example:
		<head>orange</head> will continue to expand to the east, where it has a 60-square mile sphere of influence extending to the county line. Preliminary plans call for a variety of developments in the area around Irvine Lake, with much of the area to the north of the lake remaining as open space.
		
		sense id: 2
		definition: refers to the property or quality similar to  juice AND/OR water
		example:
		<head>orange</head> juice helps improve blood circulation, avoiding the formation of clots, and therefore can prevent arteriosclerosis.
		
		sense id: 3
		definition: refers to the property or quality similar to  juice AND/OR sugar
		example:
		Worried about all that sugar you’d be drinking, though? Well, the magnesium that comes in your OJ is actually essential for blood sugar regulation, too, and keeps blood glucose levels normalized. As well as potassium, another key nutrient found plentifully in <head>orange</head> juice. Potassium is involved in regulating both blood sugar and insulin levels. Potassium works by getting sugar into the cell, which regulates cellular metabolism and makes use of that sugar by using it as fuel. Potassium has other benefits, too. It determines cellular hydration, by keeping sodium in circulation, which regulates water retention and blood volume.
		
		sense id: 4
		definition: refers to an action similar to  like AND/OR make
		example:
		But now because of the massive use of High Fructose Corn Syrup — which basically tastes like sugar but is cheaper to make — Fructose makes up a much higher percentage of our diet.  Fructose is in everything. Soft drinks. Candies. Crackers. Fructose, when you consume it in normal proportions, is fine.  But because people now are drinking 2 glasses of <head>orange</head> juice, then a soda with High Fructose Cornsyrup, and some crackers and chips (also with High Fructose Corn Syrup), and even bread that has it, fructose consumption is at an all-time high. In studies, high fructose consumption has been linked to alterations in fat levels, cholesterol changes, as well as other obvious changes that occur with a high energy intake like weight gain, metabolic disorder, and cardiovascular issues.
		


2. semeval2.xml -> output (intances clustered based on their senseids) in semeval2 format

		<corpus lang="english">
		<lexelt item="LEXELT">
		<instance id="0">
		<answer instance="0" senseid="2"/>
		<context>
		are a vegetarian it is certainly worthwhile making sure you include other good sources of iron in your diet. You'll need to get your daily iron from eggs (if you're eating them), beans such as baked beans, soya beans and lentils, leafy green veg, dried fruit, and fortified breakfast cereals (check the labels to see which have iron added). Iron from non meat sources is better absorbed by the body if it is eaten with a source of vitamin C, so try to have your cereal with a glass of fresh <head>orange</head> or your lentils with a tomato based sauce and so on. CAN you tell me which type of milk is suitable for my four-year-old daughter? A LITTLE ray of sunshine she isn't. Cute and sweet are not adjectives that readily spring to mind on surveying the stage wardrobe of Andi, the blonde frontwoman of local cover band Sunshine. But then Andi isn't really your average rock babe. Two years ago she traded her long, barbie doll locks for a
		</context>
		</instance>
		<instance id="1">
		<answer instance="1" senseid="1"/>
		<context>
		one of the biggest exhibitions at the fair with 21 towns and holiday attractions taking part. It is not all work for children at Mount Pleasant primary school in Darlington, where children take a caring approach. Distressed by mindless vandalism that destroys trees and flowers they are keeping a watchful eye on plant life. The animal kingdom gets its fair share of attention as children learn how to care for pets. A runaway hamster called Sophie takes pride of place where the school rat once roamed. And they learn to care for themselves with a sociable cafe selling <head>orange</head> and biscuits before home-time every Friday. Lynne Humes has taken time off work as a staff nurse at Darlington Memorial Hospital to start a family, and her son Anthony was born in January. The NHS is one of the main election issues for Lynne, who says she has seen many changes in the service over the last ten years.' Some were for the better but a lot were for worse. Lack of funds has been a big problem, especially for research, but it always has
		</context>
		</instance>
		.....

3. inputkey.key -> key file generated based on the inputfile(orange-noun-bomma008.xml)
		
		orange orange.0 orange.fruit_tree
		orange orange.1 orange.fruit_tree
		orange orange.2 orange.fruit_tree
		orange orange.3 orange.fruit_tree
		orange orange.4 orange.fruit_tree
		orange orange.5 orange.fruit_tree
		orange orange.6 orange.fruit_tree
		orange orange.7 orange.fruit_tree
		orange orange.8 orange.fruit_tree
		orange orange.9 orange.fruit_tree
		orange orange.10 orange.fruit_tree
		orange orange.11 orange.fruit_tree
		orange orange.12 orange.fruit_tree
		orange orange.13 orange.fruit_tree
		orange orange.14 orange.fruit_tree
		orange orange.15 orange.fruit_tree
		orange orange.16 orange.fruit_tree
		orange orange.17 orange.fruit_tree
		orange orange.18 orange.fruit_tree
		orange orange.19 orange.fruit_tree

4. outputkey.key --> key file generated based on the results of our system

		orange orange.0 orange.2
		orange orange.1 orange.1
		orange orange.2 orange.1
		orange orange.3 orange.2
		orange orange.4 orange.1
		orange orange.5 orange.2
		orange orange.6 orange.2
		orange orange.7 orange.1
		orange orange.8 orange.1
		orange orange.9 orange.2
		orange orange.10 orange.2
		orange orange.11 orange.2
		orange orange.12 orange.1
		orange orange.13 orange.1
		orange orange.14 orange.1
		orange orange.15 orange.1
		orange orange.16 orange.3
		orange orange.17 orange.0

5. report.out(in ./senseclusters_scorer directory)

			            S1	      S0	   TOTAL	
		  C0:       46	       2	      48	(35.29)
		  C1:       27	      50	      77	(56.62)
		  C2:*       3	       0	       3	(2.21)
		  C3:*       7	       0	       7	(5.15)
		  C4:*       1	       0	       1	(0.74)
		 TOTAL      84	      52	     136
		         (61.76)   (38.24)
		Precision = 76.80(96/125)
		Recall = 70.59(96/136+0)
		F-Measure = 73.56
		
		Legend of Sense Tags
		S0 = orange.Location_County
		S1 = orange.fruit_tree
	
*****************************************************************************************************************
DETAILED EXPLANATION OF THE SOLUTION WITH EXAMPLES
******************************************************************************************************************

Reading the xml input files is done by XMLParser.py class. And the useful information stored include the following.
	
    instances_data_old = [	       #store instance information including instance id, sense id and instance text
    instances_text_raw = []            #store raw text of instances without cleaning i.e text between <context> **** </context> tags in XML file.
    instances_text_clean = []          #store text of instances without punctuations and stopwords.
    

Once we parse the XML input files and collect the required data, there are 3 main tasks in this project.
1. Clustering of senses
2. Generating Definitions
3. Selecting Example.

---------------------------------------------------------
1. CLUSTERING OF SENSES: 
--------------------------------------------------------
This part is done by SenseCluster.py class in the code.

The idea of clustering is from a paper named "Automatic Word Sense Discrimination" by Hinrich Schutze.

The process can be divided into 3 steps:
    1.Extract dimensions from contexts.
    2.Create context vector based on dimensions.
    3.Cluster contexts with hierarchical clustering.

1.Extract dimensions
    Top 100 most frequent neighbors of target word in contexts are chosen as dimensions of vector space.

2.Create context vector
    Create context vector for each context based on dimensions chosen above. For each context, search 50 neighbors
    of target word and count occurrence of each dimension. So the context vector presents like below:

    context vector 1: [0,0,1,0,2,0,0,0,1,0,0,0...] (100 values)
    context vector 2: [1,0,0,0,0,1,0,3,0,0,0,1...] (100 values)
    ...

3.Cluster with hierarchical clustering
    After context vectors are computed, the program use hierarchical clustering provided by Scipy.
    The clustering can be divided into 2 steps:
        1. Use linkage() function to perform hierarchical/agglomerative clustering on context vectors created above.
           This step will generate a hierarchical clustering encoded as a linkage matrix.
        2. Use fcluster() function to forms flat clusters from the hierarchical clustering.

---------------------------------------------------------------------------------------------
 2. Generating Definition
---------------------------------------------------------------------------------------------

Our idea is very loosely based on the idea discussed in this paper: Word Sense Induction Using Graphs of Collocations
link : http://dl.acm.org/citation.cfm?id=1567349

The idea is that the definitions of a word in a context depends on the collocated words

In each cluster, we collect the 5 words present on either side of the target word in the context and form a list of all collocated words

From the list of collocated words, we pick the top 2 most repeated words and use them in the definition generation.

Once we have the top2 words for each cluster, Our initial idea was to POS_TAG those two words and then use appropriate template to generate a
easily readable complete sentence.
We were able to use this feature during development stage

    # Unfortunately we ran into issues with NLTK installation on a clean ubuntu machine and hence, we had to let go of it in the final solution.
	we are not using in our final submission.

So, as a last resort, we are randomly choosing the template and generating definitions.

for this we are using a pool of 3 choices out of which one will be chosen at random and is appended to the words to get complete sentence.
	
	str1 =  "refers to an action similar to "
	str2 = "refers to name of an entity, place or quality similar to "
	str3 = "refers to the property or quality similar to "

	
	input:  1. clusters - list of lists
#          	example : [[1,3,7], [2,6,8]]
#
#       	2. cleaned_instances - list of lists
#           	example : [['word1', 'word2','word3'], ['word4','word1', 'word3',], ['word1', 'word2','word3', 'word5']....]

#       	intermediate results:
#       	
		collocatedWords = [[word1, word2, word2, word1, word3, word4, word5], ['word4','word1', 'word3',word3, word4] ]
#       	topWords = [[word1, word2], [word3,word4]]
#
#                output: Definitions- a list
#          	 example : [refers to an action similar to word1 AND/OR word2, refers to the name of a place, entity or quality similar to word3 AND/OR word4]

----------------------------------------------------------------------------------------------------
3. Picking an Example:
----------------------------------------------------------------------------------------------------
The task is done in the ExampleGenerator.py class
The approach that is followed is outlined the following paper
		http://www.aclweb.org/anthology/P04-3020
		
Firstly, all the stop words are removed. 
#Our initial idea was to do Stemming to increase the similarity. Unfortunately we ran into issues with NLTK installation. So we had to skip that part
	
The main approach that is followed is TextRank approach. It is a graph-based approach. 
It represents each instance as a vertex and the connection between the vertices are given an edge-weight. The edge-weight is calculated based on the Similarity metric that has been outlined in the paper. Si=W1W2..Wn,Sj=W1W2W3...Wn.
Similarity(Si,Sj)=(|Wk|Wk belongs to both Si and Sj)/log(|Si|)+log(|Sj|)
	
This represents the lexical similarrity between the two instances normalized by a normalizin factor.
This gives the edge weight for two words i and j. 
	
Finally a graph is constructed and each edge represents the connection between two pairs of instances.
Finally, a rank is given to each vertex which indicates its SimilarityWeight. We pick the vertex that has the maximum SimilarityWeight as the best example.

We tried to  add frequency_similarity score(the count of the number of top common words present in a instance) to the computed similarity weight for each instance. Due, to some issues, we could not accomplish that part. 

# input:
#       1. clusters - list of lists
#             example : [[1,3,7], [2,6,8]]
#           instances - list of lists
#             example : [['word1', 'word2','word3'], ['word4','word1', 'word3',], ['word1', 'word2','word3', 'word5']....]
#       2.  output:  list of examples
#               example_list : [12,34,67]    >>>>represents the index of the best example in our cluster
	The approach that is followed is outlined the following paper

**************************************************************************************************************************

Detailed description of code components with examples:
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
**************************************************************************************************************************
***************************
COMPONENT 1 - XMLparser.py
Author: Yan Bai
***************************
Objective: The main objective of this class is to parse the XML input file and populate the useful info into python data structures.



input:  charged-verb-bomma008.xml file (intances in semeval2 format)

output: instances_data = []         #store instance information including instance id, sense id and instance text
		raw_text = []               #store raw text of instances
		clean_text = []             #store text of instances without punctuation
		  

*****************************************************************
COMPONENT 2 - SenseClusters.py
Author:  Yan Bai
*****************************************************************

Objective: The objective of this class is to categorize the instances into clusters.

input: clean_text[]  #to make clusters

output: groups = []			#store clusters. It is a list of lists. Each list indicates a cluster.
		commonwords = []	#A matrix to store common words of each instance pair.

1. groups[] -> the instances will be categorized into clusters based on the degree of similarity between the instances/contexts
	Example: groups = [[0,1,2,3,4,5,6],[7,8,9,10,11,12],[13,14,15,16,17,18],[19,20,21]]
			 In the above Example: the contexts belonging to instanceIds 0,1,2,3,4,5,6 belong to cluster 1 and the target word in
			 this cluster are expected to have the same/similar sense.


2. commonWords[] -> SenseCluster class outputs the list of the words common(repeated) among the contexts in a cluster
	Example: common = [['eat', 'dog', 'eat', 'apple', 'juice', 'taste', 'fruit', 'fruit', 'taste', 'juice', 'eat', 'fruit', 'is', 'was', 'sweet'], ['company', 'laptop', 'ipad', 'ipod', 'ceo', 'innovation', 'expensive', 'laptop', 'company', 'iphone', 'iphone', 'ipad', 'hiring', 'objectiveC'], ['orchard', 'tree', 'farm', 'tree', 'farm', 'grow', 'seed', 'tree', 'grow']]

	In the above example, for cluster 1, the commonly repeated words are eat, dog, eat, apple, juice, taste, fruit, fruit, taste, juice, taste....., sweet

**********************************
Component 3 - DefinitionGeneration.py
Author: Bharath Kumar Bommana
**********************************
Objective: generate definitions

input:  	clusters[]
			example:  [[1,3,7], [2,6,8],[11,12,13]]
		cleaned_instances[]
			example:  [['word1', 'word2','word3'], ['word4','word1', 'word3',], ['word1', 'word2','word3', 'word5']....]
output:		
		definitions[]	 #a list to store one definition for each cluster

sample output (for clusters and cleaned_instances described in component2):
		
		definitions: [refers to an action similar to word1 AND/OR word2, refers to the name of a place, entity or quality similar to word3 AND/OR word4]

*********************************************
COMPONENT 4 - Util.py
Author: Swetha Naidu
*********************************************

The main objective of this class is to process the output generated in all the previous components and format into output files as described in "instructions to run" section.
1.asnwers.txt
2.semeval2.xml
3.inputkey.key
4.outputkey.key
5.report.out (in ./senseclusters_scorer)

for the file charged-verb-bomma008.xml,

./runit.sh charged-verb-bomma008.xml charged

the following files will be generated.

in ./out
1.charged.answer.txt
2.charged_Semeval2.xml
3.charged.new.key
4.charged.old.key

and report.out in ./senseclusters_scorer
(examples for outputs are shown in "instructions to run" section)

***********************************************
COMPONENT 5 - ExampleGenerator.py
Author: Swetha Naidu
***********************************************
Objective: The main objective of this class is to generate an example that best illustrates the sense of the given word in the cluster.
 
 	input:
 	   clusters - list of lists
               example : [[1,3,7], [2,6,8],[11,12,13]]
           instances - list of lists
                example : [['word1', 'word2','word3'], ['word4','word1', 'word3',], ['word1', 'word2','word3', 'word5']....]
        output:  list of examples
                example_list : [1,6,12]    #represents the index of the best example in our cluster	
		here 1 represents the best example in cluster [1,3,7]
		     6 represents the best example in cluster [2,6,8]
		     12 represents the best example in cluster [11,12,13]