Skip to content

143230/CLTA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

#CLTA This is the project that pulishes the source code of category correlation based bilingual topic models: CC-BiLDA and CC-BiBTM, which can be applied to cross-lingual applications, such as cross-lingual taoxnomy alignment.

###Requirements:

  1. JDK 1.8.0_111
  2. Maven 3.3.9

###Data you need:

  1. Biterm Documents or Word Documents
  2. Biterm-Category or Document-Category Distribution file

###Biterm Documents content format: each line represents a category biterm document organised as follows:
<category url>@#@#@<category label>@#@#@<category lang>@#@#@<chinese-chinese biterm document>@#@#@<chinese-english biterm document>@#@#@<english-english biterm document>
for example:
http://www.ebay.com/chp/Fins-/16054@#@#@Fins@#@#@en@#@#@[呼吸 手套,...]@#@#@[呼吸 full,...]@#@#@[cheap sailor,...]
###Word Documents content format: each line represents a category word document organised as follows:
<category url>@#@#@<category label>@#@#@<category lang>@#@#@<chinese word document>@#@#@<translated english word document>
for example:
http://conference_en#c-7081035-6117083@#@#@committee@#@#@en@#@#@[任命, 报告...]@#@#@[elect, person...]
###Biterm-Category Distribution file content format: each line represents a biterm-category distribution organised as follows:
<word1>@#@#@<word2>@#@#@<lang1_lang2>\t[<category url>@#@#@<category distribution>,...]
for example:
稿件@#@#@carry@#@#@ZH_EN [http://cmt_cn#c-8430559-8614325@#@#@1.0]

###Document-Category Distribution file content format: each line represents a document-category distribution organised as follows:
<document id>@#@#@<document label>@#@#@<document language>\t[<category url>@#@#@<category distribution>, ...]
for example:
http://cmt_cn#c-1609047-4017692@#@#@合著者@#@#@zh@#@#@ [http://cmt_cn#c-1609047-4017692@#@#@1.0]

###input file organization: suppose the dataset name is 'A', for CC-BiLDA method, the Word Documents and the Document-Category Distribution file are as:
corpus/A/exact matching/CC-BiLDA/TextPairs(for BiLDA)_A
corpus/A/exact matching/CC-BiLDA/TextPairs(for BiLDA)_A.(<avg_pi> or <hier_pi>)
for CC-BiBTM method, the Biterm Documents and the Biterm-Category Distribution file are as:
corpus/A/exact matching/CC-BiBTM/Biterms(for BiBTM)_A
corpus/A/exact matching/CC-BiBTM/Biterms(for BiBTM)_A.(<avg_pi> or <hier_pi>)

###Compile Project: To run this project, you need to first compile this project using maven:
mvn assembly:assembly

#Run Project: Then the jar package of this project will be generated in the target directory named by 'alignment-1.0-SNAPSHOT.jar'

if you are first time to using this project, run:
java -jar target\alignment-1.0-SNAPSHOT.jar -h
you will get the help options

usage: Model Run Options
 -alpha <arg>         Hyper Parameter Alpha
 -avg                 Using Average Category Distribution to inference the
                      GibbsSampling.
 -f <arg>             File Name
 -h                   HELP_DESCRIPTION
 -hier                Using Hierarchy Category Distribution to inference
                      the GibbsSampling.
 -iter <arg>          Iteration Number
 -k <arg>             Topic Number
 -m <arg>             Method for training the corpus, one of <CCBiBTM,
                      CCBiLDA>
 -savestep <arg>      Step to Save
 -source_beta <arg>   Source Beta
 -t <arg>             Data Type
 -target_beta <arg>   Target Beta

then you can following the help option to run this project on your own datasets. for example, you can run:
java -jar target/alignment-1.0-SNAPSHOT.jar -m CCBiBTM -f "Biterms(for BiBTM)" -t "product catalogue" -iter 300 -savestep 100 -k 100
if options not refered, values will be put default.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages