GitHub - 143230/CLTA

#CLTA This is the project that pulishes the source code of category correlation based bilingual topic models: CC-BiLDA and CC-BiBTM, which can be applied to cross-lingual applications, such as cross-lingual taoxnomy alignment.

###Requirements:

JDK 1.8.0_111
Maven 3.3.9

###Data you need:

Biterm Documents or Word Documents
Biterm-Category or Document-Category Distribution file

###Biterm Documents content format: each line represents a category biterm document organised as follows:
<category url>@#@#@<category label>@#@#@<category lang>@#@#@<chinese-chinese biterm document>@#@#@<chinese-english biterm document>@#@#@<english-english biterm document>
for example：
http://www.ebay.com/chp/Fins-/16054@#@#@Fins@#@#@en@#@#@[呼吸手套,...]@#@#@[呼吸 full,...]@#@#@[cheap sailor,...]
###Word Documents content format: each line represents a category word document organised as follows:
<category url>@#@#@<category label>@#@#@<category lang>@#@#@<chinese word document>@#@#@<translated english word document>
for example：
http://conference_en#c-7081035-6117083@#@#@committee@#@#@en@#@#@[任命, 报告...]@#@#@[elect, person...]
###Biterm-Category Distribution file content format: each line represents a biterm-category distribution organised as follows:
<word1>@#@#@<word2>@#@#@<lang1_lang2>\t[<category url>@#@#@<category distribution>,...]
for example:
稿件@#@#@carry@#@#@ZH_EN [http://cmt_cn#c-8430559-8614325@#@#@1.0]

###Document-Category Distribution file content format: each line represents a document-category distribution organised as follows:
<document id>@#@#@<document label>@#@#@<document language>\t[<category url>@#@#@<category distribution>, ...]
for example:
http://cmt_cn#c-1609047-4017692@#@#@合著者@#@#@zh@#@#@ [http://cmt_cn#c-1609047-4017692@#@#@1.0]

###input file organization: suppose the dataset name is 'A', for CC-BiLDA method, the Word Documents and the Document-Category Distribution file are as:
corpus/A/exact matching/CC-BiLDA/TextPairs(for BiLDA)_A
corpus/A/exact matching/CC-BiLDA/TextPairs(for BiLDA)_A.(<avg_pi> or <hier_pi>)
for CC-BiBTM method, the Biterm Documents and the Biterm-Category Distribution file are as:
corpus/A/exact matching/CC-BiBTM/Biterms(for BiBTM)_A
corpus/A/exact matching/CC-BiBTM/Biterms(for BiBTM)_A.(<avg_pi> or <hier_pi>)

###Compile Project: To run this project, you need to first compile this project using maven:
mvn assembly:assembly

#Run Project: Then the jar package of this project will be generated in the target directory named by 'alignment-1.0-SNAPSHOT.jar'

if you are first time to using this project, run:
java -jar target\alignment-1.0-SNAPSHOT.jar -h
you will get the help options

usage: Model Run Options
 -alpha <arg>         Hyper Parameter Alpha
 -avg                 Using Average Category Distribution to inference the
                      GibbsSampling.
 -f <arg>             File Name
 -h                   HELP_DESCRIPTION
 -hier                Using Hierarchy Category Distribution to inference
                      the GibbsSampling.
 -iter <arg>          Iteration Number
 -k <arg>             Topic Number
 -m <arg>             Method for training the corpus, one of <CCBiBTM,
                      CCBiLDA>
 -savestep <arg>      Step to Save
 -source_beta <arg>   Source Beta
 -t <arg>             Data Type
 -target_beta <arg>   Target Beta

then you can following the help option to run this project on your own datasets. for example, you can run:
java -jar target/alignment-1.0-SNAPSHOT.jar -m CCBiBTM -f "Biterms(for BiBTM)" -t "product catalogue" -iter 300 -savestep 100 -k 100
if options not refered, values will be put default.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
src/main		src/main
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages