This project provide a text feature selection method with chi-squared test.
The script could run in stand-alone mode or cluster mode by hadoop streaming.
https://github.com/kn45/Chi-Square
| --- | cat | non-cat | sum over cats |
|---|---|---|---|
| with word | A[] | B[] | A+B |
| without word | C[] | D[] | C+D |
| sum | A+C[] | B+D[] | N |
cat[TAB]segments
cat is class label in string while segments are space separeted words from a certain passage
eg:
sport[TAB]well done MSN congrats to Barcelona
cat[TAB]word[TAB]chi2_value[TAB]A[TAB]B[TAB]C[TAB]D[TAB]pos
pos means positive(1) or negative(-1) relative
A file records the pre-computed number of passages of each category with format:
cat[TAB]count
e.g.:
fashion[TAB]347882
sport[TAB]2443297
cat input_passage.tst | ./mapred_chi2.py m | sort | ./mapred_chi2.py r passage_cnt_file > output_chi2.tst
Refer to run_chi2_uni.sh