Skip to content

okikirmui/handicR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

handicR

handicR provides Korean morphological analysis in R using the HanDic dictionary through Python. The package relies on the Python package handic and uses reticulate to interface with Python.

This allows users to perform Korean tokenization, part-of-speech tagging, and corpus analysis directly from R.


Features

  • Korean tokenization
  • Korean POS tagging
  • Frequency analysis
  • Document-Feature Matrix (DFM) creation
  • Corpus-level analysis
  • Co-occurrence network analysis

The package is designed for researchers working on:

  • Korean linguistics
  • corpus linguistics
  • language education research
  • digital humanities

Requirements

handicR requires the following:

  • R (≥ 4.0 recommended)
  • Python
  • reticulate R package
  • Python packages:
    • handic
    • mecab-python3
    • jamotools

Installation

Install the development version from GitHub.

devtools::install_github("okikirmui/handicR")

Load the package:

library(handicR)

Initial Setup (first time only)

On the first use, run the setup function to create a Python environment and install the required Python packages.

handicR::ko_setup()

By default, this creates a virtualenv environment named r-handic and installs the required Python modules.

If no suitable Python environment manager is available, reticulate may automatically install Miniconda during this process.


Using Anaconda on Windows

If you are using Anaconda on Windows, explicitly specify the conda method:

handicR::ko_setup(method = "conda")

This will create a conda environment named r-handic and install the required Python packages inside it.

If you already have an existing Anaconda environment and want to use its Python installation, you can explicitly specify the Python interpreter using reticulate::use_python() before running ko_setup():

reticulate::use_python("C:/path/to/anaconda/envs/your-env/python.exe", required = TRUE)
handicR::ko_setup(method = "conda")

This allows handicR to use the Python environment that is already configured in your Anaconda installation.


Subsequent Use

After the initial setup, you can simply load the package:

library(handicR)

Using Anaconda on Windows (subsequent sessions)

When using an Anaconda environment on Windows, select the environment before loading the package:

reticulate::use_python("C:/path/to/anaconda/envs/your-env/python.exe", required = TRUE)
handicR::ko_use_env("r-handic", "conda")
library(handicR)

Notes

  • The setup step (ko_setup()) is required only once.
  • The Python dependencies are managed through the reticulate package.
  • The default configuration uses virtualenv, but conda can be used explicitly if desired.

Tokenization

ko_tokenize()

Splits Korean text into tokens using HanDic.

Example

txt <- c(
  "공기 진짜 좋다.",
  "얼굴이 좋아 보여요."
)

ko_tokenize(txt)

Example output

[[1]]
[1] "공기06" "진짜"   "좋다01" "다06"   "."     

[[2]]
[1] "얼굴01"   "이25"     "좋다01"   "보이다02" "요81"     "." 

Surface-form tokens

ko_tokenize(txt, mode = "surface")

Example output

[[1]]
[1] "공기" "진짜" ""   ""   "."   

[[2]]
[1] "얼굴" ""   "좋아" "보여" ""   "." 

POS Tagging

ko_pos()

Performs morphological analysis and returns tokens with POS tags.

Example:

pos_df <- ko_pos(txt)

pos_df

Output example:

   doc_id     i token    pos  
 1      1     1 공기06   NNG  
 2      1     2 진짜     MAG  
 3      1     3 좋다01   VA   
 4      1     406     EF   
 5      1     5 .        SF   
 6      2     1 얼굴01   NNG  
 7      2     225     JKS  
 8      2     3 좋다01   VA   
 9      2     4 보이다02 VV   
10      2     581     EF   
11      2     6 .        SF

Using surface forms

ko_pos(txt, mode = "surface")

Frequency Analysis

ko_count()

Creates frequency tables.

Token frequency(default)

ko_count(txt, by = "token")

POS frequency

ko_count(txt, by = "pos")

Token–POS combinations

ko_count(txt, by = "token_pos")

Document Feature Matrix

ko_dfm()

Creates a quanteda Document‑Feature Matrix (DFM).

library(quanteda)

dfm_mat <- ko_dfm(txt)

dfm_mat

POS filtering

Example: nouns only

dfm_nouns <- ko_dfm(
  txt,
  pos_keep = c("NNG", "NNP")
)

dfm_nouns

Creating a DFM from Text Files

ko_dfm_dir()

Creates a DFM from all .txt files in a directory.

Example directory:

texts/
  doc1.txt
  doc2.txt
  doc3.txt

Example:

dfm_dir <- ko_dfm_dir(
  "texts",
  pos_keep = c("NNG","NNP")
)

dfm_dir

This is useful for corpus analysis of many documents.


Frequency Table from a File

ko_freq_file()

Creates a frequency table from a single text file.

Example:

freq <- ko_freq_file(
  "sample.txt",
  by = "token"
)

freq

Token‑POS frequency

freq <- ko_freq_file(
  "sample.txt",
  by = "token_pos"
)

Typical Workflow Example

Sample texts

The sample texts located in:

inst/extdata/sample_texts/

These files contain the inaugural speeches of the 18th–21st Presidents of the Republic of Korea, which are included solely as small demonstration data for the co-occurrence network examples.

Example:

# sample text 01
speech_18th <- system.file("extdata", "sample_texts/sample_01.txt", package = "handicR")

ko_freq_file(speech_18th, by = "token_pos")

Another example:

library(handicR)
library(quanteda)

# sample text 02
speech_sample <- system.file("extdata", "sample_texts/sample_02.txt", package = "handicR")

speech_19th <- scan(speech_sample, what = character(0))

# build dfm
dfm_mat <- ko_dfm(speech_19th, pos_keep = c("NNG","NP"))

# inspect most frequent terms
quanteda::topfeatures(dfm_mat)

Example: Creating a DFM and Visualizing Correspondence Analysis

This example demonstrates how to:

  • Load sample Korean texts included in the package
  • Build a document-feature matrix (DFM) using selected POS tags
  • Trim low-frequency terms
  • Run Correspondence Analysis (CA)
  • Visualize the results as a biplot
# Load required packages
library(handicR)
library(quanteda)

# Get the directory of bundled sample texts
# (located in inst/extdata/sample_texts/)
sample_dir <- system.file("extdata", "sample_texts", package = "handicR")

# Create a document-feature matrix (DFM)
# Keep only common nouns (NNG) and proper nouns (NNP)
dfm_dir <- ko_dfm_dir(sample_dir, pos_keep = c("NNG", "NNP"))

# Remove low-frequency terms to reduce noise
# - min_termfreq: minimum total frequency across all documents
# - min_docfreq: minimum number of documents a term must appear in
dfm_trimmed <- dfm_trim(dfm_dir, min_termfreq = 4, min_docfreq = 3)

# Fit Correspondence Analysis (CA)
library(quanteda.textmodels)

ca_fit <- quanteda.textmodels::textmodel_ca(
  dfm_trimmed,
  nd = 2,        # number of dimensions
  sparse = TRUE  # efficient computation for sparse matrices
)

# Visualize CA results as a biplot
library(FactoMineR)
library(factoextra)

fviz_ca_biplot(
  ca_fit,
  repel = TRUE,  # avoid label overlap
  font.family = "NotoSansCJKkr-Regular"  # ensure proper Korean font rendering
)

Example: Visualizing Co-occurrence Network

A sample script for creating a word co-occurrence network using Korean text is included in this package.

You can run an example script as follows:

example_script <- system.file("extdata", "networkD3_example.R", package = "handicR")
source(example_script)
p

The script demonstrates how to:

  • perform Korean morphological analysis using ko_pos()
  • filter tokens by part-of-speech
  • compute word co-occurrence
  • visualize the network using packages such as network, igraph, or networkD3

Notes for Windows Users (Anaconda / conda)

When installing handicR in a Windows environment that uses Anaconda / conda together with RStudio, the R session may occasionally terminate with a "session aborted" message during installation.

In many cases:

  • The package installation itself still completes successfully, even if RStudio crashes.
  • After restarting R or RStudio, the package can usually be loaded without reinstalling.

Additionally, when using reticulate with conda environments on Windows, it may be necessary to explicitly specify the Python interpreter in each new R session:

reticulate::use_python("C:/path/to/your/conda/env/python.exe", required = TRUE)
library(handicR)

This behavior is related to how reticulate detects Python environments on Windows systems and is not specific to handicR.

If possible, we recommend:

  • using a dedicated conda environment for handicR
  • launching R from within the conda environment
  • using WSL2 or Linux environments, where Python environment discovery is typically more stable

Troubleshooting

Python module not found

If you see an error like:

ModuleNotFoundError: No module named 'handic'

install the required Python packages:

reticulate::py_install(c("handic", "mecab-python3", "jamotools"))

Author

  • Yoshinori Sugai (Kindai University)

Copyrights

Copyright (c) 2026 Yoshinori Sugai

Released under MIT license

About

Korean morphological analysis in R using HanDic (MeCab) with tidytext and quanteda integration.

Topics

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages