handicR provides Korean morphological analysis in R using the HanDic dictionary through Python. The package relies on the Python package handic and uses reticulate to interface with Python.
This allows users to perform Korean tokenization, part-of-speech tagging, and corpus analysis directly from R.
- Korean tokenization
- Korean POS tagging
- Frequency analysis
- Document-Feature Matrix (DFM) creation
- Corpus-level analysis
- Co-occurrence network analysis
The package is designed for researchers working on:
- Korean linguistics
- corpus linguistics
- language education research
- digital humanities
handicR requires the following:
- R (≥ 4.0 recommended)
- Python
- reticulate R package
- Python packages:
handicmecab-python3jamotools
Install the development version from GitHub.
devtools::install_github("okikirmui/handicR")Load the package:
library(handicR)On the first use, run the setup function to create a Python environment and install the required Python packages.
handicR::ko_setup()By default, this creates a virtualenv environment named r-handic and installs the required Python modules.
If no suitable Python environment manager is available, reticulate may automatically install Miniconda during this process.
If you are using Anaconda on Windows, explicitly specify the conda method:
handicR::ko_setup(method = "conda")This will create a conda environment named r-handic and install the required Python packages inside it.
If you already have an existing Anaconda environment and want to use its Python installation, you can explicitly specify the Python interpreter using reticulate::use_python() before running ko_setup():
reticulate::use_python("C:/path/to/anaconda/envs/your-env/python.exe", required = TRUE)
handicR::ko_setup(method = "conda")This allows handicR to use the Python environment that is already configured in your Anaconda installation.
After the initial setup, you can simply load the package:
library(handicR)When using an Anaconda environment on Windows, select the environment before loading the package:
reticulate::use_python("C:/path/to/anaconda/envs/your-env/python.exe", required = TRUE)
handicR::ko_use_env("r-handic", "conda")
library(handicR)- The setup step (
ko_setup()) is required only once. - The Python dependencies are managed through the reticulate package.
- The default configuration uses virtualenv, but conda can be used explicitly if desired.
Splits Korean text into tokens using HanDic.
txt <- c(
"공기 진짜 좋다.",
"얼굴이 좋아 보여요."
)
ko_tokenize(txt)Example output
[[1]]
[1] "공기06" "진짜" "좋다01" "다06" "."
[[2]]
[1] "얼굴01" "이25" "좋다01" "보이다02" "요81" "." ko_tokenize(txt, mode = "surface")Example output
[[1]]
[1] "공기" "진짜" "좋" "다" "."
[[2]]
[1] "얼굴" "이" "좋아" "보여" "요" "." Performs morphological analysis and returns tokens with POS tags.
Example:
pos_df <- ko_pos(txt)
pos_dfOutput example:
doc_id i token pos
1 1 1 공기06 NNG
2 1 2 진짜 MAG
3 1 3 좋다01 VA
4 1 4 다06 EF
5 1 5 . SF
6 2 1 얼굴01 NNG
7 2 2 이25 JKS
8 2 3 좋다01 VA
9 2 4 보이다02 VV
10 2 5 요81 EF
11 2 6 . SFko_pos(txt, mode = "surface")Creates frequency tables.
ko_count(txt, by = "token")ko_count(txt, by = "pos")ko_count(txt, by = "token_pos")Creates a quanteda Document‑Feature Matrix (DFM).
library(quanteda)
dfm_mat <- ko_dfm(txt)
dfm_matExample: nouns only
dfm_nouns <- ko_dfm(
txt,
pos_keep = c("NNG", "NNP")
)
dfm_nounsCreates a DFM from all .txt files in a directory.
Example directory:
texts/
doc1.txt
doc2.txt
doc3.txt
Example:
dfm_dir <- ko_dfm_dir(
"texts",
pos_keep = c("NNG","NNP")
)
dfm_dirThis is useful for corpus analysis of many documents.
Creates a frequency table from a single text file.
Example:
freq <- ko_freq_file(
"sample.txt",
by = "token"
)
freqfreq <- ko_freq_file(
"sample.txt",
by = "token_pos"
)The sample texts located in:
inst/extdata/sample_texts/These files contain the inaugural speeches of the 18th–21st Presidents of the Republic of Korea, which are included solely as small demonstration data for the co-occurrence network examples.
Example:
# sample text 01
speech_18th <- system.file("extdata", "sample_texts/sample_01.txt", package = "handicR")
ko_freq_file(speech_18th, by = "token_pos")Another example:
library(handicR)
library(quanteda)
# sample text 02
speech_sample <- system.file("extdata", "sample_texts/sample_02.txt", package = "handicR")
speech_19th <- scan(speech_sample, what = character(0))
# build dfm
dfm_mat <- ko_dfm(speech_19th, pos_keep = c("NNG","NP"))
# inspect most frequent terms
quanteda::topfeatures(dfm_mat)This example demonstrates how to:
- Load sample Korean texts included in the package
- Build a document-feature matrix (DFM) using selected POS tags
- Trim low-frequency terms
- Run Correspondence Analysis (CA)
- Visualize the results as a biplot
# Load required packages
library(handicR)
library(quanteda)
# Get the directory of bundled sample texts
# (located in inst/extdata/sample_texts/)
sample_dir <- system.file("extdata", "sample_texts", package = "handicR")
# Create a document-feature matrix (DFM)
# Keep only common nouns (NNG) and proper nouns (NNP)
dfm_dir <- ko_dfm_dir(sample_dir, pos_keep = c("NNG", "NNP"))
# Remove low-frequency terms to reduce noise
# - min_termfreq: minimum total frequency across all documents
# - min_docfreq: minimum number of documents a term must appear in
dfm_trimmed <- dfm_trim(dfm_dir, min_termfreq = 4, min_docfreq = 3)
# Fit Correspondence Analysis (CA)
library(quanteda.textmodels)
ca_fit <- quanteda.textmodels::textmodel_ca(
dfm_trimmed,
nd = 2, # number of dimensions
sparse = TRUE # efficient computation for sparse matrices
)
# Visualize CA results as a biplot
library(FactoMineR)
library(factoextra)
fviz_ca_biplot(
ca_fit,
repel = TRUE, # avoid label overlap
font.family = "NotoSansCJKkr-Regular" # ensure proper Korean font rendering
)A sample script for creating a word co-occurrence network using Korean text is included in this package.
You can run an example script as follows:
example_script <- system.file("extdata", "networkD3_example.R", package = "handicR")
source(example_script)
pThe script demonstrates how to:
- perform Korean morphological analysis using
ko_pos() - filter tokens by part-of-speech
- compute word co-occurrence
- visualize the network using packages such as network, igraph, or networkD3
When installing handicR in a Windows environment that uses Anaconda / conda together with RStudio, the R session may occasionally terminate with a "session aborted" message during installation.
In many cases:
- The package installation itself still completes successfully, even if RStudio crashes.
- After restarting R or RStudio, the package can usually be loaded without reinstalling.
Additionally, when using reticulate with conda environments on Windows, it may be necessary to explicitly specify the Python interpreter in each new R session:
reticulate::use_python("C:/path/to/your/conda/env/python.exe", required = TRUE)
library(handicR)This behavior is related to how reticulate detects Python environments on Windows systems and is not specific to handicR.
If possible, we recommend:
- using a dedicated conda environment for handicR
- launching R from within the conda environment
- using WSL2 or Linux environments, where Python environment discovery is typically more stable
If you see an error like:
ModuleNotFoundError: No module named 'handic'
install the required Python packages:
reticulate::py_install(c("handic", "mecab-python3", "jamotools"))- Yoshinori Sugai (Kindai University)
Copyright (c) 2026 Yoshinori Sugai
Released under MIT license