handicR

handicR provides Korean morphological analysis in R using the HanDic dictionary through Python. The package relies on the Python package handic and uses reticulate to interface with Python.

This allows users to perform Korean tokenization, part-of-speech tagging, and corpus analysis directly from R.

Features

Korean tokenization
Korean POS tagging
Frequency analysis
Document-Feature Matrix (DFM) creation
Corpus-level analysis
Co-occurrence network analysis

The package is designed for researchers working on:

Korean linguistics
corpus linguistics
language education research
digital humanities

Requirements

handicR requires the following:

R (≥ 4.0 recommended)
Python
reticulate R package
Python packages:
- handic
- mecab-python3
- jamotools

Installation

Install the development version from GitHub.

devtools::install_github("okikirmui/handicR")

Load the package:

library(handicR)

Initial Setup (first time only)

On the first use, run the setup function to create a Python environment and install the required Python packages.

handicR::ko_setup()

By default, this creates a virtualenv environment named r-handic and installs the required Python modules.

If no suitable Python environment manager is available, reticulate may automatically install Miniconda during this process.

Using Anaconda on Windows

If you are using Anaconda on Windows, explicitly specify the conda method:

handicR::ko_setup(method = "conda")

This will create a conda environment named r-handic and install the required Python packages inside it.

If you already have an existing Anaconda environment and want to use its Python installation, you can explicitly specify the Python interpreter using reticulate::use_python() before running ko_setup():

reticulate::use_python("C:/path/to/anaconda/envs/your-env/python.exe", required = TRUE)
handicR::ko_setup(method = "conda")

This allows handicR to use the Python environment that is already configured in your Anaconda installation.

Subsequent Use

After the initial setup, you can simply load the package:

library(handicR)

Using Anaconda on Windows (subsequent sessions)

When using an Anaconda environment on Windows, select the environment before loading the package:

reticulate::use_python("C:/path/to/anaconda/envs/your-env/python.exe", required = TRUE)
handicR::ko_use_env("r-handic", "conda")
library(handicR)

Notes

The setup step (ko_setup()) is required only once.
The Python dependencies are managed through the reticulate package.
The default configuration uses virtualenv, but conda can be used explicitly if desired.

Tokenization

ko_tokenize()

Splits Korean text into tokens using HanDic.

Example

txt <- c(
  "공기 진짜 좋다.",
  "얼굴이 좋아 보여요."
)

ko_tokenize(txt)

Example output

[[1]]
[1] "공기06" "진짜"   "좋다01" "다06"   "."     

[[2]]
[1] "얼굴01"   "이25"     "좋다01"   "보이다02" "요81"     "."

Surface-form tokens

ko_tokenize(txt, mode = "surface")

Example output

[[1]]
[1] "공기" "진짜" "좋"   "다"   "."   

[[2]]
[1] "얼굴" "이"   "좋아" "보여" "요"   "."

POS Tagging

ko_pos()

Performs morphological analysis and returns tokens with POS tags.

Example:

pos_df <- ko_pos(txt)

pos_df

Output example:

   doc_id     i token    pos  
 1      1     1 공기06   NNG  
 2      1     2 진짜     MAG  
 3      1     3 좋다01   VA   
 4      1     4 다06     EF   
 5      1     5 .        SF   
 6      2     1 얼굴01   NNG  
 7      2     2 이25     JKS  
 8      2     3 좋다01   VA   
 9      2     4 보이다02 VV   
10      2     5 요81     EF   
11      2     6 .        SF

Using surface forms

ko_pos(txt, mode = "surface")

Frequency Analysis

ko_count()

Creates frequency tables.

Token frequency(default)

ko_count(txt, by = "token")

POS frequency

ko_count(txt, by = "pos")

Token–POS combinations

ko_count(txt, by = "token_pos")

Document Feature Matrix

ko_dfm()

Creates a quanteda Document‑Feature Matrix (DFM).

library(quanteda)

dfm_mat <- ko_dfm(txt)

dfm_mat

POS filtering

Example: nouns only

dfm_nouns <- ko_dfm(
  txt,
  pos_keep = c("NNG", "NNP")
)

dfm_nouns

Creating a DFM from Text Files

ko_dfm_dir()

Creates a DFM from all .txt files in a directory.

Example directory:

texts/
  doc1.txt
  doc2.txt
  doc3.txt

Example:

dfm_dir <- ko_dfm_dir(
  "texts",
  pos_keep = c("NNG","NNP")
)

dfm_dir

This is useful for corpus analysis of many documents.

Frequency Table from a File

ko_freq_file()

Creates a frequency table from a single text file.

Example:

freq <- ko_freq_file(
  "sample.txt",
  by = "token"
)

freq

Token‑POS frequency

freq <- ko_freq_file(
  "sample.txt",
  by = "token_pos"
)

Typical Workflow Example

Sample texts

The sample texts located in:

inst/extdata/sample_texts/

These files contain the inaugural speeches of the 18th–21st Presidents of the Republic of Korea, which are included solely as small demonstration data for the co-occurrence network examples.

Example:

# sample text 01
speech_18th <- system.file("extdata", "sample_texts/sample_01.txt", package = "handicR")

ko_freq_file(speech_18th, by = "token_pos")

Another example:

library(handicR)
library(quanteda)

# sample text 02
speech_sample <- system.file("extdata", "sample_texts/sample_02.txt", package = "handicR")

speech_19th <- scan(speech_sample, what = character(0))

# build dfm
dfm_mat <- ko_dfm(speech_19th, pos_keep = c("NNG","NP"))

# inspect most frequent terms
quanteda::topfeatures(dfm_mat)

Example: Creating a DFM and Visualizing Correspondence Analysis

This example demonstrates how to:

Load sample Korean texts included in the package
Build a document-feature matrix (DFM) using selected POS tags
Trim low-frequency terms
Run Correspondence Analysis (CA)
Visualize the results as a biplot

# Load required packages
library(handicR)
library(quanteda)

# Get the directory of bundled sample texts
# (located in inst/extdata/sample_texts/)
sample_dir <- system.file("extdata", "sample_texts", package = "handicR")

# Create a document-feature matrix (DFM)
# Keep only common nouns (NNG) and proper nouns (NNP)
dfm_dir <- ko_dfm_dir(sample_dir, pos_keep = c("NNG", "NNP"))

# Remove low-frequency terms to reduce noise
# - min_termfreq: minimum total frequency across all documents
# - min_docfreq: minimum number of documents a term must appear in
dfm_trimmed <- dfm_trim(dfm_dir, min_termfreq = 4, min_docfreq = 3)

# Fit Correspondence Analysis (CA)
library(quanteda.textmodels)

ca_fit <- quanteda.textmodels::textmodel_ca(
  dfm_trimmed,
  nd = 2,        # number of dimensions
  sparse = TRUE  # efficient computation for sparse matrices
)

# Visualize CA results as a biplot
library(FactoMineR)
library(factoextra)

fviz_ca_biplot(
  ca_fit,
  repel = TRUE,  # avoid label overlap
  font.family = "NotoSansCJKkr-Regular"  # ensure proper Korean font rendering
)

Example: Visualizing Co-occurrence Network

A sample script for creating a word co-occurrence network using Korean text is included in this package.

You can run an example script as follows:

example_script <- system.file("extdata", "networkD3_example.R", package = "handicR")
source(example_script)
p

The script demonstrates how to:

perform Korean morphological analysis using ko_pos()
filter tokens by part-of-speech
compute word co-occurrence
visualize the network using packages such as network, igraph, or networkD3

Notes for Windows Users (Anaconda / conda)

When installing handicR in a Windows environment that uses Anaconda / conda together with RStudio, the R session may occasionally terminate with a "session aborted" message during installation.

In many cases:

The package installation itself still completes successfully, even if RStudio crashes.
After restarting R or RStudio, the package can usually be loaded without reinstalling.

Additionally, when using reticulate with conda environments on Windows, it may be necessary to explicitly specify the Python interpreter in each new R session:

reticulate::use_python("C:/path/to/your/conda/env/python.exe", required = TRUE)
library(handicR)

This behavior is related to how reticulate detects Python environments on Windows systems and is not specific to handicR.

If possible, we recommend:

using a dedicated conda environment for handicR
launching R from within the conda environment
using WSL2 or Linux environments, where Python environment discovery is typically more stable

Troubleshooting

Python module not found

If you see an error like:

ModuleNotFoundError: No module named 'handic'

install the required Python packages:

reticulate::py_install(c("handic", "mecab-python3", "jamotools"))

Author

Yoshinori Sugai (Kindai University)

Copyrights

Released under MIT license

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.github/workflows		.github/workflows
R		R
inst/extdata		inst/extdata
man		man
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

handicR

Features

Requirements

Installation

Initial Setup (first time only)

Using Anaconda on Windows

Subsequent Use

Using Anaconda on Windows (subsequent sessions)

Notes

Tokenization

ko_tokenize()

Example

Surface-form tokens

POS Tagging

ko_pos()

Using surface forms

Frequency Analysis

ko_count()

Token frequency(default)

POS frequency

Token–POS combinations

Document Feature Matrix

ko_dfm()

POS filtering

Creating a DFM from Text Files

ko_dfm_dir()

Frequency Table from a File

ko_freq_file()

Token‑POS frequency

Typical Workflow Example

Sample texts

Example: Creating a DFM and Visualizing Correspondence Analysis

Example: Visualizing Co-occurrence Network

Notes for Windows Users (Anaconda / conda)

Troubleshooting

Python module not found

Author

Copyrights

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages