The replication package for "Plug it and Play on Logs: A Configuration-Free Statistic-Based Log Parser".
PIPLUP's parsing process is shown in the following figure. PIPLUP comprises two core parsing stages: online log clustering, cluster updating. After parsing all log lines, the results are stored in a CSV file during the template matching stage for further verification.
PIPLUP leverages a novel tree structure without assuming constant tokens' format or position, and enhances the template extraction approach based on template similarity and describability. Further, it uses a set of data-insensitive parameters, enabling users to directly "plug and play" PIPLUP on their log files without excessive configuration.
Online Log Clustering: Inspired by Drain, PIPLUP leverages a similar tree structure as a hashing function to find the most compatible leaf for an incoming log message and conduct further comparisons. Instead of hashing with
Cluster Updating: The matched clusters are dynamically updated to allow online parsing. First, an in-cluster update is triggered to update the cluster (i.e., message sample, log lines, path list, and template list). If the template list is updated, an inter-cluster template-merging process among clusters under the same constant token node is triggered to reduce template redundancy.
Template Matching: PIPLUP allows multiple templates to co-exist in a log cluster. If a cluster contains only one template, all log messages are directly matched to this event. Conversely, if multiple templates are inferred, PIPLUP assigns the template to in-cluster log messages using regex matching. The matched results are stored in a CSV file for further analysis.
- python>=3.8
- chardet==5.1.0
- ipython==8.12.0
- matplotlib==3.7.2
- natsort==8.4.0
- numpy==1.24.4
- pandas==2.0.3
- regex==2022.3.2
- scipy
- tqdm==4.65.0
- rpy2
- spacy
For experiments, we use log data from Loghub 2.0. The original data can be obtained from the Loghub 2.0 repository. Before replicating the experiments, please obtain Loghub 2.0 from its original repository. Before starting the experiment, we correct the event templates with the latest rules provided by LOGPAI to ensure the quality of the ground truths. The ground-truth templates can be automatically corrected with template_correction.py.
To replicate the results in RQ1, switch folder to benchmark and run the command ./run_rq1_ablation.sh, ./run_rq1_hit_thresh.sh, ./run_rq1_br_thresh.sh, and ./run_rq1_sim_thresh.sh; to replicate the overall evaluation of RQ2 and RQ3, run the command ./run_all_full.sh. To conduct the Scott-Knott effect size difference (ESD) analysis, please first follow the tutorial ScottKnottESD to install the package, then run the code ./sk_analysis.py.
Three parameters are required in PIPLUP's parsing process, namely
Evaluating the impact of
Evaluating the impact of
Evaluating the impact of
PIPLUP leverages a generalizable preprocessing framework to enhance parsing effectiveness and a merging process to reduce redundant templates. We conduct an ablation experiment to understand their impact on PIPLUP’s performance. The results of our ablation study are shown in the following table:
Evaluating the impact of log preprocessing: PIPLUP is sensitive to disabling the preprocessing framework, reflecting a known limitation of statistic-based parsers that applies token-level constant/variable categorization (i.e., it cannot separate constants from variables coexisting in a token). Specifically, PIPLUP’s FTA and FGA decreased to less than 0.010 on the OpenStack and HealthApp datasets. The two datasets contain many messages with high volumes of combined-type tokens (i.e., tokens containing both constant and variables), such as “totalAltitude=240” and “ask=7”. PIPLUP's performance drop highlights the importance of log preprocessing in statistic-based log parsers.
Evaluating the impact of template merging: With the merging module disabled, PIPLUP exhibited slight reductions in its average GA (-6.2%), PA (-0.4%), FGA (-5.5%), and FTA (-2.7%). PIPLUP’s merging component can effectively reduce redundant templates: it successfully eliminated at least one overlapping template in 8 datasets, specifically 32 redundant templates in the HPC dataset. The performance degradation on Hadoop and OpenSSH is due to a single wrong merge in each dataset. The template difference comparison between PIPLUP w/ & w/o merging module can be found in ./results/RQ1/merge_analysis.
Detailed results can be found under ./results/RQ1/. We inherit the parameter settings from RQ1 and use them to parse all 14 datasets in RQ2 and RQ3.
PIPLUP is compared with 7 state-of-the-art log parsers, including Drain, XDrain, Preprocessed-Drain, LILAC, LibreLog, LogBatcher, and LUNAR. Due to resource limitations, we did not replicate LibreLog. Therefore, the parsing results and time consumption of this log parser are obtained from its original study, and the evaluations are re-run on the corrected ground truths. We also experimented with PILAR, another data-insensitive log parser. PILAR's replication code is stored under the PILAR_implementation folder. The following table shows the parsing effectiveness of PIPLUP, along with seven benchmarks.
According to the table, PIPLUP achieved significantly higher average performance than state-of-the-art statistic-based parsers on all four metrics. Moreover, even with LLM-powered semantic-based parsers included, PIPLUP's performance is statistically optimal or near-optimal in terms of all four metrics.
The datasets are sorted from the smallest (i.e., with the fewest lines) to the largest (i.e., with the most lines), and their time consumptions are documented in the following table. As shown in the following table, all parsers show an anomalous high time consumption on the Thunderbird dataset. Therefore, we provided two versions of statistical time rankings (i.e., all files and files excluding Thunderbird) to avoid obtaining misleading conclusions.
On average, PIPLUP requires more processing time than Drain but less than XDrain and Preprocessed-Drain. It also has a much lower time consumption than the semantic-based parsers. According to the Scott-Knott ESD ranking, PIPLUP is the second most efficient. It requires only $\sim$1.5 seconds to $\sim$25 minutes to parse each of the studied datasets. Its time efficiency is statistically comparable to state-of-the-art statistic-based parsers and much better than semantic-based ones that rely on expensive computing resources (e.g., only using ~6% of LUNAR's parsing time).
Theoretically, PIPLUP has the same
Detailed results for RQ2 and RQ3 can be found under ./results/RQ2&RQ3/.
├── 2k_dataset # Loghub-2k
├── PILAR_implementation # PILAR evaluation with Loghub 2.0 evaluation functions
├── benchmark
├── evaluation # Configurations for the parsers
├── logparser # Main code for parsers
├── Drain
├── PIPLUP
├── Preprocessed_Drain
├── utils
├── XDrain
└── __init__.py
├── old_benchmark # Default settings for the Drain series
├── run_all_full.sh # Script for running default PIPLUP on all datasets
├── run_rq1_ablation.sh
├── run_rq1_br_thresh.sh
├── run_rq1_hit_thresh.sh
├── run_rq1_sim_thresh.sh
└── README.md
├── figures
├── result # Performance results for all parsers in CSV format (including PILAR)
├── RQ1
├── RQ2&RQ3
└── result_PIPLUP_no_merge # Templates extracted by PIPLUP w/o merging module
├── sk_analysis.py # Code for Scott-Knott ESD analysis
├── template_correction.py # Code for ground truth template correction
└── README.md








