Code for my Diploma thesis at Information and Communication Systems Engineering (University of the Aegean, School of Engineering) with title "Efficient implementation of watermark and watermark detection algorithms for image and video using the graphics processing unit" Link.
This implementation is based on the watermarking algorithms proposed by Irene G. Karybali and Kostas Berberidis: Efficient Spatial Image Watermarking via New Perceptual Masking and Blind Detection Schemes. The theoretical framework and the mathematical proofs of robustness against attacks are detailed in the original paper. This repository provides a high performance implementation designed for real-world environments, featuring GPU acceleration, disk images support, and native video container support via FFmpeg.
NOTE: This repository features a highly refactored and optimized version of the original Thesis implementation, with improved algorithms, execution times and features. The deprecated original Thesis code is in the archived repository old branch. The original Thesis code supported OpenCL and Eigen, while this implementation adds CUDA support.
This project implements and evaluates the performance (execution speed) of image watermarking algorithms on CPU versus GPU. It provides multiple implementations to enable comparisons between compute backends. Watermarks are generated as standard normal distributed matrices (μ=0, σ=1). For cryptographic robustness, a user password is hashed with SHA-256 and this 256-bit value is used as a 256-bit key for the ChaCha20 block cipher. This CSPRNG ensures bit exact, and cross platform determinism. The implementation is highly parallelized with OpenMP. The chosen transform for normal distribution is Box-Muller transform. Two watermark masks are used: The proposed Prediction Error mask, which is the main focus of the Thesis, and the NVF (Noise Visibility Function) mask for comparison purposes. The system supports both embedding and detection of watermarks in disk images and video streams. Video processing is handled via FFmpeg, enabling broad codec and container support, along with advanced features such as GPU-accelerated video decoding and encoding (CUDA only) and 10-bit/HDR (tonemapped) video support.
The repository contains all required source code and dependencies needed to reproduce the benchmarks and experiments.
- Comparative performance analysis between CPU and GPU implementations (check benchmarks folder for more information). To ensure fair and reproducible comparisons across hardware (discrete GPUs vs iGPUs for example), an adaptive time budget benchmarking algorithm is implemented. Unlike fixed loop benchmarks (which can be very slow on weak devices or undersaturate fast ones), the system dynamically adjusts iteration loops based on the coefficient of variation of the execution times, combined with a fixed time budget constraint. This ensures that the FPS values are stable rather than driver latency or OS jitter.
Implementations are optimized for maximum performance:
- CPU implementation: Uses the
Eigenlibrary for linear algebra operations combined with efficient use ofOpenMPmultithreading (reductions, parallel loops). The application utilizes all available logical (or physical, specifically on video embedding) CPU cores for maximum performance. The project is configured to useclangcompiler (clang-cl toolset) instead of MSVC, because it optimizes much better the heavily templated Eigen code. - GPU implementation: Provides both OpenCL and CUDA backends. Specifically for CUDA, we use warp shuffle techniques, CUB, Tensor Cores and Grid-Stride reduction loops to improve performance wherever applicable. All CUDA/OpenCL core kernels are 100% custom-built for maximum hardware utilization. In order to optimize VRAM usage, custom (but simple) memory pools are implemented for both backends.
Get the latest binaries here for Eigen, OpenCL or CUDA platform. The binaries contain:
- The CLI (command line) application and a sample config file (settings.ini).
- The embedded CUDA/OpenCL/Eigen implementations of the watermarking algorithms.
- The benchmark application which benchmarks each backend.
- Some sample image and video files
The CLI application:
- Embeds or detects the watermark using the NVF and the proposed Prediction-Error mask for images and videos.
- For image mode only: Supports batched operation: It can embed or detect the watermark for all images under a specified folder. It is highly parallelized for both operations to reduce disk I/O latency.
- Prints FPS/execution time for both operations, and both masks.
The Benchmark application:
- Embeds the proposed Prediction-Error mask watermark for a predefined set of images and shows the watermarkedf result on the fly in a window.
- It uses various values of p (window size) and PSNR to test a broad set of combinations.
- Does not need to parameterized, it is a standalone application used for image benchmarking only. Does not write to disk (only temporary files) nor is used for video.
- Calculates a Total Score using the geometric mean of the two pipelines, scaled by a constant (
$C=10$ ) for readability:
NOTE:
- For video and image batched operations only the proposed mask is used, which is more optimal. The NVF based watermarked image is only saved for single images (explained below) and not in batched mode.
- CPU implementation is built with AVX2 support:
-mavx2 -mfma. To enable AVX-512 replace the previous with:-march=native. The performance gains are negligible, and for much broader compatibility we use AVX2 by default.
The CLI application should be parameterized from the corresponding settings.ini file. Here is a detailed explanation for each parameter:
| Parameter | Description |
|---|---|
| [image]/mode | [single, batch_embed, batch_detect]: (Image mode only) Set the image mode option. If single the application will read the image file specified at [image]/path] and embeds/detects the watermark and prints results. If batch_embed or batch_detect then it reads a directory specified at at [image]/path] and it either embeds the watermark for all the image files it finds, writing them in a new folder called watermark_output in the specified folder, or it tries to detect the watermark and prints the correlation values. |
| [image]/path | Path to the input image (or directory for batched operations) to embed/detect watermark. This will set the sample application to image mode |
| watermark_password | The watermark password. Used to generate a deterministic and secure (as much as possible) watermark. |
| save_to_disk | [true/false]: (Image mode only) Set to true to save the watermarked NVF and Prediction-Error files to disk, works only if mode is single. |
| display_fps | [true/false]: Set to true to display execution times in FPS. Else, it will display execution time in seconds. |
| p | Window size for masking algorithms. All implementations support values of p=3,5,7 and 9. |
| psnr | PSNR (Peak Signal-to-Noise Ratio). Higher values correspond to less watermark in the image, reducing noise, but making detection harder. |
| benchmark_loops | (Image mode only) Loops the algorithms many times, simulating more work. A value of 100~1000 produces consistent execution times. Works only if mode is single. |
| opencl_device_id | [OpenCL only / Number]: Works only for OpenCL binary. If multiple OpenCL devices are found, then set this to the desired device. Set it to 0 if one device is found. |
Video-only settings:
| Parameter | Description |
|---|---|
| mode | [embed/detect]: Sets the video mode. Both options read the [video]/path as input video and either embed the watermark encode via ffmpeg) or try to detect the watermark. |
| [video]/path | Path to the video file, if we want to embed or detect the watermark for a video. This will set the sample application to video mode and will read the video-only settings that are described in this section plus the common settings (watermark_seed, display_fps, p, psnr and opencl_device_id) |
| watermark_interval | [Number]: Embed or try to detect the watermark every watermark_interval frames. If set to 1 when embedding, the watermark will be embedded for all frames, which degrades video quality. If the current frame is not divisible by this parameter, then for embedding the frame is passed to the encoder as-is (no watermark), and for detection the frame is decoded and skipped. |
| cuda_hw_decoder | [CUDA only]: Offload decoding to the GPU using NVDEC. This is much more effective on HEVC or AV1 videos (especially 4K and above) and tasks like watermark detection, as software decoders are generally fast for lower resolutions and less complex algorithms such as H264. Valid options are hevc_cuvid , h264_cuvid and av1_cuvid. Other decoders may be available like vp9_cuvid, vc1_cuvid or mjpeg_cuvid. If HW decoders aren't available, the application will automatically fall back to CPU decoding. |
| cuda_hw_encoder | [CUDA only: true/false]: Offload encoding to the GPU using NVENC. This makes more sense when combined with NVDEC but it is not necessary. If set, then the encoder options of encode_codec_options settings are ignored, and valid nvenc codec options must be provided in the hw_encode_options section. |
| encode_output_path | Set this value to a file path, in order to embed watermark on the video from [video]/path parameter and save the watermarked file to disk. This will set the sample application to video embedding mode. If you want to detect the watermark from the video parameter then comment this line, effectively setting the sample application to video detect mode. |
| encode_codec_options | These are FFmpeg options for encoding only. It configures the coded library and its options. Example: -c:v libx265 -preset fast -crf 23 will pass these encoding options to FFmpeg. |
| hw_encode_options | These are FFmpeg options for encoding with NVENC. Only used when cuda_hw_encoder is true and overwrites the encode_codec_options option. Example: -c:v hevc_nvenc -preset p6 -tune hq -cq 26 -b:v 0 is the NVENC equivalent to the sample used for CPU encoder. NOTE: Encoding and decoding as separate, we can decode with CPU and encode with NVENC (and vice versa), and of course we can do both! |
The following FFmpeg command is used to encode a new video while preserving the original input's metadata, subtitles, and audio tracks. It decodes the input video, embeds the watermark, and passes the resulting frames into standard input (stdin) for encoding, while copying audio/subtitles from the original input file as is. You can customize video codec encoding settings (codec, CRF, presets, etc) via the encode_codec_options option as described above.
ffmpeg -y -f rawvideo
-pix_fmt <fmt>
-s <width>x<height>
-r <frame_rate>
-i -
-i <input_video_file>
<encoder_options>
-c:s copy -c:a copy
-map 1:s? -map 0:v -map 1:a?
-max_interleave_delta 0
-vf "<rotation>" (OPTIONAL)
-color_range:v:0 <range>
<output_file>
-f rawvideo -pix_fmt <fmt>: Specifies raw pixel format, eitheryuv420poryuvj420p(limited or full range, extracted from the input).-s <width>x<height>: Specifies frame size (extracted from the input).-r <frame_rate>: Frame rate of the video (extracted from the input).-i -: Accepts raw video from stdin.-i <input_video_file>: USER SUPPLIED: Original input file.<encoder_options>: USER SUPPLIED: Encoder options such as codec, preset, and quality options. If CUDA NVENC is requested then it reads thehw_encode_optionsparameter from the settings file, else it readsencode_codec_options.-c:s copy -c:a copy: Copies subtitle and audio streams without re-encoding.-map 1:s? -map 0:v -map 1:a?: Maps subtitles/audio from the original input, and video from stdin.-max_interleave_delta 0: Reduces potential interleaving delay issues.-vf "<rotation>": Filter to be applied for rotating the output video (optional, may not be set, extracted from the input).-color_range:v:0 <range>: Sets the output color range metadata to help video players (value of "tv" or "pc" is supplied, extracted from the input).<output_file>: USER SUPPLIED: Output file path for the final video.
NOTES:
- Only Constant Frame Rate (CFR) works as expected for an input video. If the input video is Variable Frame Rate (VFR) there may be issues with audio/subtitles sync on the output file.
- 10-bit video support is experimental: 10-bit non HDR is fully supported. HDR 10-bit is tonemapped (mobius) to SDR by CPU, Hardware accelerated tonemapping is not yet supported by FFMPEG. Encoding is always 8-bit.
This project is built using Visual Studio and consists of a solution with various projects.
- Watermarking-Impl: The Core of this project, implements the algorithms for each backend. It also implements a fast, efficient, secure and deterministic watermark generation with OpenMP (CPU-only based). It is built as a static library.
- Watermarking-CLI: The sample command line application that interacts with the Core project to embed and detect watermark in images and video.
- Watermarking-BenchUI: The benchmarking project. It interacts with the Core project and benchmarks the performance of image watermarking. It uses Qt for UI.
- Watermarking-Util: Common utility methods without dependencies, that may be used by any project. It is built as a static library.
- Watermarking-Tests: Basic tests for the Core project.
The solution provides multiple build configurations, each targeting a specific backend:
- Git must be installed and Git LFS is required to download the large library binary dependencies. Install it with:
git lfs install. - Clone this repository:
git clone https://github.com/kar-dim/Watermarking-Accelerated. - Open the
.slnfile in Visual Studio 2022 (or later). - In the Solution Configurations dropdown (top toolbar), select your configuration (e.g.
CUDA_Release) or selectBatch Buildand select what configurations you want to build. - Build the solution via Build > Build Solution.
We bundle all necessary DLLs with the prebuilt binaries so the application runs out-of-the-box.
| Backend | Dependencies |
|---|---|
| All | FFmpeg (all libav*.dll), zlib1.dll, libpng16.dll, jpeg62.dll, tiff.dll, libwebp.lib (static lib) |
| CUDA | cudart_static.lib, cuda.lib (from CUDA toolkit) |
| OpenCL | OpenCL.lib |
| Eigen | libomp.dll (clang's OpenMP) |
NOTES:
- OpenCL implementation: The OpenCL Headers, OpenCL C++ Bindings and OpenCL Library file are already included and configured for this project.
- CUDA implementation: NVIDIA CUDA Toolkit is required for building. Minimum supported GPUs with Compute Capability 7.0 (sm_75) or newer, CUDA Toolkit 12.4 or newer preferred.
- Image libraries (libjpeg, libpng, libtiff etc) are included and utilized internally by CImg for loading and saving of images for all backends.
- FFmpeg must exist on system PATH (Pre-build binaries already include FFmpeg binaries and DLLs).
- Eigen: A C++ template library for linear algebra.
- FFmpeg: A complete, cross-platform solution to record, convert and stream audio and video.
- CImg: A C++ library for image processing.
- inih: A lightweight C++ library for parsing .ini configuration files.
- cub: A lower-level CUDA library designed for speed-of-light parallel algorithms. Used for device-wide, block-wide, and warp-wide reductions.
- Intel VTune Profiler and AMD uProf: Used to profile CPU performance.
- NVIDIA Nsight Systems and NVIDIA Compute: Used to profile overall system-wide CUDA performance, and to individually profile specific CUDA kernels with detailed performance metrics.
Below we include some comparisons of the original image (left) versus the final watermarked images based on the NVF mask (middle) and the proposed Prediction error mask (right). Images are zoomed for comparison purposes.
Resolution: 512x152, p=5, PSNR=40dB
Resolution: 1280x720, p=5, PSNR=45dB
Resolution: 3840x2160, p=5, PSNR=40dB
This section includes performance comparisons between the three backends: CPU (Eigen), CUDA, and OpenCL. The benchmarks measure the throughput (in Frames Per Second) of the watermarking algorithm across various resolutions (480p to 4K) and window sizes (p=3,5,7,9). Two kind of benchmarks are done:
- The first set is generated by running the CLI in
single imagemode with1000loops to ensure stability, for each image used in thesamplesdirectory (in Releases). It is run on a machine with the below Hardware specs:- CPU: AMD Ryzen 7 7800X3D (8-Core)
- GPU: NVIDIA RTX 4070 SUPER (12 GB VRAM)
- RAM: 32 GB DDR5 @ 6000 MHz (2x16GB)
- The second set uses the dedicated Benchmark GUI application. While currently tested on a limited selection of hardware, we aim to expand this list significantly. Community submissions are of course welcome!
| p = 3 | p = 5 |
|---|---|
![]() |
![]() |
| p = 7 | p = 9 |
![]() |
![]() |
| CUDA | OpenCL |
|---|---|
![]() |
![]() |
| CPU/Eigen | |
![]() |
| Benchmark screen | Result screen |
|---|---|
![]() |
![]() |
![]() |
![]() |
















