Skip to content

0xh3xa/awesome-malware-benign-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Malware Benign Datasets

Awesome List Badge

A curated collection of high‑quality malware and benign datasets for cybersecurity researchers, AI Cybersecurity researchers, machine learning, and malware analysis.

Table of Contents

Windows Datasets

Dataset Year Size / Samples Labels Format Description Access Link
VirusShare 2010–Present Millions Malware families Binary Large malware binary archive (requires access by request) Required Link
Malimg 2011 ~9,458 images Malware families Image Grayscale images for malware classification Public Link
Android Malware Genome 2012 1,260 malware Malware Families APK Historic dataset of early Android malware Public Link
Virus-MNIST 2017 ~10,000 images Malware Image Dataset for malware detection using image-based methods Public Link
CICAndMal2017 2017 10,854 samples Adware, Ransomware, Scareware, SMS Malware APK, PCAP, CSV Real-device collected malware samples with network and behavior data Public Link
CIC-AAGM2017 2017 1,900 apps Adware, General Malware, Benign .pcap, .csv Real-device collected network traffic from Android adware and general malware apps Public Link
Microsoft Malware Prediction 2019 ~8M rows Binary labels CSV Contains Windows system telemetry data for predicting malicious or benign files based on system behavior. Public Link
Malware Bazaar 2019–Present 10M+ samples Malware Binary Community malware sample exchange Public Link
VxHeaven 2019 595 to 2955 files Malware/Benign CSV Static and dynamic features extracted from VxHeaven and VirusTotal datasets, with 1087 features for classification Public Link
SOREL-20M 2020 ~20M samples (8TB) Malicious/Benign Binary/Features Large scale benchmark dataset for malicious PE detection, including malware samples, feature vectors, and models. Public Link
DikeDataset 2020 (PE binaries) Malware/Benign Binary PE binaries Public Link
Dumpware 10 2020 ~4,294 images Malware/Benign Image (RGB) Malware images Public Link
MalMem-2022 2022 29,298 benign, 29,298 malicious Malware/Benign Binary Memory analysis dataset for obfuscated malware detection using memory dumps Public Link
MalRadar 2022 4,534 malware samples Malware Various A growing Android malware dataset, manually verified, containing 4,534 samples across 121 families Restricted Link
CIC-Evasive-PDFMal2022 2022 10,025 records Malicious/Benign CSV, PDF A dataset with 5,557 malicious and 4,468 benign PDF records that attempt to evade common detection techniques. Public Link
Microsoft BIG 2015 2015 ~20K Malware types Binary PE malware binaries Public Link
EMBER2017-2018 2018 ~3.2M files Malware/Benign Features/metadata Large public benchmark for malware classifiers Public Link
BODMAS 2021 57,293 malware, 77,142 benign Malware/Benign Binary Blue Hexagon dataset with malware samples and family info Required Link
EMBER2024 (New Benchmark) 2025 ~3.2M files Malware/Benign Features/metadata Large public benchmark for malware classifiers Public Link
Android-Malware-2023 (AIM-2023) 2023 250K apps Malware/Benign APK, CSV New Android malware + benign apps with detailed metadata Public Link
AndroZoo (2022+) 2022 25M+ samples Malware/Benign APK The largest Android dataset; malware + benign apps Restricted Link
Kronodroid N/A 70,000+ samples Malware/Benign CSV, APK A dataset designed to study concept drift and cross-device detection issues, with 289 dynamic and 200 static features Public Link
ContagioDump N/A N/A Malware Binary Collection of malware samples for research Public Link

Android Datasets

Dataset Year Size / Samples Labels Format Description Access Link
Android Malware Genome 2012 1,260 malware Malware Families APK Historic dataset of early Android malware Public Link
Drebin 2014 5,560 malware Malware APK, Features One of the most famous Android malware datasets Public Link
CICAndMal2017 2017 10,854 samples Adware, Ransomware, Scareware, SMS Malware APK, PCAP, CSV Real-device collected malware samples with network and behavior data Public Link
CIC-AAGM2017 2017 1,900 apps Adware, General Malware, Benign .pcap, .csv Real-device collected network traffic from Android adware and general malware apps Public Link
PRAGuard Android Dataset 2017 25K apps Malware/Benign APK, CFG Focuses on obfuscation + packed apps Public Link
Kronodroid N/A 70,000+ samples Malware/Benign CSV, APK A dataset designed to study concept drift and cross-device detection issues, with 289 dynamic and 200 static features Public Link
CICMalDroid-2020 2020 17,341 samples Adware, Banking, SMS, Riskware, Benign APK, CSV Comprehensive Android malware dataset with dynamic and static features Public Link
MalNet 2021 1,262,024 images/graphs Malware/Benign Image/Graph Large-scale dataset of Android malware with function call graphs and images Public Link
MalRadar 2022 4,534 malware samples Malware Various A growing Android malware dataset, manually verified, containing 4,534 samples across 121 families Restricted Link
CIC-Evasive-PDFMal2022 2022 10,025 records Malicious/Benign CSV, PDF A dataset with 5,557 malicious and 4,468 benign PDF records that attempt to evade common detection techniques. Public Link
AMD 2.0 2022 150K malware Malware Families APK, JSON Updated AMD with modern Android malware families Public Link
AndroZoo (2022+) 2022 25M+ samples Malware/Benign APK The largest Android dataset; malware + benign apps Restricted Link
Android-Malware-2023 (AIM-2023) 2023 250K apps Malware/Benign APK, CSV New Android malware + benign apps with detailed metadata Public Link

Document Datasets

Dataset Year Size / Samples Labels Format Description Access Link
Malicious PDF Generator 2020 10 PDFs (generated) Malicious PDF Generates 10 different malicious PDFs for penetration testing with phone-home functionality. Public Link
Dike 2020 1,871 documents Malware/Benign doc, docx, docm, xls, xlsx, xlsm, ppt, pptx, pptm A dataset containing various document formats (doc, xls, ppt) for malware detection. Public Link
CIC-Evasive-PDFMal2022 2022 10,025 records Malicious/Benign CSV, PDF A dataset with 5,557 malicious and 4,468 benign PDF records that attempt to evade common detection techniques. Public Link

Contributing

Contributions are always welcome 🤝

Thank you for helping make this project better! Please review the Contribution Guidelines.

License

Creative Commons License

This repository is licensed under a Creative Commons Attribution 4.0 International License.

Topic: Malware Dataset