Skip to content

infusionsoft/docker-clamav-malware-scanner

Repository files navigation

Malware Scanner Service

This repository contains the code to build a pipeline that scans objects uploaded to GCS for malware, moving the documents to a clean or quarantined bucket depending on the malware scan status.

It illustrates how to use Cloud Run and Eventarc to build such a pipeline.

This fork (infusionsoft/docker-clamav-malware-scanner) extends the upstream project (GoogleCloudPlatform/docker-clamav-malware-scanner) with periodic rescans of objects already stored in the clean bucket (see Periodic clean-bucket rescan below). Repository ownership is described in CODEOWNERS.

Architecture diagram

How to use this example

Use the tutorial to understand how to configure your Google Cloud Platform project to use Cloud Run and Eventarc.

Using Environment variables in the configuration

The tutorial above uses a configuration file config.json built into the Docker container for the configuration of the unscanned, clean, quarantined and CVD updater cloud storage buckets.

Environment variables can be used to vary the deployment in 2 ways:

Expansion of environment variables

Any environment variables specified using shell-format within the config.json file will be expanded using envsubst.

Passing entire configuration as environment variable

An alternative to building the configuration file into the container is to use environmental variables to contain the configuration of the service, so that multiple deployments can use the same container, and configuration updates do not need a container rebuild.

This can be done by setting the environmental variable CONFIG_JSON containing the JSON configuration, which will override any config in the config.json file.

If using the gcloud run deploy command line, this environment variable must be set using the --env-vars-file argument, specifying a YAML file containing the environment variable definitions (This is because the commas in JSON would break the parsing of --set-env-vars)

Take care when embedding JSON in YAML - it is recommended to use the Literal Block Scalar style using |, as this preserves newlines and quotes

For example, the CONFIG_JSON environment variable could be defined in a file config-env.yaml as follows:

CONFIG_JSON: |
  {
    "buckets": [
      {
        "unscanned": "unscanned-bucket-name",
        "clean": "clean-bucket-name",
        "quarantined": "quarantined-bucket-name"
      }
    ],
    "ClamCvdMirrorBucket": "cvd-mirror-bucket-name",
    "cleanRescanIntervalDays": 30,
    "cleanRescanMaxObjectsPerRun": 100,
    "cleanRescanMetadataKey": "last-clamav-scan",
    "fileExclusionPatterns": [],
    "ignoreZeroLengthFiles": false
  }

An example commandline using this file to specify the environment:

gcloud beta run deploy "${SERVICE_NAME}" \
  --source . \
  --region "${REGION}" \
  --no-allow-unauthenticated \
  --memory 4Gi \
  --cpu 1 \
  --concurrency 20 \
  --min-instances 1 \
  --max-instances 5 \
  --no-cpu-throttling \
  --cpu-boost \
  --service-account="${SERVICE_ACCOUNT}" \
  --env-vars-file=config-env.yaml

If you are using Terraform to deploy, then the equivalent way to specify the environment variable using the google_cloud_run_v2_service resource is by using the env block and jsonencode:

resource "google_cloud_run_v2_service" "malware-scanner" {
  name = "malware-scanner"
  // other service parameters...
  template {
    // other template parameters...
    containers {
      // other container parameters...
      env {
        name = "CONFIG_JSON"
        value = jsonencode({
          buckets = [
            {
              unscanned   = "unscanned-bucket-name",
              clean       = "clean-bucket-name",
              quarantined = "quarantined-bucket-name"
            }
          ]
          ClamCvdMirrorBucket = "cvd-mirror-bucket-name",
          cleanRescanIntervalDays     = 30,
          cleanRescanMaxObjectsPerRun = 100,
          cleanRescanMetadataKey      = "last-clamav-scan",
          fileExclusionPatterns       = [],
          ignoreZeroLengthFiles       = false
        })
      }
    }
  }
}

Notes on fileExclusionPatterns

The fileExclusionPatterns array in the config file can be used to ignore any uploaded files matching a Regular Expression.

This can be used for example if you have an upload system that creates temporary files, then renames them once the files are fully uploaded.

The elements in the fileExclusionPatterns array can either be simple strings, for example:

"fileExclusionPatterns": [
  "\\.tmp$",
  "^ignore_me.*\\.txt$"
]

or they can be an array of 2 string values, allowing regular expression flags to be specified, for example "i" for case-insensitive matches:

"fileExclusionPatterns": [
  [ "\\.tmp$", "i" ],
  [ "tempfile.*.upload$", "i" ]
]

Files matching these patterns will be ignored by the scanner, and left in the unscanned bucket, and an ignored-files counter incremented.

Helpful tools for regular expressions include the Regular Expression Cheatsheet, and the Regex101 playground (ensure ECMAScript flavor is selected).

Note that when adding regular expressions into the config file, care must be taken with \ and " characters -- any of these characters in the regular expression must be escaped with another \.

Notes on quarantine config block

The quarantine block in the config configures auto-quarantine of files which do not have any malware detected, depending on certain factors:

quarantine.encryptedFiles

(Default: true)

This enables ClamAV's encrypted file detection (AlertEncrypted setting), which treats encrypted archive or docuemnt files as containing malware. This is enabled by default as it is impossible to scan the contents of encyrpted files for malware.

Encrypted files will be quarantined as if they were malware with a log line indicating that they were infected with Heuristics.Encrypted.xxx.

quarantine.fileExtensionAllowList

(Default: [] -- allow all.)

A list of allowed file extensions. Files not having these extensions will be quarantined as if they were malware with a log line indicating that they were infected with Config.AllowList.Blocked.

If no extensions are specified all files are allowed.

File extensions are case-insensitive. An empty string: "" matches files with no extension. Specifying double-extensions (eg "tar.gz" is supported).

Example: [ "doc", "pdf", "jpg" ]

quarantine.fileExtensionDenyList

(Default: [] -- deny nothing.)

A list of denied file extensions. Files having these extensions will be quarantined as if they were malware with a log line indicating that they were infected with Config.DenyList.Blocked.

If no extensions are specified all files are allowed.

File extensions are case-insensitive. An empty string: "" matches files with no extension. Specifying double-extensions (eg "tar.gz" is supported).

Example: [ "zip", "tar.gz", "jpg" ]

Periodic clean-bucket rescan

This fork rescans files that are already in the clean bucket on a schedule, so objects can be re-checked after malware definitions age. The service does not use Eventarc on the clean bucket (a finalize event would also fire when files are moved from unscanned to clean and would duplicate the initial scan).

Instead, Cloud Scheduler (or any client) sends an HTTP POST to the Cloud Run service with body {"kind":"schedule#clean_rescan"}. The service lists each configured clean bucket (up to a per-run cap), selects objects whose last-scan metadata is missing, invalid, or older than the configured interval, streams them through ClamAV, then either:

  • Still clean: updates only the last-scan metadata on the object (no move), or
  • Infected: moves the object to the matching quarantined bucket (same behavior as the unscanned path).

Configuration (config.json / CONFIG_JSON)

Field Meaning
cleanRescanIntervalDays Minimum days between successful scans for an object in clean (default 30).
cleanRescanMaxObjectsPerRun Maximum number of objects to scan in one schedule#clean_rescan invocation, across all bucket groups (default 100).
cleanRescanMetadataKey GCS custom metadata key storing the last successful scan time as ISO 8601 UTC (default last-clamav-scan). In HTTP, this appears as x-goog-meta-<key> (e.g. x-goog-meta-last-clamav-scan for the default key).

After a successful initial scan, when a file is moved from unscanned to clean, the service sets this metadata on the object in the clean bucket.

Omitting cleanRescanIntervalDays / cleanRescanMaxObjectsPerRun / cleanRescanMetadataKey applies the defaults above. See config.json.tmpl for commented examples.

Scheduling

Deploy a Cloud Scheduler HTTP job that POSTs the Cloud Run service URL with JSON {"kind":"schedule#clean_rescan"} and OIDC authentication using the same pattern as the existing CVD mirror update job (service account must be able to invoke Cloud Run). Example (adjust name, region, schedule, and URLs for your project):

gcloud scheduler jobs create http "${PROJECT_ID}-clean-rescan" \
  --location="${REGION}" \
  --schedule="0 6 * * *" \
  --uri="${MALWARE_SCANNER_URL}" \
  --http-method=POST \
  --headers="Content-Type=application/json" \
  --message-body='{"kind":"schedule#clean_rescan"}' \
  --oidc-service-account-email="malware-scanner@${PROJECT_ID}.iam.gserviceaccount.com" \
  --oidc-token-audience="${MALWARE_SCANNER_URL}"

The Terraform module in this repo still creates only the CVD mirror scheduler by default; add the clean-rescan job manually (as above) or extend Terraform if you prefer it as code.

Metrics

OpenTelemetry counters include:

  • .../clean-rescan-clean-files — periodic rescan remained clean (metadata updated).
  • .../clean-rescan-infected-files — periodic rescan quarantined the object.

(Same metric prefix as other scanner metrics, e.g. workload.googleapis.com/googlecloudplatform/gcs-malware-scanning/....)

Change history

See CHANGELOG.md

Upgrading from v2.x to v3.x

In Version 3.x, the metrics reporting was changed to OpenTelemetry which uses a different naming convention for metrics, so the metric names have changed from:

custom.googleapis.com/opencensus/malware-scanning/METRIC-NAME

to

workload.googleapis.com/googlecloudplatform/gcs-malware-scanning/METRIC-NAME

Any dashboards or alerts using these metrics must be updated

Upgrading from v1.x to v2.x

Version 2 has a different way of handling ClamAV updates to avoid issues with the ClamAV content distribution network.

See upgrade_from_v1.md for upgrading instructions.

License

Copyright 2022 Google LLC

Licensed under the Apache License, Version 2.0 (the "License"); you may not use
this file except in compliance with the License. You may obtain a copy of the
License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed
under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

About

Malware scanner

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors