Skip to content

Faceted Search with SQL and Impala

Ram Viswanadha edited this page Nov 5, 2017 · 1 revision

Introduction

Recently we had an interesting request from the product organization, the product needs to have an ability to store, search and analyze the log data for 90-180 days. We usually archive up to 15 days of raw data which we use to build our statistical models and generate insights. One of the key requirements of this ability was to be able to do faceted search and faceted queries on top of an SQL engine that we were using to retrieve data from our long term storage.

Building an SQL query for performing faceted search is tricky and not at all obvious. This post introduces a way to get facetted search results from any SQL engine.

Requirement

The logs are stored in 2 different ways:

  1. Indexed in Solr for fast analysis and searching
  2. Stored on the disk for long term storage and archival

Log Data Format

We process various types of logs. One of the logs that we process are web proxy logs. The following is the log pattern for Squid Web Proxy:

Sample log entries:


<135>Sep 26 00:02:01 ip-192-168-12-26 (squid):   192.168.12.21 - - [26/Sep/2016:00:02:01 +0000] "CONNECT iecvlist.microsoft.com:443 HTTP/1.0" 200 243 8649 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; Trident/7.0; rv:11.0) like Gecko" TCP_MISS:DIRECT

<135>Sep 26 01:21:44 ip-192-168-12-26 (squid):   192.168.12.30 - - [26/Sep/2016:01:21:44 +0000] "CONNECT iecvlist.microsoft.com:443 HTTP/1.0" 200 156 8649 "-" "-" TCP_MISS:DIRECT

<135>Sep 26 01:26:42 ip-192-168-12-26 (squid):   192.168.12.30 - - [26/Sep/2016:01:26:42 +0000] "CONNECT r20swj13mr.microsoft.com:443 HTTP/1.0" 200 247 8766 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; Trident/7.0; rv:11.0) like Gecko" TCP_MISS:DIRECT

<135>Sep 26 01:01:13 ip-192-168-12-26 (squid):   192.168.12.47 - - [26/Sep/2016:01:01:13 +0000] "CONNECT e8security.e8security.com:443 HTTP/1.1"" 200 101 3959 "-" "-" TCP_MISS:DIRECT

<135>Sep 26 02:42:34 ip-192-168-12-26 (squid):   192.168.12.30 - - [26/Sep/2016:02:42:34 +0000] "CONNECT iecvlist.microsoft.com:443 HTTP/1.0" 200 243 8649 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; Trident/7.0; rv:11.0) like Gecko" TCP_MISS:DIRECT

Parsed Format

Our parsers parse the logs into a format similar to the following.

  {
    "uuid": "0-61fd80e3-91aa-11e6-b27e-0cc47a593a6c-e082-9206bf0e4ce115a319c2689dd9d6e41f",
    "externallogsourcetype": "Squid",
    "rawlog": "<135>Oct 14 00:04:33 ip-192-168-12-26 (squid):   192.168.12.47 - - [14/Oct/2016:00:04:33 +0000] \"CONNECT e8security.e8security.com:443 HTTP/1.1\" 200 101 3959 \"-\" \"-\" TCP_MISS:DIRECT",
    "starttimeiso": "2016-10-14T00:04:33.000Z",
    "applicationprotocol": "NULL",
    "bytesin": 3959,
    "bytesout": 101,
    "cefsignatureid": 200,
    "destinationaddress": "NULL",
    "destinationdnsdomain": "e8security.com",
    "destinationhostname": "e8security",
    "destinationmacaddress": "NULL",
    "destinationnameorip": "e8security.e8security.com",
    "destinationport": 443,
    "deviceaction": "DIRECT",
    "deviceaddress": "NULL",
    "devicednsdomain": "NULL",
    "deviceeventcategory": "NULL",
    "devicehostname": "ip-192-168-12-26",
    "deviceprocessname": "NULL",
    "devicenameorip": "ip-192-168-12-26",
    "endtime": "NULL",
    "message": "NULL",
    "requestmethod": "CONNECT",
    "requestclientapplication": "-",
    "sourceaddress": "192.168.12.47",
    "sourcednsdomain": "NULL",
    "sourcehostname": "NULL",
    "sourcemacaddress": "NULL",
    "sourcenameorip": "192.168.12.47",
    "sourceport": "NULL",
    "sourceusername": "NULL",
    "starttime": 1476403473000,
    "transportprotocol": "NULL",
    "responsecontenttype": "NULL",
    "reason": "NULL",
    "requestscheme": "NULL",
    "requestpath": "NULL",
    "requestquery": "NULL",
    "requestreferer": "-",
    "devicepolicyaction": "TCP_MISS",
    "dt": "2016-10-14"
  }

The archived logs are in Avro format. Avro provides good format for specifying the schema and storing the data. We needed a format that would support the efficient queries on time series data and store the data in compressed binary format.

Since we are a Cloudera Hadoop based platform our choices were clear:

  1. Store the data in Parquet and expose the data via [Impala] (http://impala.apache.org/) SQL engine
  2. Store the data in Kudu and expose the data via Impala SQL engine

Both Parquet and Kudu are columnar storage formats, in fact Kudu's on-disk storage format closely resembles Parquet's on-disk format with the difference that the on-disk data cannot be queried without a Kudu client. Kudu has better performance for OLAP operations. We chose Parquet for the ease of implementation and the availability of an upgrade path to Kudu if the performance of Parquet turned out to be suboptimal.

Challenge

Faceted search is the ability in a product to search over large amounts of data, to display summaries about various field values and to filter the data based on the field values.

The current version of the product uses Solr to index the parsed logs and display the information in a intuitive fashion. The queries took advantage of the power of Solr's Query language to fetch the data and the facets.

Solr is a document oriented store. The Query Domain Specific Language (DSL) is rich and has built in facilities to query facets. The Query DSL from Solr has to be ported to Impala which did not support such built in functions.

Here is an example of the query on Solr

curl -XGET 'http://solr:8983/solr/log_index/select?facet=true&facet.limit=5&facet.mincount=1&facet.sort=count&facet.field=requestClientApplication&facet.field=sourceUserName&facet.field=responseContentType&facet.field=requestScheme&facet.field=requestMethod&facet.field=devicePolicyAction&facet.field=sourceNameOrIp&facet.field=cefSignatureId&facet.field=destinationNameOrIp&start=0&rows=50&fq=startTimeISO%3A%5B2016-10-16T00%3A00%3A00.000Z+TO+2016-10-29T00%3A00%3A00.000Z%5D&sort=startTimeISO+desc&q=%28sourceNameOrIp%3A%28192.168.12.18%29%29&wt=json&indent=true'

The most important query parameters are facet=true, facet.limit=5, facet.mincount=1, facet.sort=count, facet.field=requestClientApplication, facet.field=sourceUserName, facet.field=responseContentType, facet.field=requestScheme, facet.field=requestMethod, facet.field=devicePolicyAction, facet.field=sourceNameOrIp, facet.field=cefSignatureId, facet.field=destinationNameOrIp. By specifing these parameters, Solr computes the values internally and returns the extra fields in the data returned as shown below

{
 "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "requestClientApplication":[
        "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0",96,
        "Notepad++/Plugin-Manager;v1.3.5.0",2],
      "sourceUserName":[],
      "responseContentType":[],
      "requestScheme":[
        "http",50],
      "requestMethod":[
        "CONNECT",48,
        "GET",37,
        "POST",13],
      "devicePolicyAction":[
        "TCP_MISS",97,
        "TCP_MEM_HIT",1],
      "sourceNameOrIp":[
        "192.168.12.18",98],
      "cefSignatureId":[
        "200",73,
        "206",20,
        "302",2,
        "303",2,
        "0",1],
      "destinationNameOrIp":[
        "download.cdn.mozilla.net",20,
        "172.16.1.251",12,
        "ocsp.digicert.com",10,
        "safebrowsing-cache.google.com",10,
        "safebrowsing.google.com",10]},
    "facet_dates":{},
    "facet_ranges":{},
    "facet_intervals":{}}
}

We need to construct SQL queries that reproduce the above data for the same search terms.

Solution

The brute force solution to the problem is to construct SQL queries for each of the facets such as:

SELECT destinationNameOrIp, COUNT(destinationNameOrIp) FROM LOG_DATA WHERE sourceNameOrIp = '192.168.12.18' AND startTimeISO>='2016-10-16T00:00:00.000Z' AND startTimeISO< '2016-10-29T00:00:A00.000Z' GROUP BY destinationNameOrIp;

The issue with the brute force approach is that a different query needs to run for computing the values of each query. The query in the impala-shell takes about 0.58s to complete.

destinationnameorip,count(destinationnameorip)
blocklists.settings.services.mozilla.com,1
ftp.mozilla.org,3
tracking-protection.cdn.mozilla.net,1
self-repair.mozilla.org,2
ctldl.windowsupdate.com,2
Fetched 5 row(s) in 0.57s

The explain plan of the query sheds some light. A full table scan is performed for this query. Even if the all the facet queries are executed in parallel via the API server, the table is scanned the multiple times which is very wasteful. Please read Cloudera's documentation on Explain plan for more information decipher the information.

WARNING: The following tables are missing relevant table and/or column statistics.
e8sec.web_proxy_parquet
""
04:EXCHANGE [UNPARTITIONED]
|  limit: 5
|
03:AGGREGATE [FINALIZE]
|  output: count:merge(destinationNameOrIp)
|  group by: destinationNameOrIp
|  limit: 5
|
02:EXCHANGE [HASH(destinationNameOrIp)]
|
01:AGGREGATE
|  output: count(destinationNameOrIp)
|  group by: destinationNameOrIp
|
00:SCAN HDFS [e8sec.web_proxy_parquet]
   partitions=37/42 files=37 size=1.16MB
"   predicates: sourceNameOrIp = '192.168.12.18', startTimeISO >= '2016-10-16T00:00:00.000Z', startTimeISO < '2016-10-29T00:00:A00.000Z'"
Fetched 21 row(s) in 0.01s

What if only a single query could be run to retrieve the information? What if all the queries can be combined?

SELECT requestClientApplication, COUNT( requestClientApplication) as requestClientApplicationCount, 
sourceUserName, COUNT( sourceUserName) as sourceUserNameCount, responseContentType, COUNT( responseContentType) 
as responseContentTypeCount, requestScheme, COUNT(requestScheme) as requestSchemeCount, requestMethod,
COUNT( requestMethod) as requestMethodCount, devicePolicyAction, COUNT( devicePolicyAction) as devicePolicyActionCount, sourceNameOrIp, COUNT( sourceNameOrIp) as sourceNameOrIpCount, cefSignatureId, 
COUNT( cefSignatureId) as cefSignatureIdCount, destinationNameOrIp, COUNT( destinationNameOrIp) as destinationNameOrIpCount FROM E8SEC.WEB_PROXY_PARQUET WHERE STARTTIMEISO >= '2016-10-16T00:00:00.000Z' 
AND STARTTIMEISO <'2016-10-29T00:00:00.000Z'  AND NOT requestMethod IN ('POST','post','GET','get') 
AND sourceNameOrIp IN  ('192.168.12.18' )  GROUP BY requestClientApplication, sourceUserName, responseContentType, requestScheme, requestMethod, devicePolicyAction, sourceNameOrIp, cefSignatureId, destinationNameOrIp ORDER BY requestClientApplicationCount, sourceUserNameCount, responseContentTypeCount, requestSchemeCount, requestMethodCount, devicePolicyActionCount, sourceNameOrIpCount, cefSignatureIdCount, destinationNameOrIpCount 

This approach would get get us quite far and fetches the required data correctly. Running the above SQL would return the following data.

requestclientapplication,requestclientapplicationcount,sourceusername,sourceusernamecount,responsecontenttype,responsecontenttypecount,requestscheme,requestschemecount,requestmethod,requestmethodcount,devicepolicyaction,devicepolicyactioncount,sourcenameorip,sourcenameoripcount,cefsignatureid,cefsignatureidcount,destinationnameorip,destinationnameoripcount
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0,1,NULL,0,NULL,0,NULL,0,CONNECT,1,TCP_MISS,1,192.168.12.18,1,200,1,tiles.services.mozilla.com,1
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0,1,NULL,0,NULL,0,NULL,0,CONNECT,1,TCP_MISS,1,192.168.12.18,1,200,1,blocklists.settings.services.mozilla.com,1
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0,1,NULL,0,NULL,0,NULL,0,CONNECT,1,TCP_MISS,1,192.168.12.18,1,200,1,blocklist.addons.mozilla.org,1
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0,1,NULL,0,NULL,0,NULL,0,CONNECT,1,TCP_MISS,1,192.168.12.18,1,200,1,snippets.cdn.mozilla.net,1
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0,1,NULL,0,NULL,0,NULL,0,CONNECT,1,TCP_MISS,1,192.168.12.18,1,200,1,services.addons.mozilla.org,1
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0,1,NULL,0,NULL,0,NULL,0,CONNECT,1,TCP_MISS,1,192.168.12.18,1,200,1,quickdraw.splunk.com,1
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0,1,NULL,0,NULL,0,NULL,0,CONNECT,1,TCP_MISS,1,192.168.12.18,1,200,1,tiles-cloudfront.cdn.mozilla.net,1
jupdate,1,NULL,0,NULL,0,NULL,0,CONNECT,1,TCP_MISS,1,192.168.12.18,1,200,1,javadl-esd-secure.oracle.com,1
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0,1,NULL,0,NULL,0,NULL,0,CONNECT,1,TCP_MISS,1,192.168.12.18,1,200,1,tracking-protection.cdn.mozilla.net,1
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0,1,NULL,0,NULL,0,NULL,0,CONNECT,1,TCP_MISS,1,192.168.12.18,1,200,1,addons.cdn.mozilla.net,1
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0,1,NULL,0,NULL,0,NULL,0,CONNECT,1,TCP_MISS,1,192.168.12.18,1,200,1,location.services.mozilla.com,1
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0,1,NULL,0,NULL,0,NULL,0,CONNECT,1,TCP_MISS,1,192.168.12.18,1,200,1,normandy.cdn.mozilla.net,1
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0,2,NULL,0,NULL,0,NULL,0,CONNECT,2,TCP_MISS,2,192.168.12.18,2,200,2,incoming.telemetry.mozilla.org,2
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0,2,NULL,0,NULL,0,NULL,0,CONNECT,2,TCP_MISS,2,192.168.12.18,2,200,2,aus5.mozilla.org,2
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0,2,NULL,0,NULL,0,NULL,0,CONNECT,2,TCP_MISS,2,192.168.12.18,2,200,2,self-repair.mozilla.org,2
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0,3,NULL,0,NULL,0,NULL,0,CONNECT,3,TCP_MISS,3,192.168.12.18,3,200,3,versioncheck-bg.addons.mozilla.org,3
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0,3,NULL,0,NULL,0,NULL,0,CONNECT,3,TCP_MISS,3,192.168.12.18,3,200,3,ftp.mozilla.org,3
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0,5,NULL,0,NULL,0,NULL,0,CONNECT,5,TCP_MISS,5,192.168.12.18,5,200,5,shavar.services.mozilla.com,5
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0,10,NULL,0,NULL,0,NULL,0,CONNECT,10,TCP_MISS,10,192.168.12.18,10,200,10,safebrowsing.google.com,10
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0,10,NULL,0,NULL,0,NULL,0,CONNECT,10,TCP_MISS,10,192.168.12.18,10,200,10,safebrowsing-cache.google.com,10
Fetched 20 row(s) in 0.77s

Compare the the previous query, this query performs a bit worse (0.77s vs 0.54s). Since the table is not scanned multiple time, CPU resources are saved. Explain plan, confirms the hypothesis that the table is scanned only once.

Estimated Per-Host Requirements: Memory=288.00MB VCores=2
WARNING: The following tables are missing relevant table and/or column statistics.
e8sec.web_proxy_parquet
""
05:MERGING-EXCHANGE [UNPARTITIONED]
"|  order by: count(requestClientApplication) ASC, count(sourceUserName) ASC, count(responseContentType) ASC, count(requestScheme) ASC, count(requestMethod) ASC, count(devicePolicyAction) ASC, count(sourceNameOrIp) ASC, count(cefSignatureId) ASC, count(destinationNameOrIp) ASC"
|
02:SORT
"|  order by: count(requestClientApplication) ASC, count(sourceUserName) ASC, count(responseContentType) ASC, count(requestScheme) ASC, count(requestMethod) ASC, count(devicePolicyAction) ASC, count(sourceNameOrIp) ASC, count(cefSignatureId) ASC, count(destinationNameOrIp) ASC"
|
04:AGGREGATE [FINALIZE]
"|  output: count:merge(requestClientApplication), count:merge(sourceUserName), count:merge(responseContentType), count:merge(requestScheme), count:merge(requestMethod), count:merge(devicePolicyAction), count:merge(sourceNameOrIp), count:merge(cefSignatureId), count:merge(destinationNameOrIp)"
"|  group by: requestClientApplication, sourceUserName, responseContentType, requestScheme, requestMethod, devicePolicyAction, sourceNameOrIp, cefSignatureId, destinationNameOrIp"
|
"03:EXCHANGE [HASH(requestClientApplication,sourceUserName,responseContentType,requestScheme,requestMethod,devicePolicyAction,sourceNameOrIp,cefSignatureId,destinationNameOrIp)]"
|
01:AGGREGATE
"|  output: count(requestClientApplication), count(sourceUserName), count(responseContentType), count(requestScheme), count(requestMethod), count(devicePolicyAction), count(sourceNameOrIp), count(cefSignatureId), count(destinationNameOrIp)"
"|  group by: requestClientApplication, sourceUserName, responseContentType, requestScheme, requestMethod, devicePolicyAction, sourceNameOrIp, cefSignatureId, destinationNameOrIp"
|
00:SCAN HDFS [e8sec.web_proxy_parquet]
   partitions=39/44 files=39 size=1.20MB
"   predicates: STARTTIMEISO >= '2016-10-16T00:00:00.000Z', STARTTIMEISO < '2016-10-29T00:00:00.000Z', NOT requestMethod IN ('POST', 'post', 'GET', 'get'), sourceNameOrIp IN ('192.168.12.18')"
Fetched 23 row(s) in 0.02s

Although, results are correct this would need some post processing of the data returned to transform it into the same format as Solr.

What if a query construction that would return the data in the same form as Solr? This could be achieved, if the queries are combined using a union method.

SELECT "requestClientApplication" as facet, requestClientApplication as facet_value, COUNT( requestClientApplication) as facet_count, SUM(COUNT(requestClientApplication)) OVER() AS total_count FROM E8SEC.WEB_PROXY_PARQUET WHERE STARTTIMEISO >= '2016-10-16T00:00:00.000Z' AND STARTTIMEISO <'2016-10-29T00:00:00.000Z'  AND NOT requestMethod IN ('POST','post','GET','get') AND sourceNameOrIp IN  ('192.168.12.18' ) GROUP BY  requestClientApplication
UNION ALL
SELECT "cefsignatureid" as facet,  cefsignatureid as facet_value, COUNT( cefsignatureid) as facet_count, SUM(COUNT(cefsignatureid)) OVER() AS total_count FROM E8SEC.WEB_PROXY_PARQUET WHERE STARTTIMEISO >= '2016-10-16T00:00:00.000Z' AND STARTTIMEISO <'2016-10-29T00:00:00.000Z'  AND NOT cefsignatureid IN ('POST','post','GET','get') AND sourceNameOrIp IN  ('192.168.12.18' )  GROUP BY cefsignatureid

The above SQL produces the data in the format that we are

facet,facet_value,facet_count,total_count
requestClientApplication,Mozilla/5.0 (Windows NT 6.1; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0,48,49
requestClientApplication,jupdate,1,49
cefsignatureid,304,2,103
cefsignatureid,0,1,103
cefsignatureid,302,2,103
cefsignatureid,206,20,103
cefsignatureid,200,76,103
cefsignatureid,303,2,103
Fetched 8 row(s) in 1.05s

There are few things that are going on in the SQL. Lets dissect it:

  1. "requestClientApplication" as facet - The name of the column is used as is in the output
  2. requestClientApplication as facet_value - The values of the column are returned in the output
  3. COUNT( requestClientApplication) as facet_count & GROUP BY requestClientApplication - The different values in the column that match the search query are counted and are grouped by different values.
  4. SUM(COUNT(requestClientApplication)) OVER() AS total_count - The total number of rows that match the query criteria are counted. This number is a sum of all the counts from above
  5. UNION ALL - Now that the query has aliased the names of columns, all subsequent queries with the same aliases can be merged together and returned

Putting it all together to find all the facets, the SQL transforms to

SELECT "requestClientApplication" as facet, requestClientApplication as facet_value, COUNT(requestClientApplication) as facet_count, SUM(COUNT(requestClientApplication)) OVER() AS total_count FROM E8SEC.WEB_PROXY_PARQUET WHERE STARTTIMEISO >= '2016-10-16T00:00:00.000Z' AND STARTTIMEISO <'2016-10-29T00:00:00.000Z'  AND NOT requestMethod IN ('POST','post','GET','get') AND sourceNameOrIp IN  ('192.168.12.18' ) GROUP BY requestClientApplication LIMIT 5
 UNION ALL 
SELECT "sourceUserName" as facet, sourceUserName as facet_value, COUNT(sourceUserName) as facet_count, SUM(COUNT(sourceUserName)) OVER() AS total_count FROM E8SEC.WEB_PROXY_PARQUET WHERE STARTTIMEISO >= '2016-10-16T00:00:00.000Z' AND STARTTIMEISO <'2016-10-29T00:00:00.000Z'  AND NOT requestMethod IN ('POST','post','GET','get') AND sourceNameOrIp IN  ('192.168.12.18' ) GROUP BY sourceUserName LIMIT 5
 UNION ALL 
SELECT "responseContentType" as facet, responseContentType as facet_value, COUNT(responseContentType) as facet_count, SUM(COUNT(responseContentType)) OVER() AS total_count FROM E8SEC.WEB_PROXY_PARQUET WHERE STARTTIMEISO >= '2016-10-16T00:00:00.000Z' AND STARTTIMEISO <'2016-10-29T00:00:00.000Z'  AND NOT requestMethod IN ('POST','post','GET','get') AND sourceNameOrIp IN  ('192.168.12.18' ) GROUP BY responseContentType LIMIT 5
 UNION ALL 
SELECT "requestScheme" as facet, requestScheme as facet_value, COUNT(requestScheme) as facet_count, SUM(COUNT(requestScheme)) OVER() AS total_count FROM E8SEC.WEB_PROXY_PARQUET WHERE STARTTIMEISO >= '2016-10-16T00:00:00.000Z' AND STARTTIMEISO <'2016-10-29T00:00:00.000Z'  AND NOT requestMethod IN ('POST','post','GET','get') AND sourceNameOrIp IN  ('192.168.12.18' ) GROUP BY requestScheme LIMIT 5
 UNION ALL 
SELECT "requestMethod" as facet, requestMethod as facet_value, COUNT(requestMethod) as facet_count, SUM(COUNT(requestMethod)) OVER() AS total_count FROM E8SEC.WEB_PROXY_PARQUET WHERE STARTTIMEISO >= '2016-10-16T00:00:00.000Z' AND STARTTIMEISO <'2016-10-29T00:00:00.000Z'  AND NOT requestMethod IN ('POST','post','GET','get') AND sourceNameOrIp IN  ('192.168.12.18' ) GROUP BY requestMethod LIMIT 5
 UNION ALL 
SELECT "devicePolicyAction" as facet, devicePolicyAction as facet_value, COUNT(devicePolicyAction) as facet_count, SUM(COUNT(devicePolicyAction)) OVER() AS total_count FROM E8SEC.WEB_PROXY_PARQUET WHERE STARTTIMEISO >= '2016-10-16T00:00:00.000Z' AND STARTTIMEISO <'2016-10-29T00:00:00.000Z'  AND NOT requestMethod IN ('POST','post','GET','get') AND sourceNameOrIp IN  ('192.168.12.18' ) GROUP BY devicePolicyAction LIMIT 5
 UNION ALL 
SELECT "sourceNameOrIp" as facet, sourceNameOrIp as facet_value, COUNT(sourceNameOrIp) as facet_count, SUM(COUNT(sourceNameOrIp)) OVER() AS total_count FROM E8SEC.WEB_PROXY_PARQUET WHERE STARTTIMEISO >= '2016-10-16T00:00:00.000Z' AND STARTTIMEISO <'2016-10-29T00:00:00.000Z'  AND NOT requestMethod IN ('POST','post','GET','get') AND sourceNameOrIp IN  ('192.168.12.18' ) GROUP BY sourceNameOrIp LIMIT 5
 UNION ALL 
SELECT "cefSignatureId" as facet, cefSignatureId as facet_value, COUNT(cefSignatureId) as facet_count, SUM(COUNT(cefSignatureId)) OVER() AS total_count FROM E8SEC.WEB_PROXY_PARQUET WHERE STARTTIMEISO >= '2016-10-16T00:00:00.000Z' AND STARTTIMEISO <'2016-10-29T00:00:00.000Z'  AND NOT requestMethod IN ('POST','post','GET','get') AND sourceNameOrIp IN  ('192.168.12.18' ) GROUP BY cefSignatureId LIMIT 5
 UNION ALL 
SELECT "destinationNameOrIp" as facet, destinationNameOrIp as facet_value, COUNT(destinationNameOrIp) as facet_count, SUM(COUNT(destinationNameOrIp)) OVER() AS total_count FROM E8SEC.WEB_PROXY_PARQUET WHERE STARTTIMEISO >= '2016-10-16T00:00:00.000Z' AND STARTTIMEISO <'2016-10-29T00:00:00.000Z'  AND NOT requestMethod IN ('POST','post','GET','get') AND sourceNameOrIp IN  ('192.168.12.18' ) GROUP BY destinationNameOrIp LIMIT 5

The above SQL generates output as follows:

facet,facet_value,facet_count,total_count
requestClientApplication,Mozilla/5.0 (Windows NT 6.1; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0,48,49
requestClientApplication,jupdate,1,49
sourceUserName,NULL,0,0
responseContentType,NULL,0,0
requestScheme,NULL,0,0
requestMethod,CONNECT,49,49
devicePolicyAction,TCP_MISS,49,49
sourceNameOrIp,192.168.12.18,49,49
cefSignatureId,200,49,49
destinationNameOrIp,ftp.mozilla.org,3,49
destinationNameOrIp,blocklists.settings.services.mozilla.com,1,49
destinationNameOrIp,snippets.cdn.mozilla.net,1,49
destinationNameOrIp,self-repair.mozilla.org,2,49
destinationNameOrIp,tiles.services.mozilla.com,1,49
Fetched 14 row(s) in 3.39s

The query although elegant, takes 3x more time to fetch the data. What could be the reason? The hypothesis was that by combining all the queries into one query, the performance would be improved. Explain plan the above query provides further insights. The way the queries are written with a UNION ALL operation result in multiple tables scans.

Estimated Per-Host Requirements: Memory=256.00MB VCores=2
WARNING: The following tables are missing relevant table and/or column statistics.
e8sec.web_proxy_parquet
""
00:UNION
|
|--27:ANALYTIC
|  |  functions: sum(count(destinationNameOrIp))
|  |  limit: 5
|  |
|  54:EXCHANGE [UNPARTITIONED]
|  |
|  53:AGGREGATE [FINALIZE]
|  |  output: count:merge(destinationNameOrIp)
|  |  group by: destinationNameOrIp
|  |
|  52:EXCHANGE [HASH(destinationNameOrIp)]
|  |
|  26:AGGREGATE
|  |  output: count(destinationNameOrIp)
|  |  group by: destinationNameOrIp
|  |
|  25:SCAN HDFS [e8sec.web_proxy_parquet]
|     partitions=39/44 files=39 size=1.20MB
"|     predicates: STARTTIMEISO >= '2016-10-16T00:00:00.000Z', STARTTIMEISO < '2016-10-29T00:00:00.000Z', NOT requestMethod IN ('POST', 'post', 'GET', 'get'), sourceNameOrIp IN ('192.168.12.18')"
|
|--24:ANALYTIC
|  |  functions: sum(count(cefSignatureId))
|  |  limit: 5
|  |
|  51:EXCHANGE [UNPARTITIONED]
|  |
|  50:AGGREGATE [FINALIZE]
|  |  output: count:merge(cefSignatureId)
|  |  group by: cefSignatureId
|  |
|  49:EXCHANGE [HASH(cefSignatureId)]
|  |
|  23:AGGREGATE
|  |  output: count(cefSignatureId)
|  |  group by: cefSignatureId
|  |
|  22:SCAN HDFS [e8sec.web_proxy_parquet]
|     partitions=39/44 files=39 size=1.20MB
"|     predicates: STARTTIMEISO >= '2016-10-16T00:00:00.000Z', STARTTIMEISO < '2016-10-29T00:00:00.000Z', NOT requestMethod IN ('POST', 'post', 'GET', 'get'), sourceNameOrIp IN ('192.168.12.18')"
|
|--21:ANALYTIC
|  |  functions: sum(count(sourceNameOrIp))
|  |  limit: 5
|  |
|  48:EXCHANGE [UNPARTITIONED]
|  |
|  47:AGGREGATE [FINALIZE]
|  |  output: count:merge(sourceNameOrIp)
|  |  group by: sourceNameOrIp
|  |
|  46:EXCHANGE [HASH(sourceNameOrIp)]
|  |
|  20:AGGREGATE
|  |  output: count(sourceNameOrIp)
|  |  group by: sourceNameOrIp
|  |
|  19:SCAN HDFS [e8sec.web_proxy_parquet]
|     partitions=39/44 files=39 size=1.20MB
"|     predicates: STARTTIMEISO >= '2016-10-16T00:00:00.000Z', STARTTIMEISO < '2016-10-29T00:00:00.000Z', NOT requestMethod IN ('POST', 'post', 'GET', 'get'), sourceNameOrIp IN ('192.168.12.18')"
|
|--18:ANALYTIC
|  |  functions: sum(count(devicePolicyAction))
|  |  limit: 5
|  |
|  45:EXCHANGE [UNPARTITIONED]
|  |
|  44:AGGREGATE [FINALIZE]
|  |  output: count:merge(devicePolicyAction)
|  |  group by: devicePolicyAction
|  |
|  43:EXCHANGE [HASH(devicePolicyAction)]
|  |
|  17:AGGREGATE
|  |  output: count(devicePolicyAction)
|  |  group by: devicePolicyAction
|  |
|  16:SCAN HDFS [e8sec.web_proxy_parquet]
|     partitions=39/44 files=39 size=1.20MB
"|     predicates: STARTTIMEISO >= '2016-10-16T00:00:00.000Z', STARTTIMEISO < '2016-10-29T00:00:00.000Z', NOT requestMethod IN ('POST', 'post', 'GET', 'get'), sourceNameOrIp IN ('192.168.12.18')"
|
|--15:ANALYTIC
|  |  functions: sum(count(requestMethod))
|  |  limit: 5
|  |
|  42:EXCHANGE [UNPARTITIONED]
|  |
|  41:AGGREGATE [FINALIZE]
|  |  output: count:merge(requestMethod)
|  |  group by: requestMethod
|  |
|  40:EXCHANGE [HASH(requestMethod)]
|  |
|  14:AGGREGATE
|  |  output: count(requestMethod)
|  |  group by: requestMethod
|  |
|  13:SCAN HDFS [e8sec.web_proxy_parquet]
|     partitions=39/44 files=39 size=1.20MB
"|     predicates: STARTTIMEISO >= '2016-10-16T00:00:00.000Z', STARTTIMEISO < '2016-10-29T00:00:00.000Z', NOT requestMethod IN ('POST', 'post', 'GET', 'get'), sourceNameOrIp IN ('192.168.12.18')"
|
|--12:ANALYTIC
|  |  functions: sum(count(requestScheme))
|  |  limit: 5
|  |
|  39:EXCHANGE [UNPARTITIONED]
|  |
|  38:AGGREGATE [FINALIZE]
|  |  output: count:merge(requestScheme)
|  |  group by: requestScheme
|  |
|  37:EXCHANGE [HASH(requestScheme)]
|  |
|  11:AGGREGATE
|  |  output: count(requestScheme)
|  |  group by: requestScheme
|  |
|  10:SCAN HDFS [e8sec.web_proxy_parquet]
|     partitions=39/44 files=39 size=1.20MB
"|     predicates: STARTTIMEISO >= '2016-10-16T00:00:00.000Z', STARTTIMEISO < '2016-10-29T00:00:00.000Z', NOT requestMethod IN ('POST', 'post', 'GET', 'get'), sourceNameOrIp IN ('192.168.12.18')"
|
|--09:ANALYTIC
|  |  functions: sum(count(responseContentType))
|  |  limit: 5
|  |
|  36:EXCHANGE [UNPARTITIONED]
|  |
|  35:AGGREGATE [FINALIZE]
|  |  output: count:merge(responseContentType)
|  |  group by: responseContentType
|  |
|  34:EXCHANGE [HASH(responseContentType)]
|  |
|  08:AGGREGATE
|  |  output: count(responseContentType)
|  |  group by: responseContentType
|  |
|  07:SCAN HDFS [e8sec.web_proxy_parquet]
|     partitions=39/44 files=39 size=1.20MB
"|     predicates: STARTTIMEISO >= '2016-10-16T00:00:00.000Z', STARTTIMEISO < '2016-10-29T00:00:00.000Z', NOT requestMethod IN ('POST', 'post', 'GET', 'get'), sourceNameOrIp IN ('192.168.12.18')"
|
|--06:ANALYTIC
|  |  functions: sum(count(sourceUserName))
|  |  limit: 5
|  |
|  33:EXCHANGE [UNPARTITIONED]
|  |
|  32:AGGREGATE [FINALIZE]
|  |  output: count:merge(sourceUserName)
|  |  group by: sourceUserName
|  |
|  31:EXCHANGE [HASH(sourceUserName)]
|  |
|  05:AGGREGATE
|  |  output: count(sourceUserName)
|  |  group by: sourceUserName
|  |
|  04:SCAN HDFS [e8sec.web_proxy_parquet]
|     partitions=39/44 files=39 size=1.20MB
"|     predicates: STARTTIMEISO >= '2016-10-16T00:00:00.000Z', STARTTIMEISO < '2016-10-29T00:00:00.000Z', NOT requestMethod IN ('POST', 'post', 'GET', 'get'), sourceNameOrIp IN ('192.168.12.18')"
|
03:ANALYTIC
|  functions: sum(count(requestClientApplication))
|  limit: 5
|
30:EXCHANGE [UNPARTITIONED]
|
29:AGGREGATE [FINALIZE]
|  output: count:merge(requestClientApplication)
|  group by: requestClientApplication
|
28:EXCHANGE [HASH(requestClientApplication)]
|
02:AGGREGATE
|  output: count(requestClientApplication)
|  group by: requestClientApplication
|
01:SCAN HDFS [e8sec.web_proxy_parquet]
   partitions=39/44 files=39 size=1.20MB
"   predicates: STARTTIMEISO >= '2016-10-16T00:00:00.000Z', STARTTIMEISO < '2016-10-29T00:00:00.000Z', NOT requestMethod IN ('POST', 'post', 'GET', 'get'), sourceNameOrIp IN ('192.168.12.18')"
Fetched 185 row(s) in 0.04s

Conclusion

It is certainly possible to fetch all the facets using plain old SQL and replicate the functionality that Solr supports out of the box. However, if the data needs to be computed the same way as Solr, then the performance suffers.

Clone this wiki locally