cLabs: ZqDataDiscovery

see LearningZq

examples use zq version 1.16.0 and jq version 1.7

Note: I started writing this as I prepped some examples for ZqSearchIsFirstClass, but it didn't turn out to offer any immediate “a-ha!” content. But it was still some work to get it together and I figure why not keep it around. The Zed docs have some more thorough examples of how to do this sort of thing.

Let's work with a large set of denormalized JSON. Download this version of this .json file of English films scraped from Wikipedia:

#!/usr/bin/env bash

echo "Downloading..."
curl "https://raw.githubusercontent.com/prust/wikipedia-movie-data/91ae721d/movies.json" \
  >raw.json

echo "Flattening single array into separate lines:"

echo "- Processing to movies.zng..."
zq -f zng 'over this' raw.json >movies.zng

echo "- Processing to movies.json..."
jq -c '.[]' raw.json >movies.json

echo "Removing raw.json"
rm raw.json

The minor bit of processing is a simple transform from a single large array into files with each record or object in its own row. That will simplify both zq and jq examples by not having to start each one with over this or .[] accordingly.

Let's find out if each row in this dataset is normalized, or has the same set of keys. zq has a specialized method for this: sample.

This doesn't seem too difficult with jq, with its keys method, we can just gather up all the keys and find uniq arrays of them:

❯ jq -c --slurp 'map(keys) | sort | unique | .[]' movies.json
["cast","extract","genres","href","thumbnail","thumbnail_height",↩
 "thumbnail_width","title","year"]
["cast","extract","genres","href","title","year"]
["cast","genres","href","title","year"]
["cast","genres","title","year"]

We did have to --slurp this back in to an array after intentionally dumping it to file as separate lines, which is on me....

But zq‘s sample leaves us with 18 ”shapes” as zq calls it, not 4. Hmmm.

❯ zq 'sample | collect(this) | len(this)' movies.zng
18

But zq does have a keys equivalent in the fields function, and it does return similar output to jq.

❯ zq 'sample | yield fields(this) | sort | uniq' movies.zng
[["title"],["year"],["cast"],["genres"]]
[["title"],["year"],["cast"],["genres"],["href"]]
[["title"],["year"],["cast"],["genres"],["href"],["extract"]]
[["title"],["year"],["cast"],["genres"],["href"],["extract"],↩
 ["thumbnail"],["thumbnail_width"],["thumbnail_height"]]

In addition to the fields function, zq has a typeof function, which infers types from the data it finds. This can give us a clues as to why sample is finding 18 different “shapes”:

❯ zq 'sample | typeof(this) | sort | head 4' movies.zng
<{title:string,year:int64,cast:[string],genres:[string]}>
<{title:string,year:int64,cast:[string],genres:[null]}>
<{title:string,year:int64,cast:[null],genres:[string]}>
<{title:string,year:int64,cast:[null],genres:[null]}>

In these first 4 results, the keys are all the same, but the types zq has run across vary. In one way this is more detail than we need to help us understand the variance in the data in our file, but in another it's great to know should we run into arrays of ints vs. strings down the road. For now though, let's just worry about keys.

(In a later post, I can share how I implemented a keys function in zq that matches jq.)

Alright, we know the data isn't uniform, but it's not too bad, and extract is the only field that's not always present that may have text data we want to search on.

...and this is where I started getting good enough examples for ZqSearchIsFirstClass and abandoned this post ... I should still clean it up I s'pose