see
LearningZqexamples use zq
version 1.16.0 and jq version 1.7Note: I started writing this as I prepped some examples for ZqSearchIsFirstClass, but it didn't turn out to offer any immediate “a-ha!” content. But it was still some work to get it together and I figure why not keep it around. The Zed docs have some more thorough examples of how to do this sort of thing.Let's work with a large set of denormalized JSON. Download this version of this .json file of English films scraped from Wikipedia:
#!/usr/bin/env bash
echo "Downloading..."
curl "https://raw.githubusercontent.com/prust/wikipedia-movie-data/91ae721d/movies.json" \
>raw.json
echo "Flattening single array into separate lines:"
echo "- Processing to movies.zng..."
zq -f zng 'over this' raw.json >movies.zng
echo "- Processing to movies.json..."
jq -c '.[]' raw.json >movies.json
echo "Removing raw.json"
rm raw.json
The minor bit of processing is a simple transform from a single large array into files with each record or object in its own row. That will simplify both
zq
and
jq
examples by not having to start each one with
over this
or
.[]
accordingly.
Let's find out if each row in this dataset is normalized, or has the same set of keys.
zq
has a specialized method for this:
sample
.
This doesn't seem too difficult with
jq
, with its
keys
method, we can just gather up all the keys and find uniq arrays of them:
❯ jq -c --slurp 'map(keys) | sort | unique | .[]' movies.json
["cast","extract","genres","href","thumbnail","thumbnail_height",↩
"thumbnail_width","title","year"]
["cast","extract","genres","href","title","year"]
["cast","genres","href","title","year"]
["cast","genres","title","year"]
We did have to
--slurp
this back in to an array after intentionally dumping it to file as separate lines, which is on me....
But
zq
‘s
sample
leaves us with 18 ”
shapes” as
zq
calls it, not 4. Hmmm.
❯ zq 'sample | collect(this) | len(this)' movies.zng
18
But
zq
does have a
keys
equivalent in the
fields
function, and it does return similar output to
jq
.
❯ zq 'sample | yield fields(this) | sort | uniq' movies.zng
[["title"],["year"],["cast"],["genres"]]
[["title"],["year"],["cast"],["genres"],["href"]]
[["title"],["year"],["cast"],["genres"],["href"],["extract"]]
[["title"],["year"],["cast"],["genres"],["href"],["extract"],↩
["thumbnail"],["thumbnail_width"],["thumbnail_height"]]
In addition to the
fields
function,
zq
has a
typeof
function, which infers types from the data it finds. This can give us a clues as to why
sample
is finding 18 different “shapes”:
❯ zq 'sample | typeof(this) | sort | head 4' movies.zng
<{title:string,year:int64,cast:[string],genres:[string]}>
<{title:string,year:int64,cast:[string],genres:[null]}>
<{title:string,year:int64,cast:[null],genres:[string]}>
<{title:string,year:int64,cast:[null],genres:[null]}>
In these first 4 results, the keys are all the same, but the types
zq
has run across vary. In one way this is more detail than we need to help us understand the variance in the data in our file, but in another it's great to know should we run into arrays of ints vs. strings down the road. For now though, let's just worry about keys.
(
In a later post, I can share how I implemented a keys
function in zq
that matches jq
.)
Alright, we know the data isn't uniform, but it's not too bad, and
extract
is the only field that's not always present that may have text data we want to search on.
...and this is where I started getting good enough examples for ZqSearchIsFirstClass and abandoned this post ... I should still clean it up I s'pose