cLabs: ZqSearchIsFirstClass

from LearningZq

examples use zq version 1.16.0 and jq version 1.7

I've made this the first post in this series, even though it's about the 5th one I've written, because it really seems to be the foundation of my jq → zq learning curve gotchyas.

In service of making search easy, the search Dataflow Operator is not only an Implied Operator but it is the “primary” implied operator, the first one to be evaluated by the parser (see also ZqImpliedOperator).

Let's take it for a spin. For these examples, I'll be working with a dataset of English movies since 1900 scraped from Wikipedia, downloaded like so:

#!/usr/bin/env bash

echo "Downloading..."
curl "https://raw.githubusercontent.com/prust/wikipedia-movie-data/91ae721d/movies.json" \
  >raw.json

echo "Flattening single array into separate lines:"

echo "- Processing to movies.zng..."
zq -f zng 'over this' raw.json >movies.zng

echo "- Processing to movies.json..."
jq -c '.[]' raw.json >movies.json

echo "Removing raw.json"
rm raw.json

The processing extracts each object/record out of the array onto its own line, which saves, in many cases, the step of iterating over it. For zq, it also saves the data into a binary Zed format called ZNG (“zing”) that performs much better (see also ZqWhereDoYouGetTheseNames).

Kevin Bacon seems like a fairly prominent figure in Hollywood, let's go bacon hunting!

❯ zq 'Bacon' movies.zng
...
...[snip a bacon-ton of output]...
...
{title:"Leave the World Behind",year:2023...}

That's a lot of output. How about just titles in the last couple years?

❯ zq 'year >= 2022 | Bacon | yield this.title' movies.zng
"They/Them"
"One Way"
"Smile"
"Space Oddity"
"Leave the World Behind"

That's better.

But wait, “Smile” isn't quite right, is it?

❯ zq 'year >= 2022 | Bacon | ! Kevin | cut title, year, cast' movies.zng
{title:"Smile",year:2022,
 cast:["Sosie Bacon","Jessie T. Usher","Kyle Gallner",↩
       "Robin Weigert","Caitlin Stasey","Kal Penn","Rob Morgan"]}

Right! Sosie is Kevin's daughter.

I wonder what other non-Kevin Bacon is in the data?

❯ zq 'Bacon | ! Kevin | cut title, year' movies.zng
...
...[snip another bacon-ton of output]...
...

That's a lot of olde-timey bacon -- can we summarize it?

❯ zq -j 'Bacon | ! Kevin | cut title, year
         | decade:=(year/10)*10 | count() by decade | sort decade' movies.zng
{"decade":1910,"count":7}
{"decade":1920,"count":19}
{"decade":1930,"count":43}
{"decade":1940,"count":31}
{"decade":1950,"count":13}
{"decade":2020,"count":1}

Who are these other Bacons?

‘Bacon | ! Kevin | year < 1960 | cut title, cast' is too much output, but scanning it doesn't show many Bacon names at all. It must be in the Extract field (unless “Bacon” is a genre? And why shouldn't it be?! mmmm, bacon)

First, the Bacons that are in title and cast:

❯ zq 'year < 1960 | cut title, cast | Bacon' movies.zng
{title:"The Fireman",cast:["Charlie Chaplin","Edna Purviance","Lloyd Bacon"]}
{title:"The Floorwalker",cast:["Charlie Chaplin","Edna Purviance","Eric Campbell","Lloyd Bacon"]}
{title:"The House of Intrigue",cast:["Mignon Anderson","Lloyd Bacon"]}
{title:"The Girl in the Rain",cast:["Anne Cornwall","Lloyd Bacon"]}
{title:"The Greater Profit",cast:["Edith Storey","Pell Trenton","Lloyd Bacon"]}
{title:"Hearts and Masks",cast:["Elinor Field","Lloyd Bacon","Francis McDonald"]}
{title:"Bringin' Home the Bacon",cast:["Jay Wilsey","Jean Arthur"]}
{title:"Branded Men",cast:["Ken Maynard","June Clyde","Irving Bacon"]}

...gives us Lloyd Bacon, and the 1924 movie Bringin’ Home the Bacon.

Now just Bacon from the extract field:

❯ zq 'year < 1960 | cut extract | Bacon
      | bacon:=regexp_replace(this.extract, /.*\W(\w+? Bacon).*/, "$1")
      | cut bacon | sort | uniq' movies.zng
{bacon:"Daskam Bacon"}
{bacon:"David Bacon"}
{bacon:"Irving Bacon"}
{bacon:"Lloyd Bacon"}
{bacon:"the Bacon"}

...with “the Bacon” being our previously found title:

❯ zq '"the Bacon" | cut title' movies.zng
{title:"Bringin' Home the Bacon"}

Alrighty. I think I've had my fill. Good breakfast!

The main point here is to illustrate how simple many of these queries are to write because zq -- in addition to its many other features -- is designed to be a search tool, and because search is the primary Implied Operator, meaning you don't need to type out search in contexts where the tool presumes that's what you meant.

But this sets up what was, for me, a fairly consistent gotchya in my zq learning curve. Click on through to ZqImpliedOperatorsCanTrickYou for more on that.

A couple final things for this post:

• There's a lot more tooling in zq for doing data exploration and normalization. The Zed docs offer a couple of great tutorials that go into more detail: Real-World GitHub Data & Zed and Schools Data.

• I think it'd be interesting to do a deeper dive comparison on searching in zq vs jq in a different post, but we can take a peek in that direction here before we sign off.

I'm sure you can do all of these same queries in jq, but even the basic zq search operator doesn't seem to be as easy in jq, to search through all of the values of objects. Here's what I've come up with so far:

❯ jq 'with_entries(select(.value | tostring | contains("Bacon")))
      | select(. != {})' movies.json

We can put that into a file and re-use it like this:

❯ cat search.jq
def search(term):
  with_entries(select(.value | tostring | contains(term))) |
    select(. != {});

❯ jq 'include "search"; search("Bacon")' movies.json
...

But more time would need to be spent to do a more thorough side-by-side comparison.

Up: LearningZq | Next: ZqImpliedOperatorsCanTrickYou