ClWiki: Recent

Recently I wondered how zq would compare to jq on one of the most common things I have to do with AWS JSON results: dig out a resource's name from its Tags. This has got to be one of the most annoying decisions AWS has made across the board. I understand names can't be required, but not having a top-level optional attribute vs. pushing it down into Tags seems ridiculous to me.

If you're not familiar with the structure, here's a (very) slimmed down version of an EC2 JSON result:

{
  "Reservations": [
    {
      "Groups": [],
      "Instances": [
        {
          "InstanceId": "i-1234567890abcdefa",
          "InstanceType": "t4g.medium",
          "Tags": [
            {
              "Key": "Name",
              "Value": "my-groovy-ec2"
            }
          ]
        },
        {
          "InstanceId": "i-9876543210abcdefa",
          "InstanceType": "t4g.medium",
          "Tags": [
            {
              "Key": "Name",
              "Value": "et-c-tu-brute"
            }
          ]
        }
      ]
    }
  ]
}

How can we pull the Name up from the Tags alongside its InstanceId? Here's a jq way to do it:

❯ aws ec2 describe-instances |
  jq -c -r '
    .Reservations[].Instances[]
    | {name: (.Tags[] | select(.Key == "Name") | .Value), id: .InstanceId}'

{"name":"my-groovy-ec2","id":"i-1234567890abcdefa"}
{"name":"et-c-tu-brute","id":"i-9876543210abcdefa"}

How do we do this with zq? First, let's add some commenting to the jq version so we're sorta tracking where we're at in the navigation of the original document as we go:

❯ aws ec2 describe-instances |
  jq -c -r '
  .Reservations[]   # for each object in the Reservations array...
    .Instances[]    #   for each object in the Res Instances array...
    | {
       name:        #     make a new object with a name key, and for its value...
                      #    drop down into the Res Instances Tags array...
             (        #    and find the Value of Name
              .Tags[] | select(.Key == "Name") | .Value
             ),
                    #     back up at the new object with the Instance object
                    #     grab the InstanceId as the value of id
       id:   .InstanceId
      }'

{"name":"my-groovy-ec2","id":"i-1234567890abcdefa"}
{"name":"et-c-tu-brute","id":"i-9876543210abcdefa"}

In zq, how do we do the parenthetical bit of dropping down into the Tags to retrieve the Name while still being able to execute something on the parent to grab the Id?

You can do this with a Lateral Subquery.

“Lateral subqueries provide a powerful means to apply a Zed query to each subsequence of values generated from an outer sequence of values. The inner query may be any Zed query and may refer to values from the outer sequence.”

# Lateral Subquery

❯ aws ec2 describe-instances |
  zq -z '
    over Reservations
    | over Instances
    | over Tags with obj=this => ( where Key=="Name"
                                   | {name: this.Value, id:obj.InstanceId} )' -

If we keep reading the docs there, in turns out there's a tighter way to do this, even better!

# Lateral Expression

❯ aws ec2 describe-instances |
  zq -z '
    over Reservations
    | over Instances
    | {name:(over Tags | where Key=="Name" | Value), id:InstanceId}' -

What?! Dataflow Operators inside an Expression?! I thought you said that was impossible? And I quote: “the pipe character is not valid in an expression” from ZqPipeCharacter. Well, I didn't lie ... I just hadn't ... learned about Lateral Expressions yet...

Lateral Expressions

“Lateral subqueries can also appear in expression context using the parenthesized form. Note that the parentheses disambiguate a lateral expression from a lateral dataflow operator.”

This is pretty tidy. And in this case, brings us about equal with jq here.

Prev: ZqUserOperatorsNeedParens | Up: LearningZq

from LearningZq

examples use zq version 1.16.0 and jq version 1.7

I've made this the first post in this series, even though it's about the 5th one I've written, because it really seems to be the foundation of my jq → zq learning curve gotchyas.

In service of making search easy, the search Dataflow Operator is not only an Implied Operator but it is the “primary” implied operator, the first one to be evaluated by the parser (see also ZqImpliedOperator).

Let's take it for a spin. For these examples, I'll be working with a dataset of English movies since 1900 scraped from Wikipedia, downloaded like so:

#!/usr/bin/env bash

echo "Downloading..."
curl "https://raw.githubusercontent.com/prust/wikipedia-movie-data/91ae721d/movies.json" \
  >raw.json

echo "Flattening single array into separate lines:"

echo "- Processing to movies.zng..."
zq -f zng 'over this' raw.json >movies.zng

echo "- Processing to movies.json..."
jq -c '.[]' raw.json >movies.json

echo "Removing raw.json"
rm raw.json

The processing extracts each object/record out of the array onto its own line, which saves, in many cases, the step of iterating over it. For zq, it also saves the data into a binary Zed format called ZNG (“zing”) that performs much better (see also ZqWhereDoYouGetTheseNames).

Kevin Bacon seems like a fairly prominent figure in Hollywood, let's go bacon hunting!

❯ zq 'Bacon' movies.zng
...
...[snip a bacon-ton of output]...
...
{title:"Leave the World Behind",year:2023...}

That's a lot of output. How about just titles in the last couple years?

❯ zq 'year >= 2022 | Bacon | yield this.title' movies.zng
"They/Them"
"One Way"
"Smile"
"Space Oddity"
"Leave the World Behind"

That's better.

But wait, “Smile” isn't quite right, is it?

❯ zq 'year >= 2022 | Bacon | ! Kevin | cut title, year, cast' movies.zng
{title:"Smile",year:2022,
 cast:["Sosie Bacon","Jessie T. Usher","Kyle Gallner",↩
       "Robin Weigert","Caitlin Stasey","Kal Penn","Rob Morgan"]}

Right! Sosie is Kevin's daughter.

I wonder what other non-Kevin Bacon is in the data?

❯ zq 'Bacon | ! Kevin | cut title, year' movies.zng
...
...[snip another bacon-ton of output]...
...

That's a lot of olde-timey bacon -- can we summarize it?

❯ zq -j 'Bacon | ! Kevin | cut title, year
         | decade:=(year/10)*10 | count() by decade | sort decade' movies.zng
{"decade":1910,"count":7}
{"decade":1920,"count":19}
{"decade":1930,"count":43}
{"decade":1940,"count":31}
{"decade":1950,"count":13}
{"decade":2020,"count":1}

Who are these other Bacons?

‘Bacon | ! Kevin | year < 1960 | cut title, cast' is too much output, but scanning it doesn't show many Bacon names at all. It must be in the Extract field (unless “Bacon” is a genre? And why shouldn't it be?! mmmm, bacon)

First, the Bacons that are in title and cast:

❯ zq 'year < 1960 | cut title, cast | Bacon' movies.zng
{title:"The Fireman",cast:["Charlie Chaplin","Edna Purviance","Lloyd Bacon"]}
{title:"The Floorwalker",cast:["Charlie Chaplin","Edna Purviance","Eric Campbell","Lloyd Bacon"]}
{title:"The House of Intrigue",cast:["Mignon Anderson","Lloyd Bacon"]}
{title:"The Girl in the Rain",cast:["Anne Cornwall","Lloyd Bacon"]}
{title:"The Greater Profit",cast:["Edith Storey","Pell Trenton","Lloyd Bacon"]}
{title:"Hearts and Masks",cast:["Elinor Field","Lloyd Bacon","Francis McDonald"]}
{title:"Bringin' Home the Bacon",cast:["Jay Wilsey","Jean Arthur"]}
{title:"Branded Men",cast:["Ken Maynard","June Clyde","Irving Bacon"]}

...gives us Lloyd Bacon, and the 1924 movie Bringin’ Home the Bacon.

Now just Bacon from the extract field:

❯ zq 'year < 1960 | cut extract | Bacon
      | bacon:=regexp_replace(this.extract, /.*\W(\w+? Bacon).*/, "$1")
      | cut bacon | sort | uniq' movies.zng
{bacon:"Daskam Bacon"}
{bacon:"David Bacon"}
{bacon:"Irving Bacon"}
{bacon:"Lloyd Bacon"}
{bacon:"the Bacon"}

...with “the Bacon” being our previously found title:

❯ zq '"the Bacon" | cut title' movies.zng
{title:"Bringin' Home the Bacon"}

Alrighty. I think I've had my fill. Good breakfast!

The main point here is to illustrate how simple many of these queries are to write because zq -- in addition to its many other features -- is designed to be a search tool, and because search is the primary Implied Operator, meaning you don't need to type out search in contexts where the tool presumes that's what you meant.

But this sets up what was, for me, a fairly consistent gotchya in my zq learning curve. Click on through to ZqImpliedOperatorsCanTrickYou for more on that.

A couple final things for this post:

• There's a lot more tooling in zq for doing data exploration and normalization. The Zed docs offer a couple of great tutorials that go into more detail: Real-World GitHub Data & Zed and Schools Data.

• I think it'd be interesting to do a deeper dive comparison on searching in zq vs jq in a different post, but we can take a peek in that direction here before we sign off.

I'm sure you can do all of these same queries in jq, but even the basic zq search operator doesn't seem to be as easy in jq, to search through all of the values of objects. Here's what I've come up with so far:

❯ jq 'with_entries(select(.value | tostring | contains("Bacon")))
      | select(. != {})' movies.json

We can put that into a file and re-use it like this:

❯ cat search.jq
def search(term):
  with_entries(select(.value | tostring | contains(term))) |
    select(. != {});

❯ jq 'include "search"; search("Bacon")' movies.json
...

But more time would need to be spent to do a more thorough side-by-side comparison.

Up: LearningZq | Next: ZqImpliedOperatorsCanTrickYou

from LearningZq

examples use zq version 1.16.0.

In zq you can define your own dataflow operators and functions. All functions need to be called with parens, even argumentless ones, whether built-in or user-defined.

Operators, however, are quirky in that built-in ones cannot be called with parens, but user-defined ones must be.

First off, when do I need to create a function vs. an operator?

The syntax for a user-defined function is:

func <id> ( [<param> [, <param> ...]] ) : ( <expr> )

“where <id> and <param> are identifiers and <expr> is an expression that may refer to parameters but not to runtime state such as this.”

The syntax for a user-defined operator is:

op <id> ( [<param> [, <param> ...]] ) : (
  <sequence>
)

“where <id> is the operator identifier, <param> are the [optional] parameters for the operator, and <sequence> is the chain of operators (e.g., operator | ...) where the operator does its work.” And it can refer to this.

To over-simplify:

User-defined Function Can only call other functions. Cannot refer to this.
User-defined Operator Can call operators and functions. Can refer to this.

Though it's more precise to say the body of a user-defined function is an Expression, and the body of a user-defined operator is a Sequence of Dataflow Operators.

Let's make a user-defined operator.

❯ zq -z \
'op add_ids(): (
  over this
  | yield {id:count(), ...this}
)

yield [{name:"lorem"},{name:"ipsum"}] | add_ids()'

{id:1(uint64),name:"lorem"}
{id:2(uint64),name:"ipsum"}

The user-defined operator add_ids will take an array of records and add an auto-incrementing id integer to each one.

There's some other cool zq things going on here, but for now we'll just focus on calling the operator, with parens. A call without, and again this will fall to the search ImpliedOperator:

❯ zq -z \
'op add_ids(): (
  over this
  | yield {id:count(), ...this}
)

yield [{name:"lorem"},{name:"ipsum"}] | add_ids'

❯

add_ids without parens is interpreted by zq as search add_ids and the string “add_ids” can't be found in the array of two records.

Prev: ZqFunctionsNeedParens | Up: LearningZq | Next: ZqLateralSubquery

from LearningZq

examples use zq version 1.16.0 and jq version 1.7

Knowing the difference between an operator and a function in Zed can be confusing to me because there's not a hard distinction like that in jq (not to mention “operators” in jq docs refer to things like + and -, and zq “dataflow operators” are things like over, yield, and put).

In addition, functions with zq always need to be called with parentheses, even if it has no arguments (like the now function). But in jq, argumentless functions receive their input via the pipe | operator and don't have parens, like Dataflow Operators in zq.

❯ echo "[1,2,3]" | jq 'length'
3

❯ echo "[1,2,3]" | jq 'length()'
jq: error: syntax error, unexpected ')' ... at <top-level>, line 1:
length()
jq: 1 compile error

While I don't have any advice on how to “just know” when a built-in zq “thing” is a dataflow operator vs. a function (I generally have to look all those things up anyway with either tool) the rule to keep in mind in zq is if it's a function, it's gotta have parens.

❯ echo '{a:null}' | zq 'now() | {a:this}' -
{a:2024-07-12T14:25:24.70848Z}

If it doesn't have parens, then, in many cases it won't even error out, it'll fall back to the search implied operator. (see ZqImpliedOperator).

❯ echo '{a:null}' | zq 'now | {a:this}' -

❯

Prev: ZqPipeCharacter | Up: LearningZq | Next: ZqUserOperatorsNeedParens

from LearningZq

examples use zq version 1.16.0 and jq version 1.7

A subset of zq operators are Implied Operators, which means “Zed allows certain operator names to be optionally omitted when they can be inferred from context.” The operators are evaluated in this order:

Evaluation	Operator	Implied	Explicit
search expression	`search`	`foo`	`search foo`
boolean expression	`where`	`a >= 1`	`where a >= 1`
field assignment	`put`	`a:=x+1,b:=y-1`	`put a:=x+1,b:=y-1`
aggregation	`summarize`	`count()`	`summarize count()`
expression	`yield`	`{a:x+1,b:y-1}`	`yield {a:x+1,b:y-1}`

Note: The -C flag can be passed to zq to output the parsed query with explicit operators.

In many contexts this is really helpful (see ZqSearchIsFirstClass), but as I've been learning zq, it's been confusing at times.

I was experimenting with the example data found in this Brim Data article and put together this query:

❯ zq 'over docs
      | has(author_name)
      | grep(/tuta/, author_name)
      | yield author_key' openlibrary.json
["OL369643A"]
❯

Browsing the docs for some functions, I decided to try out lower function, and as is my wont given my jq experience, piped our previous output to lower. But this didn't work, and I got no output:

❯ zq 'over docs
    | has(author_name)
    | grep(/tuta/, author_name)
    | yield author_key | over this | lower' openlibrary.json
❯

I realized I made an error in how to call the function (see ZqFunctionsNeedParens and ZqPipeCharacter), and fixed it accordingly:

❯ zq 'over docs
    | has(author_name)
    | grep(/tuta/, author_name)
    | yield author_key | over this | lower(this)' openlibrary.json
"ol369643a"
❯

I'm aware that I wrote both the has and grep functions correctly, but I think because lower doesn't require an additional argument, I just fell into a jq habit.

But I am still curious. Why didn't I get an error or something letting me know I wasn't using lower properly? If I make a mistake by passing a non-string into it, I'll get an error:

❯ zq 'lower(1)'
error({message:"lower: string arg required",on:1})

So, why no error here?

❯ zq 'yield "HEY" | lower'

❯

Our friend the -C flag has the answer:

❯ zq -C 'yield "HEY" | lower'
yield "HEY"
| search lower

Ahh. The real reason this returns nothing, not even an error, is search is now the implied operator for the term ‘lower’! It's not parsed as a built-in function because it's also a valid search expression, and because ZqSearchIsFirstClass, which is a good thing in search contexts, that's given the priority.

While this example is a bit contrived, hopefully it highlights a frequent experience for me while learning zq. Remember, if you make a change, and get no output, an Implied Operator is probably in play.

Prev: ZqSearchIsFirstClass | Up: LearningZq | Next: ZqPipeCharacter

from LearningZq

examples use zq version 1.16.0 and jq version 1.7

The pipe character appears quite frequently in both jq and zq. Here's an example of an
equivalent query in both:

jq -c '[.docs[]
        | {title, author_name: .author_name[0], publish_year: .publish_year[0]}
        | select(.author_name!=null and .publish_year!=null)
       ]
       | group_by(.author_name)
       | [.[] | {author_name: .[0].author_name, count: . | length}]
       | sort_by(.count) | reverse | limit(3;.[])' openlibrary.json

zq -j 'over docs
       | {title, author_name: author_name[0], publish_year: publish_year[0]}
       | has(author_name) and has(publish_year)
       | count() by author_name | sort -r count | head 3' openlibrary.json

(Examples taken from this Brim Data article comparing zq and jq).

In jq every program is a series of filters separated by the pipe operator. There are a few places the pipe operator cannot be used, but it's fairly ubiquitous.

zq has a more elaborate and structured syntax. At its highest level, one or more dataflow operators are joined together in a sequence with the pipe character (sequence overview).

operator | operator | ...

But within operators, syntax varies and the pipe character isn't used. Field references and expressions are common, and most functions receive arguments passed in parentheses. Function outputs can only be passed to other functions as nested calls, not via the pipe character.

In this example, the who value is replaced with “me” whenever it's “chrismo”:

> zq -j '[{who:"bob"}, {who:"chrismo"}]
         | over this
         | put who:=replace(who, "chrismo", "me")'
{"who":"bob"}
{"who":"me"}

put is an operation that sets a field to an expression. replace is a function taking three string arguments.

If not all records have a who field, we can use the coalesce function to return an empty string if who is missing, but we cannot use the pipe character like we could in jq between two functions:

❯ zq -j '[{}, {who:"bob"}, {who:"chrismo"}]
         | over this
         | put who:=((coalesce(who, "") | replace(who, "chrismo", "me"))'

zq: error parsing Zed at line 3, column 39:
       | put who:=((coalesce(who, "") | replace(who, "chrismo", "me"))
                                  === ^ ===

The syntax for put is

[put] <field>:=<expr>

and the pipe character is not valid in an expression**. This can be accomplished by nesting the function calls:

❯ zq -j '[{}, {who:"bob"}, {who:"chrismo"}]
         | over this
         | put who:=replace(coalesce(who, ""), "chrismo", "me")'
{"who":""}
{"who":"bob"}
{"who":"me"}

**well ... there is an exception to this, but that's covered in ZqLateralSubquery.

Prev: ZqImpliedOperatorsCanTrickYou | Up: LearningZq | Next: ZqFunctionsNeedParens

Over the past couple of years in my ops work, I've built up a fair amount of jq code. Recently, I've been checking out zq and evaluating it as a jq replacement.

As I've been learning zq in more detail, while it looks similar on the surface, it's pretty different under the hood and I've been tripped up by a few things along the way. The goal with these posts is to try and flatten the learning curve for folks also coming over from jq.

For a more thorough introduction, the Brim team has a wonderful zq tutorial that helps introduce zq to folks familiar with jq. I also want to thank Phil Rzewski from the Brim team, who's been incredibly helpful, both in their support Slack and in the GitHub repo, fielding my questions patiently and thoughtfully.

While I've tried to be accurate in what is written here, these posts are not designed to be a comprehensive comparison of the two tools. I'd love any feedback, corrections, or clarifications you have. Check out my ContactInfo for how to reach me.

Post	tl;dr
ZqSearchIsFirstClass	Searching is an integral part of aggregation and transformation, but compared to `jq`, `zq` search is first class.
ZqImpliedOperatorsCanTrickYou	Did you make a change and now there's no output? It's probably the `search` implied operator hiding the fact that you tried to pipe something to a function like you would in `jq`, but that's not how it works. See previous post.
ZqPipeCharacter	You can't use `\|` everywhere, esp. not in or between Expressions. Only in-between Dataflow Operators.
ZqFunctionsNeedParens	Functions always have to be called with parens. If you don't, the `search` implied operator will getchya.
ZqUserOperatorsNeedParens	While built-in dataflow operators do not take parens, user-defined ones must. If you don't, (all together now) the `search` implied operator will getchya.
ZqLateralSubquery	How to drop down a level of JSON while parsing.

So, with all of these quirks coming over from jq, why bother? In addition to wrangling JSON data, zq has a host of additional features that make it not just a decent replacement of jq, but a solid upgrade.

It performs better and streams data by default (jq can too with a special option); it supports multiple input & output formats: from csv to Parquet and more (including groking log files a la Logstash); it can naturally join data together in a relational fashion from either files or its own Data Lake, and supports a form of gradual typing of your data with Shaping.

Inspired by LawsOfUX and this tweet by @viktorcessan, a collection of Laws all managers should know. Some are focused on technology work, but some apply broadly. (As I continued to assemble the list, the categorization of it all is a bit sprawly, but, let's live a little).

Brooks's Law - Adding people to a late software project makes it later.

Lewin's Equation - B = f(P, E). An individual’s behavior (B) is a function (f) of the the person (P), including their history, personality and motivation, and their environment (E), which includes both their physical and social surroundings.

Conway's Law - Any organization that designs a system will produce a design whose structure is a copy of the organization's communication structure.

Goodhart's Law - Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.

Hyrum's Law - With a sufficient number of users of an API all observable behaviors of your system will be depended on by somebody.

Gall's Law - A complex system that works is invariably found to have evolved from a simple system that worked. The inverse proposition also appears to be true: A complex system designed from scratch never works and cannot be made to work. You have to start over, beginning with a working simple system.

McNamara Fallacy - The first step is to measure whatever can be easily measured. This is OK as far as it goes. The second step is to disregard that which can't be easily measured or to give it an arbitrary quantitative value. This is artificial and misleading. The third step is to presume that what can't be measured easily really isn't important. This is blindness. The fourth step is to say that what can't be easily measured really doesn't exist. This is suicide.

Seven Deadly Diseases of Management - Lack of constancy of purpose; Emphasis on short-term profits; Annual performance reviews; Manager job hopping within an org; Relying solely on measurable metrics (“the most important figures that one needs for management are unknown or unknownable”); Excessive medical costs; Excessive liability costs.

deming.org - Every system is perfectly designed to get the results it gets.

85/15 Rule - So ... I can't actually trace this back to any reliable source...

Psychological Safety - “Drive out fear, so that everyone may work effectively for the company.” -- Deming

Usability Tests only need 5 Users

Laws of Cooperation - A long list of laws assembled by @bartlog.

https://lawsofux.com/

These are great. Some of my favs:

Aesthetic Usability Effect - Users often perceive aesthetically pleasing design as design that’s more usable.

Hick's Law - The time it takes to make a decision increases with the number and complexity of choices.

Jakob's Law - Users spend most of their time on other sites. This means that users prefer your site to work the same way as all the other sites they already know.

Peak-End Rule - People judge an experience largely based on how they felt at its peak and at its end, rather than the total sum or average of every moment of the experience.

Postel's Law - Be liberal in what you accept, and conservative in what you send.

Zeigarnik Effect - People remember uncompleted or interrupted tasks better than completed tasks.

“The value of a company is the sum of all problems solved.”

The last season of Startup, Gimlet's first podcast, discusses their acquisition by Spotify, and it has some interesting behind-the-scenes stuff (maybe not as much as I'd like, but ...)

One of the ongoing themes is the tension between Alex and Matt, the co-founders of Gimlet. Alex cares more about the quality of the product and Matt cares more about the business needs, and Gimlet was struggling, even in the quarter preceding the acquisition. Neither of them were handling the conflict very well.

In this episode, Thanksgiving in Stockholm, Alex is talking with Daniel Ek, CEO of Spotify, after the acquisition has happened to reflect on prior conversations they'd had before.

[36:28] Daniel Ek, Spotify co-founder: It's actually my co-founder's [Martin Lorentzon] saying. He said this thing, I'm not even sure he was aware that he coined what I think is an iconic quote. He said, “the value of a company is the sum of all problems solved.” Even to this day, it's one of those things that I think about. You may think about all the things you guys went through as all the issues that you went through, but you solved them, one by one. And I think the most important thing that you got right is the integrity of the programming and the shows that you make. At the end of the day, that is the value that you're bringing to that and bringing to consumers and it really served you well in the end.

Then Alex sums up his thoughts in a way I don't quite agree with.

[37:20] Alex Blumberg: In other words, Matt's and mine constant fighting had produced something valuable, the fighting itself in fact was the thing that made it better. If I hadn't cared about what I cared about and he hadn't cared about what he cared about and we hadn't each cared enough to fight with each other, the company we built, it wouldn't have worked as well.

It's the caring that matters, not the fighting. It's possible to combine caring with good conflict resolution skills and minimize the fighting. Then things are even better.

User-defined Function	Can only call other functions. Cannot refer to `this`.
User-defined Operator	Can call operators and functions. Can refer to `this`.

cLabs