ZqSearchIsFirstClass
from LearningZq

examples use zq version 1.16.0 and jq version 1.7

I've made this the first post in this series, even though it's about the 5th one I've written, because it really seems to be the foundation of my jqzq learning curve gotchyas.

In service of making search easy, the search Dataflow Operator is not only an Implied Operator but it is the “primary” implied operator, the first one to be evaluated by the parser (see also ZqImpliedOperator).

Let's take it for a spin. For these examples, I'll be working with a dataset of English movies since 1900 scraped from Wikipedia, downloaded like so:
#!/usr/bin/env bash

echo "Downloading..."
curl "https://raw.githubusercontent.com/prust/wikipedia-movie-data/91ae721d/movies.json" \
>raw.json

echo "Flattening single array into separate lines:"

echo "- Processing to movies.zng..."
zq -f zng 'over this' raw.json >movies.zng

echo "- Processing to movies.json..."
jq -c '.[]' raw.json >movies.json

echo "Removing raw.json"
rm raw.json

The processing extracts each object/record out of the array onto its own line, which saves, in many cases, the step of iterating over it. For zq, it also saves the data into a binary Zed format called ZNG (“zing”) that performs much better (see also ZqWhereDidYouGetAllTheseNames).

Kevin Bacon seems like a fairly prominent figure in Hollywood, let's go bacon hunting!
❯ zq 'Bacon' movies.zng
...
...[snip a bacon-ton of output]...
...
{title:"Leave the World Behind",year:2023...}

That's a lot of output. How about just titles in the last couple years?
❯ zq 'year >= 2022 | Bacon | yield this.title' movies.zng
"They/Them"
"One Way"
"Smile"
"Space Oddity"
"Leave the World Behind"

That's better.

But wait, “Smile” isn't quite right, is it?
❯ zq 'year >= 2022 | Bacon | ! Kevin | cut title, year, cast' movies.zng
{title:"Smile",year:2022,
cast:["Sosie Bacon","Jessie T. Usher","Kyle Gallner",↩
"Robin Weigert","Caitlin Stasey","Kal Penn","Rob Morgan"]}


Right! Sosie is Kevin's daughter.

I wonder what other non-Kevin Bacon is in the data?
❯ zq 'Bacon | ! Kevin | cut title, year' movies.zng
...
...[snip another bacon-ton of output]...
...

That's a lot of olde-timey bacon -- can we summarize it?
❯ zq -j 'Bacon | ! Kevin | cut title, year
| decade:=(year/10)*10 | count() by decade | sort decade' movies.zng
{"decade":1910,"count":7}
{"decade":1920,"count":19}
{"decade":1930,"count":43}
{"decade":1940,"count":31}
{"decade":1950,"count":13}
{"decade":2020,"count":1}

Who are these other Bacons?

‘Bacon | ! Kevin | year < 1960 | cut title, cast' is too much output, but scanning it doesn't show many Bacon names at all. It must be in the Extract field (unless “Bacon” is a genre? And why shouldn't it be?! mmmm, bacon)

First, the Bacons that are in title and cast:
❯ zq 'year < 1960 | cut title, cast | Bacon' movies.zng
{title:"The Fireman",cast:["Charlie Chaplin","Edna Purviance","Lloyd Bacon"]}
{title:"The Floorwalker",cast:["Charlie Chaplin","Edna Purviance","Eric Campbell","Lloyd Bacon"]}
{title:"The House of Intrigue",cast:["Mignon Anderson","Lloyd Bacon"]}
{title:"The Girl in the Rain",cast:["Anne Cornwall","Lloyd Bacon"]}
{title:"The Greater Profit",cast:["Edith Storey","Pell Trenton","Lloyd Bacon"]}
{title:"Hearts and Masks",cast:["Elinor Field","Lloyd Bacon","Francis McDonald"]}
{title:"Bringin' Home the Bacon",cast:["Jay Wilsey","Jean Arthur"]}
{title:"Branded Men",cast:["Ken Maynard","June Clyde","Irving Bacon"]}

...gives us Lloyd Bacon, and the 1924 movie Bringin’ Home the Bacon.

Now just Bacon from the extract field:
❯ zq 'year < 1960 | cut extract | Bacon
| bacon:=regexp_replace(this.extract, /.*\W(\w+? Bacon).*/, "$1")
| cut bacon | sort | uniq' movies.zng
{bacon:"Daskam Bacon"}
{bacon:"David Bacon"}
{bacon:"Irving Bacon"}
{bacon:"Lloyd Bacon"}
{bacon:"the Bacon"}

...with “the Bacon” being our previously found title:
❯ zq '"the Bacon" | cut title' movies.zng
{title:"Bringin' Home the Bacon"}


Alrighty. I think I've had my fill. Good breakfast!


The main point here is to illustrate how simple many of these queries are to write because zq -- in addition to its many other features -- is designed to be a search tool, and because search is the primary Implied Operator, meaning you don't need to type out search in contexts where the tool presumes that's what you meant.

But this sets up what was, for me, a fairly consistent gotchya in my zq learning curve. Click on through to ZqImpliedOperatorsCanTrickYou for more on that.

A couple final things for this post:

• There's a lot more tooling in zq for doing data exploration and normalization. The Zed docs offer a couple of great tutorials that go into more detail: Real-World GitHub Data & Zed and Schools Data.

• I think it'd be interesting to do a deeper dive comparison on searching in zq vs jq in a different post, but we can take a peek in that direction here before we sign off.

I'm sure you can do all of these same queries in jq, but even the basic zq search operator doesn't seem to be as easy in jq, to search through all of the values of objects. Here's what I've come up with so far:
❯ jq 'with_entries(select(.value | tostring | contains("Bacon")))
| select(. != {})' movies.json

We can put that into a file and re-use it like this:
❯ cat search.jq
def search(term):
with_entries(select(.value | tostring | contains(term))) |
select(. != {});

❯ jq 'include "search"; search("Bacon")' movies.json
...


But more time would need to be spent to do a more thorough side-by-side comparison.



Up: LearningZq | Next: ZqImpliedOperatorsCanTrickYou

ZqUserOperatorsNeedParens
from LearningZq

examples use zq version 1.16.0.

In zq you can define your own dataflow operators and functions. All functions need to be called with parens, even argumentless ones, whether built-in or user-defined.

Operators, however, are quirky in that built-in ones cannot be called with parens, but user-defined ones must be.

First off, when do I need to create a function vs. an operator?

The syntax for a user-defined function is:
func <id> ( [<param> [, <param> ...]] ) : ( <expr> )

“where <id> and <param> are identifiers and <expr> is an expression that may refer to parameters but not to runtime state such as this.”

The syntax for a user-defined operator is:
op <id> ( [<param> [, <param> ...]] ) : (
<sequence>
)

“where <id> is the operator identifier, <param> are the [optional] parameters for the operator, and <sequence> is the chain of operators (e.g., operator | ...) where the operator does its work.” And it can refer to this.

To over-simplify:
User-defined FunctionCan only call other functions. Cannot refer to this.
User-defined OperatorCan call operators and functions. Can refer to this.

Though it's more precise to say the body of a user-defined function is an Expression, and the body of a user-defined operator is a Sequence of Dataflow Operators.

Let's make a user-defined operator.
❯ zq -z \
'op add_ids(): (
over this
| yield {id:count(), ...this}
)

yield [{name:"lorem"},{name:"ipsum"}] | add_ids()'

{id:1(uint64),name:"lorem"}
{id:2(uint64),name:"ipsum"}

The user-defined operator add_ids will take an array of records and add an auto-incrementing id integer to each one.

There's some other cool zq things going on here, but for now we'll just focus on calling the operator, with parens. A call without, and again this will fall to the search ImpliedOperator:
❯ zq -z \
'op add_ids(): (
over this
| yield {id:count(), ...this}
)

yield [{name:"lorem"},{name:"ipsum"}] | add_ids'



add_ids without parens is interpreted by zq as search add_ids and the string “add_ids” can't be found in the array of two records.



Prev: ZqFunctionsNeedParens | Up: LearningZq

ZqFunctionsNeedParens
from LearningZq

examples use zq version 1.16.0 and jq version 1.7

Knowing the difference between an operator and a function in Zed can be confusing to me because there's not a hard distinction like that in jq (not to mention “operators” in jq docs refer to things like + and -, and zq “dataflow operators” are things like over, yield, and put).

In addition, functions with zq always need to be called with parentheses, even if it has no arguments (like the now function). But in jq, argumentless functions receive their input via the pipe | operator and don't have parens, like Dataflow Operators in zq.
❯ echo "[1,2,3]" | jq 'length'
3

❯ echo "[1,2,3]" | jq 'length()'
jq: error: syntax error, unexpected ')' ... at <top-level>, line 1:
length()
jq: 1 compile error

While I don't have any advice on how to “just know” when a built-in zq “thing” is a dataflow operator vs. a function (I generally have to look all those things up anyway with either tool) the rule to keep in mind in zq is if it's a function, it's gotta have parens.
❯ echo '{a:null}' | zq 'now() | {a:this}' -
{a:2024-07-12T14:25:24.70848Z}

If it doesn't have parens, then, in many cases it won't even error out, it'll fall back to the search implied operator. (see ZqImpliedOperator).
❯ echo '{a:null}' | zq 'now | {a:this}' -






Prev: ZqPipeCharacter | Up: LearningZq | Next: ZqUserOperatorsNeedParens

ZqImpliedOperatorsCanTrickYou
from LearningZq

examples use zq version 1.16.0 and jq version 1.7

A subset of zq operators are Implied Operators, which means “Zed allows certain operator names to be optionally omitted when they can be inferred from context.” The operators are evaluated in this order:

EvaluationOperatorImpliedExplicit
search expressionsearchfoosearch foo
boolean expressionwherea >= 1where a >= 1
field assignmentputa:=x+1,b:=y-1put a:=x+1,b:=y-1
aggregationsummarizecount()summarize count()
expressionyield{a:x+1,b:y-1}yield {a:x+1,b:y-1}


Note: The -C flag can be passed to zq to output the parsed query with explicit operators.

In many contexts this is really helpful (see ZqSearchIsFirstClass), but as I've been learning zq, it's been confusing at times.

I was experimenting with the example data found in this Brim Data article and put together this query:
❯ zq 'over docs
| has(author_name)
| grep(/tuta/, author_name)
| yield author_key' openlibrary.json
["OL369643A"]


Browsing the docs for some functions, I decided to try out lower function, and as is my wont given my jq experience, piped our previous output to lower. But this didn't work, and I got no output:
❯ zq 'over docs
| has(author_name)
| grep(/tuta/, author_name)
| yield author_key | over this | lower' openlibrary.json


I realized I made an error in how to call the function (see ZqFunctionsNeedParens and ZqPipeCharacter), and fixed it accordingly:
❯ zq 'over docs
| has(author_name)
| grep(/tuta/, author_name)
| yield author_key | over this | lower(this)' openlibrary.json
"ol369643a"


I'm aware that I wrote both the has and grep functions correctly, but I think because lower doesn't require an additional argument, I just fell into a jq habit.

But I am still curious. Why didn't I get an error or something letting me know I wasn't using lower properly? If I make a mistake by passing a non-string into it, I'll get an error:
❯ zq 'lower(1)'
error({message:"lower: string arg required",on:1})

So, why no error here?
❯ zq 'yield "HEY" | lower'



Our friend the -C flag has the answer:
❯ zq -C 'yield "HEY" | lower'
yield "HEY"
| search lower

Ahh. The real reason this returns nothing, not even an error, is search is now the implied operator for the term ‘lower’! It's not parsed as a built-in function because it's also a valid search expression, and because ZqSearchIsFirstClass, which is a good thing in search contexts, that's given the priority.

While this example is a bit contrived, hopefully it highlights a frequent experience for me while learning zq. Remember, if you make a change, and get no output, an Implied Operator is probably in play.



Prev: ZqSearchIsFirstClass | Up: LearningZq | Next: ZqPipeCharacter

ZqPipeCharacter
from LearningZq

examples use zq version 1.16.0 and jq version 1.7

The pipe character appears quite frequently in both jq and zq. Here's an example of an
equivalent query in both:
jq -c '[.docs[]
| {title, author_name: .author_name[0], publish_year: .publish_year[0]}
| select(.author_name!=null and .publish_year!=null)
]
| group_by(.author_name)
| [.[] | {author_name: .[0].author_name, count: . | length}]
| sort_by(.count) | reverse | limit(3;.[])' openlibrary.json

zq -j 'over docs
| {title, author_name: author_name[0], publish_year: publish_year[0]}
| has(author_name) and has(publish_year)
| count() by author_name | sort -r count | head 3' openlibrary.json

(Examples taken from this Brim Data article comparing zq and jq).

In jq every program is a series of filters separated by the pipe operator. There are a few places the pipe operator cannot be used, but it's fairly ubiquitous.

zq has a more elaborate and structured syntax. At its highest level, one or more dataflow operators are joined together in a sequence with the pipe character (sequence overview).
operator | operator | ...
But within operators, syntax varies and the pipe character isn't used. Field references and expressions are common, and most functions receive arguments passed in parentheses. Function outputs can only be passed to other functions as nested calls, not via the pipe character.

In this example, the who value is replaced with “me” whenever it's “chrismo”:
> zq -j '[{who:"bob"}, {who:"chrismo"}]
| over this
| put who:=replace(who, "chrismo", "me")'
{"who":"bob"}
{"who":"me"}

put is an operation that sets a field to an expression. replace is a function taking three string arguments.

If not all records have a who field, we can use the coalesce function to return an empty string if who is missing, but we cannot use the pipe character like we could in jq between two functions:
❯ zq -j '[{}, {who:"bob"}, {who:"chrismo"}]
| over this
| put who:=((coalesce(who, "") | replace(who, "chrismo", "me"))'

zq: error parsing Zed at line 3, column 39:
| put who:=((coalesce(who, "") | replace(who, "chrismo", "me"))
=== ^ ===

The syntax for put is
[put] <field>:=<expr>
and the pipe character is not valid in an expression. This can be accomplished by nesting the function calls:
❯ zq -j '[{}, {who:"bob"}, {who:"chrismo"}]
| over this
| put who:=replace(coalesce(who, ""), "chrismo", "me")'
{"who":""}
{"who":"bob"}
{"who":"me"}




Prev: ZqImpliedOperatorsCanTrickYou | Up: LearningZq | Next: ZqFunctionsNeedParens

LearningZq
Over the past couple of years in my ops work, I've built up a fair amount of jq code. Recently, I've been checking out zq and evaluating it as a jq replacement.

As I've been learning zq in more detail, while it looks similar on the surface, it's pretty different under the hood and I've been tripped up by a few things along the way. The goal with these posts is to try and flatten the learning curve for folks also coming over from jq.

For a more thorough introduction, the Brim team has a wonderful zq tutorial that helps introduce zq to folks familiar with jq. I also want to thank Phil Rzewski from the Brim team, who's been incredibly helpful, both in their support Slack and in the GitHub repo, fielding my questions patiently and thoughtfully.

While I've tried to be accurate in what is written here, these posts are not designed to be a comprehensive comparison of the two tools. I'd love any feedback, corrections, or clarifications you have. Check out my ContactInfo for how to reach me.

Posttl;dr
ZqSearchIsFirstClassSearching is an integral part of aggregation and transformation, but compared to jq, zq search is first class.
ZqImpliedOperatorsCanTrickYouDid you make a change and now there's no output? It's probably the search implied operator hiding the fact that you tried to pipe something to a function like you would in jq, but that's not how it works. See previous post.
ZqPipeCharacterYou can't use | everywhere, esp. not in or between Expressions. Only in-between Dataflow Operators.
ZqFunctionsNeedParensFunctions always have to be called with parens. If you don't, the search implied operator will getchya.
ZqUserOperatorsNeedParensWhile built-in dataflow operators do not take parens, user-defined ones must. If you don't, (all together now) the search implied operator will getchya.


(So, with these hurdles, why bother? There's lots to love about zq: simple search; easier querying than jq (despite some learning curve); more performant; multiple input & output formats: from csv to Parquet and more; flexibly discover, massage, and query unstructured data alongside typed records with the power of relational joins in a file system lake; and more.)





LawsOfManagement
Inspired by LawsOfUX and this tweet by @viktorcessan, a collection of Laws all managers should know. Some are focused on technology work, but some apply broadly. (As I continued to assemble the list, the categorization of it all is a bit sprawly, but, let's live a little).



LawsOfUX
https://lawsofux.com/

These are great. Some of my favs:


TheValueOfACompany
“The value of a company is the sum of all problems solved.”

The last season of Startup, Gimlet's first podcast, discusses their acquisition by Spotify, and it has some interesting behind-the-scenes stuff (maybe not as much as I'd like, but ...)

One of the ongoing themes is the tension between Alex and Matt, the co-founders of Gimlet. Alex cares more about the quality of the product and Matt cares more about the business needs, and Gimlet was struggling, even in the quarter preceding the acquisition. Neither of them were handling the conflict very well.

In this episode, Thanksgiving in Stockholm, Alex is talking with Daniel Ek, CEO of Spotify, after the acquisition has happened to reflect on prior conversations they'd had before.

[36:28] Daniel Ek, Spotify co-founder: It's actually my co-founder's [Martin Lorentzon] saying. He said this thing, I'm not even sure he was aware that he coined what I think is an iconic quote. He said, “the value of a company is the sum of all problems solved.” Even to this day, it's one of those things that I think about. You may think about all the things you guys went through as all the issues that you went through, but you solved them, one by one. And I think the most important thing that you got right is the integrity of the programming and the shows that you make. At the end of the day, that is the value that you're bringing to that and bringing to consumers and it really served you well in the end.

Then Alex sums up his thoughts in a way I don't quite agree with.

[37:20] Alex Blumberg: In other words, Matt's and mine constant fighting had produced something valuable, the fighting itself in fact was the thing that made it better. If I hadn't cared about what I cared about and he hadn't cared about what he cared about and we hadn't each cared enough to fight with each other, the company we built, it wouldn't have worked as well.

It's the caring that matters, not the fighting. It's possible to combine caring with good conflict resolution skills and minimize the fighting. Then things are even better.
RoomToMove
In his book, Slack, Tom DeMarco writes:

Slack is a prescription for building a capacity to change into the modern enterprise. It looks into the heart of the efficiency-flexibility quandary: The more efficient you get, the harder it is to change. The book shows managers how to make their organizations slightly less efficient but enormously more effective. It coaches them on the introduction of slack, the missing ingredient required for all change. It counsels a thoughtful use of slack instead of the mindless obsession with elimination of all slack in the interests of efficiency.

The other day on the job one of the devs on my team posted a PR for code review with a comment like: “This removes X from the codebase. It's not terribly important, but it's been bugging for a long time and I'm glad to be rid of it.”

I realized that the presence of such PRs is an interesting metric to measure the amount of Slack of my team.

If they're too efficient in knocking out cards from stakeholders, I probably won't see any work like this.

ValueOfBuildSpeed
Willem van Bergen, in the devproductivity.io Slack, mentioned a talk given at the @Scale conference of September 2017 that shared an internal study done by Google around the productivity costs of their build speeds. Taking aggregate data from tracks of how individual engineers are spending their time, they arrived at this graph:


The speaker Collin Winter had this to say about it (starting 27:44 into the video).
This is a graph from a report that we just published internally a few weeks ago that looks at, for builds of a different duration, how much time are people spending in each of these different cognitive modes. What we can see here is slow builds, builds lasting more than an hour, actually aren't taking that much time in aggregate. Likewise, builds that are really fast, well, they’re really fast, and they’re also not taking that much time in aggregate.

But we can see here that builds in the 2 to 5 minute range and the 5 to 10 minute range, those are actually pretty expensive. … That red bar is time where, as far as we can tell, people were just sitting there twiddling their thumbs staring at the terminal. … We can see time where we simply don’t know what they were doing [yellow].

It looks like the 2 to 10 minutes range is where we should spend our optimization effort and we can use that to focus our efforts more intelligently than a scattershot approach.

I'm of two minds here. Part of me is really interested in this data, but I'm also doubtful. They must be making some possibly large presumptions when interpreting the data of what counts as “untracked” vs. what's “waiting”. I'm also curious if this data is collected from their entire engineer population, or just volunteers? Do they know they're being tracked? How do they deal with the Heisenberg effect?

If the data is trustworthy, and I have a 15 minute build, will I make things worse by getting it into the 2-10 minute range? And how large does my engineering team need to be before any of this matters?