zq
version 1.16.0 and jq version 1.7jq
→ zq
learning curve gotchyas.search
Dataflow Operator is not only an Implied Operator but it is the “primary” implied operator, the first one to be evaluated by the parser (see also ZqImpliedOperator).#!/usr/bin/env bash
echo "Downloading..."
curl "https://raw.githubusercontent.com/prust/wikipedia-movie-data/91ae721d/movies.json" \
>raw.json
echo "Flattening single array into separate lines:"
echo "- Processing to movies.zng..."
zq -f zng 'over this' raw.json >movies.zng
echo "- Processing to movies.json..."
jq -c '.[]' raw.json >movies.json
echo "Removing raw.json"
rm raw.json
zq
, it also saves the data into a binary Zed format called ZNG (“zing”) that performs much better (see also ZqWhereDidYouGetAllTheseNames).❯ zq 'Bacon' movies.zng
...
...[snip a bacon-ton of output]...
...
{title:"Leave the World Behind",year:2023...}
❯ zq 'year >= 2022 | Bacon | yield this.title' movies.zng
"They/Them"
"One Way"
"Smile"
"Space Oddity"
"Leave the World Behind"
❯ zq 'year >= 2022 | Bacon | ! Kevin | cut title, year, cast' movies.zng
{title:"Smile",year:2022,
cast:["Sosie Bacon","Jessie T. Usher","Kyle Gallner",↩
"Robin Weigert","Caitlin Stasey","Kal Penn","Rob Morgan"]}
❯ zq 'Bacon | ! Kevin | cut title, year' movies.zng
...
...[snip another bacon-ton of output]...
...
❯ zq -j 'Bacon | ! Kevin | cut title, year
| decade:=(year/10)*10 | count() by decade | sort decade' movies.zng
{"decade":1910,"count":7}
{"decade":1920,"count":19}
{"decade":1930,"count":43}
{"decade":1940,"count":31}
{"decade":1950,"count":13}
{"decade":2020,"count":1}
‘Bacon | ! Kevin | year < 1960 | cut title, cast'
is too much output, but scanning it doesn't show many Bacon names at all. It must be in the Extract field (unless “Bacon” is a genre? And why shouldn't it be?! mmmm, bacon)❯ zq 'year < 1960 | cut title, cast | Bacon' movies.zng
{title:"The Fireman",cast:["Charlie Chaplin","Edna Purviance","Lloyd Bacon"]}
{title:"The Floorwalker",cast:["Charlie Chaplin","Edna Purviance","Eric Campbell","Lloyd Bacon"]}
{title:"The House of Intrigue",cast:["Mignon Anderson","Lloyd Bacon"]}
{title:"The Girl in the Rain",cast:["Anne Cornwall","Lloyd Bacon"]}
{title:"The Greater Profit",cast:["Edith Storey","Pell Trenton","Lloyd Bacon"]}
{title:"Hearts and Masks",cast:["Elinor Field","Lloyd Bacon","Francis McDonald"]}
{title:"Bringin' Home the Bacon",cast:["Jay Wilsey","Jean Arthur"]}
{title:"Branded Men",cast:["Ken Maynard","June Clyde","Irving Bacon"]}
extract
field:❯ zq 'year < 1960 | cut extract | Bacon
| bacon:=regexp_replace(this.extract, /.*\W(\w+? Bacon).*/, "$1")
| cut bacon | sort | uniq' movies.zng
{bacon:"Daskam Bacon"}
{bacon:"David Bacon"}
{bacon:"Irving Bacon"}
{bacon:"Lloyd Bacon"}
{bacon:"the Bacon"}
❯ zq '"the Bacon" | cut title' movies.zng
{title:"Bringin' Home the Bacon"}
zq
-- in addition to its many other features -- is designed to be a search tool, and because search
is the primary Implied Operator, meaning you don't need to type out search
in contexts where the tool presumes that's what you meant.zq
learning curve. Click on through to ZqImpliedOperatorsCanTrickYou for more on that.zq
for doing data exploration and normalization. The Zed docs offer a couple of great tutorials that go into more detail: Real-World GitHub Data & Zed and Schools Data.zq
vs jq
in a different post, but we can take a peek in that direction here before we sign off.jq
, but even the basic zq
search
operator doesn't seem to be as easy in jq
, to search through all of the values of objects. Here's what I've come up with so far:❯ jq 'with_entries(select(.value | tostring | contains("Bacon")))
| select(. != {})' movies.json
❯ cat search.jq
def search(term):
with_entries(select(.value | tostring | contains(term))) |
select(. != {});
❯ jq 'include "search"; search("Bacon")' movies.json
...
zq
you can define your own dataflow operators and functions. All functions need to be called with parens, even argumentless ones, whether built-in or user-defined.func <id> ( [<param> [, <param> ...]] ) : ( <expr> )
<id>
and <param>
are identifiers and <expr>
is an expression that may refer to parameters but not to runtime state such as this
.”op <id> ( [<param> [, <param> ...]] ) : (
<sequence>
)
<id>
is the operator identifier, <param>
are the [optional] parameters for the operator, and <sequence>
is the chain of operators (e.g., operator | ...
) where the operator does its work.” And it can refer to this
.Though it's more precise to say the body of a user-defined function is an Expression, and the body of a user-defined operator is a Sequence of Dataflow Operators.
User-defined Function Can only call other functions. Cannot refer to this
.User-defined Operator Can call operators and functions. Can refer to this
.
❯ zq -z \
'op add_ids(): (
over this
| yield {id:count(), ...this}
)
yield [{name:"lorem"},{name:"ipsum"}] | add_ids()'
{id:1(uint64),name:"lorem"}
{id:2(uint64),name:"ipsum"}
add_ids
will take an array of records and add an auto-incrementing id
integer to each one.zq
things going on here, but for now we'll just focus on calling the operator, with parens. A call without, and again this will fall to the search
ImpliedOperator:❯ zq -z \
'op add_ids(): (
over this
| yield {id:count(), ...this}
)
yield [{name:"lorem"},{name:"ipsum"}] | add_ids'
❯
add_ids
without parens is interpreted by zq
as search add_ids
and the string “add_ids”
can't be found in the array of two records.zq
version 1.16.0 and jq version 1.7jq
(not to mention “operators” in jq
docs refer to things like +
and -
, and zq
“dataflow operators” are things like over
, yield
, and put
).zq
always need to be called with parentheses, even if it has no arguments (like the now
function). But in jq
, argumentless functions receive their input via the pipe |
operator and don't have parens, like Dataflow Operators in zq
.❯ echo "[1,2,3]" | jq 'length'
3
❯ echo "[1,2,3]" | jq 'length()'
jq: error: syntax error, unexpected ')' ... at <top-level>, line 1:
length()
jq: 1 compile error
zq
“thing” is a dataflow operator vs. a function (I generally have to look all those things up anyway with either tool) the rule to keep in mind in zq
is if it's a function, it's gotta have parens.❯ echo '{a:null}' | zq 'now() | {a:this}' -
{a:2024-07-12T14:25:24.70848Z}
search
implied operator. (see ZqImpliedOperator).❯ echo '{a:null}' | zq 'now | {a:this}' -
❯
zq
version 1.16.0 and jq version 1.7zq
operators are Implied Operators, which means “Zed allows certain operator names to be optionally omitted when they can be inferred from context.” The operators are evaluated in this order:Evaluation | Operator | Implied | Explicit |
search expression | search | foo | search foo |
boolean expression | where | a >= 1 | where a >= 1 |
field assignment | put | a:=x+1,b:=y-1 | put a:=x+1,b:=y-1 |
aggregation | summarize | count() | summarize count() |
expression | yield | {a:x+1,b:y-1} | yield {a:x+1,b:y-1} |
-C
flag can be passed to zq
to output the parsed query with explicit operators.❯ zq 'over docs
| has(author_name)
| grep(/tuta/, author_name)
| yield author_key' openlibrary.json
["OL369643A"]
❯
jq
experience, piped our previous output to lower. But this didn't work, and I got no output:❯ zq 'over docs
| has(author_name)
| grep(/tuta/, author_name)
| yield author_key | over this | lower' openlibrary.json
❯
❯ zq 'over docs
| has(author_name)
| grep(/tuta/, author_name)
| yield author_key | over this | lower(this)' openlibrary.json
"ol369643a"
❯
has
and grep
functions correctly, but I think because lower
doesn't require an additional argument, I just fell into a jq
habit.lower
properly? If I make a mistake by passing a non-string into it, I'll get an error:❯ zq 'lower(1)'
error({message:"lower: string arg required",on:1})
❯ zq 'yield "HEY" | lower'
❯
-C
flag has the answer:❯ zq -C 'yield "HEY" | lower'
yield "HEY"
| search lower
search
is now the implied operator for the term ‘lower’! It's not parsed as a built-in function because it's also a valid search expression, and because ZqSearchIsFirstClass, which is a good thing in search contexts, that's given the priority.zq
. Remember, if you make a change, and get no output, an Implied Operator is probably in play.zq
version 1.16.0 and jq version 1.7jq
and zq
. Here's an example of anjq -c '[.docs[]
| {title, author_name: .author_name[0], publish_year: .publish_year[0]}
| select(.author_name!=null and .publish_year!=null)
]
| group_by(.author_name)
| [.[] | {author_name: .[0].author_name, count: . | length}]
| sort_by(.count) | reverse | limit(3;.[])' openlibrary.json
zq -j 'over docs
| {title, author_name: author_name[0], publish_year: publish_year[0]}
| has(author_name) and has(publish_year)
| count() by author_name | sort -r count | head 3' openlibrary.json
jq
every program is a series of filters separated by the pipe operator. There are a few places the pipe operator cannot be used, but it's fairly ubiquitous.zq
has a more elaborate and structured syntax. At its highest level, one or more dataflow operators are joined together in a sequence with the pipe character (sequence overview). But within operators, syntax varies and the pipe character isn't used. Field references and expressions are common, and most functions receive arguments passed in parentheses. Function outputs can only be passed to other functions as nested calls, not via the pipe character.operator | operator | ...
who
value is replaced with “me” whenever it's “chrismo”:> zq -j '[{who:"bob"}, {who:"chrismo"}]
| over this
| put who:=replace(who, "chrismo", "me")'
{"who":"bob"}
{"who":"me"}
put
is an operation that sets a field to an expression. replace
is a function taking three string arguments.who
field, we can use the coalesce
function to return an empty string if who
is missing, but we cannot use the pipe character like we could in jq between two functions:❯ zq -j '[{}, {who:"bob"}, {who:"chrismo"}]
| over this
| put who:=((coalesce(who, "") | replace(who, "chrismo", "me"))'
zq: error parsing Zed at line 3, column 39:
| put who:=((coalesce(who, "") | replace(who, "chrismo", "me"))
=== ^ ===
put
is and the pipe character is not valid in an expression. This can be accomplished by nesting the function calls:[put] <field>:=<expr>
❯ zq -j '[{}, {who:"bob"}, {who:"chrismo"}]
| over this
| put who:=replace(coalesce(who, ""), "chrismo", "me")'
{"who":""}
{"who":"bob"}
{"who":"me"}
jq
code. Recently, I've been checking out zq
and evaluating it as a jq
replacement.zq
in more detail, while it looks similar on the surface, it's pretty different under the hood and I've been tripped up by a few things along the way. The goal with these posts is to try and flatten the learning curve for folks also coming over from jq
.zq
to folks familiar with jq
. I also want to thank Phil Rzewski from the Brim team, who's been incredibly helpful, both in their support Slack and in the GitHub repo, fielding my questions patiently and thoughtfully.Post | tl;dr |
ZqSearchIsFirstClass | Searching is an integral part of aggregation and transformation, but compared to jq , zq search is first class. |
ZqImpliedOperatorsCanTrickYou | Did you make a change and now there's no output? It's probably the search implied operator hiding the fact that you tried to pipe something to a function like you would in jq , but that's not how it works. See previous post. |
ZqPipeCharacter | You can't use | everywhere, esp. not in or between Expressions. Only in-between Dataflow Operators. |
ZqFunctionsNeedParens | Functions always have to be called with parens. If you don't, the search implied operator will getchya. |
ZqUserOperatorsNeedParens | While built-in dataflow operators do not take parens, user-defined ones must. If you don't, (all together now) the search implied operator will getchya. |
zq
: simple search; easier querying than jq
(despite some learning curve); more performant; multiple input & output formats: from csv to Parquet and more; flexibly discover, massage, and query unstructured data alongside typed records with the power of relational joins in a file system lake; and more.)[36:28] Daniel Ek, Spotify co-founder: It's actually my co-founder's [Martin Lorentzon] saying. He said this thing, I'm not even sure he was aware that he coined what I think is an iconic quote. He said, “the value of a company is the sum of all problems solved.” Even to this day, it's one of those things that I think about. You may think about all the things you guys went through as all the issues that you went through, but you solved them, one by one. And I think the most important thing that you got right is the integrity of the programming and the shows that you make. At the end of the day, that is the value that you're bringing to that and bringing to consumers and it really served you well in the end.
[37:20] Alex Blumberg: In other words, Matt's and mine constant fighting had produced something valuable, the fighting itself in fact was the thing that made it better. If I hadn't cared about what I cared about and he hadn't cared about what he cared about and we hadn't each cared enough to fight with each other, the company we built, it wouldn't have worked as well.
Slack is a prescription for building a capacity to change into the modern enterprise. It looks into the heart of the efficiency-flexibility quandary: The more efficient you get, the harder it is to change. The book shows managers how to make their organizations slightly less efficient but enormously more effective. It coaches them on the introduction of slack, the missing ingredient required for all change. It counsels a thoughtful use of slack instead of the mindless obsession with elimination of all slack in the interests of efficiency.
This is a graph from a report that we just published internally a few weeks ago that looks at, for builds of a different duration, how much time are people spending in each of these different cognitive modes. What we can see here is slow builds, builds lasting more than an hour, actually aren't taking that much time in aggregate. Likewise, builds that are really fast, well, they’re really fast, and they’re also not taking that much time in aggregate.
But we can see here that builds in the 2 to 5 minute range and the 5 to 10 minute range, those are actually pretty expensive. … That red bar is time where, as far as we can tell, people were just sitting there twiddling their thumbs staring at the terminal. … We can see time where we simply don’t know what they were doing [yellow].
It looks like the 2 to 10 minutes range is where we should spend our optimization effort and we can use that to focus our efforts more intelligently than a scattershot approach.