I know this dates me, but ...

Royce Williams

I know this dates me, but ... 80% of the problems I'm solving with jq are caused by using JSON at all ... when a simpler format would have been fine.

Repeating every verbose field name in each record, when the schema is flat, is often premature "schema might need to be variable someday" optimization.

When the Rapid7 DNS data was freely available, it was distributed as a one-line-per-stanza JSON file. The first thing I'd do after downloading it was convert it to CSV ... which cut its size by 60%.

It's like buying a ten-pound box of individually wrapped grains of rice.

Ryan Castellucci :nonbinary_flag:

@tychotithonus This matters a lot less if you compress it. Overhead matters a lot less at that point.

I have a bunch of tooling that automatically detects (via the first few bytes) compressed data and handles it. Mostly using zstandard.

Also, CSV is bad and I hate it. TSV is acceptable

Royce Williams

@ryanc Yeah, Definitely pro TSV! When I say CSV out loud, I actually mean TSV in my head. I need to watch that ...

I'll also have to dig up the post where I grieve for the alternate future where we actually used the actual dedicated field and record separator characters built into ASCII. So much avoidable pain.

Aaron Toponce ⚛️:debian:

@tychotithonus @ryanc Doesn't 0x1E solve this problem as a record separator, or am I misunderstanding?

Royce Williams

@atoponce Yes, exactly! ...for sufficiently obscure and esoteric values of "solve". Most people have never even heard of them.

If they had been universally used from the beginning, CSV wouldn't even be a thing, and plenty of things we do to avoid CSV would also not be things ...

@ryanc

Tim Bray

@tychotithonus CBOR is a very decent option.

BTW I think being able to read the data without consulting the schema is a pretty big plus. But yep, JSON is unnecessarily verbose.

Risotto

@tychotithonus @phil I disagree for four reasons:

1. I'm bitter and you're right, JSON is verbose and easily reconstructable. Obviously we should use the fewest bytes in the hardest to read files (this is what IDE highlighting of JSON an XML is for, no highlighting of CSV exists outside of excel)
2. usually API calls filter to less objects
3. compression exists, and more usable/durable takes priority over network savings (see next point)
4. most shell tools haven't adopted good null-terminated fields that behave well in cut/awk and fails comma data in weird unexpected ways (but in JQ noise goes away)

Whiskers The Great and Terrible

@risottobias @tychotithonus I feel about JSON the same way I feel about most schemaless databases. They are great right up to the point that you put a schema on them

Risotto

@phil @tychotithonus fair, once you normalize a DB then that kind of thing can go away.

but cut/awk are not as solid as an SQL terminal.

unix tools and handmade tools that don't correctly follow comma escaping or quotting or empty fields are super annoying.

querying data attributes by their names is pretty cool (and nothing is stopping people from using a CSV tool that does the same thing)

user, email, asset_tag, cpu, hostname

curl https:// somecoolapi /assets/
| csvq '(.cpu="x86")|{.hostname,.user}'

vs | awk "split , f3=x86" | cut -f5 -f2 (heck I dunno what I was doing with this in the terminal history. shorter and fancier isn't better)

Ryan Castellucci :nonbinary_flag:

@tychotithonus @atoponce I used to regularly work with TSV where the values were JSON. :lolsob: