I know this dates me, but ...
-
I know this dates me, but ... 80% of the problems I'm solving with
jq
are caused by using JSON at all ... when a simpler format would have been fine.Repeating every verbose field name in each record, when the schema is flat, is often premature "schema might need to be variable someday" optimization.
When the Rapid7 DNS data was freely available, it was distributed as a one-line-per-stanza JSON file. The first thing I'd do after downloading it was convert it to CSV ... which cut its size by 60%.
It's like buying a ten-pound box of individually wrapped grains of rice.
-
Ryan Castellucci :nonbinary_flag:replied to Royce Williams last edited by
@tychotithonus This matters a lot less if you compress it. Overhead matters a lot less at that point.
I have a bunch of tooling that automatically detects (via the first few bytes) compressed data and handles it. Mostly using zstandard.
Also, CSV is bad and I hate it. TSV is acceptable
-
Royce Williamsreplied to Ryan Castellucci :nonbinary_flag: last edited by
@ryanc Yeah, Definitely pro TSV! When I say CSV out loud, I actually mean TSV in my head. I need to watch that ...
I'll also have to dig up the post where I grieve for the alternate future where we actually used the actual dedicated field and record separator characters built into ASCII. So much avoidable pain.
-
Aaron Toponce ⚛️:debian:replied to Royce Williams last edited by
@tychotithonus @ryanc Doesn't 0x1E solve this problem as a record separator, or am I misunderstanding?
-
Royce Williamsreplied to Aaron Toponce ⚛️:debian: last edited by
-
@tychotithonus CBOR is a very decent option.
BTW I think being able to read the data without consulting the schema is a pretty big plus. But yep, JSON is unnecessarily verbose.
-
Risottoreplied to Royce Williams last edited by [email protected]
@tychotithonus @phil I disagree for four reasons:
1. I'm bitter and you're right, JSON is verbose and easily reconstructable. Obviously we should use the fewest bytes in the hardest to read files (this is what IDE highlighting of JSON an XML is for, no highlighting of CSV exists outside of excel)
2. usually API calls filter to less objects
3. compression exists, and more usable/durable takes priority over network savings (see next point)
4. most shell tools haven't adopted good null-terminated fields that behave well in cut/awk and fails comma data in weird unexpected ways (but in JQ noise goes away) -
Whiskers The Great and Terriblereplied to Risotto last edited by
@risottobias @tychotithonus I feel about JSON the same way I feel about most schemaless databases. They are great right up to the point that you put a schema on them
-
Risottoreplied to Whiskers The Great and Terrible last edited by [email protected]
@phil @tychotithonus fair, once you normalize a DB then that kind of thing can go away.
but cut/awk are not as solid as an SQL terminal.
unix tools and handmade tools that don't correctly follow comma escaping or quotting or empty fields are super annoying.
querying data attributes by their names is pretty cool (and nothing is stopping people from using a CSV tool that does the same thing)
user, email, asset_tag, cpu, hostname
curl https:// somecoolapi /assets/
| csvq '(.cpu="x86")|{.hostname,.user}'vs | awk "split , f3=x86" | cut -f5 -f2 (heck I dunno what I was doing with this in the terminal history. shorter and fancier isn't better)
-
Ryan Castellucci :nonbinary_flag:replied to Royce Williams last edited by
@tychotithonus @atoponce I used to regularly work with TSV where the values were JSON. :lolsob: