The Mysterious Extra Quotes that `jc` Put In My JSON
The Symptom
I try to use jc
to convert a CSV file to JSON for further processing with jq
. When I try to process the JSON document with jq
, it fails, unable to find a value for the first cell of the first row.
When I look at the JSON document, the JSON object key corresponding the first column name in the CSV document has extra quotes around it: instead of "Date"
, the key is "\"Date\""
. Only this key has been changed; the others are exactly as I expect them.
It seems that jc
has put extraneous quotes (or extra quotation marks1 ) around a CSV column header name.
The Cause
The CSV file has a Byte Order Mark (BOM) at the beginning, which is unprintable, so I didn’t notice it. These unprintable characters become part of the name of the first CSV column, which then becomes a key in the resulting JSON objects that jc --csv
parses.
Since "\"Date\""
is not equal to "Date"
, chaos reigns.
The Solution
Upgrade jc
to at least version 1.22.2, at least according to this issue. I was using… 1.20.4. How embarrassing.
$ pip3 install jc
$ jc --version
jc version: 1.25.0
python interpreter version: 3.11.2
python path: /home/jbrains/ThirdParty/python-3.11.2/bin/python3
https://github.com/kellyjonbrazil/jc © 2019-2024 Kelly Brazil
The Gory Details
I reproduce these details in order to give the next poor soul a chance to make it here by reading a single web page. Here are the steps that led me here.
I downloaded a CSV file of transactions from PayPal. (I use them against my will.)
I tried to process the CSV file with jc --csv | jq '.["Date"] | parse_paypal_date'
, because I’ve written a jq
function to parse PayPal’s annoying date format of dd/mm/yyyy
. (If I got to pass one law, it’d require ISO 8601 for all dates all the time everywhere.)
My processing command failed with
jq: error (at <stdin>:1): strptime/1 requires string inputs and arguments
I checked the CSV file and I saw a column named Date
with obvious string values, starting with "02/11/2022"
, which is, of course, 2022-11-02 to people with taste.
After a few iterations of disbelief and confusion, I checked the JSON document.
[
{
"\"Date\"": "02/11/2022",
Extra quotes?!
And, BTW, when I paste the JSON output into Typora to write this article, Typora clearly shows me that there’s an unprintable character at the beginning of that key.
I’ve never seen this before, so let me check another, very similar CSV file from a different source. Was this always broken?!
Nope. Everything looks good from another CSV document that also has "Date"
as the first column name.
[
{
"Date": "11/01/2022",
Spot the Difference!
At least now I can compare these two CSV documents and ignore the rest of the world. That makes diagnosis much simpler.
Since I can’t see a difference in the first column name between these two files, there must be an unprintable difference. How do I see the unprintable characters in a text file on the typical Linux distribution?
Today I Learned about cat -A
.
$ cat -A /path/to/annoying/paypal.csv | head -n 1
M-oM-;M-?"Date","Time","TimeZone","Name","Type","Status","Currency","Gross","Fee","Net","From Email Address","To Email Address","Transaction ID","Shipping Address","Address Status","Item Title","Item ID","Shipping and Handling Amount","Insurance Amount","Sales Tax","Option 1 Name","Option 1 Value","Option 2 Name","Option 2 Value","Reference Txn ID","Invoice Number","Custom Number","Quantity","Receipt ID","Balance","Address Line 1","Address Line 2/District/Neighborhood","Town/City","State/Province/Region/County/Territory/Prefecture/Republic","Zip/Postal Code","Country","Contact Phone Number","Subject","Note","Country Code","Balance Impact"$
TIL: Byte Order Marks
I inhaled quickly, then exhaled slowly.
I searched the web for "M-oM-;M-?"
and found a non-trivial number of resuilts, which led me to learn about Byte Order Marks (BOM), thanks first to a Stack Overflow answer.
I don’t know much about Byte Order Marks, except that now I have better search keywords.
I searched the web for remove "BOM" from csv file
and found a non-trivial number of results, including the authoritative-sounding “CSV and BOM character”. Unfortunately, I’m not writing PHP code using LeagueCSV, so that helps me a bit less to work around the problem. Even so, it gives me confidence that there exists some body of prior art on this subject, so I don’t need to figure this all out on my own.
I Can’t Be the First!
I went back to jc
, hoping to find an issue about this, before I went charging off on my own to fix work around the problem. And that’s when I found this issue. And that’s when I discovered that I’m using an outdated version of jc
and that upgrading might fix the problem.
And that’s when I wrote all this, while it was fresh in my mind.
I add synonyms here, not for clarity, but to attract more hits in web searches.↩︎