The odb import
tool can import data in the "wide" format (as produced by the sql
tool run with "-f wide
" option):
$ ./odb sql select \* -i 2000010106.1.0.odb -f wide -o 2000010106.1.0.csv $ head -n 1 2000010106.1.0.csv expver@desc:string andate@desc:integer antime@desc:integer seqno@hdr:integer obstype@hdr:integer obschar@hdr:Bitfield[codetype:9;instype:10;retrtype:6;geoarea:6] subtype@hdr:integer date@hdr:integer time@hdr:integer rdbflag@hdr:Bitfield[lat_humon:1;lat_qcsub:1;lat_override:1;lat_flag:2;lat_hqc_flag:1;lon_humon:1;lon_qcsub:1;lon_override:1;lon_flag:2;lon_hqc_flag:1;date_humon:1;date_qcsub:1;date_override:1;date_flag:2;date_hqc_flag:1;time_humon:1;time_qcsub:1;time_override:1;time_flag:2;time_hqc_flag:1;stalt_humon:1;stalt_qcsub:1;stalt_override:1;stalt_flag:2;stalt_hqc_flag:1] status@hdr:Bitfield[active:1;passive:1;rejected:1;blacklisted:1;monthly:1;constant:1;experimental:1;whitelist:1] ...
The header of the text format is a list of column descriptions, each in a format: <column-name>:<type>
The type can be:
- REAL
- DOUBLE
- INTEGER
- STRING
- BITFIELD
In the last case, BITFIELD, the list of fields and their sizes in bits follows, in square brackets, for example:
rdbflag@hdr:Bitfield[lat_humon:1;lat_qcsub:1;lat_override:1;lat_flag:2;lat_hqc_flag:1;lon_humon:1;lon_qcsub:1;lon_override:1;lon_flag:2;lon_hqc_flag:1;date_humon:1;date_qcsub:1;date_override:1;date_flag:2;date_hqc_flag:1;time_humon:1;time_qcsub:1;time_override:1;time_flag:2;time_hqc_flag:1;stalt_humon:1;stalt_qcsub:1;stalt_override:1;stalt_flag:2;stalt_hqc_flag:1]
So, importing CSV text data (TAB delimited similarly as the one produced using the odb sql
tool in the example above) to ODB can be done like follows:
$ ./odb import -d TAB 2000010106.1.0.csv 2000010106.1.0.imported.odb
Delimiter can be changed with option -d
, by default it is ','.
Regarding the data in CSV, one should remember that we have currently the limitation that STRINGS can be 8 characters long only.
Converting from other binary formats like e.g. netcdf to ODB via an intermediate ASCII should be avoided, due to lose of precision (unless the data is printed with full precision).
6 Comments
Ian Harris
Hi, what is the maximum number of fields that can be imported? I have to convert a PP file with 148 fields per line (plus 3 date fields at the start), and ODB's having none of it.
A version with only eight fields (plus 3) imported fine.
So, given that I'll have to convert the single file into multiples with a portion of the fields in each, what is the maximum safe field count please?
I'd rather not use trial and error!
Cheers
Harry
Peter Kuchta
I am not aware of any fixed limit on number of fields. Can you please provide the file so I can try to reproduce the problem?
Ian Harris
Thanks Peter - I'll check the file itself first. I think the text editor I was using might have corrupted it, It's good to know that there's no limit - one less thing in the equation!
Have a good weekend.
Cheers, Harry
Ian Harris
I tried with a fresh copy, using a better editor (gedit) to add the INTEGER and REAL tags in the header line. I get exactly the same error, which can be summarised as:
Unknown type: 'REAL
(that's exactly how it displays). That error is coming from Columns.cc.
I will email you a ten-record version as the original has over 50,000 records. However this smaller version does have all the fields (3 + 148) and still gives the error.
I'm out until Tuesday so no rush!
Cheers
Harry
Peter Kuchta
Hi Ian, thanks for the email with sample CSV file. I had a look and it turned out the problem was caused by the fact that some of the fields (columns' descriptions) in the header were separated with tab and some with space character. I converted the file so only tab was used as a separator using a following Python script (redirecting its output to a file converted.csv):
with open('C3-EURO4M-MEDARE_PP.1st10fields.txt') as f:
for line in f:
print '\t'.join(line.split())
and then successfully imported with
$ odb import -d TAB converted.csv data.odb
Please note that although the import tool was able to create a file with column names like: "60100-0", "60115-0", etc. we generally recommend using column names looking like valid identifiers in programming languages like C, Java or Python, i.e. consisting of only alphanumeric characters and starting with a letter. By convention, at ECMWF we only use small caps in column names. We have not tested ODB SQL engine with identifiers like "60100-0", I suspect these kind of names can cause problems when working with SQL.
Also, there is several columns with the same name in the provided text file. This will certainly be a problem when using SQL engine. The columns on a file should have unique names.
Ian Harris
Thank you, Peter - that's very helpful.
I'll get back to Manola and team; they have some reconsidering to do!
Cheers
Harry