Dear ECMWF,
I find Copernicus data very important, however the NetCDF files are very difficult to work with. I've been trying different scripts in python and R to extract climate data from NetCDF files into CSV for further analysis, but it seems a very difficult task.
I would like to ask for your help.
I need to extract climate data (precipitation, average temperature, etc.) for European countries using the data provided by Copernicus - "Temperature and precipitation climate impact indicators from 1970 to 2100 derived from European climate projections ". Ideally I would like to have a panel data set, with columns per year, per country, and then separate columns for each climate variable.
I've tried the python script recommended by this page - How to convert NetCDF to CSV
This is the code used:
#this is for reading the .nc in the working folder
import glob
#this is reaquired ti read the netCDF4 data
from netCDF4 import Dataset
#required to read and write the csv files
import pandas as pd
#required for using the array functions
import numpy as np
from matplotlib.dates import num2date
data = Dataset('prAdjust_tmean.nc')
This is how data looks - contents
print(data) <class 'netCDF4._netCDF4.Dataset'> root group (NETCDF4 data model, file format HDF5): CDI: Climate Data Interface version 1.8.2 (http://mpimet.mpg.de/cdi) frequency: year CDO: Climate Data Operators version 1.8.2 (http://mpimet.mpg.de/cdo) creation_date: 2020-02-12T15:00:49ZCET+0100 Conventions: CF-1.6 institution_url: www.smhi.se invar_platform_id: - invar_rcm_model_driver: MPI-M-MPI-ESM-LR time_coverage_start: 1971 time_coverage_end: 2000 domain: EUR-11 geospatial_lat_min: 23.942343 geospatial_lat_max: 72.641624 geospatial_lat_resolution: 0.04268074 degree geospatial_lon_min: -35.034023 geospatial_lon_max: 73.937675 geospatial_lon_resolution: 0.009246826 degree geospatial_bounds: - NCO: netCDF Operators version 4.7.7 (Homepage = http://nco.sf.net, Code = http://github.com/nco/nco) acknowledgements: This work was performed within Copernicus Climate Change Service - C3S_424_SMHI, https://climate.copernicus.eu/operational-service-water-sector, on behalf of ECMWF and EU. contact: Hydro.fou@smhi.se keywords: precipitation license: Copernicus License V1.2 output_frequency: 30 year average value summary: Calculated as the mean annual values of daily precipitation averaged over a 30 year period. comment: The Climate Data Operators (CDO) software was used for the calculation of climate impact indicators (https://code.mpimet.mpg.de/projects/cdo/embedded/cdo.pdf, https://code.mpimet.mpg.de/projects/cdo/embedded/cdo_eca.pdf). history: CDO commands (last cdo command first and separated with ;): timmean; yearmean invar_bc_institution: Swedish Meteorological and Hydrological Institute invar_bc_method: TimescaleBC, Description in deliverable C3S_D424.SMHI.1.3b invar_bc_method_id: TimescaleBC v1.02 invar_bc_observation: EFAS-Meteo, https://ec.europa.eu/jrc/en/publication/eur-scientific-and-technical-research-reports/efas-meteo-european-daily-high-resolution-gridded-meteorological-data-set-1990-2011 invar_bc_observation_id: EFAS-Meteo invar_bc_period: 1990-2018 data_quality: Testing of EURO-CORDEX data performed by ESGF nodes. Additional tests were performed when producing CII and ECVs in C3S_424_SMHI. institution: SMHI project_id: C3S_424_SMHI references: source: The RCM data originate from EURO-CORDEX (Coordinated Downscaling Experiment - European Domain, EUR-11) https://euro-cordex.net/. invar_experiment_id: rcp45 invar_realisation_id: r1i1p1 invar_rcm_model_id: MPI-CSC-REMO2009-v1 variable_name: prAdjust_tmean dimensions(sizes): x(1000), y(950), time(1), bnds(2) variables(dimensions): float32 lon(y,x), float32 lat(y,x), float64 time(time), float64 time_bnds(time,bnds), float32 prAdjust_tmean(time,y,x) groups:
After that I extract the needed variable:
t2m = data.variables['prAdjust_tmean']
Get dimensions assuming 3D: time, latitude, longitude
time_dim, lat_dim, lon_dim = t2m.get_dims() time_var = data.variables[time_dim.name] times = num2date(time_var[:], time_var.units) latitudes = data.variables[lat_dim.name][:] longitudes = data.variables[lon_dim.name][:] output_dir = './'
And the Error:
OverflowError Traceback (most recent call last) <ipython-input-9-69e10e41e621> in <module> 2 time_dim, lat_dim, lon_dim = t2m.get_dims() 3 time_var = data.variables[time_dim.name] ----> 4 times = num2date(time_var[:], time_var.units) 5 latitudes = data.variables[lat_dim.name][:] 6 longitudes = data.variables[lon_dim.name][:] C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\dates.py in num2date(x, tz) 509 if tz is None: 510 tz = _get_rc_timezone() --> 511 return _from_ordinalf_np_vectorized(x, tz).tolist() 512 513 C:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\function_base.py in __call__(self, *args, **kwargs) 2106 vargs.extend([kwargs[_n] for _n in names]) 2107 -> 2108 return self._vectorize_call(func=func, args=vargs) 2109 2110 def _get_ufunc_and_otypes(self, func, args): C:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\function_base.py in _vectorize_call(self, func, args) 2190 for a in args] 2191 -> 2192 outputs = ufunc(*inputs) 2193 2194 if ufunc.nout == 1: C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\dates.py in _from_ordinalf(x, tz) 329 330 dt = (np.datetime64(get_epoch()) + --> 331 np.timedelta64(int(np.round(x * MUSECONDS_PER_DAY)), 'us')) 332 if dt < np.datetime64('0001-01-01') or dt >= np.datetime64('10000-01-01'): 333 raise ValueError(f'Date ordinal {x} converts to {dt} (using ' OverflowError: int too big to convert
And this is the last part of the script:
import os
# Path
path = "/home"
# Join various path components
print(os.path.join(path, "User/Desktop", "file.txt"))
# Path
path = "User/Documents"
# Join various path components
print(os.path.join(path, "/home", "file.txt"))
filename = os.path.join(output_dir, 'table.csv')
print(f'Writing data in tabular form to {filename} (this may take some time)...')
times_grid, latitudes_grid, longitudes_grid = [
x.flatten() for x in np.meshgrid(times, latitudes, longitudes, indexing='ij')]
df = pd.DataFrame({
'time': [t.isoformat() for t in times_grid],
'latitude': latitudes_grid,
'longitude': longitudes_grid,
't2m': t2m[:].flatten()})
df.to_csv(filename, index=False)
print('Done')
Once again, I would like to stress that I need to extract from nc file these specific columns: year (time period), longitude, latitude and the climate variables (temperature, precipitations, etc.). I hope you can help me in this task.
Many thanks for your time and help.
Best regards, Marian
2 Comments
Milana Vuckovic
Hi Marian,
You can do what you need using Python libraries pandas and xarray.
So something like this:
If pandas complain your data is too big, try to work with smaller chunk until you manage to do what you need and then try to scale.
If you want to learn more, you might find these resources useful:
https://gitlab.com/eu_hidalgo/ecmwf_enccs_training
I hope this helps!
Milana
Marian Dobranschi
Dear Milana,
I've tried your script. It works to extract some data, but it fails to extract yearly data. The output is just for one year, which is not helpful at all.
I need average temp for a given lat and long for every year that the nc document stores. It has data from 1971-2000, but the output from your script manages to extract only data for 2000. This issue gets more and more complicated, and the netCSF files are impossible to work with. So climate data are needed in non-spatial format, non-array to be included in regression analysis as a standalone variables. These variables should be time series, meaning yearly (or monthly, or daily) observations per country. Array data cannot be used in multi-variate regressions, because time series should be similar, for example GDP per year per country and precipitations per year per country should be comparable in terms of form, shape, length. It goes against the practice and rationale of statistical softwares to combine non-array and array data, that's why we need a method to extract array data in a non-array format in order to work in other way than plotting (which array data is best for).