Overview
Teaching: 20 min
Exercises: 10 minQuestions
How can I import data using Pandas?
What is NetCDF?
What tools are available to work with NetCDF?
Objectives
Describe the data formats supported by Pandas.
Show examples of reading/writing data formats.
Describe other data formats used for GIS data.
The Network Common Data Form (NetCDF) is a set of libraries and self-describing machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data. NetCDF was developed and is maintained at Unidata, part of the University Corporation for Atmospheric Research (UCAR) Office of Programs (UOP). Unidata is funded primarily by the National Science Foundation.
Pandas does not have native support for reading/writing NetCDF, however there are a number of additional packages that provide this capability:
The “classic” NetCDF data model consists of the following components:
In addition, the “common data model” introduced in NetCDF-4 includes the following:
NetCDF can be used to store many kinds of data, but it was originally developed for the Earth science community.
NetCDF views the world of scientific data in the same way that an atmospheric scientist might: as sets of related arrays. There are various physical quantities (such as pressure and temperature) located at points at a particular latitude, longitude, vertical level, and time. A scientist might also like to store supporting information, such as the units, or some information about how the data were produced.
The axis information (latitude, longitude, level, and time) would be stored as NetCDF dimensions. Dimensions have a length and a name.
The physical quantities (pressure, temperature) would be stored as NetCDF variables. Variables are N-dimensional arrays of data, with a name and an associated set of NetCDF dimensions.
It is also customary to add one variable for each dimension, to hold the values along that axis. These variables are call “coordinate variables.” The latitude coordinate variable would be a one-dimensional variable (with latitude as its dimension), and it would hold the latitude values at each point along the axis.
The additional bits of metadata would be stored as NetCDF attributes. Attributes are always single values or one-dimensional arrays. (This works out well for a string, which is a one-dimensional array of ASCII characters.)
Although there are a number of packages for reading NetCDF files, we will just be working with netcdf4-python
for this tutorial.
To open a NetCDF file from Python, you simply call the Dataset() constructor as follows (a sample data set is available here):
from netCDF4 import Dataset
dataset = Dataset('sresa1b_ncar_ccsm3-example.nc')
print(dataset.file_format)
The netCDF4 module can read in any NetCDF format. This tutorial will focus exclusively on the NetCDF-
“classic” data model, which is contains only dimensions, variables and attributes only. When you
print out the file_format
attribute, you can see that the file is in NETCDF3_CLASSIC
mode.
You can interrogate dimensions using simple dictionary calls:
print(dataset.dimensions.keys())
print(dataset.dimensions['time'])
This would result in the output:
['lat', 'lon', 'bnds', 'plev', 'time']
<class 'netCDF4._netCDF4.Dimension'> (unlimited): name = 'time', size = 1
You can interrogate variables using simple dictionary calls:
for key in dataset.variables.keys():
var = dataset.variables[key]
print("name=" + var.name + ", shape=" + str(var.shape))
This would result in the output:
name=area, shape=(128, 256)
name=lat, shape=(128,)
name=lat_bnds, shape=(128, 2)
name=lon, shape=(256,)
name=lon_bnds, shape=(256, 2)
name=msk_rgn, shape=(128, 256)
name=plev, shape=(17,)
name=pr, shape=(1, 128, 256)
name=tas, shape=(1, 128, 256)
name=time, shape=(1,)
name=time_bnds, shape=(1, 2)
name=ua, shape=(1, 17, 128, 256)
Notice that you can access a variable by its key:
var = dataset.variables['pr']
print(" dims =", var.dimensions)
print(" shape =", var.shape)
print(" size =", var.size)
print(" ndim =", var.ndim)
print(" datatype =", var.datatype)
Which would result in:
dims = ('time', 'lat', 'lon')
shape = (1, 128, 256)
size = 32768
ndim = 3
datatype = float32
Global attributes are available as attributes of the Dataset
instance (more about classes and attributes later):
# Find all NetCDF global attributes
print(dataset.ncattrs())
for attr in dataset.ncattrs():
print(attr, '=', getattr(dataset, attr))
Results in:
['CVS_Id', 'creation_date', 'prg_ID', 'cmd_ln', 'history', 'table_id', 'title', 'institution', 'source', 'contact', 'project_id', 'Conventions', 'references', 'acknowledgment', 'realization', 'experiment_id', 'comment', 'model_name_english']
CVS_Id = $Id$
creation_date =
prg_ID = Source file unknown Version unknown Date unknown
cmd_ln = bds -x 256 -y 128 -m 23 -o /data/zender/data/dst_T85.nc
...
if you know the name of an attribut, you can access it directly:
print(dataset.title)
Variable attributes are available as attributes of the Variable
instance:
var = dataset.variables['pr']
# Get units attribute
print(var.units)
# Find all variable attributes
for attr in var.ncattrs():
print(attr, '=', getattr(var, attr))
Results in:
kg m-2 s-1
comment = Created using NCL code CCSM_atmm_2cf.ncl on
machine eagle163s
missing_value = 1e+20
_FillValue = 1e+20
cell_methods = time: mean (interval: 1 month)
history = (PRECC+PRECL)*r[h2o]
original_units = m-1 s-1
original_name = PRECC, PRECL
standard_name = precipitation_flux
units = kg m-2 s-1
long_name = precipitation_flux
cell_method = time: mean
Challenge
Write a Python script to open the CMIP5 global emissions data set. Call the script
historical_co2_emissions.py
. Use your knowledge of netCDF4 to answer the following questions:
- How many dimensions does the data set contain? What are the sizes of the dimensions?
- How many variables does the data set contain, and what are their names and dimensions?
- What do you think the purpose of the ‘FF’ variable is?
Once a NetCDF data set is loaded, it is possible to access the data as if it was a NumPy array using array slice notation. We can use the dimensions
attribute to see how the data is stored:
pr = dataset.variables['pr']
print(pr.dimensions)
pr_np = pr[:,:,:]
print(type(pr_np))
print(pr_np)
Results in:
('time', 'lat', 'lon')
<class 'numpy.ndarray'>
[[[ 1.09154621e-06 1.05453501e-06 1.07892288e-06 ..., 1.19990307e-06
1.18260778e-06 1.11406700e-06]
[ 8.50919207e-07 8.01471970e-07 7.74396426e-07 ..., 8.93815013e-07
8.46728483e-07 8.41070630e-07]
[ 7.78378194e-07 7.64248853e-07 7.55447502e-07 ..., 8.13378392e-07
7.91471564e-07 7.89699868e-07]
...,
[ 9.11267216e-06 9.23918469e-06 9.17026045e-06 ..., 9.26296707e-06
9.18914975e-06 9.14246812e-06]
[ 1.10062556e-05 1.11204690e-05 1.11354339e-05 ..., 1.14826926e-05
1.16207366e-05 1.14265431e-05]
[ 1.03569955e-05 1.01802771e-05 9.95330629e-06 ..., 1.03063812e-05
1.03916263e-05 1.05708477e-05]]]
We can use the normal NumPy operations on the resulting array:
print(pr_np.mean(axis=1))
Key Points