Overview
Teaching: 20 min
Exercises: 30 minQuestions
How can I use NetCDF in Pandas?
How should I structure my DataFrame?
Objectives
In the last module, we wrote a simple Python program to load the NetCDF data and print some useful information about it. An example of what this script might look like is available here. We also saw that the data is accessible as NumPy arrays. However, it would be much more useful if we could store the data as Pandas DataFrames so that we get the benefit of Pandas features for manipulating the data.
One problem we have with the emissions data is that it is stored as a 3-dimensional array, a “time” dimension containing the month since 1751, and latitude and logitude dimensions. How do we represent this as a DataFrame, which are inherently 2-dimensional structures?
In the Introduction to Pandas lessons, we saw how to index rows and columns of a Pandas Series
or DataFrame
using labels and integers. Pandas uses an Index
for this purpose, which is essentially a set of labels that represent the elements of a Series
or one of the axes
of the DataFrame
. An Index
is inherently one-dimensional, so can only represent one set of labels.
Pandas also provides the ability to index data hierarchically using a MultiIndex
, which makes it possible to store and manipulate data with an arbitrary number of
dimensions. You can think of a MulitIndex
as an array of tuples, where each tuple is unique. A Series
or DataFrame
with a hierarchical index can then be
manipulated as if it was a multi-dimensional array.
As an example, suppose we have an array that represents the amount of money we earned at the end of each week for the first quarter of 2017:
earnings = np.array([370.72, 613.96, 98.17, 616.04, 575.00, 266.93, 243.11,
467.21, 349.54, 698.09, 842.06, 988.02])
With the data stored this way, it is not easy to answer a question like “what are the mean weekly earnings each month?”.
Let’s instead create a list of pairs of month labels and the dates of all the Friday’s of the first quarter as follows:
month_labels = ['Jan'] * 4 + ['Feb'] * 4 + ['Mar'] * 4
date_labels = [6, 13, 20, 27, 3, 10, 17, 24, 3, 10, 17, 24]
month_date_pairs = list(zip(month_labels, date_labels))
print(month_date_pairs)
This generates the list:
[('Jan', 6), ('Jan', 13), ('Jan', 20), ('Jan', 27), ('Feb', 3), ('Feb', 10), ('Feb', 17), ('Feb', 24), ('Mar', 3), ('Mar', 10), ('Mar', 17), ('Mar', 24)]
We can now use this list to create a MultiIndex
and add it to a Series
representing the earnings data:
import pandas as pd
fridays_index = pd.MultiIndex.from_tuples(month_date_pairs, names=['Month','Date'])
first_quarter_earnings = pd.Series(earnings, index=fridays_index)
print(first_quarter_earnings)
The output will be:
Month Date
Jan 6 370.72
13 613.96
20 98.17
27 616.04
Feb 3 575.00
10 266.93
17 243.11
24 467.21
Mar 3 349.54
10 698.09
17 842.06
24 988.02
dtype: float64
Notice that Pandas only displays the index labels if they are not repeated. Also, we gave each level of the index a name (“Month” and “Date”). Although this is not necessary, it allows the index levels to be manipulated using the names.
Now we can easily answer the question “what are the mean weekly earnings each month?” as follows:
for month in ['Jan', 'Feb', 'Mar']:
print('Earnings in ' + month + ' were', first_quarter_earnings[month].mean())
Which results in:
Earnings in Jan were 424.7225
Earnings in Feb were 388.0625
Earnings in Mar were 719.4275
Challenge
Answer the following questions:
- What are the total earnings for the first quarter?
- What are the total earnings each month?
- Which month was most profitable?
- Which month had the most variance in weekly earnings?
- Generate a plot of the first quarter earnings using the plot method associated with
Series
.
We saw that it is possible to create a MultiIndex
by providing a list of all the combinations of labels that we want to use. This can become
very tedious when there are large numbers of labels at each level, or multiple levels of hierarchy. One possible way to overcome this is to
use a package such as itertools
, which provides methods for generating combinations of elements from lists.
Another approach is to use the MultiIndex.from_product()
method which will generate an index by combining every element from a list of
two or more iterables. For example, suppose we wanted to create a two-level index where the labels are taken from NumPy arrays.
We could do this in one statement as follows:
latitude = np.array([-86.5, -85.5, -84.5, -83.5, -82.5, -81.5])
longitude = np.array([-105.5, -104.5, -103.5, -102.5, -101.5, -100.5])
multi_index = pd.MultiIndex.from_product([latitude, longitude], names=['Latitude', 'Longitude'])
print(multi_index)
This would generate the required MultiIndex
and the output:
MultiIndex(levels=[[-86.5, -85.5, -84.5, -83.5, -82.5, -81.5], [-105.5, -104.5, -103.5, -102.5, -101.5, -100.5]],
labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5], [0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5]],
names=['Latitude', 'Longitude'])
Challenge
In the case of our CMIP5 global emissions data, we would like to create a
MultiIndex
that has three levels, one for each dimension of the variable, namely “time_counter”, “latitude”, and “longitude”. We can obtain the values for the labels from the corresponding variables in the data set.Modify the
historical_co2_emissions.py
script as follows:
- Obtain NumPy arrays for the “Longitude”, “Latitiude”, “time_counter”, “FF”, and “AREA” netCDF variables and assign them to Python variables.
- Create a
MultiIndex
calledemissions_index
from the product of the “time_counter”, “Latitude”, and “Longitude” arrays. Use ‘Month’, ‘Latitude’, and ‘Longitude’ as the level names.
Although a hierarchical index allows us to access data in a Series
or DataFrame
as if it was multi-dimensional, we still have to create them using
data that is one-dimensional.
Suppose that the quarterly earnings data was actually in a 2-dimensional array, rather than a one-dimensional array and we
try to create the Series
the same way as before:
earnings = np.array([[370.72, 613.96, 98.17, 616.04], [575.00, 266.93, 243.11,
467.21], [349.54, 698.09, 842.06, 988.02]])
fridays_index = pd.MultiIndex.from_tuples(month_date_pairs, names=['Month','Date'])
first_quarter_earnings = pd.Series(earnings, index=fridays_index)
print(first_quarter_earnings)
If we run this, we get the result:
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-81-916f673e41ea> in <module>()
1 fridays_index = pd.MultiIndex.from_tuples(month_date_pairs, names=['Month','Date'])
----> 2 first_quarter_earnings = pd.Series(earnings, index=fridays_index)
anaconda3/lib/python3.6/site-packages/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
246 else:
247 data = _sanitize_array(data, index, dtype, copy,
--> 248 raise_cast_failure=True)
249
250 data = SingleBlockManager(data, index, fastpath=True)
anaconda3/lib/python3.6/site-packages/pandas/core/series.py in _sanitize_array(data, index, dtype, copy, raise_cast_failure)
3025 elif subarr.ndim > 1:
3026 if isinstance(data, np.ndarray):
-> 3027 raise Exception('Data must be 1-dimensional')
3028 else:
3029 subarr = _asarray_tuplesafe(data, dtype=dtype)
Exception: Data must be 1-dimensional
This is Pandas telling us that the data must be 1-dimensional in order to create the Series
. So
we need to “flatten” the data before it could be wrapped with a Series
. Remember, we want to use a MultiIndex
to access
the data as if it was multi-dimensional, not the indexes of a multi-dimensional array.
Fortunately, NumPy provides a variety of methods for reshaping arrays. One of these is the
reshape()
method,
which takes the new shape as an argument and returns an array with the new shape. As long as
the new shape has the same total number of elements, the result will just be a new “view” of
the original array, not a copy. The examples below show how the array shape can be modified:
print(earnings.reshape((2, 2, 3))) # reshaped to 3-dimensional
print(earnings.reshape((4, 3))) # same shape as the original
print(earnings.reshape((12,))) # reshape to 1-dimensional
print(earnings.reshape(-1)) # one dimension can be -1 to specify "whatever is left"
The output generated would be:
[[[ 370.72 613.96 98.17]
[ 616.04 575. 266.93]]
[[ 243.11 467.21 349.54]
[ 698.09 842.06 988.02]]]
[[ 370.72 613.96 98.17]
[ 616.04 575. 266.93]
[ 243.11 467.21 349.54]
[ 698.09 842.06 988.02]]
[ 370.72 613.96 98.17 616.04 575. 266.93 243.11 467.21 349.54
698.09 842.06 988.02]
[ 370.72 613.96 98.17 616.04 575. 266.93 243.11 467.21 349.54
698.09 842.06 988.02]
Notice that the last two produce the same result, a one-dimensional array. The reshape
method allows one of the
dimensions to be -1
, which indicates that it should be inferred from the length of the array and remaining
dimensions.
So now we are able to create the Series
from the 2-dimensional array as follows:
first_quarter_earnings = pd.Series(earnings.reshape(-1), index=fridays_index)
print(first_quarter_earnings)
Challenge
Returning to our CMIP5 global emissions data, it should now be possible to create a
Series
using the NumPy array version of the ‘FF’ variable. Modify thehistorical_co2_emissions.py
script so that it creates aSeries
calledff_pd
using theemissions_index
you created previously.
Pandas provides a variety of tools for reading and writing textual and binary file formats. For textual formats, the most common of these are:
Format | Example |
CSV | import pandas as pd df = pd.read_csv(filename, delimiter, header, ...) |
JSON | import pandas as pd df = pd.read_json('https://api.github.com/repos/pydata/pandas/issues?per_page=5') |
HTML | import pandas as pd df = pd.read_html('http://www.fdic.gov/bank/individual/failed/banklist.html') |
Pandas also supports reading and writing a variety of binary data formats, including:
Format | Example |
Excel |
import pandas as pd df = pd.DataFrame({'a':[1,2,3,4], 'b':[5,6,7,8]}, index=pd.MultiIndex.from_product([['a','b'],['c','d']])) df.to_excel('df.excel.xlsx') xlsx = pd.ExcelFile('df.xlsx') pd.read_excel(xlsx, 'Sheet1') |
HDF5 |
import pandas as pd df = pd.DataFrame(dict(A=list(range(5)), B=list(range(5)))) df.to_hdf('df.h5','table',append=True) pd.read_hdf('df.h5', 'table', where = ['index>2']) |
Feather |
import pandas as pd df = pd.DataFrame(dict(A=list(range(5)), B=list(range(5)))) df.to_feather('df.feather') pd.read_feather('df.feather') |
Msgpack |
import pandas as pd df = pd.DataFrame(np.random.rand(5,2),columns=list('AB')) df.to_msgpack('df.msg') pd.read_msgpack('df.msg') |
Stata |
import pandas as pd df = pd.DataFrame(np.random.randn(10, 2), columns=list('AB')) df.to_stata('df.dta') pd.read_stata('df.dta') |
Pickle |
import pandas as pd df = pd.DataFrame(np.random.rand(5,2),columns=list('AB')) df.to_pickle("df.pkl.gz", compression="gzip") pd.read_pickle("df.pkl.gz", compression="gzip") |
Pandas also provides tools for managing SAS, SQL, and Google BigQuery formats. Please refer to the Pandas documentation for more information.
Key Points