Importing yout data¶
To import your data you should use pandas. It is also possible to load one of the classics geostatistical datasets, such as Walker Lake and Jura, form our datasets module.

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.
The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. Pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.
To import Pandas:
[1]:
import pandas as pd
For additional information on Pandas, see the documentation
Getting Data In/Out¶
Pandas is packed with a set of IO functions that returns a pandas object. Reader functions are accessed like pd.read_fileformat() while writer functions are accessed like DataFrame.to_fileformat().
| Format Type | Data Description | Reader | Writer |
|---|---|---|---|
| text | CSV | read_csv | to_csv |
| text | JSON | read_json | to_json |
| text | HTML | read_html | to_html |
| text | Local clipboard | read_clipboard | to_clipboard |
| binary | MS Excel | read_excel | to_excel |
| binary | HDF5 Format | read_hdf | to_hdf |
| binary | Feather Format | read_feather | to_feather |
| binary | Parquet Format | read_parquet | to_parquet |
| binary | Msgpack | read_msgpack | to_msgpack |
| binary | Stata | read_stata | to_stata |
| binary | SAS | read_sas | |
| binary | Python Pickle Format | read_pickle | to_pickle |
| SQL | SQL | read_sql | to_sql |
| SQL | Google Big Query | read_gbq | to_gbq |
Importing a .csv dataset:
[2]:
path = "data/"
file = "walker.csv"
data = pd.read_csv(path + file, sep=",", na_values=-999)
Viewing Data¶
df.head(n) shows the first n rows (if nothing is passed n=5) of the DataFrame while df.tails(n) shows the last n lines.
[3]:
data.head()
[3]:
| Id | X | Y | V | U | T | |
|---|---|---|---|---|---|---|
| 0 | 1.0 | 11.0 | 8.0 | 0.0 | NaN | 2.0 |
| 1 | 2.0 | 8.0 | 30.0 | 0.0 | NaN | 2.0 |
| 2 | 3.0 | 9.0 | 48.0 | 224.4 | NaN | 2.0 |
| 3 | 4.0 | 8.0 | 69.0 | 434.4 | NaN | 2.0 |
| 4 | 5.0 | 9.0 | 90.0 | 412.1 | NaN | 2.0 |
Filtering¶
Accesing variables collumns:
[4]:
data[["U", "V"]].head()
[4]:
| U | V | |
|---|---|---|
| 0 | NaN | 0.0 |
| 1 | NaN | 0.0 |
| 2 | NaN | 224.4 |
| 3 | NaN | 434.4 |
| 4 | NaN | 412.1 |
Accessing DataFrame where variable V is bigger than 640:
[5]:
df_filter = data['V'] > 640
data[df_filter].head()
[5]:
| Id | X | Y | V | U | T | |
|---|---|---|---|---|---|---|
| 18 | 19.0 | 31.0 | 68.0 | 895.2 | NaN | 2.0 |
| 19 | 20.0 | 28.0 | 88.0 | 702.6 | NaN | 2.0 |
| 30 | 31.0 | 49.0 | 11.0 | 653.3 | NaN | 2.0 |
| 34 | 35.0 | 50.0 | 88.0 | 820.8 | NaN | 2.0 |
| 37 | 38.0 | 49.0 | 151.0 | 773.3 | NaN | 2.0 |
Accessing DataFrame where variable T is equal to 1:
[6]:
df_filter = data["T"] == 1
data[df_filter].head()
[6]:
| Id | X | Y | V | U | T | |
|---|---|---|---|---|---|---|
| 11 | 12.0 | 10.0 | 231.0 | 82.1 | NaN | 1.0 |
| 12 | 13.0 | 11.0 | 250.0 | 81.1 | NaN | 1.0 |
| 44 | 45.0 | 51.0 | 290.0 | 159.6 | NaN | 1.0 |
| 55 | 56.0 | 69.0 | 208.0 | 97.4 | NaN | 1.0 |
| 56 | 57.0 | 69.0 | 229.0 | 0.0 | NaN | 1.0 |
describe() shows a quick statistic summary of your data:
[7]:
data[["U", "V"]].describe()
[7]:
| U | V | |
|---|---|---|
| count | 275.000000 | 470.000000 |
| mean | 604.081091 | 435.298723 |
| std | 767.405620 | 299.882302 |
| min | 0.000000 | 0.000000 |
| 25% | 82.150000 | 184.600000 |
| 50% | 319.300000 | 424.000000 |
| 75% | 844.550000 | 640.850000 |
| max | 5190.100000 | 1528.100000 |
corr() shows the correlation matrix:
[8]:
data[["U", "V"]].corr()
[8]:
| U | V | |
|---|---|---|
| U | 1.000000 | 0.551482 |
| V | 0.551482 | 1.000000 |
Selection¶
Selecting a single column, which yields a Series:
[9]:
V_variable = data.V
#which is the same as:
V_variable = data["V"]
[10]:
type(V_variable)
[10]:
pandas.core.series.Series
To select values:
df.at can only access a single value at a time.
df.loc can select multiple rows and/or columns.
[11]:
data.loc[[3, 4, 5]] #indexes 3, 4 and 5 for all columns
[11]:
| Id | X | Y | V | U | T | |
|---|---|---|---|---|---|---|
| 3 | 4.0 | 8.0 | 69.0 | 434.4 | NaN | 2.0 |
| 4 | 5.0 | 9.0 | 90.0 | 412.1 | NaN | 2.0 |
| 5 | 6.0 | 10.0 | 110.0 | 587.2 | NaN | 2.0 |
[12]:
data.at[4, "V"] #index 4 for variable V
[12]:
412.1
Setting¶
First, let`s create a capped U variable:
[13]:
import numpy as np
[14]:
U_cap = np.where(data["U"] > 2535, 2535, data["U"])
Now let`s create a new collumn data U capped in the DataFrame
[15]:
data["U capped"] = U_cap
[16]:
data.tail()
[16]:
| Id | X | Y | V | U | T | U capped | |
|---|---|---|---|---|---|---|---|
| 465 | 466.0 | 214.0 | 19.0 | 242.5 | 15.6 | 2.0 | 15.6 |
| 466 | 467.0 | 245.0 | 231.0 | 161.2 | 26.1 | 2.0 | 26.1 |
| 467 | 468.0 | 233.0 | 220.0 | 626.0 | 959.7 | 2.0 | 959.7 |
| 468 | 469.0 | 226.0 | 221.0 | 800.1 | 1681.5 | 2.0 | 1681.5 |
| 469 | 470.0 | 213.0 | 218.0 | 482.6 | 476.2 | 2.0 | 476.2 |
df.loc and df.at can be use to set values too
[17]:
data.at[4, "V"] = 0
[18]:
data.at[4, "V"]
[18]:
0.0
Missing data¶
To drop any rows that have missing data use df.dropna(). This is specially usefull to filter an isotopic dataset.
[19]:
data.dropna().head()
[19]:
| Id | X | Y | V | U | T | U capped | |
|---|---|---|---|---|---|---|---|
| 195 | 196.0 | 40.0 | 71.0 | 76.2 | 1.1 | 2.0 | 1.1 |
| 196 | 197.0 | 21.0 | 69.0 | 284.3 | 7.8 | 2.0 | 7.8 |
| 197 | 198.0 | 28.0 | 80.0 | 606.8 | 105.3 | 2.0 | 105.3 |
| 198 | 199.0 | 29.0 | 59.0 | 772.7 | 1512.7 | 2.0 | 1512.7 |
| 199 | 200.0 | 41.0 | 81.0 | 269.5 | 9.8 | 2.0 | 9.8 |