Importing yout data

To import your data you should use pandas. It is also possible to load one of the classics geostatistical datasets, such as Walker Lake and Jura, form our datasets module.

pandas

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.

The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. Pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

To import Pandas:

[1]:
import pandas as pd

For additional information on Pandas, see the documentation

Getting Data In/Out

Pandas is packed with a set of IO functions that returns a pandas object. Reader functions are accessed like pd.read_fileformat() while writer functions are accessed like DataFrame.to_fileformat().

Format Type Data Description Reader Writer
text CSV read_csv to_csv
text JSON read_json to_json
text HTML read_html to_html
text Local clipboard read_clipboard to_clipboard
binary MS Excel read_excel to_excel
binary HDF5 Format read_hdf to_hdf
binary Feather Format read_feather to_feather
binary Parquet Format read_parquet to_parquet
binary Msgpack read_msgpack to_msgpack
binary Stata read_stata to_stata
binary SAS read_sas  
binary Python Pickle Format read_pickle to_pickle
SQL SQL read_sql to_sql
SQL Google Big Query read_gbq to_gbq

Importing a .csv dataset:

[2]:
path = "data/"
file = "walker.csv"

data = pd.read_csv(path + file, sep=",", na_values=-999)

Viewing Data

df.head(n) shows the first n rows (if nothing is passed n=5) of the DataFrame while df.tails(n) shows the last n lines.

[3]:
data.head()
[3]:
Id X Y V U T
0 1.0 11.0 8.0 0.0 NaN 2.0
1 2.0 8.0 30.0 0.0 NaN 2.0
2 3.0 9.0 48.0 224.4 NaN 2.0
3 4.0 8.0 69.0 434.4 NaN 2.0
4 5.0 9.0 90.0 412.1 NaN 2.0

Filtering

Accesing variables collumns:

[4]:
data[["U", "V"]].head()
[4]:
U V
0 NaN 0.0
1 NaN 0.0
2 NaN 224.4
3 NaN 434.4
4 NaN 412.1

Accessing DataFrame where variable V is bigger than 640:

[5]:
df_filter = data['V'] > 640

data[df_filter].head()
[5]:
Id X Y V U T
18 19.0 31.0 68.0 895.2 NaN 2.0
19 20.0 28.0 88.0 702.6 NaN 2.0
30 31.0 49.0 11.0 653.3 NaN 2.0
34 35.0 50.0 88.0 820.8 NaN 2.0
37 38.0 49.0 151.0 773.3 NaN 2.0

Accessing DataFrame where variable T is equal to 1:

[6]:
df_filter = data["T"] == 1

data[df_filter].head()
[6]:
Id X Y V U T
11 12.0 10.0 231.0 82.1 NaN 1.0
12 13.0 11.0 250.0 81.1 NaN 1.0
44 45.0 51.0 290.0 159.6 NaN 1.0
55 56.0 69.0 208.0 97.4 NaN 1.0
56 57.0 69.0 229.0 0.0 NaN 1.0

describe() shows a quick statistic summary of your data:

[7]:
data[["U", "V"]].describe()
[7]:
U V
count 275.000000 470.000000
mean 604.081091 435.298723
std 767.405620 299.882302
min 0.000000 0.000000
25% 82.150000 184.600000
50% 319.300000 424.000000
75% 844.550000 640.850000
max 5190.100000 1528.100000

corr() shows the correlation matrix:

[8]:
data[["U", "V"]].corr()
[8]:
U V
U 1.000000 0.551482
V 0.551482 1.000000

Selection

Selecting a single column, which yields a Series:

[9]:
V_variable = data.V

#which is the same as:
V_variable = data["V"]
[10]:
type(V_variable)
[10]:
pandas.core.series.Series

To select values:

df.at can only access a single value at a time.

df.loc can select multiple rows and/or columns.

[11]:
data.loc[[3, 4, 5]] #indexes 3, 4 and 5 for all columns
[11]:
Id X Y V U T
3 4.0 8.0 69.0 434.4 NaN 2.0
4 5.0 9.0 90.0 412.1 NaN 2.0
5 6.0 10.0 110.0 587.2 NaN 2.0
[12]:
data.at[4, "V"] #index 4 for variable V
[12]:
412.1

Setting

First, let`s create a capped U variable:

[13]:
import numpy as np
[14]:
U_cap = np.where(data["U"] > 2535, 2535, data["U"])

Now let`s create a new collumn data U capped in the DataFrame

[15]:
data["U capped"] = U_cap
[16]:
data.tail()
[16]:
Id X Y V U T U capped
465 466.0 214.0 19.0 242.5 15.6 2.0 15.6
466 467.0 245.0 231.0 161.2 26.1 2.0 26.1
467 468.0 233.0 220.0 626.0 959.7 2.0 959.7
468 469.0 226.0 221.0 800.1 1681.5 2.0 1681.5
469 470.0 213.0 218.0 482.6 476.2 2.0 476.2

df.loc and df.at can be use to set values too

[17]:
data.at[4, "V"] = 0
[18]:
data.at[4, "V"]
[18]:
0.0

Missing data

To drop any rows that have missing data use df.dropna(). This is specially usefull to filter an isotopic dataset.

[19]:
data.dropna().head()
[19]:
Id X Y V U T U capped
195 196.0 40.0 71.0 76.2 1.1 2.0 1.1
196 197.0 21.0 69.0 284.3 7.8 2.0 7.8
197 198.0 28.0 80.0 606.8 105.3 2.0 105.3
198 199.0 29.0 59.0 772.7 1512.7 2.0 1512.7
199 200.0 41.0 81.0 269.5 9.8 2.0 9.8