Importing yout data¶

To import your data you should use pandas. It is also possible to load one of the classics geostatistical datasets, such as Walker Lake and Jura, form our datasets module.

pandas

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.

The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. Pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

To import Pandas:

[1]:

import pandas as pd

For additional information on Pandas, see the documentation

Getting Data In/Out¶

Pandas is packed with a set of IO functions that returns a pandas object. Reader functions are accessed like pd.read_fileformat() while writer functions are accessed like DataFrame.to_fileformat().

Format Type	Data Description	Reader	Writer
text	CSV	read_csv	to_csv
text	JSON	read_json	to_json
text	HTML	read_html	to_html
text	Local clipboard	read_clipboard	to_clipboard
binary	MS Excel	read_excel	to_excel
binary	HDF5 Format	read_hdf	to_hdf
binary	Feather Format	read_feather	to_feather
binary	Parquet Format	read_parquet	to_parquet
binary	Msgpack	read_msgpack	to_msgpack
binary	Stata	read_stata	to_stata
binary	SAS	read_sas
binary	Python Pickle Format	read_pickle	to_pickle
SQL	SQL	read_sql	to_sql
SQL	Google Big Query	read_gbq	to_gbq

Importing a .csv dataset:

[2]:

path = "data/"
file = "walker.csv"

data = pd.read_csv(path + file, sep=",", na_values=-999)

Viewing Data¶

df.head(n) shows the first n rows (if nothing is passed n=5) of the DataFrame while df.tails(n) shows the last n lines.

[3]:

data.head()

[3]:

	Id	X	Y	V	U	T
0	1.0	11.0	8.0	0.0	NaN	2.0
1	2.0	8.0	30.0	0.0	NaN	2.0
2	3.0	9.0	48.0	224.4	NaN	2.0
3	4.0	8.0	69.0	434.4	NaN	2.0
4	5.0	9.0	90.0	412.1	NaN	2.0

Filtering¶

Accesing variables collumns:

[4]:

data[["U", "V"]].head()

[4]:

	U	V
0	NaN	0.0
1	NaN	0.0
2	NaN	224.4
3	NaN	434.4
4	NaN	412.1

Accessing DataFrame where variable V is bigger than 640:

[5]:

df_filter = data['V'] > 640

data[df_filter].head()

[5]:

	Id	X	Y	V	U	T
18	19.0	31.0	68.0	895.2	NaN	2.0
19	20.0	28.0	88.0	702.6	NaN	2.0
30	31.0	49.0	11.0	653.3	NaN	2.0
34	35.0	50.0	88.0	820.8	NaN	2.0
37	38.0	49.0	151.0	773.3	NaN	2.0

Accessing DataFrame where variable T is equal to 1:

[6]:

df_filter = data["T"] == 1

data[df_filter].head()

[6]:

	Id	X	Y	V	U	T
11	12.0	10.0	231.0	82.1	NaN	1.0
12	13.0	11.0	250.0	81.1	NaN	1.0
44	45.0	51.0	290.0	159.6	NaN	1.0
55	56.0	69.0	208.0	97.4	NaN	1.0
56	57.0	69.0	229.0	0.0	NaN	1.0

describe() shows a quick statistic summary of your data:

[7]:

data[["U", "V"]].describe()

[7]:

	U	V
count	275.000000	470.000000
mean	604.081091	435.298723
std	767.405620	299.882302
min	0.000000	0.000000
25%	82.150000	184.600000
50%	319.300000	424.000000
75%	844.550000	640.850000
max	5190.100000	1528.100000

corr() shows the correlation matrix:

[8]:

data[["U", "V"]].corr()

[8]:

	U	V
U	1.000000	0.551482
V	0.551482	1.000000

Selection¶

Selecting a single column, which yields a Series:

[9]:

V_variable = data.V

#which is the same as:
V_variable = data["V"]

[10]:

type(V_variable)

[10]:

pandas.core.series.Series

To select values:

df.at can only access a single value at a time.

df.loc can select multiple rows and/or columns.

[11]:

data.loc[[3, 4, 5]] #indexes 3, 4 and 5 for all columns

[11]:

	Id	X	Y	V	U	T
3	4.0	8.0	69.0	434.4	NaN	2.0
4	5.0	9.0	90.0	412.1	NaN	2.0
5	6.0	10.0	110.0	587.2	NaN	2.0

[12]:

data.at[4, "V"] #index 4 for variable V

[12]:

412.1

Setting¶

First, let`s create a capped U variable:

[13]:

import numpy as np

[14]:

U_cap = np.where(data["U"] > 2535, 2535, data["U"])

Now let`s create a new collumn data U capped in the DataFrame

[15]:

data["U capped"] = U_cap

[16]:

data.tail()

[16]:

	Id	X	Y	V	U	T	U capped
465	466.0	214.0	19.0	242.5	15.6	2.0	15.6
466	467.0	245.0	231.0	161.2	26.1	2.0	26.1
467	468.0	233.0	220.0	626.0	959.7	2.0	959.7
468	469.0	226.0	221.0	800.1	1681.5	2.0	1681.5
469	470.0	213.0	218.0	482.6	476.2	2.0	476.2

df.loc and df.at can be use to set values too

[17]:

data.at[4, "V"] = 0

[18]:

data.at[4, "V"]

[18]:

0.0

Missing data¶

To drop any rows that have missing data use df.dropna(). This is specially usefull to filter an isotopic dataset.

[19]:

data.dropna().head()

[19]:

	Id	X	Y	V	U	T	U capped
195	196.0	40.0	71.0	76.2	1.1	2.0	1.1
196	197.0	21.0	69.0	284.3	7.8	2.0	7.8
197	198.0	28.0	80.0	606.8	105.3	2.0	105.3
198	199.0	29.0	59.0	772.7	1512.7	2.0	1512.7
199	200.0	41.0	81.0	269.5	9.8	2.0	9.8