GitHub - nahuelmol/datos_final: Script that applies data-analysis-methods for dimension reduction and developes classification models

This application

For the Data Laboratory final exam I had to apply PCA using the R programming language. This repository contains the same project, but also a version implemented in Python as a CMD tool that extends functionalities, adding more analytical methods.

This application aims for:

data handling
data cleaning
data processing
data visualization through plots and log files

Used libraries

Sklearn: provides machine learning algorithms making possible to build classification and regression models, implementing unsupervised techniques.
Scipy: provides scientific algorithms needed for statistics and signal processing. It makes possible to build polynomials that better fits a set of values.
Numpy: introduce multi dimensional arrays and a set of mathematical functions to operate on these ndarrays.
Pandas: allows to work with external data files in python which becomes very powerful at combining it with numpy functionalities.
Gnuplot: used externally to builds the 3D plots once the .dat files are ready

Implemented methods

For now, the following methods were implemented:

Dimension reduction algorithms

ICA
PCA
tSNE

Classification algorithms

Logistic Regression
Decision Tree as Classifier
Random Forest Classifier
K-Nearest Neighbors Classifier
Support Vector Classifer

Regression algorithms

Linear Regression
Ridge Regression
Decision Tree Regressor
K-nearest Neighbor Regressor
Support Vector Regressor

Polynomial algorithms

Lagrange
Chebyshev
Taylor (in discussion)

Common commands

Information about these commands can be found by using the help command.

cal help

Let's create a new project called heisenberg (walter white)

cal new p:heisenberg

Later, for setting an specific source data file:

cal set d:src

Global variables and labels can be setted for exploratory univariate analysis like building histograms, boxplots or printing dispersion statistics.

cal set g:var

cal set g:lab

cal set g:histo_var

Once a data source is selected, its variables can be observed by using list. For type, n stands for numerical and c for categorical. A white space includes all variables. Once variables are listed, any can be choosen to fill the var or histo_var (which cannot be categorical) field.

cal list vars <type>

Exploratory analysis

For exploratory analysis xp command is used. If any type of analysis is specified it executes everyone as long as variables are setted at the manifest.json. Available types are boxplot, histograms, correlation matrix, dispersion metrics, and cateogorical variables. Use the help command to learn the code that stands for the the mentioned analysis.

cal xp <type>

Dimension reduction algorithms

The following command applies the dimension reduction PCA algorithm taking the source file previously setted:

cal app dr:pca -r <reference>

Alternatively, output filename and reference (which is the target colummn in the dataframe) can be specified in the same command:

cal app dr:<example.csv>:pca -o <pca.png> -r <reference>

Classification models

For classification using Logistic model:

cal app c:l -r <reference>

Regression models

This is analogous to the previous one:

cal app r:svr -r <reference>

Approximation models

Polynomial approximation can be applied using Lagrange or Chebyshev algorithms by indicating them with l or c respectively.

cal app a:l

Besides polynomials, lines can be used as approximations (which is not true because they're are just put together scattered points, that is the reason that not proccesing it carried out, there is not a model to generate)

cal app a:l -lt <linetype>

In should be specfied the variable to plot like 'R' for aparent resistivity, 'C' for aparent conductivity or 'IO' for the Ip and Op component in EMI prospecting data.

cal app a:- -lt <linetype>

- stands for setting. It's used for building EMI profiles from field sheets. Once profiles are ready, plotlines can be built.

cal app a:s -lt <linetype>

s stands for lines. It only builds the mentioned plotlines.

Checking

To check the applied current project's methods or the models built, ch is used (which stands for check) as follows:

cal ch meths
cal ch mods
cal ch exps
cal ch pols

The target mods, meths, exps and pols work under the same logic. From the current project, specific methods can be checked. w stands for where conditioning a type group and is signs an specific method.

cal ch meths w pca

cal ch meths w pca is 1

From the data directory, specific filetypes can be filter out by doing

cal ch file -ft csv

Cleaning

Every method and models can be deleted on a cleaning process by typing the cl command that stands for clean. The use is the same as for ch.

cal cl meths w pca

cal cl meths w pca is 1

Setting datasets

For building machine learning models, train and test data should be specified or the program will use src specified. Operator can set test file globally by doing:

cal set d:tt

where tt stands for test, tn for train and src for source. What follows is a questions, asking for target file's name. Take into account that the file should be on the data directory.

At the moment of buliding the model, user must choose between the setted train and test data or just splitting the source file by using train_test_split() function that applies a test_size=0.2 unless the key -ts is included followed by any test size required.

Projects management

Projects could be also deleted by doing:

cal del p:heisenberg

or if you want to delete it all.

cal del p:all

Also, produced plots as images can be delted:

cal del o:all

The operator can navigates throughout different projects by doing:

cal switch p:pinkman

Other commands will be added here as the project grows

Plotting with gnuplot

Located in the "data" directory, enter the gnuplot assitant:

gnuplot

Then enter the following commands:

set xlabel "Longitud"
set ylabel "Latitud"
set zlabel "REa"
set datafile separator ','

Later, a 3D scatter plot can be built:

splot "Grid" using 6:5:4 with points pt 7 ps 1 notitle

And also, a surface plot:

set hidden3d
set pm3d at s
set view 60, 30
set dgrid3d 50, 50, 2
splot "Grid" using 6:5:4 with pm3d

Installation

This repository should be cloned in your local machine or just download it as a zip. Later unzip it at an specific directory. Then use pip to all install libraries needed.

pip install -r requirements.txt

pyinstaller is used to compile the executable:

pyinstaller cal.py

The following is to paste the path to this executable in the environment variables. This way the application can be used globally from the CMD.

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
__r		__r
abss		abss
approximation		approximation
classification		classification
dimreduction		dimreduction
exploratory		exploratory
regression		regression
.gitignore		.gitignore
README.md		README.md
cal.py		cal.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

This application

Used libraries

Implemented methods

Dimension reduction algorithms

Classification algorithms

Regression algorithms

Polynomial algorithms

Common commands

Exploratory analysis

Dimension reduction algorithms

Classification models

Regression models

Approximation models

Checking

Cleaning

Setting datasets

Projects management

Plotting with gnuplot

Installation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

This application

Used libraries

Implemented methods

Dimension reduction algorithms

Classification algorithms

Regression algorithms

Polynomial algorithms

Common commands

Exploratory analysis

Dimension reduction algorithms

Classification models

Regression models

Approximation models

Checking

Cleaning

Setting datasets

Projects management

Plotting with gnuplot

Installation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages