For the Data Laboratory final exam I had to apply PCA using the R programming language. This repository contains the same project, but also a version implemented in Python as a CMD tool that extends functionalities, adding more analytical methods.
This application aims for:
- data handling
- data cleaning
- data processing
- data visualization through plots and log files
- Sklearn: provides machine learning algorithms making possible to build classification and regression models, implementing unsupervised techniques.
- Scipy: provides scientific algorithms needed for statistics and signal processing. It makes possible to build polynomials that better fits a set of values.
- Numpy: introduce multi dimensional arrays and a set of mathematical functions to operate on these ndarrays.
- Pandas: allows to work with external data files in python which becomes very powerful at combining it with numpy functionalities.
- Gnuplot: used externally to builds the 3D plots once the .dat files are ready
For now, the following methods were implemented:
- ICA
- PCA
- tSNE
- Logistic Regression
- Decision Tree as Classifier
- Random Forest Classifier
- K-Nearest Neighbors Classifier
- Support Vector Classifer
- Linear Regression
- Ridge Regression
- Decision Tree Regressor
- K-nearest Neighbor Regressor
- Support Vector Regressor
- Lagrange
- Chebyshev
- Taylor (in discussion)
Information about these commands can be found by using the help command.
cal help
Let's create a new project called heisenberg (walter white)
cal new p:heisenberg
Later, for setting an specific source data file:
cal set d:src
Global variables and labels can be setted for exploratory univariate analysis like building histograms, boxplots or printing dispersion statistics.
cal set g:var
cal set g:lab
cal set g:histo_var
Once a data source is selected, its variables can be observed by using list. For type, n stands for numerical and c for categorical. A white space includes all variables. Once variables are listed, any can be choosen to fill the var or histo_var (which cannot be categorical) field.
cal list vars <type>
For exploratory analysis xp command is used. If any type of analysis is specified it executes everyone as long as variables are setted at the manifest.json. Available types are boxplot, histograms, correlation matrix, dispersion metrics, and cateogorical variables. Use the help command to learn the code that stands for the the mentioned analysis.
cal xp <type>
The following command applies the dimension reduction PCA algorithm taking the source file previously setted:
cal app dr:pca -r <reference>
Alternatively, output filename and reference (which is the target colummn in the dataframe) can be specified in the same command:
cal app dr:<example.csv>:pca -o <pca.png> -r <reference>
For classification using Logistic model:
cal app c:l -r <reference>
This is analogous to the previous one:
cal app r:svr -r <reference>
Polynomial approximation can be applied using Lagrange or Chebyshev algorithms by indicating them with l or c respectively.
cal app a:l
Besides polynomials, lines can be used as approximations (which is not true because they're are just put together scattered points, that is the reason that not proccesing it carried out, there is not a model to generate)
cal app a:l -lt <linetype>
In should be specfied the variable to plot like 'R' for aparent resistivity, 'C' for aparent conductivity or 'IO' for the Ip and Op component in EMI prospecting data.
cal app a:- -lt <linetype>
- stands for setting. It's used for building EMI profiles from field sheets. Once profiles are ready, plotlines can be built.
cal app a:s -lt <linetype>
s stands for lines. It only builds the mentioned plotlines.
To check the applied current project's methods or the models built, ch is used (which stands for check) as follows:
cal ch meths
cal ch mods
cal ch exps
cal ch pols
The target mods, meths, exps and pols work under the same logic. From the current project, specific methods can be checked. w stands for where conditioning a type group and is signs an specific method.
cal ch meths w pca
cal ch meths w pca is 1
From the data directory, specific filetypes can be filter out by doing
cal ch file -ft csv
Every method and models can be deleted on a cleaning process by typing the cl command that stands for clean. The use is the same as for ch.
cal cl meths w pca
cal cl meths w pca is 1
For building machine learning models, train and test data should be specified or the program will use src specified. Operator can set test file globally by doing:
cal set d:tt
where tt stands for test, tn for train and src for source. What follows is a questions, asking for target file's name. Take into account that the file should be on the data directory.
At the moment of buliding the model, user must choose between the setted train and test data or just splitting the source file by using train_test_split() function that applies a test_size=0.2 unless the key -ts is included followed by any test size required.
Projects could be also deleted by doing:
cal del p:heisenberg
or if you want to delete it all.
cal del p:all
Also, produced plots as images can be delted:
cal del o:all
The operator can navigates throughout different projects by doing:
cal switch p:pinkman
Other commands will be added here as the project grows
Located in the "data" directory, enter the gnuplot assitant:
gnuplot
Then enter the following commands:
set xlabel "Longitud"
set ylabel "Latitud"
set zlabel "REa"
set datafile separator ','
Later, a 3D scatter plot can be built:
splot "Grid" using 6:5:4 with points pt 7 ps 1 notitle
And also, a surface plot:
set hidden3d
set pm3d at s
set view 60, 30
set dgrid3d 50, 50, 2
splot "Grid" using 6:5:4 with pm3d
This repository should be cloned in your local machine or just download it as a zip. Later unzip it at an specific directory. Then use pip to all install libraries needed.
pip install -r requirements.txt
pyinstaller is used to compile the executable:
pyinstaller cal.py
The following is to paste the path to this executable in the environment variables. This way the application can be used globally from the CMD.