Datamining

Thursday, 25 August 2011

Sample R code

Was searching on good programming languages for data analysis. Open source, easy to undersand, R seems to be the language for data analysis. The following is a bit of R code for you to munch:

## N.B. If you get an error when loading the libraries
## you will need to install them using commands like
##    install.packages("gplots", repos="http://cran.uk.R-project.org/", dependencies=TRUE)
##
## You should only need to do this once for each library.
## You will then need to load them with the
##    library(gplots)
## command  or equivalent each time you use R.

library("party")
library(gplots)
library("vcd")
library("RODBC")
library(e1071)
library("nnet")
library("rpart")
library("mlbench")

## Simple Commands
1+1
10*3
c(1,2,3)
c(1,2,3)*10
x <- 5
x*x
exp(1)

## ------------------------------

## Correlation + Scatterplots
colnames(iris)
plot(iris$Sepal.Length, iris$Petal.Length)
cor(iris$Sepal.Length, iris$Petal.Length)
cor(iris$Sepal.Length, iris$Petal.Length)^2
cor(rank(iris$Sepal.Length), rank(iris$Petal.Length))

cor.sp <- function(x,y) {
  return(cor(rank(x),rank(y)))
}


## ------------------------------
## Plot Scatterplot of Iris data
plotIris1 <- function(){
  table(iris$Species) # is data.frame with 'Species' factor
  iS <- iris$Species == "setosa"
  iV <- iris$Species == "versicolor"
  matplot(c(1, 8), c(0, 4.5), type= "n", xlab = "Length", ylab = "Width",
          main = "Petal and Sepal Dimensions in Iris Blossoms")
  matpoints(iris[iS,c(1,3)], iris[iS,c(2,4)], pch = "sS", col = c(2,4))
  matpoints(iris[iV,c(1,3)], iris[iV,c(2,4)], pch = "vV", col = c(2,4))
  legend(1, 4, c("    Setosa Petals", "    Setosa Sepals",
                 "Versicolor Petals", "Versicolor Sepals"),
         pch = "sSvV", col = rep(c(2,4), 2))
}

plotIris1()
## ------------------------------

# Bar Chart Example with confidence intervals and grid
prettyBarChart <- function(){
  ## Source: R Graph Gallery
  hh <- t(VADeaths)[, 5:1]
  mybarcol <- "gray20"
  ci.l <- hh * 0.85
  ci.u <- hh * 1.15
  mp <- barplot2(hh, beside = TRUE,
                 col = c("lightblue", "mistyrose",
                   "lightcyan", "lavender"),
                 legend = colnames(VADeaths), ylim = c(0, 100),
                 main = "Stairlift Usage in Virginia", font.main = 4,
                 sub = "Faked 95 percent error bars", col.sub = mybarcol,
                 cex.names = 1.5, plot.ci = TRUE, ci.l = ci.l, ci.u = ci.u,
                 plot.grid = TRUE)
  mtext(side = 1, at = colMeans(mp), line = 2,
        text = paste("Mean", formatC(colMeans(hh))), col = "red")
  box()
}
prettyBarChart()

## ------------------------------
## Mosaic Plots
data(HairEyeColor)
mosaic(HairEyeColor, shade = TRUE)

## ------------------------------
## Linear Models
plot(iris$Sepal.Length, iris$Petal.Length)
plot(iris$Sepal.Length, iris$Petal.Length, col="blue",pch=19)

## Make a Model of Petals in terms of Sepals
irisModel <- lm(iris$Petal.Length ~ iris$Sepal.Length)

## plot the corresponding line
abline(irisModel)

## Details of the Model
summary(irisModel)

plot(iris$Petal.Length ~ iris$Species, col="cyan")


## ------------------------------

## Regression Tree

## Select data where Ozone level is known
airq <- subset(airquality, !is.na(Ozone))
## Build a regression tree predicting Ozone 
airct <- ctree(Ozone ~ ., data = airq)
## Show the tree structure
plot(airct)
## Compare actual and predicted values
plot(airq$Ozone,predict(airct))


## Classification Tree
irisct <- ctree(Species ~ .,data = iris)
plot(irisct)
table(predict(irisct), iris$Species)

## Ctree Forest
iriscf <- cforest(Species ~ .,data = iris)
table(predict(iriscf), iris$Species)

## Random Forest
irisrf <- randomForest(Species ~ .,data = iris)
table(predict(irisrf), iris$Species)
varImpPlot(irisrf)

## Naive Bayes
irisnb <- naiveBayes(Species ~ .,data = iris)
table(predict(irisnb, iris[,-5]), iris$Species)

## Neural Net
irisnn <- nnet(Species ~ .,data = iris, size=2)
table(predict(irisnn, iris, type="class"), iris$Species)

## ------------------------------
## SQL Interface

library("RODBC")
channel <- odbcConnect("PostgreSQL30w", case="postgresql")
sqlSave(channel,iris, tablename="iris")
myIris <- sqlQuery(channel, "select * from iris")
summary(myIris)

## ------------------------------
demo(graphics)

Thursday, 26 May 2011

The heart of Data Mining - ALGORITHMS!

I finally realised that Data Mining is all about algorithms. Algorithms are the soul of Data Mining. My misfortune that that is the subject I neglected maximum in my Engineering and hence I am all at sea in my Data Mining courses. But I have decided that I will learn Algorithms. I will make a plan to learn this vast but exciting field. Without great grasp of the types and techniques of the working of Algorithmsm you cannot work in either Data Mining or for that matter any field of Artificial Intelligence.

To master Algorithms it is extremely important to figure out the various area of the subject. And once we do that find out sub topics and there you are ready to go! Make your time table and one by one learn all those techques which is the main focus of DATA MINING!!

Wednesday, 23 March 2011

Comeback

Here after quite a while. I have nicer things to share. A bit more knowledge. A bit more passion. A bit more focus. All over the internet you'll find more stuff than required on Data Mining. But I promise you that Ill give you a bit different things to ponder about.

I am done with the first six months of the Master program. Results not out ewww...But anyhow am into the second semester. Much better thinking not as many unreasonable bludgned homwworks on your faces. So much more relaxing, able to go into details about the courses.

The courses are:

Ontology Engineering and Semantic Web.
Symbolic Machine Learning.
Bayesian Networks.
Visual Datamining (or Visualisation).
Relational Pattern Mining.
Research tools and methodologies (Latex mainly to start with).
Case Studies (Association Rule Mining in Social Network Database, This ones interesting, we'll be analysing a facebook database kinda)

Next Blog will give a bit more detailed explaination pointer about the subjects.

Best Wishes,
SidMiner

Friday, 7 January 2011

Erasmus Mundus Master Data Mining and Knowledge Management Courses

First Semister Erasmus Mundus Master Data Mining and Knowledge Management (EM-DMKM) Courses

I have a small abreviation for them as there are six of them, I use PMSOLD.

They are:
Probability and Statistics (P)
Multi Dimentional Data Analysis (M)
Software Methodologies (S)
Optimisation (O)
Logic and Knowledge Representation (L)
Advanced DataBases (A)

All the above mentioned courses are diverse courses, all meant to help the students with future semister subjects pertaining to Data Mining and Knowledge Management (DMKM).

In the comming blogs I will try to explore these subjects in details give a point of view of their importance.

I am four month into the DMKM course and havn't really started well. Its time now to fasten seatbelts and go for mastering all these subjects.
In my view in the semister Probability and Statistics is by far the most difficult as far as understanding it is concerned.

Software Methodologies is a theoritically rich subject exploring varios methods of software engineering. Studying is going to be mostly learning various processes pertaining to software engineering.

Multidimentional Data Analysis and Optimisation are method oriented subjects. Not too much theory, background theory is wide but as far as the EMDMKM is concerend its not that theoritic.

While the other subjects have quite wide aswell.

In the logic and knowledge management subject we are studying a declarative programming language called PROLOG. PROLOG is a very powerful language. It is mostly recursion oriented and very powerful. You can write programs which are supposed to be long in procedural languages like C, in few lines!!

In the blog we will try to study the prolog language and also give many examples along with it. In the coming blogs I shall go into details with the other subjects as well. Things are getting exciting. Till next time see you and bye!

Intelligent Data Mining

This is a 23 year old Indian guy by the name of Siddhartha Chatterjee. I have started this blog to give life to my grad school major in Data Mining and Knowledge Management. I am currently an Erasmus Mundus Masters Student in Data Mining and Knowledge Management at the Ecole Polytechnique of the University of Nantes in France. I am in the first semister of my course of Data Mining and Knowledge Management.

Due to certain reasons the course became boring for me and I started to look for all ways to run away from it. Even at this moment my heart is telling me to run away from this field. But back of my mind I know that this field is promising and very interesting. So, starting a blog about the subject is a good way to keep learning, sharing and enjoying the grad school.

I would like to provide you all with information regarding Datamining.

Datamining is projected to become a highly promising field. The aim of this field is to find patterns in large Databases in order to extract hiddent knowledge out of it. These hidden knowledge extracted out of Datamining can be used in the future in order to make better Decisions. We all know that it is so important to make the right decisions at the right time in order to have a good life for an individual, organisation or anyone. Hence, a major aim of Dataming is to help businesses to make correct decisions in order to drive businesses to the zenith.

I am one person who has no interest in money. As in I am from a well off family and all my needs are provided so money is never a motivation in my life. One motivation that I have is to do something valuable that will make me and my family proud.

I have always been a laid back and fun loving kind of a person till now but as I have hit the 23 age mark and one fine day I realised many sportsman become leader at this age I wanted to start working like a mad horse!!!! As I feel or I have a belief that it is work that can drive a man all his life. The person who is madly in love with his or her work is a very lucky man. I have not yet reached that state but would like to reach that stage and I know the love factor has to come from within!!

So, as my education so far has been in the field of computer science and engineering and me doing datamining at the moment, its a good oppurtunity to expand my knowledge in this field. I am lucky in a way that I am in a specialised course as that will help me to focus my energies on to something concentrated.

To be honest I really don't have a big idea about datamining but slowly but sure I will become a master at it and so will you :)

In the blog I would try to put information regarding the the field and sub subfields of DataMining. The softwares available in the market to Datamine and also reasearch areas in this field. To start with Ill list out the subjects that I am currently studying at grad school and will put up interesting things related to them. The next blog shall have the subjects that I am currently studying (or not studying :D, thats why I started the blog!!!). We shall try and make the activities here fun as life is too uninteresting without fun. Even Datamining requires fun quotient along with it!!!!

I wish you all a very happy new year and hope to see you on a regular basis on this new adventure of mine called Intelligent Data Mining!!!!!