We
are running in the most quality period of human race. when you open some
article about machine learning, you see dozens of detailed descriptions.
The
idea behind writing this blog is to get the knowledge about Machine learning
across the world. Through this blog, ML provides potential solutions in all
different domains and more, and is set to be a pillar of our future
civilization.. I am providing a flow level understanding about various machine
learning Types along with description. These should be sufficient to get your
hands dirty.
So
what exactly is “machine learning”
Machine
Learning (ML) is coming into its own, It is playing a key role in a wide range
of critical applications, such as data mining, natural language processing,
image recognition, and expert systems. Machine Learning is all around us.
Apple, Amazon, Microsoft, Uber and many more companies are using machine
learning.
Generally there are four approaches
in Machine Learning -:
1)
Supervised Learning
2)
Unsupervised Learning
3)
Semi-Supervised Learning
4)
Reinforcement Learning
1) Supervised Learning
In
Supervised Learning which is machine learning set of paired input-output
training samples is already given and from that
training
samples data set we get the input output relationship information.
Here
an input-output training sample is also called labeled training data and output
is regarded as the label of the input data or the supervision. The goal of
supervised learning is to build an artificial system that can learn the mapping
between the input and the output, and can predict the output of the system
given new inputs. If the output takes a finite set of discrete values that
indicate the class labels of the input.
Mostly there are two
ways for Supervised Learning -:
a) Regression
b) Classification
Regression :-
It is a form of predictive modelling technique.
Regression find outs the relationship between a dependent (target) and
independent variable (s) (predictor).Regression helps investment and financial
managers to value assets and understand the relationships between variables,
such as commodity prices and the stocks of businesses dealing in those
commodities.
Linear Regression -:
Linear Regression uses linear equation to observed
data for investigate the relationship between two variables. There are two variables
first one considered as an explanatory variable and second one considered as a
dependent variable. A linear regression line has an equation of the form Y = a
+ bX, where X is the explanatory variable and Y is the dependent variable. The
slope of the line is b, and a is the intercept (the value of y when x = 0).
Ordinary Least Squares Regression -:
It is mainly use in statistics. Ordinary Least
Squares Regression is use to find out the unknown parameter. OLS chooses the
parameters of a linear function of a set of explanatory variables by minimizing
the sum of the squares of the differences between the observed dependent
variable (values of the variable being predicted) in the given dataset and
those predicted by the linear function. Geometrically this is seen as the sum
of the squared distances, parallel to the axis of the dependent variable,
between each data point in the set and the corresponding point on the
regression line – the smaller the differences, the better the model fits the
data. The resulting estimator can be expressed by a simple formula, especially
in the case of a single regressor on the right-hand side.
OLS regression
-:
find outs the relation between dependent and
independent variables.
Example - scores(dependent variable) and hours(independent variable) equation Y =
β0 + Σj=1..p βjXj + ε where Y is the dependent variable,β0, is the intercept of
the model,X j corresponds to the jth explanatory variable of the model (j= 1 to
p), and e is the random error with expectation 0 and variance σ².
Stepwise regression -:
In Stepwise regression an automatic procedure
is used for finding choice of predictive variables. As the name suggest, here
are multiple steps are there .In each step some calculation is there based on
predefined criteria.
Generally
adddition or subtraction operation is performed on the set of explanatory
variables. In stepwise regression Forward selection, Backward elimination,
Bidirectional elimination basically these approaches are present.
Multivariate
adaptive regression splines -:
Multivariate adaptive regression splines was
invented by Mr. Jerome H. Friedman in 1991.It is also called as MARS. MARS
mainly containing two phases forward pass and backward pass. In forward pass
adding terms continues until the change in residual error is too small to
continue or until the maximum number of terms is reached. The maximum number of
terms is specified by the user before model building starts.The backward pass
has an advantage over the forward pass: at any step it can choose any term to
delete, whereas the forward pass at each step can only see the next pair of
terms.
LOESS Curve Fitting
(Local Polynomial Regression) -:
In LOESS Curve Fitting one smooth curve will
be fitted into two variables.
LOWESS stands for Locally Weighted
Scatter-plot Smoother.In this the linearity assumptions of conventional regression methods have been relaxed
that`s why it is also called as nonparametric method.
The fitted points and their standard errors
represent are estimated with respect to the whole curve rather than a
particular estimate. So, the overall uncertainty is measured as how well the
estimated curve fits the population curve.
Logistic
regression -:
It determine an outcome .Logistic regression
is a statistical method for analyzing a dataset in which there are one or more
independent variables that determine an outcome. The outcome is measured with a
dichotomous variable (in which there are only two possible outcomes).
k-nearest neighbors algorithm -:
KNN
is used for classification and regresion.In this k is the main factor, mainly
taken as odd for less complexity.Suppose we have taken k=5 then from that point
distance will be calculated to all points.We have taken k=5 therefore 5 earest
neighbours will be considered and the outcome will be decided based upon
these 5 neighbours.To calculate distances between points
some equations are used
Decision Tree:-
1)Iterative Dichotomiser 3-:
In decision tree learning, ID3 (Iterative
Dichotomiser 3) is an algorithm invented by Ross Quinlan[1] used to generate a
decision tree from a dataset. ID3 is the precursor to the C4.5 algorithm, and
is typically used in the machine learning and natural language processing
domains.The ID3 algorithm begins with the original set {displaystyle S} S as
the root node. On each iteration of the algorithm, it iterates through every
unused attribute of the set {displaystyle S} S and calculates the entropy
{displaystyle H(S)} H(S) (or information gain {displaystyle IG(S)}
{displaystyle IG(S)}) of that attribute. It then selects he attribute which has
the smallest entropy (or largest information gain) value. The set {displaystyle
S} S is then split by the selected attribute (e.g. age is less than 50, age is
between 50 and 100, age is greater than 100) to produce subsets of the data.
The algorithm continues to recurse on each subset, considering only attributes
never selected before.
Recursion on a subset
may stop in one of these cases:-
1)every
element in the subset belongs to the same class (+ or -), then the node is
turned into a leaf and labelled with the class of the examples
2)there
are no more attributes to be selected, but the examples still do not belong to
the same class (some are + and some are-), then the node is turned into a leaf
and labelled with the most common class of the examples in the subset
3)there
are no examples in the subset, this happens when no example in the parent set
was found to be matching a specific value of the selected attribute, for
example if there was no example with age >= 100. Then a leaf is created, and
labelled with the most common class of the examples in the parent set.
2)C4.5 algorithm-:
C4.5 is an algorithm used to generate a
decision tree.C4.5 is an extension of Quinlan's earlier ID3 algorithm.The
decision trees generated by C4.5 can be used for classification, and for this
reason, C4.5 is often referred to as a statistical classifier.
C4.5 builds decision trees from a set of
training data in the same way as ID3, using the concept of information entropy.
The training data is a set S=s1,s2,...of already classified samples. Each
sample si consists of a p-dimensional vector {x1,i,x2,i,...,xp,i} where the xj
represent attribute values or features of the sample, as well as the class in
which si falls.
This algorithm has a few base cases.
1)All
the samples in the list belong to the same class. When this happens, it simply
creates a leaf node for the decision tree saying to choose that class.
2)None
of the features provide any information gain. In this case, C4.5 creates a
decision node higher up the tree using the expected value of the class.
3)Instance
of previously-unseen class encountered. Again, C4.5 creates a decision node
higher up the tree using the expected value.
3)Chi-square
automatic interaction detection-:
Chi-square
automatic interaction detection (CHAID) is a decision tree technique, based on
adjusted significance testing (Bonferroni testing). CHAID can be used for
prediction (in a similar fashion to regression analysis, this version of CHAID
being originally known as XAID) as well as classification, and for detection of
interaction between variables. CHAID is based on a formal extension of the US
AID (Automatic Interaction Detection) and THAID (THeta Automatic Interaction
Detection) procedures.
CHAID is often used in the context of direct
marketing to select groups of consumers and predict how their responses to some
variables affect other variables, although other early applications were in the
field of medical and psychiatric research.
CHAID's advantages are that its output is
highly visual and easy to interpret. Because it uses multiway splits by
default, it needs rather large sample sizes to work effectively, since with
small sample sizes the respondent groups can quickly become too small for
reliable analysis.One important advantage of CHAID over alternatives such as multiple
regression is that it is non-parametric.
4)Decision stump-:
A decision stump is a Decision Tree. which uses only a single
attribute for splitting. For discrete attributes, this typically means that the
tree consists only of a single interior node (i.e., the root has only leaves as
successor nodes). If the attribute is numerical, the tree may be more complex.Decision
stumps perform surprisingly well on some commonly used benchmark datasets from
the UCI repository.which illustrates that learners with a high Bias and low
Variance may perform well because they are less prone to Overfitting. Decision stumps are also often
used as weak learners in Ensemble
Methods such as boosting.
2)Unsupervised Learning
1)k-means
clustering algorithm-:
K-means
is simplest unsupervised learning algorithms is used to solve the clustering
problem.it classify the given data set very simple and easy manner through a
certain number of cluster(assume k clusters) Fixed apriori.the idea is to
define K centers,one for each
cluster.these centers should be placed in a cunning way because of different
location causes different result.So,the better choice is to place them as
much as possible far away from
each other. The next step is to take each point belonging to a
given data set and associate it to the nearest center. When no point is
pending, the first step is
completed and an early group age is
done. At this point we need to re-calculate k new centroids as barycenter
of the clusters resulting from the
previous step. After we have these k new centroids, a new binding has to be
done between the same data set points and the nearest new center. A loop has been
generated. As a result of this loop
we may
notice that the k centers change their location step by step until no
more changes are done or in
other words centers do not move any more.
2)K-medians clustering-:
Basically in statistics and data
mining,K-medians clustering algorithm is used.its different from K-means ,here
instead of calculating the mean for each cluster to determine its centroid,one
instead calculates the median.this has the effect of minimizing error over all
clusters with to the 1-norm distance metric,as opposed to the square of the
2-norm distance metric.This relates directly to the k-median problem which is
the problem of finding k centers such that the clusters formed by them are the
most compact. Formally, given a set of data points x, the k centers ci are to
be chosen so as to minimize the sum of the distances from each x to the nearest
ci.The criterion function formulated in this way is sometimes a better
criterion than that used in the k-means clustering algorithm, in which the sum
of the squared distances is used. The sum of distances is widely used in
applications such as facility location.
3)Expectation–maximization
algorithm-:
The
EM algorithm can be used to estimate latent variables, like ones that come from
mixture distributions (you know they came from a mixture, but not which
specific distribution).The Expectation-Maximization (EM) algorithm is a way to
find maximum-likelihood estimates for model parameters when your data is incomplete,
has missing data points, or has unobserved (hidden) latent variables. It is an
iterative way to approximate the maximum likelihood function. While maximum
likelihood estimation can find the “best fit” model for a set of data, it
doesn’t work particularly well for incomplete data sets. The more complex EM
algorithm can find model parameters even if you have missing data. It works by
choosing random values for the missing data points, and using those guesses to
estimate a second set of data. The new values are used to create a better guess
for the first set, and the process continues until the algorithm converges on a
fixed point. Although Maximum Likelihood Estimation (MLE) and EM can both find
“best-fit” parameters, how they find the models are very different. MLE
accumulates all of the data first and then uses that data to construct the most
likely model. EM takes a guess at the parameters first — accounting for the
missing data — then tweaks the model to fit the guesses and the observed data.
The basic
steps for the algorithm are:
An initial guess is made for the model’s
parameters and a probability distribution is created. This is sometimes called
the “E-Step” for the “Expected” distribution.
Newly observed data is fed into the model.
The probability distribution from the E-step
is tweaked to include the new data. This is sometimes called the “M-step.”
Applications:-
- Dis-entangling superimposed signals,
- Estimating Gaussian mixture models (GMMs),
- Estimating hidden Markov models (HMMs),
- Estimating parameters for compound Dirichlet distributions,
- Finding optimal mixtures of fixed models.
4)Hierarchical Clustering Algorithms-:
There are 2 types of Hierarchical Clustering
1) Agglomerative Hierarchical clustering
algorithm or AGNES (agglomerative nesting) and
2) Divisive Hierarchical clustering
algorithm or DIANA (divisive analysis).
Both this algorithm are exactly reverse of
each other. So we will be covering Agglomerative Hierarchical clustering algorithm
in detail.
Agglomerative Hierarchical clustering -This
algorithm works by grouping
the data one by one on the basis of the
nearest distance measure of all the pairwise distance between the data
point. Again distance between the data point is recalculated but which distance
to consider when the groups has been formed? For this there are many available
methods. Some of them are:
1) single-nearest distance or single linkage.
2) complete-farthest distance or complete
linkage.
3) average-average distance or average
linkage.
4) centroid distance.
Algorithmic steps for Agglomerative
Hierarchical clustering
Let X
= {x1, x2, x3, ..., xn} be the set of data points.
1) Begin with the disjoint clustering having
level L(0) = 0 and sequence number m = 0.
2) Find the least distance pair of clusters
in the current clustering, say pair (r), (s), according to d[(r), (s)] = min
d[(i),(j)] where the minimum is over
all pairs of clusters in the current clustering.
3) Increment the sequence number: m = m
+1.Merge clusters (r) and (s) into a single cluster to form the next
clustering m. Set the level of this
clustering to L(m) = d[(r),(s)].
4) Update the distance matrix, D, by deleting
the rows and columns corresponding to clusters (r) and (s) and adding a row and
column corresponding to the newly formed cluster. The distance between the new
cluster, denoted (r,s) and old cluster(k) is defined in this way: d[(k), (r,s)]
= min (d[(k),(r)], d[(k),(s)]).
5) If all the data points are in one cluster then
stop, else repeat from step 2).
Divisive Hierarchical clustering - It is just
the reverse of Agglomerative Hierarchical approach.
6) ward's method - sum of squared euclidean
distance is minimized.
Advantages:-
1) No apriori information about the number of
clusters required.
2) Easy to implement and gives best result in
some cases.
Disadvantages:-
1) Algorithm can never undo what was done
previously.
2) Time
complexity of at least O(n2 log n) is required, where ‘n’ is the number of data
points.
Association:-
Apriori Algorithm-:
The Apriori Algorithm is an influential
algorithm for mining frequent itemsets for boolean association rules.
Key
Concepts :-
• Frequent Itemsets: The sets of item which
has minimum support (denoted by Li for ith-Itemset).
• Apriori Property: Any subset of frequent
itemset must be frequent.
• Join Operation: To find Lk , a set of
candidate k-itemsetsis generated by joining Lk-1 with itself.
3)Semi-supervised Learning
Yarowsky algorithm-:
The Yarowsky algorithm is a well-known
bootstrapping algorithm, but it is not mathematically well understood.
The Yarowsky bootstrapping algorithm
resolves the homograph-level word sense disambiguation (WSD) problem, which is
the sense granularitylevel required for real natural language processing (NLP)
applications. At the same time it resolves the knowledge acquisitionbottleneck
problem affecting most WSD algorithms and can be easily applied to foreign
language corpora.
4)Reinforcement Learning
1)TD-Learning(Temporal Differences)-:
Temporal Difference (TD) Learning is mainly
used in order to predict a quantity and that quantity depends on the future
values.
and the reason why its called as Temporal
Difference because its use of changes,or differences,in predictions over time
steps to get the learning process.The prediction at any given time step is updated to bring it closer to the prediction of the same quantity at the next
time step.TD Learning is a supervised learning in which the training signal for
a prediction is a future prediction and its algorithms are often used in
reinforcement learning. It's method can be used to estimate these
value functions. if the value functions were calculated without estimation,the
agent would need to wait until the final reward was received before any
state-action pair values can be updated.
Once the final reward was received, the path
taken to reach the final state would need to be traced back and each value
updated.
2)Actor-critic-:
(a)Actor-only
methods work with a parameterized family of policies. The gradient of the
performance, with respect to the actor parameters, is directly estimated by
simulation, and the parameters are updated in a direction of improvement [4, 5,
8, 13J. A possible drawback of such methods is that the gradient estimators may
have a large variance. Furthermore, as the policy
changes,
a new gradient is estimated independently of past estimates.Hence, there is no
"learning," in the sense of accumulation and consolidation of older
information.
(b)Critic-only
methods rely exclusively on value function approximation and aim at learning an
approximate solution to the Bellman equation, which will then hopefully
prescribe a near-optimal policy. Such methods are indirect in the sense that
they do not try to optimize directly over a policy space. A method of this type
may succeed in constructing a "good" approximation of
the
value function, yet lack reliable guarantees in terms of near-optimality of the
resulting policy.
Actor-critic
methods implement generalized policy iteration alternating between a policy
evaluation and a policy improvement step.
There are two closely related processes of
actor improvement which aims at improving the current policy critic evaluation
which evaluates the current policy If the critic is modeled by a bootstrapping
method it reduces the variance so the learning is more stable than pure policy
gradient methods.
*Value-based methods:
1)estimate the value function
2)policy is implicit (eg SI-greedy)
*Policy-based methods
1)estimate the policy
2) no value function
*Actor-critic methods
1)estimate the policy
2)estimate the value function
3)Q-learning-:
The Q-Learning algorithm was introduced as a
way to optimize solutions in Markov decision process problems. The distinctive feature of Q-Learning is in
its capacity to choose between immediate rewards and delayed rewards. At each step of time, an agent observes the
vector of state xt, then chooses and applies an action ut. As the process moves
to state xt+1, the agent receives a reinforcement r(xt, ut). The goal of the training is to find the
sequential order of actions which maximizes the sum of the future
reinforcements, thus leading to the shortest path from start to finish.
The
transition rule of Q learning is a very simple formula:
Q(state,
action) = R(state, action) + gamma * Max[Q(next state, all actions)]
The
gamma parameter has a range of 0 to 1 (0 <= gamma > 1), and ensures the
convergence of the sum. If gamma is
closer to zero, the agent will tend to consider only immediate rewards. If gamma is closer to one, the agent will
consider future rewards with greater weight, willing to delay the reward.
The Q-Learning algorithm goes as follows:
1.
Set the gamma parameter, and environment rewards in matrix R.
2.
Initialize matrix Q to zero.
3.
For each episode:
Select
a random initial state.
Do
While the goal state hasn't been reached.
Select
one among all possible actions for the current state.
Using
this possible action, consider going to the next state.
Get
maximum Q value for this next state based on all possible actions.
Compute:
Q(state, action) = R(state, action) + gamma * Max[Q(next state, all actions)]
Set
the next state as the current state.
End
Do
End
For
4)Dyna(shows that learning and planning are related and integral
to RL)-:
The Dyna algorithm requires about k times the
computation of Q-learning per instance, but this is typically vastly less than
for the naive model-based method. A reasonable value of k can be determined
based on the relative speeds of computation and of taking action.
5)Adaptive real-time dynamic programming(allows an instance
to change its behavior interacting over time.)-:
Adaptive Real-Time Dynamic Programming
(ARTDP) is an algorithm that allows an agent to improve its behavior while
interacting over time with an incompletely known dynamic environment. It can
also be viewed as a heuristic search algorithm for finding shortest paths in
incompletely known stochastic domains. ARTDP is based on Dynamic Programming (DP), but unlike conventional
DP, which consists of off-line algorithms, ARTDP is an on-line algorithm
because it uses agent behavior to guide its computation. ARTDP is adaptive
because it does not need a complete and accurate model of the environment but
learns a model from data collected during agent-environment interaction. When a
good model is available, Real-Time
Dynamic Programming (RTDP) is applicable, which is ARTDP without the
model-learning component.
6)Least square algorithm for temporal difference-:
An algorithm we call Least-Squares TD (LS TD)
for which we prove probability-one convergence when it is used with a function
approximator linear in the adjustable parameters. We then define a recursive
version of this algorithm, Recursive Least-Squares TD (RLS TD). Although these
new TD algorithms require more computation per time-step than do Sutton's TD(A)
algorithms, they are more efficient in a statistical sense because they extract
more information from training experiences. We describe a simulation experiment
showing the substantial improvement in learning rate achieved by RLS TD.In
addition to converging more rapidly, LS TD and RLS TD do not have control
parameters, such as a learning rate parameter, thus eliminating the possibility
of achieving poor performance by an unlucky choice of parameters.
7)LSPI(Least Squares Policy Iteration)-:
Least-Squares Policy Iteration (LSPI) is a
reinforcement learning algorithm designed and used to solve control problems.
It uses value function approximation to cope with large state spaces and batch
processing for efficient use of training data. LSPI has been used successfully
to solve several large scale problems using relatively few training data. LSPI
learns to control the pendulum or the bicycle by merely observing a relatively
small number of trials where actions are selected randomly. LSPI is also
compared against Q-learning.
8)The Parti-game Algorithm(The Parti-game Algorithm for
Variable Resolution Reinforcement Learning in Multidimensional State-spaces.)-:
Parti-game is a new algorithm for learning
feasible tra jectories to goal regions in high dimensional continuous
state-spaces. In high dimensions it is essential that learning does not plan
uniformly over a state-space. Parti-game maintains a decision-tree partitioning
of state-space and applies techniques from game-theory and computational
geometry to esociently and adaptively concentrate high resolution only on
critical areas.
The current version of the algorithm is
designed to find feasible paths or tra jectories to goal regions in high
dimensional spaces.
9)prioritized sweeping-:
Although Dyna is a great improvement on
previous methods, it suffers from being relatively undirected. It is
particularly unhelpful when the goal has just been reached or when the agent is
stuck in a dead end; it continues to update random state-action pairs, rather
than concentrating on the interesting parts of the state space. These problems
are addressed by prioritized sweeping, which are independently-developed.
The algorithm is similar to Dyna, except that
updates are no longer chosen at random and values are now associated with
states (as in value iteration) instead of state-action pairs (as in
Q-learning). To make appropriate choices, we must store additional information
in the model. Each state remembers its predecessors: the states that have a
non-zero transition probability to it under some action. In addition, each
state has a priority, initially set to zero.
10)sparse sampling algorithm-:
A sparse sampling algorithm for planning in
Markov decision processes whose running time has no dependence on the number of
states.A new algorithm providing a near-optimal solution to the
exploration-exploitation trade-off in Markov decision processes, a basic problem
in reinforcement learning.Powerful new algorithms for probabilistic inference
in Bayesian networks.
Algorithms for computing approximate Nash
equilibria in large-population games.
11)Policy
search-:
Policy search approaches to reinforcement
learning represent a promising method for solving
POMDPs
and large MDPs. In the policy search setting, we assume that we are given
some
class ENO of policies mapping from the states to the actions, and wish to find
a good
policy
EM 2 ENO. A common problem with policy search is that the search through ENO
can be
difficult
and computationally expensive, and is thus typically based on local search
heuristics
that
do not come with any performance guarantees.
In
this paper, we show that if we give the learning agent a “base distribution” on
states
(specifically,
one that indicates how often we expect it to be in each state; cf. [5, 4]),
then we
can
derive an efficient policy search algorithm that terminates after a polynomial
number
of
steps. Our algorithm outputs a non-stationary policy, and each step in the
algorithm
requires
only a minimization that can be performed or approximated via a call to a
standard
12)Policy gradient Algorithm-:
Policy gradient methods are a type of
reinforcement learning techniques that rely upon optimizing parametrized
policies with respect to the expected return (long-term cumulative reward) by
gradient descent. They do not suffer from many of the problems that have been
marring traditional reinforcement learning approaches such as the lack of
guarantees of a value function, the intractability problem resulting from
uncertain state information and the complexity arising from continuous states
& actions.
Conclusion:-
Machine
learning approaches applicable in complex research fields such as quality
improvement may assist in the title and abstract screening process
From
this blog, I am sure that you have an idea of Machine learning Types. I hope
you have enjoyed Machine Learning approaches as well.
Comments
Post a Comment