Skip to main content

Machine Learning The Easy Way

We are running in the most quality period of human race. when you open some article about machine learning, you see dozens of detailed descriptions.
The idea behind writing this blog is to get the knowledge about Machine learning across the world. Through this blog, ML provides potential solutions in all different domains and more, and is set to be a pillar of our future civilization.. I am providing a flow level understanding about various machine learning Types along with description. These should be sufficient to get your hands dirty.

So what exactly is “machine learning”
Machine Learning (ML) is coming into its own, It is playing a key role in a wide range of critical applications, such as data mining, natural language processing, image recognition, and expert systems. Machine Learning is all around us. Apple, Amazon, Microsoft, Uber and many more companies are using machine learning.

Generally there are four approaches in Machine Learning -:
1) Supervised Learning
2) Unsupervised Learning
3) Semi-Supervised Learning
4) Reinforcement Learning

1) Supervised Learning

In Supervised Learning which is machine learning set of paired input-output training samples is already given and from that
training samples data set we get the input output relationship information.
Here an input-output training sample is also called labeled training data and output is regarded as the label of the input data or the supervision. The goal of supervised learning is to build an artificial system that can learn the mapping between the input and the output, and can predict the output of the system given new inputs. If the output takes a finite set of discrete values that indicate the class labels of the input.

Mostly there are two ways for Supervised Learning -:
a) Regression
b) Classification


Regression :-

It is a form of predictive modelling technique. Regression find outs the relationship between a dependent (target) and independent variable (s) (predictor).Regression helps investment and financial managers to value assets and understand the relationships between variables, such as commodity prices and the stocks of businesses dealing in those commodities. 

Linear Regression -:
Linear Regression uses linear equation to observed data for investigate the relationship between two variables. There are two variables first one considered as an explanatory variable and second one considered as a dependent variable. A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0).

Ordinary Least Squares Regression -:
It is mainly use in statistics. Ordinary Least Squares Regression is use to find out the unknown parameter. OLS chooses the parameters of a linear function of a set of explanatory variables by minimizing the sum of the squares of the differences between the observed dependent variable (values of the variable being predicted) in the given dataset and those predicted by the linear function. Geometrically this is seen as the sum of the squared distances, parallel to the axis of the dependent variable, between each data point in the set and the corresponding point on the regression line – the smaller the differences, the better the model fits the data. The resulting estimator can be expressed by a simple formula, especially in the case of a single regressor on the right-hand side. 

OLS regression -:
find outs the relation between dependent and independent variables.

Example - scores(dependent variable)  and hours(independent variable) equation Y = β0 + Σj=1..p βjXj + ε where Y is the dependent variable,β0, is the intercept of the model,X j corresponds to the jth explanatory variable of the model (j= 1 to p), and e is the random error with expectation 0 and variance σ².

Stepwise regression -:
In Stepwise regression an automatic procedure is used for finding choice of predictive variables. As the name suggest, here are multiple steps are there .In each step some calculation is there based on predefined criteria.

Generally adddition or subtraction operation is performed on the set of explanatory variables. In stepwise regression Forward selection, Backward elimination, Bidirectional elimination basically these approaches are present.

Multivariate adaptive regression splines -:
Multivariate adaptive regression splines was invented by Mr. Jerome H. Friedman in 1991.It is also called as MARS. MARS mainly containing two phases forward pass and backward pass. In forward pass adding terms continues until the change in residual error is too small to continue or until the maximum number of terms is reached. The maximum number of terms is specified by the user before model building starts.The backward pass has an advantage over the forward pass: at any step it can choose any term to delete, whereas the forward pass at each step can only see the next pair of terms.

LOESS Curve Fitting (Local Polynomial Regression) -:
In LOESS Curve Fitting one smooth curve will be fitted into two variables.
LOWESS stands for Locally Weighted Scatter-plot Smoother.In this the linearity assumptions of conventional regression methods have been relaxed that`s why it is also called as nonparametric  method.
The fitted points and their standard errors represent are estimated with respect to the whole curve  rather than a particular estimate. So, the overall uncertainty is measured as how well the estimated  curve fits the population curve.

Logistic regression -:
It determine an outcome .Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes).

k-nearest neighbors algorithm -:
KNN is used for classification and regresion.In this k is the main factor, mainly taken as odd for less complexity.Suppose we have taken k=5 then from that point distance will be calculated to all points.We have taken k=5 therefore 5 earest neighbours will be considered and the outcome will be decided based upon
these 5 neighbours.To calculate distances between points some equations are used

Decision Tree:-

1)Iterative Dichotomiser 3-:
In decision tree learning, ID3 (Iterative Dichotomiser 3) is an algorithm invented by Ross Quinlan[1] used to generate a decision tree from a dataset. ID3 is the precursor to the C4.5 algorithm, and is typically used in the machine learning and natural language processing domains.The ID3 algorithm begins with the original set {displaystyle S} S as the root node. On each iteration of the algorithm, it iterates through every unused attribute of the set {displaystyle S} S and calculates the entropy {displaystyle H(S)} H(S) (or information gain {displaystyle IG(S)} {displaystyle IG(S)}) of that attribute. It then selects he attribute which has the smallest entropy (or largest information gain) value. The set {displaystyle S} S is then split by the selected attribute (e.g. age is less than 50, age is between 50 and 100, age is greater than 100) to produce subsets of the data. The algorithm continues to recurse on each subset, considering only attributes never selected before. 

 Recursion on a subset may stop in one of these cases:-

1)every element in the subset belongs to the same class (+ or -), then the node is turned into a leaf and labelled with the class of the examples
 2)there are no more attributes to be selected, but the examples still do not belong to the same class (some are + and some are-), then the node is turned into a leaf and labelled with the most common class of the examples in the subset
 3)there are no examples in the subset, this happens when no example in the parent set was found to be matching a specific value of the selected attribute, for example if there was no example with age >= 100. Then a leaf is created, and labelled with the most common class of the examples in the parent set.

2)C4.5 algorithm-:
C4.5 is an algorithm used to generate a decision tree.C4.5 is an extension of Quinlan's earlier ID3 algorithm.The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier.
C4.5 builds decision trees from a set of training data in the same way as ID3, using the concept of information entropy. The training data is a set S=s1,s2,...of already classified samples. Each sample si consists of a p-dimensional vector {x1,i,x2,i,...,xp,i} where the xj represent attribute values or features of the sample, as well as the class in which si falls.

 This algorithm has a few base cases.

1)All the samples in the list belong to the same class. When this happens, it simply creates a leaf node for the decision tree saying to choose that class.
2)None of the features provide any information gain. In this case, C4.5 creates a decision node higher up the tree using the expected value of the class.
3)Instance of previously-unseen class encountered. Again, C4.5 creates a decision node higher up the tree using the expected value.

3)Chi-square automatic interaction detection-:
Chi-square automatic interaction detection (CHAID) is a decision tree technique, based on adjusted significance testing (Bonferroni testing). CHAID can be used for prediction (in a similar fashion to regression analysis, this version of CHAID being originally known as XAID) as well as classification, and for detection of interaction between variables. CHAID is based on a formal extension of the US AID (Automatic Interaction Detection) and THAID (THeta Automatic Interaction Detection) procedures.
CHAID is often used in the context of direct marketing to select groups of consumers and predict how their responses to some variables affect other variables, although other early applications were in the field of medical and psychiatric research.
CHAID's advantages are that its output is highly visual and easy to interpret. Because it uses multiway splits by default, it needs rather large sample sizes to work effectively, since with small sample sizes the respondent groups can quickly become too small for reliable analysis.One important advantage of CHAID over alternatives such as multiple regression is that it is non-parametric.

4)Decision stump-:
A decision stump is a  Decision Tree. which uses only a single attribute for splitting. For discrete attributes, this typically means that the tree consists only of a single interior node (i.e., the root has only leaves as successor nodes). If the attribute is numerical, the tree may be more complex.Decision stumps perform surprisingly well on some commonly used benchmark datasets from the UCI repository.which illustrates that learners with a high  Bias and low  Variance may perform well because they are less prone to  Overfitting. Decision stumps are also often used as weak learners in  Ensemble Methods such as boosting.

2)Unsupervised Learning

1)k-means clustering algorithm-:
K-means is simplest unsupervised learning algorithms is used to solve the clustering problem.it classify the given data set very simple and easy manner through a certain number of cluster(assume k clusters) Fixed apriori.the idea is to define  K centers,one for each cluster.these centers should be placed in a cunning way because of different location causes different result.So,the better choice is to place them  as  much as possible  far away from each other. The  next  step is to take each point belonging  to a  given data set and associate it to the nearest center. When no point  is  pending,  the first step is completed and an early group age  is done. At this point we need to re-calculate k new centroids as barycenter of  the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done  between  the same data set points  and  the nearest new center. A loop has been generated. As a result of  this loop we  may  notice that the k centers change their location step by step until no more changes  are done or  in  other words centers do not move any more.

2)K-medians clustering-:
Basically in statistics and data mining,K-medians clustering algorithm is used.its different from K-means ,here instead of calculating the mean for each cluster to determine its centroid,one instead calculates the median.this has the effect of minimizing error over all clusters with to the 1-norm distance metric,as opposed to the square of the 2-norm distance metric.This relates directly to the k-median problem which is the problem of finding k centers such that the clusters formed by them are the most compact. Formally, given a set of data points x, the k centers ci are to be chosen so as to minimize the sum of the distances from each x to the nearest ci.The criterion function formulated in this way is sometimes a better criterion than that used in the k-means clustering algorithm, in which the sum of the squared distances is used. The sum of distances is widely used in applications such as facility location.

3)Expectation–maximization algorithm-:
The EM algorithm can be used to estimate latent variables, like ones that come from mixture distributions (you know they came from a mixture, but not which specific distribution).The Expectation-Maximization (EM) algorithm is a way to find maximum-likelihood estimates for model parameters when your data is incomplete, has missing data points, or has unobserved (hidden) latent variables. It is an iterative way to approximate the maximum likelihood function. While maximum likelihood estimation can find the “best fit” model for a set of data, it doesn’t work particularly well for incomplete data sets. The more complex EM algorithm can find model parameters even if you have missing data. It works by choosing random values for the missing data points, and using those guesses to estimate a second set of data. The new values are used to create a better guess for the first set, and the process continues until the algorithm converges on a fixed point. Although Maximum Likelihood Estimation (MLE) and EM can both find “best-fit” parameters, how they find the models are very different. MLE accumulates all of the data first and then uses that data to construct the most likely model. EM takes a guess at the parameters first — accounting for the missing data — then tweaks the model to fit the guesses and the observed data.

The basic steps for the algorithm are:
 An initial guess is made for the model’s parameters and a probability distribution is created. This is   sometimes called the “E-Step” for the “Expected” distribution.
 Newly observed data is fed into the model.
 The probability distribution from the E-step is tweaked to include the new data. This is sometimes       called the “M-step.”

  Applications:-
  1. Dis-entangling superimposed signals,
  2. Estimating Gaussian mixture models (GMMs),
  3. Estimating hidden Markov models (HMMs),
  4. Estimating parameters for compound Dirichlet distributions,
  5. Finding optimal mixtures of fixed models.

 4)Hierarchical Clustering Algorithms-:
 There are 2 types of Hierarchical Clustering
  
    1) Agglomerative Hierarchical clustering algorithm or AGNES (agglomerative nesting) and
    2) Divisive Hierarchical clustering algorithm or DIANA (divisive analysis).
  
 Both this algorithm are exactly reverse of each other. So we will be covering Agglomerative           Hierarchical clustering algorithm in detail.
 Agglomerative Hierarchical clustering -This algorithm  works by  grouping  the data one by one on   the basis of the  nearest distance measure of all the pairwise distance between the data point. Again   distance between the data point is recalculated but which distance to consider when the groups has     been formed? For this there are many available methods. Some of them are:
             1) single-nearest distance or single linkage.
             2) complete-farthest distance or complete linkage.
             3) average-average distance or average linkage.
             4) centroid distance.
 
  Algorithmic steps for Agglomerative Hierarchical clustering
 
  Let  X = {x1, x2, x3, ..., xn} be the set of data points.

 1) Begin with the disjoint clustering having level L(0) = 0 and sequence number m = 0.
 2) Find the least distance pair of clusters in the current clustering, say pair (r), (s), according to d[(r),       (s)] = min d[(i),(j)]   where the minimum is over all pairs of clusters in the current clustering.
 3) Increment the sequence number: m = m +1.Merge clusters (r) and (s) into a single cluster to form       the next clustering   m. Set the level of this clustering to L(m) = d[(r),(s)].
 4) Update the distance matrix, D, by deleting the rows and columns corresponding to clusters (r) and       (s) and adding a row and column corresponding to the newly formed cluster. The distance                   between  the new cluster, denoted (r,s) and old cluster(k) is defined in this way: d[(k), (r,s)] = min       (d[(k),(r)], d[(k),(s)]).
  5) If all the data points are in one cluster then stop, else repeat from step 2).
      Divisive Hierarchical clustering - It is just the reverse of Agglomerative Hierarchical approach.
  6) ward's method - sum of squared euclidean distance is minimized.

Advantages:-
  1) No apriori information about the number of clusters required.
  2) Easy to implement and gives best result in some cases.

Disadvantages:-
  1) Algorithm can never undo what was done previously.
  2) Time complexity of at least O(n2 log n) is required, where ‘n’ is the number of data points.

Association:-

Apriori Algorithm-:
The Apriori Algorithm is an influential algorithm for mining frequent itemsets for boolean association rules.
Key Concepts :-
  • Frequent Itemsets: The sets of item which has minimum support (denoted by Li for ith-Itemset).
  • Apriori Property: Any subset of frequent itemset must be frequent.
  • Join Operation: To find Lk , a set of candidate k-itemsetsis generated by joining Lk-1 with itself.

3)Semi-supervised Learning 


Yarowsky algorithm-:
The Yarowsky algorithm is a well-known bootstrapping algorithm, but it is not mathematically well understood.
The Yarowsky bootstrapping algorithm resolves the homograph-level word sense disambiguation (WSD) problem, which is the sense granularitylevel required for real natural language processing (NLP) applications. At the same time it resolves the knowledge acquisitionbottleneck problem affecting most WSD algorithms and can be easily applied to foreign language corpora.

4)Reinforcement Learning

1)TD-Learning(Temporal Differences)-:
Temporal Difference (TD) Learning is mainly used in order to predict a quantity and that quantity  depends on the future values.
and the reason why its called as Temporal Difference because its use of changes,or differences,in  predictions over time steps to get the learning process.The prediction at any given time step is  updated to bring it closer to the prediction of the same quantity at the next time step.TD Learning is a supervised learning in which the training signal for a prediction is a future prediction and its algorithms are often used in reinforcement learning. It's method can be used to estimate these value functions. if the value functions were calculated without estimation,the agent would need to wait until the final reward was received before any state-action pair values can be updated.
Once the final reward was received, the path taken to reach the final state would need to be traced back and each value updated.

2)Actor-critic-:
(a)Actor-only methods work with a parameterized family of policies. The gradient of the performance, with respect to the actor parameters, is directly estimated by simulation, and the parameters are updated in a direction of improvement [4, 5, 8, 13J. A possible drawback of such methods is that the gradient estimators may have a large variance. Furthermore, as the policy
changes, a new gradient is estimated independently of past estimates.Hence, there is no "learning," in the sense of accumulation and consolidation of older information.

(b)Critic-only methods rely exclusively on value function approximation and aim at learning an approximate solution to the Bellman equation, which will then hopefully prescribe a near-optimal policy. Such methods are indirect in the sense that they do not try to optimize directly over a policy space. A method of this type may succeed in constructing a "good" approximation of
the value function, yet lack reliable guarantees in terms of near-optimality of the resulting policy.

Actor-critic methods implement generalized policy iteration alternating between a policy evaluation and a policy improvement step.

There are two closely related processes of actor improvement which aims at improving the current policy critic evaluation which evaluates the current policy If the critic is modeled by a bootstrapping method it reduces the variance so the learning is more stable than pure policy gradient methods.

*Value-based methods:
    1)estimate the value function
    2)policy is implicit (eg SI-greedy)
               
*Policy-based methods
    1)estimate the policy
    2) no value function

*Actor-critic methods
   1)estimate the policy
   2)estimate the value function

3)Q-learning-:
The Q-Learning algorithm was introduced as a way to optimize solutions in Markov decision process problems.  The distinctive feature of Q-Learning is in its capacity to choose between immediate rewards and delayed rewards.  At each step of time, an agent observes the vector of state xt, then chooses and applies an action ut. As the process moves to state xt+1, the agent receives a reinforcement r(xt, ut).  The goal of the training is to find the sequential order of actions which maximizes the sum of the future reinforcements, thus leading to the shortest path from start to finish.

The transition rule of Q learning is a very simple formula:

Q(state, action) = R(state, action) + gamma * Max[Q(next state, all actions)]

The gamma parameter has a range of 0 to 1 (0 <= gamma > 1), and ensures the convergence of the sum.  If gamma is closer to zero, the agent will tend to consider only immediate rewards.  If gamma is closer to one, the agent will consider future rewards with greater weight, willing to delay the reward.

The Q-Learning algorithm goes as follows:

1. Set the gamma parameter, and environment rewards in matrix R.

2. Initialize matrix Q to zero.

3. For each episode:

Select a random initial state.

Do While the goal state hasn't been reached.

Select one among all possible actions for the current state.
Using this possible action, consider going to the next state.
Get maximum Q value for this next state based on all possible actions.
Compute: Q(state, action) = R(state, action) + gamma * Max[Q(next state, all actions)]
Set the next state as the current state.

End Do

End For

4)Dyna(shows that learning and planning are related and integral to RL)-:
The Dyna algorithm requires about k times the computation of Q-learning per instance, but this is typically vastly less than for the naive model-based method. A reasonable value of k can be determined based on the relative speeds of computation and of taking action.

  
5)Adaptive real-time dynamic programming(allows an instance to change its behavior interacting over time.)-:
Adaptive Real-Time Dynamic Programming (ARTDP) is an algorithm that allows an agent to improve its behavior while interacting over time with an incompletely known dynamic environment. It can also be viewed as a heuristic search algorithm for finding shortest paths in incompletely known stochastic domains. ARTDP is based on  Dynamic Programming (DP), but unlike conventional DP, which consists of off-line algorithms, ARTDP is an on-line algorithm because it uses agent behavior to guide its computation. ARTDP is adaptive because it does not need a complete and accurate model of the environment but learns a model from data collected during agent-environment interaction. When a good model is available,  Real-Time Dynamic Programming (RTDP) is applicable, which is ARTDP without the model-learning component.

6)Least square algorithm for temporal difference-:
An algorithm we call Least-Squares TD (LS TD) for which we prove probability-one convergence when it is used with a function approximator linear in the adjustable parameters. We then define a recursive version of this algorithm, Recursive Least-Squares TD (RLS TD). Although these new TD algorithms require more computation per time-step than do Sutton's TD(A) algorithms, they are more efficient in a statistical sense because they extract more information from training experiences. We describe a simulation experiment showing the substantial improvement in learning rate achieved by RLS TD.In addition to converging more rapidly, LS TD and RLS TD do not have control parameters, such as a learning rate parameter, thus eliminating the possibility of achieving poor performance by an unlucky choice of parameters.

7)LSPI(Least Squares Policy Iteration)-:
Least-Squares Policy Iteration (LSPI) is a reinforcement learning algorithm designed and used to solve control problems. It uses value function approximation to cope with large state spaces and batch processing for efficient use of training data. LSPI has been used successfully to solve several large scale problems using relatively few training data. LSPI learns to control the pendulum or the bicycle by merely observing a relatively small number of trials where actions are selected randomly. LSPI is also compared against Q-learning.

8)The Parti-game Algorithm(The Parti-game Algorithm for Variable Resolution Reinforcement Learning in Multidimensional State-spaces.)-:
Parti-game is a new algorithm for learning feasible tra jectories to goal regions in high dimensional continuous state-spaces. In high dimensions it is essential that learning does not plan uniformly over a state-space. Parti-game maintains a decision-tree partitioning of state-space and applies techniques from game-theory and computational geometry to esociently and adaptively concentrate high resolution only on critical areas.
The current version of the algorithm is designed to find feasible paths or tra jectories to goal regions in high dimensional spaces.

9)prioritized sweeping-:
Although Dyna is a great improvement on previous methods, it suffers from being relatively undirected. It is particularly unhelpful when the goal has just been reached or when the agent is stuck in a dead end; it continues to update random state-action pairs, rather than concentrating on the interesting parts of the state space. These problems are addressed by prioritized sweeping, which are independently-developed.
The algorithm is similar to Dyna, except that updates are no longer chosen at random and values are now associated with states (as in value iteration) instead of state-action pairs (as in Q-learning). To make appropriate choices, we must store additional information in the model. Each state remembers its predecessors: the states that have a non-zero transition probability to it under some action. In addition, each state has a priority, initially set to zero.

10)sparse sampling algorithm-:
A sparse sampling algorithm for planning in Markov decision processes whose running time has no dependence on the number of states.A new algorithm providing a near-optimal solution to the exploration-exploitation trade-off in Markov decision processes, a basic problem in reinforcement learning.Powerful new algorithms for probabilistic inference in Bayesian networks.
 Algorithms for computing approximate Nash equilibria in large-population games.

11)Policy search-:
Policy search approaches to reinforcement learning represent a promising method for solving
POMDPs and large MDPs. In the policy search setting, we assume that we are given
some class ENO of policies mapping from the states to the actions, and wish to find a good
policy EM 2 ENO. A common problem with policy search is that the search through ENO can be
difficult and computationally expensive, and is thus typically based on local search heuristics
that do not come with any performance guarantees.
In this paper, we show that if we give the learning agent a “base distribution” on states
(specifically, one that indicates how often we expect it to be in each state; cf. [5, 4]), then we
can derive an efficient policy search algorithm that terminates after a polynomial number
of steps. Our algorithm outputs a non-stationary policy, and each step in the algorithm
requires only a minimization that can be performed or approximated via a call to a standard

 12)Policy gradient Algorithm-:
Policy gradient methods are a type of reinforcement learning techniques that rely upon optimizing parametrized policies with respect to the expected return (long-term cumulative reward) by gradient descent. They do not suffer from many of the problems that have been marring traditional reinforcement learning approaches such as the lack of guarantees of a value function, the intractability problem resulting from uncertain state information and the complexity arising from continuous states & actions.

Conclusion:-

Machine learning approaches applicable in complex research fields such as quality improvement may assist in the title and abstract screening process
From this blog, I am sure that you have an idea of Machine learning Types. I hope you have enjoyed Machine Learning approaches as well.









Comments

Popular posts from this blog

Where have you seen Machine Learning in your everyday life?

1 –  Google’s AI-Powered Predictions Using anonymized location data from smartphones, Google Maps (Maps) can analyze the speed of movement of traffic at any given time. And, with its acquisition of crowdsourced traffic app Waze in 2013, Maps can more easily incorporate user-reported traffic incidents like construction and accidents. Access to vast amounts of data being fed to its proprietary algorithms means Maps can reduce  commutes by suggesting the fastest routes to and from work. 2 –  Ridesharing Apps Like Uber and Lyft How do they determine the price of your ride? How do they minimize the wait time once you hail a car? How do these services optimally match you with other passengers to minimize detours? The answer to all these questions is ML. Engineering Lead for Uber ATC Jeff Schneider  discussed in an NPR interview how the company uses ML to predict rider demand to ensure that “surge pricing”(short periods of sharp price increases to decrease rider demand and increase driver supply…

How does a total beginner start to learn machine learning if they have some knowledge of programming languages?

I work with people who write C/C++ programs that generate GBs of data, people who manage TBs of data distributed across giant databases, people who are top notch programmers in SQL, Python, R, and people who have setup an organization wide databases working with Hadoop, Sap, Business Intelligence etc. My inspiration to anyone and everyone would be following: Learn all the basics from Coursera, but if I really have to compare what you would get out of Coursera compared to the vastness of data science, let us say ~ Coursera is as good as eating a burrito at Chipotle Mexican Grill. You certainly can satiate yourself, and you have a few things to eat there.The pathway to value adding data science is really quite deep, and I consider it equivalent to a five star buffet offering 20 cuisines and some 500 different recipes.Coursera is certainly a good starting point, and one should certainly go over these courses, but I personally never paid any money to Coursera, and I could easily learn a v…

2.0

I sometimes see people refer to neural networks as just “another tool in your machine learning toolbox”. They have some pros and cons, they work here or there, and sometimes you can use them to win Kaggle competitions. Unfortunately, this interpretation completely misses the forest for the trees. Neural networks are not just another classifier, they represent the beginning of a fundamental shift in how we write software. They are Software 2.0. The “classical stack” of Software 1.0 is what we’re all familiar with — it is written in languages such as Python, C++, etc. It consists of explicit instructions to the computer written by a programmer. By writing each line of code, the programmer is identifying a specific point in program space with some desirable behavior. In contrast, Software 2.0 is written in neural network weights. No human is involved in writing this code because there are a lot of weights (typical networks might have millions), and coding directly in weights is kind of …