Monday, November 18, 2019

What is Softmax function?

Softmax function is an activation function that turns numbers 
into probabilities that sum to one.

Softmax function outputs a vector that represents the probability distributions
of a list of potential outcomes.

In deep learning, This vector represents the last neuron layer of neural network which also
called Logits which are the raw scores output of the neural network.
We will use softmax to map the non-normalized Logits output number of a network to a 
probability distribution over predicted output classes.

e =  2.71828













let's write an example:



Sunday, October 13, 2019

Mean Squared Error (MSE) - A cost function


Mean Squared Error (MSE) is a numeric value that represents the error of the prediction of the model.

Draw a line and calculate the distance(error) between the line and the data points.
The distance from the line to the data point is also called a "residual".
This errors are the delta between the predicted line and the data points. 
Summarize the square's of all deltas and divided the result by the number of points.


And so we can write








Saturday, October 12, 2019

what is Gradient Descent?


Gradient descent is a very basic concept in Machine Learning (ML) algorithms.
It is an iterative optimization method for finding a local minimum(a measurement of the error) 
of function and thus finding the most optimal set of parameters for a given problem.

We pick a few points at random.
These means picking some set of parameters to the model as an input to start with and then
we measure the error it produces. Then we continue to move by steps down the curve in a given direction.
In every step we continue to measure the error which should be smaller in every step until we rich the minimum of the error which will increase in the next step ahead.

Gradient Descent means push the parameters in a given direction to minimize the error.



Gradient Descent is an algorithm that helps  find the best fit line for a given data set in the lowest number of iterations and in the most efficient way.

As we know the line formula is: y = mx + b.
How exactly do we know how to implement this small steps in Gradient descent algorithm?
We need to find and improve(minimize the error) in each iteration the values of

m - line slope
b - intercept

As we know Mean Squared Error (MSE) is a cost measurement so we want to rich the minimum cost of and b in order for the predicted model to be accurate as possible with respect to the true values of input x and predicted y.





In order to reach the minimum point we need to reduce the steps size as we progress along the line otherwise if the steps are too big we may miss the minimum point and skip to the other side.Now,how do we do that?





We need to calculate the slope in every point which gives us the right direction on the curve.
.
When calculating the b value in every step this slope is actually a derivative of b with respect to the cost function (MSE).When calculating the value in every step this slope is actually a derivative of m with respect to the cost function (MSE).

Also we need to use a learning rate witch is a "step" with conjunction with the slope.

To calculate the derivative we need to use math and calculus (sounds scary? :) )

We need to find the partial derivative b
We need to find the partial derivative m


Once we have a partial derivative we have direction and we need to take a step forward.
This step is a learning rate.






Lets  implement all this theory to python code example:
import numpy as np

def dradient_decent(x, y):
    m_curr = b_curr = 0    iterations = 10000    n = len(x)
    learning_rate = 0.008
    for i in range(iterations):
        y_predicted = m_curr * x + b_curr
        cost = (1/n) * sum([val**2 for val in (y-y_predicted)])
        m = -(2/n)*sum(x*(y-y_predicted))
        b = -(2/n)*sum(y-y_predicted)

        m_curr = m_curr - learning_rate * m
        b_curr = b_curr - learning_rate * b

        print("m = {}, b = {}, cost = {}, iterations = {}".format(m_curr, b_curr, cost, i))


x = np.array([1, 2, 3, 4, 5])
y = np.array([5, 7, 9, 11, 13])

dradient_decent(x, y)





































Wednesday, October 2, 2019

Distance and Similarity between 2 number lists using cosine function


Sometimes we want to find the distance or the similarity between 2 number lists or vectors.
We can use the spatial.distance.cosine() function for that.
The cosine similarity is between 0 and 1.

from scipy import spatial

v1 = [1, 2, 3, 4, 5]
v2 = [1, 2, 3, 4, 5]
distance = spatial.distance.cosine(v1, v2)
similarity = 1 - distance;
print(distance)
print(similarity)
# output:0.0
# output:1.0
v3 = [2, 4, 6, 8, 10]
distance = spatial.distance.cosine(v1, v3)
similarity = 1 - distance;
print(distance)
print(similarity)
# output:0.0
# output:1.0
v4 = [15, 7, 6, 4, 1]
distance = spatial.distance.cosine(v1, v4)
similarity = 1 - distance;
print(distance)
print(similarity)
# output:0.4929466088203224
# output:0.5070533911796776

Sunday, September 29, 2019

K-Nearest Neighbor - supervised technique for classification

  • Used to classify new data points based on the distance from known data.
  • Find K nearest neighbors points.Let  this points vote on the classification.
  • the idea is simple!



This ex. finds the most similar and recommended movies to a particular movie by using the KNN idea.The code defines k nearest distance matrices by genres information and
by rating information.

import pandas as pd
import numpy as np
import operator
from scipy import spatial

# let's define a "distance" function that computes the distance between two movies
# based on the similarity of their genres and popularity
def ComputeDistance(a, b):
# a[1] and b[1] are array's of genres the movie belongs to. ex [0,1,1,1,0.....]
# 1 - belongs, 0 - not belongs
    genresA = a[1]
    genresB = b[1]
    generesDistanse = spatial.distance.cosine(genresA, genresB)
    popularityA = a[2]
    popularityB = b[2]
    popularityDistance = abs(popularityA - popularityB)
    return generesDistanse + popularityDistance


def getNeighbors(movieID, K):
    distance = []
    for movie in movieDict:
        if(movie != movieID):
            dist = ComputeDistance(movieDict[movieID], movieDict[movie])
            distance.append((movie, dist))
    distance.sort(key = operator.itemgetter(1))
    neighbors = []
    for x in range(K):
        neighbors.append(distance[x][0])
    return neighbors


# define columns names
r_cols = ['user_id', 'movie_id', 'rating']
# u.data is a tab delimited data set that contains every rating of a movie
# take the first 3 columns in the data file and import them to a new data frame
ratings = pd.read_csv('data/u.data', sep='\t', names=r_cols, usecols=range(3))
print(ratings.head())

# group by movie_id and compute the total number of ratings and the average rating
# size means how many people rated the movie
MovieProperties = ratings.groupby('movie_id').agg({'rating': [np.size, np.mean]})
print(MovieProperties.head())

# size as a number of ratings gives us no real measurement for popularity.
# we will create a new DataFrame that will normalize the number of size
# ratings by its popularity.# value of 0 means no one rated it.
# value of 1 means it most popular movie there is.
MovieNumRatings = pd.DataFrame(MovieProperties['rating']['size'])
MovieNormalizeNumRatings = 
       MovieNumRatings.apply(lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)))
print(MovieNormalizeNumRatings)

# now lets build a big dictionary called movieDict.
# each entry will contain:# 1.movie name
# 2.list of genre values the movie belongs to 1-belongs 0-not belongs
# 3.the normalised popularity score 0 to 1# the average rating
movieDict = {}
with open('data/u.item') as f:
    for line in f:
        fields = line.rstrip('\n').split('|')
        movieID = int(fields[0])
        name = fields[1]
        genres = fields[5:25]
        genres = [int(x) for x in genres]
        movieDict[movieID] = 
           (name, genres, MovieNormalizeNumRatings.loc[movieID].get('size'),
            MovieProperties.loc[movieID].rating.get('mean'))

print(movieDict[1][1])
print(movieDict[2][1])
print(ComputeDistance(movieDict[2], movieDict[4]))


K=10
avgRating = 0
neighbors = getNeighbors(1, K)
for neighbor in neighbors:
    avgRating += movieDict[neighbor][3]
    print(movieDict[neighbor][0] + " " + str(movieDict[neighbor][3]))

Tuesday, September 24, 2019

Covariance and Correlation - Are two different attributes are related to each other?

  • Covariance - measures how two variables vary from their means.
  • Covariance  is the result of a calculation that returns a number that indicates whether there is a correlation between two attributes but this number is not a measurement.So we use the covariance  to calculate the correlation that gives us a standard measurement (-1 to 1).
  • Correlation  -1  means perfect inverse correlation
    Correlation  0  means no correlation.
    Correlation  1  means perfect correlation.

Let's calculate covariance and correlation and also check
the built-in functions in  Python numpy lib
import numpy as np
import matplotlib.pyplot as plt

def de_mean(x):
    xmean = np.mean(x)
    return [xi - xmean for xi in x]

def covariance(x, y):
    n = len(x)
    return np.dot(de_mean(x), de_mean(y)) / (n-1)

def covrrelation(x, y):
    stdx = np.std(v1)
    stdy  = np.std(v2)
    return covariance(x, y) / stdx / stdy

v1 = [1, 2, 3, 4, 5]
v2 = [1, 3, 2, 4, 5]

plt.scatter(v1, v2)
plt.show()

print(de_mean(v1))
print(de_mean(v2))

print(np.std(v1))
print(np.std(v2))

# use our defined covariance function
covar = covariance(v1, v2)
print(covar)

# use numpy covariance function - cov
print(np.cov(v1, v2))

# use our defined covrrelation function
corr = covrrelation(v1, v2)
print(corr)

# use numpy covrrelation function - corrcoef
print(np.corrcoef(v1, v1))

Sunday, September 22, 2019

Support Vector Machine - supervised learning classification

  • SVM can handle data sets with lots of features and uses advanced mathematical trickery to cluster data - the "kernel trick" - we will use it as a function of separation line or the hyper plan.
  • A kernel is actually a transformation math function on the existing features.
    Helps to draw a clearer line between the different groups we want to classify.
  • SVM is a supervised learning technique and uses test/train. 
  • SVC - support vector classification - classify data using SVM.
    We can use different "kernels" with SVC , ex: linear , RBF , polynomial
  • SVM actually draws a line(2 d) or a hyper plan in n dimensional space such that it maximizes  the margins between classification groups. It maximizes  the margins between the supportive vectors (which are the near by data points) to the decision boundary line.

python code example:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn import svm, datasets
from pylab import *

# create fake income/age clusters for N people and k clusters
ef createClusterdData(N, k):
    np.random.seed(1234)
    pointsPerCluster = float(N)/k
    X=[]
    y=[]
    for i in range(k):
        # from 20000 to 200000
        incomeCentroid = np.random.uniform(20000, 200000)

        # from 20 to 70
        ageCentroid = np.random.uniform(20, 70)

        for j in range(int(pointsPerCluster)):
            X.append([np.random.normal(incomeCentroid, 10000.0), np.random.normal(ageCentroid, 2.0)])
            y.append(i)

    X = np.array(X)
    y= np.array(y)
    return X, y

def plotPredictions(clf):
    # create a dense grid of points to sample
    xx, yy = np.meshgrid(np.arange(-1, 1, .001), np.arange(-1, 1, .001))

    # convert to numpy arrays
    npx = xx.ravel()
    npy = yy.ravel()

    # Convert to list of 2D (income, age) points
    samplePoints = np.c_[npx, npy]

    # Generate predictive labels for each point
    Z = clf.predict(samplePoints)

    plt.figure(figsize=(8, 6))
    # Reshape results to mach xx dimensions
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
    plt.scatter(X[:, 0], X[:, 1], c=y.astype(np.float))
    plt.show()


(X, y) = createClusterdData(100, 5)

plt.figure(figsize=(8, 6))
# scatter column 0, column 1 ,c = color correlated by 
#label so each cluster will have different label
plt.scatter(X[:, 0], X[:, 1], c=y.astype(np.float))
plt.show()

scaling = MinMaxScaler(feature_range=(-1, 1)).fit(X)
X = scaling.transform(X)

plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y.astype(np.float))
plt.show()

# use linear SVC to partition out graph into clusters
C = 1.0svc = svm.SVC(kernel='linear', C=C).fit(X, y)


plotPredictions(svc)

print(svc.predict([[5000, 65]]))

Wednesday, September 18, 2019

Decision Trees - another method of supervised learning

  • Given a banch of attributs or variables used to decide a classification by constructing a flowchart.
  • The algorithem predicts a decision based on a given attributes.Our goal is to reach concrete decisions in the early stages.
  • For each step,find the attribute we can use to partition the data set to
    minimize the entropy of the data at the next step and and reach
    a definitive answer in as few steps as possible.
  • decision trees are very susceptible to overfitting. Thats why we will use a technique called Random Forests.
    Random Forests is basically a "forest" of decision trees or alternate decision trees
    to Vote on the final classification.
    Random Forests randomly re-samples the input data for each tree and also randomizes
    a subset of attributes each step it allows to choose from.

Overfitting and Underfitting

Overfitting is a fundamental problem in statistics and machine learning where 
the model is overly adapted to a particular collection of data 
(e.g. the collection available for training) and thus less successful in forecasting.
One of the reasons that can cause Overfitting is when the model has more parameters
then it shpould.The excess parameter allows the model to study the statistical noise
as if it represents real behavior.

How to Prevent Overfitting?
Use techniques like: Cross-validation,Remove features,Regularization,Train with more data 

Underfitting, on the other hand, occurs when the statistical model is too simple 
to properly represent the basic structure of the data,
for example due to a minority of parameters that define the model. 
An example is an attempt to use a linear model to describe nonlinear behavior.

Tuesday, September 17, 2019

Basic math - the concept of log

In machine learning we will use mathematical formulas to give precise definitions for some methods.
The following is a reminder of the concept of "log".

"Log" is the exponent you put on - It's that simple.
Let's look at some examples:
(log base  2)


log of a fraction is negative

For example, in a decision tree we use entropy calculation:
entropy = i-pi×Log(pi)

pi is the proportion of the data labeled for each class.
pi is a fraction.
entropy is the value between 0 and 1.
0 means no entropy and all data is the same.







Monday, September 16, 2019

K-Means clustering - unsupervised learning


  • Collection of data we want to divide into meaningful groups or clusters.
  • The method attempts to split data data into K groups. each group is closest to a centroid point. hens there are K centroid points.
  • Can discover interesting groups of things - People, behaviors, etc.
  • How the algorithm works?
  1. randomly pick K centroids(K-means).
  2. assign each data point to a cenroid it closest to.
  3. recompute centroids based on the average position of each centroids points.
  4. iterate until you rich a threshold and centroids stop changing.


split data into 3 clusters



python code example using sklearn.cluster  KMeans:

from sklearn.cluster import KMeans
from sklearn.preprocessing import scale
import numpy as np
import matplotlib.pyplot as plt

# create fake data of income/age for N people and k clusters
def createClusterdData(Npeople, k):
    np.random.seed(10)
    pointsPerCluster = float(Npeople)/k
    X=[]
    for i in range(k):
        # from 20000 to 200000
        incomeCentroid = np.random.uniform(20000, 200000)

        # from 20 to 70
        ageCentroid = np.random.uniform(20, 70)

        for j in range(int(pointsPerCluster)):
            X.append([np.random.normal(incomeCentroid, 10000.0), np.random.normal(ageCentroid, 2.0)])

    X = np.array(X)
    return X


# 100 people for 5 clusters
data = createClusterdData(100, 5)

# create a model
model = KMeans(n_clusters=4)

# scale the data to normalize it because there is a very large difference
# between the size numbers of age and income - scale function fit values 
# to the same scale
model = model.fit(scale(data))

print(model.labels_)

plt.figure(figsize=(8, 6))
# 2 columns income,age
plt.scatter(data[:, 0], data[:, 1],  c=model.labels_.astype(np.float))
plt.show()


4-means cluster




Sunday, September 15, 2019

Linear regression is a private case of polynomial regression


  • The formula of linear regression: y=mx+b .This formula represents a straight line but not all relationships are linear.
  • Linear regression is just one example of all class of regressions.It's actually a first-degree polynomial regression.
  • Polynomial regressions:
    first degree polynomial regression - linear regression

    second degree polynomial regression
    third degree polynomial regression

  • python code example
    import numpy as np
    import matplotlib.pyplot as plt
    
    np.random.seed(2)
    numberOfVisitors = np.random.normal(3, 1, 1000)
    numberOfClicks = np.random.normal(50, 10, 1000) / numberOfVisitors
    
    x = np.array(numberOfVisitors)
    y = np.array(numberOfClicks)
    
    # create a line from x=0 to x=7 width 100 evenly spaced values
    xp = np.linspace(0, 7, 100)
    
    # 4 degree polynomial fit
    p4 = np.poly1d(np.polyfit(x, y, 4))
    plt.scatter(x, y)
    plt.plot(xp, p4(xp), c='r')
    plt.show()
4 degree polynomial fit