,1.One nearest neighbor
Input:
Query article: Xq
Corpus of documents (N docs): (X1, X2, X3,... ,XN)
output :
XNN = min disance(Xq, Xi)
2. K-NN Algorithm
#list of sorted distance smallest_distances = sort(σ1, σ2...σk) #list of sorted distance closest_articles = sort(book1, book2...bookK) For i=k+1,...,N compute σ = distance(book1 ,bookq) if σ < smallest_distances[k]: find j such that σ > smallest_distances[j-1] but σ < smallest_distances[j] remove furthest article and shift queue: closest_articles[j+1:k] = closest_articles[j:k-1] smallest_distances [j+1:k] = smallest_distances [j:k-1] set smallest_distances[j] = σ and smallest_distances[j] = booki return k most similar articles
3. Document representation
i. Bag of words model
Issue:
ii. TF-IDF document representation
And if defined as tf*idf :
(tf) Term frequency = word counts
(idf) Inverse doc frequency = log(number of docs/(1+number of docs which are using this word))
4. Distance metric
i.Weight different features:
Specify weights as a function of feature spread
For feature j:
1 / (max(xi[j]) - min(Xi[j]))
Scaled euclidean distance:
and A is the diagonal matrix where a1, a2, ..., ad is the diagonal
ii. Similarity
the inner product between two article vectors,
Normalized similarity is defined as:
Theta is the angle between those two vector.
We can normalized the similarity to rule out the effect of the length of an article,
but normalizing can make dissimilar objects appear more similar like tweet vs long article, we can cap maximum word counts to cure this.
5. Complexity of brute-force search of KNN
O(N) distance computations per 1-NN query
O(NlogK) per k-NN query (maintain a priority queue and insert into it in an efficient manner)
6. KD-trees
which dimensin do we split along?
which value do we split at?
when do we stop?
7. KNN with KD-trees
Complexity of this algorithm
Construction:
size = 2N - 1 odes if 1 datapoint at each leaf O(N)
depth = Olog(N)
median + send points left right = O(N) at every level of the tree
Construction time = depth * cost at each level = O(Nlog(N))
1-NN Query:
Traverse down tree to starting point = O(log(N))
Maximum backtrack and traverse = O(N) in worst case
Complexity range: O(log(N)) - O(N) but exponential in d (dimensions)
for k queries, original KNN method = O(N2) ,using KD trees = O(Nlog(N)) - O(N2)
In high dimensions:
8. Approximate K-NN with KD-trees
Before: Prune when distance to bounding box > r
Now: Prune when distance to bounding box > r/α
so we can prune more than allowed
9. Locality sensitive hashing
simple "binning" of data into multiple bins, and we only search points within the bin, fewer points in the bin
Each line can split points, so we are sacrificing accuray for speed.
but we can search neighboring bins(filp 1 bit, until computational budget is reached).
In high dimensions, we just draw random planes and per daa point, need d multiplies to determine bin index per plane.
机器学习笔记(Washington University)- Clustering Specialization-week one & week two
原文:http://www.cnblogs.com/climberclimb/p/6891713.html