I attended this talk on Explporing Linguistic Features for Web Spam and Using String Distance Matrices to Lemmatization of Polish Names. So will write some key points from the topics.
Explporing Linguistic Features for Web Spam:
Task is tough due to:
- Complexity
- Scale , and
- Adaptive nature of spam
Also they were using GENERAL INQUIRER by harvard university, but I am not much interested in it.
Using String Distance Matrices to Lemmatization of Polish Names:
Task was about synonym and homonym names.
The impotant thing that I liked in this Talk was DISTANCE MATRICES. I will just write a bit about these. There are different categories to these:
- Edit Distance Matrices: these include Leveshtein Distance, Bag Distance, Needleman-Wunsch, Smith-Waterman, Smith-Waterman with Affine Gaps
- q-Grams
- Longest Common Substring, Jaro, Jaro-Winker
- Recursive String Distance Matrices: These include Monge-Elkan, Sorted Tokens, Permuted Tokens