Fixes and more documentation

This commit is contained in:
Dale Weiler 2013-01-06 04:06:38 +00:00
parent 44a7154f58
commit d98cc564b1

View file

@ -44,9 +44,20 @@
* out of all possible corrections that maximizes the probability of C
* for the original identifer I.
*
* Bayes' Therom suggests something of the following:
* Thankfully there exists some theroies for probalistic interpretations
* of data. Since we're operating on two distictive intepretations, the
* transposition from I to C. We need something that can express how much
* degree of I should rationally change to become C. this is called the
* Bayesian interpretation. You can read more about it from here:
* http://www.celiagreen.com/charlesmccreery/statistics/bayestutorial.pdf
* (which is probably the only good online documentation for bayes theroy
* no lie. Everything else just sucks ..)
*
* Bayes' Thereom suggests something like the following:
* AC P(I|C) P(C) / P(I)
* Since P(I) is the same for every possibly I, we can ignore it giving
*
* However since P(I) is the same for every possibility of I, we can
* complete ignore it giving just:
* AC P(I|C) P(C)
*
* This greatly helps visualize how the parts of the expression are performed
@ -111,6 +122,7 @@
* Our control mechanisim could use a limit, i.e limit the number of
* sets of edits for distance X. This would also increase execution
* speed considerably.
*
*/
@ -163,12 +175,12 @@ static GMQCC_INLINE char *correct_pool_claim(const char *data) {
}
/*
* A fast space efficent trie for a disctonary of identifiers. This is
* A fast space efficent trie for a dictionary of identifiers. This is
* faster than a hashtable for one reason. A hashtable itself may have
* fast constant lookup time, but the hash itself must be very fast. We
* have one of the fastest hash functions for strings, but if you do a
* lost of hashing (which we do, almost 3 million hashes per identifier)
* a hashtable becomes slow. Very Very Slow.
* a hashtable becomes slow.
*/
correct_trie_t* correct_trie_new() {
correct_trie_t *t = (correct_trie_t*)mem_a(sizeof(correct_trie_t));
@ -440,7 +452,8 @@ static char **correct_known(correct_trie_t* table, char **array, size_t rows, si
end = correct_edit(array[itr]);
row = correct_size(array[itr]);
for (; jtr < row; jtr++) {
/* removing jtr=0 here speeds it up by 100ms O_o */
for (jtr = 0; jtr < row; jtr++) {
if (correct_find(table, end[jtr]) && !correct_exist(res, len, end[jtr])) {
res = correct_known_resize(res, &nxt, len+1);
res[len++] = end[jtr];