Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I have a list of cities that have numerous incorrect spelling for the same city. One city is misspelled 18 times! I am trying to clean this up but its taking hours. Is there some algorithm that might "guess" at the valid city name for each of these misspelled ones? Some form of weighting? The data is in MySQL and I do have a table of the correct spelling as well to compare against.

Any ideas on this? A PHP example would help if possible.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
391 views
Welcome To Ask or Share your Answers For Others

1 Answer

you could use a damerau-levenstein function to get the string distance between two strings. (This also checks for transposition)

http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

If your tables are big then you might need to optimise the algo a bit to break once the string distance exceeds a threashold.

Also if you can assume that the first letter of the city is typed correctly that should reduce the number of comparisons to.

Its not PHP, but if its of any help heres a java version I wrote:

class LevinshteinDistance{
    public static void main(String args[]){
        if(args.length != 2){
            System.out.println("Displays the Levenshtein distance between 2 strings");
            System.out.println("Usage: LevenshteinDistance stringA stringB");
        }else{
            int distance = getLevenshteinDistance(args[0], args[1]);
            System.out.print(getLevenshteinMatrix(args[0], args[1]));
            System.out.println("Distance: "+distance);
        }
    }   

    /**
     * @param a first string for comparison
     * @param b second string for comparison
     * @param caseSensitive whether or not to use case sensitive matching
     * @return a levenshtein string distance
     */  
    public static int getLevenshteinDistance(String a, String b, boolean caseSensitive){
        if(! caseSensitive){
        a = a.toUpperCase();
        b = b.toUpperCase();
        }
        int[][] matrix = generateLevenshteinMatrix(a, b);
        return matrix[a.length()][b.length()];
    }

    /**
     * @param a first string for comparison
     * @param b second string for comparison
     * @return a case sensitive levenshtein string distance
     */  
    public static int getLevenshteinDistance(String a, String b){
        int[][] matrix = generateLevenshteinMatrix(a, b);
        return matrix[a.length()][b.length()];
    }

    /**
     * @param a first string for comparison
     * @param b second string for comparison
     * @return a  case sensitive string representation of the search matrix
     */  
    public static String getLevenshteinMatrix(String a, String b){
        int[][] matrix = generateLevenshteinMatrix(a, b);
        StringBuilder result = new StringBuilder();
        final int ROWS = a.length()+1;
        final int COLS = b.length()+1;

        result.append(rowSeperator(COLS-1, false));
        result.append("|    "+b+" |
");
        result.append(rowSeperator(COLS-1, true));  

        for(int r=0; r<ROWS; r++){
            result.append('|');
            if(r > 0){
                result.append(a.charAt(r-1));
            }else{
                result.append(' ');
            }
            result.append(" |");
            for(int c=0; c<COLS; c++){
                result.append(matrix[r][c]);
            }
            result.append(" |
");
        }       
        result.append(rowSeperator(COLS-1, false));
        return result.toString();   
    }   


    private static String rowSeperator(final int LEN, boolean hasGap){
        StringBuilder result = new StringBuilder();
        if(hasGap){
            result.append("+  +-");
        }else{
            result.append("+----");
        }
        for(int i=0; i<LEN; i++) 
            result.append('-');
        result.append("-+
");
        return result.toString();
    }

    private static int[][] generateLevenshteinMatrix(String a, String b){
        final int ROWS = a.length()+1;
        final int COLS = b.length()+1;
        int matrix[][] = new int[ROWS][COLS];

        for(int r=0; r<ROWS; r++){
            matrix[r][0]=r;
        }
        for(int c=0; c<COLS; c++){ 
            matrix[0][c]=c;
        }

        for(int r=1; r<ROWS; r++){
            char cA = a.charAt(r-1);
            for(int c=1; c<COLS; c++){
                    char cB = b.charAt(c-1);
                int cost = (cA == cB)?0:1;

                int deletion =  matrix[r-1][c]+1; 
                int insertion = matrix[r][c-1]+1;
                int substitution = matrix[r-1][c-1]+cost;
                int minimum = Math.min(Math.min(deletion, insertion), substitution);    

                if( (r > 1 && c > 1) && a.charAt(r-2) == cB && cA == b.charAt(c-2) ){
                    int transposition = matrix[r-2][c-2]+cost;
                    minimum = Math.min(minimum, transposition);
                }
                matrix[r][c] = minimum;
            }
        }   
        return matrix;
    }
}

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...