Tuesday, August 30, 2011

The Methodology Behind Words of Loss and Words of Win

Last month I posted a list of words on Prosper which, when used in a listing which successfully became a loan, were more successful than average and those that were less successful than average.

A comment from havastat on the prospers.org forum made me realize that I had neglected to talk about the methodology used to find these words. It is as follows:


For every loan created successfully on Prosper before the end of 2007 I created a list of all of the words used in the Title and Description of the loan. For every instance of a word in a loan that was Paid I added 1 to a running total of PaidInstances. For every instance of a word in a loan that had any other status I added 1 to a running total of UnpaidInstances.

I then calculated the percentage for the word with the formula:
PaidInstances / (PaidInstances + UnpaidInstances)
(Which is to say: PaidInstances / TotalWordUsage)

I reduced the list to words which had been used at least 1000 in the loan set and sorted it from words that were most often in Paid loans to words that were least often in Paid loans and compared that list to the overall likelihood of any loan to be paid back.

I found the word 'lender' at the top, with loans containing the word having been Paid 68.96% of the time. I found the word 'payday' at the bottom, with loans containing the word having been Paid only 38.89% of the time. (This compared to an average Paid percentage, across all loans, of about 61% for this time period.)

Now I think that there is an argument to be made that it would have been better to count each word a maximum of once for each listing -- what I did measures use of the word itself, more than it measures the use of the word in the listing ("help, help, help, help!" in one listing counts 4 times, instead of just once), but I think that the best choice really depends on what you're trying to do with the information.

No comments:

Post a Comment