CRUCIVERB.COM

Constructing => Software / Technical => Topic started by: michaeladamkatz on September 16, 2011, 01:26:36 AM

Title: what are my word list options for "highly recognizable" names?
Post by: michaeladamkatz on September 16, 2011, 01:26:36 AM
(I'm absolutely new on this forum. I did do a quick scan, but I apologize if the answer to this question is already posted.)

I am looking for a word list, and it may be that the CRUCIVERB word list is what I need, in which case I'm happy to pay the membership fee to get it. But I just wanted to make sure it would work for my purpose.

I'm a programmer and I'm trying to make puzzles for a new game. They are actually not traditional crossword puzzles, but they do use crossword-style clues and answers. The puzzles are meant to be solvable by a "casual player", like maybe the level of an airplane magazine puzzle. The way the game works is that the player sees a quotation, and from that quotation I draw strings of letters that combine to form answers. So if part of the quotation were "FOR A MAN WHO MAKES", I might take the "FOR" and the "AKES" and throw in an "s" (which is allowed in this particular game) and make the answer "FORsAKES", and then write a clue like "Abandons". I've been doing that process by hand, and I want to use a word list to automate it, or at least to make a combined manual/automated process. To be clear, I don't want to automate the clue generation, just this piecemeal composition of answers from strings within the quote such that a reasonably easy clue can be written for them.

I have a lower-case word list (i.e., Scrabble-allowed words) that is appropriately thinned to match the vocabulary level I'm going for. I'm looking for a list to throw in lots of capitalized words: place names, people names, names of movies and books, etc., etc.

What I want to understand is whether, with the CRUCIVERB list, I'll have the option to automatically select only the "highly recognizable" place names, people names, names of movies and books, etc.

I guess what I'm asking is whether the entries in the CRUCIVERB list are somehow rated to indicate their level of obscurity, so that, roughly speaking, I could distinguish a proper noun that is recognizable by, say, 99% of the 30-year-olds on the street, given the proper clue (e.g., the name Jackson), from a proper noun that is recognizable by a much smaller percentage, regardless of the clue (e.g., the name Reyes, or any number of even moderately obscure names, terms, acronyms, or other words where no matter what the clue it's not going to be recognized by a very "casual" player).

Really it's like I just want a list of the 10,000 (? -- I don't know if that's the appropriate number but it sounds reasonable) most recognized celebrity names, movie titles, book titles, historical figures, fictional characters, geographical names, and so on. I can piece together a list from lots of different sites around the internet, but I was hoping to just find or purchase something.

Thanks for any help.
Title: Re: what are my word list options for "highly recognizable" names?
Post by: Todd G on September 17, 2011, 06:17:07 PM
As far as I know, no one has created lists of the most recognizable people/places/movies, etc.  Even if someone did once, it would be out of date in a couple of years.

The CRUCIVERB lists, as far as I know, are unranked lists of entries that have appeared in puzzles before.  So ROBERTREDFORD and ROBERTSHAW get the same rank, though the first is clearly better known than the second.

You might try sites like www.whoismorefamous.com, but this is clearly a limited sample set.

If you do find such a list somewhere, we constructors would be really happy to hear about it!

—Todd
Title: Re: what are my word list options for "highly recognizable" names?
Post by: michaeladamkatz on September 17, 2011, 11:06:29 PM
Yes, I agree the list would get outdated quickly. Nonetheless, it seems like a worthy goal to build such a list, and I'll see what I can put together (or find).

Michael
Title: Re: what are my word list options for "highly recognizable" names?
Post by: Alex Boisvert on September 23, 2011, 02:42:32 PM
If I were to do this, I'd use the Wikipedia data dumps to build such a list.  I've actually done bits and pieces of this before.  Here would be my general strategy:

1. Get the latest pages-articles.xml and pagelinks.sql from http://dumps.wikimedia.org/enwiki/latest/ (http://dumps.wikimedia.org/enwiki/latest/) (These downloads take absolutely forever)

2. Make a perl script to sift through them.  Great examples are found here:
http://en.wikipedia.org/wiki/Wikipedia:Computer_help_desk/ParseMediaWikiDump (http://en.wikipedia.org/wiki/Wikipedia:Computer_help_desk/ParseMediaWikiDump)

3. I would order the results by number of links to a given page. To use Todd's example, Robert Redford's page would presumably have a lot more incoming links than Robert Shaw, so Redford would rank higher.  (Just checking ... last time I ran this Redford had 842 incoming links and Shaw had 116).  It's not a perfect metric, but it's pretty good.  You can run through the sql file to make those rankings -- you'd probably have to do this for all pages, since you can't tell who's a person just yet.

4. Then you go through the .xml file to pull out only people.  One way to do that is to look for "births" in the list of category names.  Then associate each "birth" with a number from step 3, and you're done.

EDIT: I notice it's not just people you're after. You can add other categories to the list, in that case.

I can help you out with the perl, if you're interested in this strategy.
Title: Re: what are my word list options for "highly recognizable" names?
Post by: michaeladamkatz on September 24, 2011, 05:55:57 PM
Alex, thanks for that good idea, and the offer to share code. Those are some massive files. Have you downloaded them recently?

Yeah, it might not be necessary to filter at all. I just want any recognizable terms, regardless of category.

If I simply want to count links, do I need to even download the latest-page-articles? I.e., does the latest-pagelinks contain the article title?

Michael
Title: Re: what are my word list options for "highly recognizable" names?
Post by: Alex Boisvert on September 26, 2011, 02:46:38 PM
Yes, in that case the pagelinks file would be enough.  and it would be a substantially easier process.

Good luck!