2008-02-21 22:39:23
昨日は多次元配列の縦横の変換の手際よさに感嘆したのだが、今日は頻度一覧作成の手際悪さに落胆した。どうして、PHPのarray_count_valuesのようなのがないのだろう。私はPHPでこれを愛用しているので、ないと不便でかなわない。あと、連想配列のソートがPHPなら、キーでも値でも簡単にソートできるのだが、どうしてPythonではこんなに手間取るのか。Pythonが嫌いになってきた。新聞記事のコソボの話題から抜き出した以下の文にどんな単語がどれくらいの頻度で出現するかを集計するためにどうしたらいいか調べてみた。
#!/usr/bin/python
import re
sentences = "Earlier, Albanian police officers, part of Kosovo's\
multiethnic police force, were forced out of the neighboring \
Serbian village, where they were patrolling with fellow Serbs. \
It was the latest sign that Serbs in Kosovo, incensed by the \
declaration of independence Sunday, are trying to assert control \
over the northern part of Kosovo in an attempt to force partition."
noLine = re.sub('(\n)+', ' ', sentences)
sLine = re.sub('[.!:?]', ' ', noLine)
sLine2 = re.sub("[,\-']", ' ', sLine)
sLine3 = re.sub("[,';:]", '', sLine2)
sLine3 = re.sub('"', '', sLine3)
words = sLine3.split(' ')
words = [w for w in words if len(w) > 0]
wDict = {}
for w in words:
if wDict.has_key(w):
wDict[w] += 1
else:
wDict[w] = 1
keys = wDict.keys()
keys.sort()
for key in keys:
print key, wDict[key]
print 'There are a total of ' + str(len(wDict)) + ' words in this text.'
結果はこうなる。
Albanian 1 Earlier 1 It 1 Kosovo 3 Serbian 1 Serbs 2 Sunday 1 an 1 are 1 assert 1 attempt 1 by 1 control 1 declaration 1 fellow 1 force 2 forced 1 in 2 incensed 1 independence 1 latest 1 multiethnic 1 neighboring 1 northern 1 of 4 officers 1 out 1 over 1 part 2 partition 1 patrolling 1 police 2 s 1 sign 1 that 1 the 4 they 1 to 2 trying 1 village 1 was 1 were 2 where 1 with 1 There are a total of 44 words in this text.リストの中身を数え上げるのが面倒くさい。これはアルファベット順だが、頻度順にするにはどうしたらいいんだ?
#!/usr/bin/python
import re
sentences = "Earlier, Albanian police officers, part of Kosovo's\
multiethnic police force, were forced out of the neighboring \
Serbian village, where they were patrolling with fellow Serbs. \
It was the latest sign that Serbs in Kosovo, incensed by the \
declaration of independence Sunday, are trying to assert control \
over the northern part of Kosovo in an attempt to force partition."
noLine = re.sub('(\n)+', ' ', sentences)
sLine = re.sub('[.!:?]', ' ', noLine)
sLine2 = re.sub("[,\-']", ' ', sLine)
sLine3 = re.sub("[,';:]", '', sLine2)
sLine3 = re.sub('"', '', sLine3)
words = sLine3.split(' ')
words = [w for w in words if len(w) > 0]
wDict = {}
for w in words:
if wDict.has_key(w):
wDict[w] += 1
else:
wDict[w] = 1
alph = sorted(wDict.items())
print 'Sorted by alphabet:'
for word in alph:
print word[0], ":", word[1]
freq = sorted(wDict.items(), key=lambda (k, v): (v, k))
print 'Sorted by frequency:'
for word in freq:
print word[0], ":", word[1]
print 'There are a total of ' + str(len(wDict)) + ' words in this text.'
結果はこうなる。
Sorted by alphabet: Albanian : 1 Earlier : 1 It : 1 Kosovo : 3 Serbian : 1 Serbs : 2 Sunday : 1 an : 1 are : 1 assert : 1 attempt : 1 by : 1 control : 1 declaration : 1 fellow : 1 force : 2 forced : 1 in : 2 incensed : 1 independence : 1 latest : 1 multiethnic : 1 neighboring : 1 northern : 1 of : 4 officers : 1 out : 1 over : 1 part : 2 partition : 1 patrolling : 1 police : 2 s : 1 sign : 1 that : 1 the : 4 they : 1 to : 2 trying : 1 village : 1 was : 1 were : 2 where : 1 with : 1 Sorted by frequency: Albanian : 1 Earlier : 1 It : 1 Serbian : 1 Sunday : 1 an : 1 are : 1 assert : 1 attempt : 1 by : 1 control : 1 declaration : 1 fellow : 1 forced : 1 incensed : 1 independence : 1 latest : 1 multiethnic : 1 neighboring : 1 northern : 1 officers : 1 out : 1 over : 1 partition : 1 patrolling : 1 s : 1 sign : 1 that : 1 they : 1 trying : 1 village : 1 was : 1 where : 1 with : 1 Serbs : 2 force : 2 in : 2 part : 2 police : 2 to : 2 were : 2 Kosovo : 3 of : 4 the : 4 There are a total of 44 words in this text.これでアルファベット順と頻度順の一覧が出るが……おや、頻度順は少ない順ではないか。多い順にするにはどうしたらいいんだ? PHPなら簡単なのに。ちなみにPHPで同じことをさせてみるとこうなる。簡単ではないか。結果は同じなのでもう示さない。
<?php
sentences = "Earlier, Albanian police officers, part of Kosovo's\
multiethnic police force, were forced out of the neighboring \
Serbian village, where they were patrolling with fellow Serbs. \
It was the latest sign that Serbs in Kosovo, incensed by the \
declaration of independence Sunday, are trying to assert control \
over the northern part of Kosovo in an attempt to force partition."
$noLine = preg_replace("/(\n)+/", " ", $sentences);
$sLine = preg_replace("/[.!:?]/", " ", $noLine);
$sLine = preg_replace("/[,';:]/", "", $sLine);
$sLine = preg_replace("/\"/", "", $sLine);
$words = explode(" ",$sLine);
$wDict = array_count_values($words);
unset($wDict[""]);
ksort($wDict);
echo "Sorted by alphabet:\n";
foreach($wDict as $key=>$val){
echo $key.":".$val."\n";
}
arsort($wDict);
echo "\nSorted by frequency:\n";
foreach($wDict as $key=>$val){
echo $key.":".$val."\n";
}
echo "There are a total of ".count($wDict)." words in this text.\n";
?>
今日参考にしたのは以下の2サイト。他にオライリーのPythonの本とか。http://handasse.blogspot.com/2008/01/python.html
http://mainline.brynmawr.edu/Courses/cs325/fall2005/PythonIntro.html