Tolle et Lege Diary

2008-02-21 22:39:23

　昨日は多次元配列の縦横の変換の手際よさに感嘆したのだが、今日は頻度一覧作成の手際悪さに落胆した。どうして、PHPのarray_count_valuesのようなのがないのだろう。私はPHPでこれを愛用しているので、ないと不便でかなわない。あと、連想配列のソートがPHPなら、キーでも値でも簡単にソートできるのだが、どうしてPythonではこんなに手間取るのか。Pythonが嫌いになってきた。新聞記事のコソボの話題から抜き出した以下の文にどんな単語がどれくらいの頻度で出現するかを集計するためにどうしたらいいか調べてみた。

＃!/usr/bin/python

import re

sentences = "Earlier, Albanian police officers, part of Kosovo's\
multiethnic police force, were forced out of the neighboring \
Serbian village, where they were patrolling with fellow Serbs. \
It was the latest sign that Serbs in Kosovo, incensed by the \
declaration of independence Sunday, are trying to assert control \
over the northern part of Kosovo in an attempt to force partition."

noLine = re.sub('(\n)+', ' ', sentences)
sLine = re.sub('[.!:?]', ' ', noLine)
sLine2 = re.sub("[,\-']", ' ', sLine)
sLine3 = re.sub("[,';:]", '', sLine2)
sLine3 = re.sub('"', '', sLine3)
words = sLine3.split(' ')
words = [w for w in words if len(w) > 0]
wDict = {} 
for w in words:
    if wDict.has_key(w):
        wDict[w] += 1
    else:
        wDict[w] = 1
keys = wDict.keys()
keys.sort()
for key in keys:
  print key, wDict[key]
print 'There are a total of ' + str(len(wDict)) + ' words in this text.'

結果はこうなる。

Albanian 1
Earlier 1
It 1
Kosovo 3
Serbian 1
Serbs 2
Sunday 1
an 1
are 1
assert 1
attempt 1
by 1
control 1
declaration 1
fellow 1
force 2
forced 1
in 2
incensed 1
independence 1
latest 1
multiethnic 1
neighboring 1
northern 1
of 4
officers 1
out 1
over 1
part 2
partition 1
patrolling 1
police 2
s 1
sign 1
that 1
the 4
they 1
to 2
trying 1
village 1
was 1
were 2
where 1
with 1
There are a total of 44 words in this text.

リストの中身を数え上げるのが面倒くさい。これはアルファベット順だが、頻度順にするにはどうしたらいいんだ？

＃!/usr/bin/python

import re

sentences = "Earlier, Albanian police officers, part of Kosovo's\
multiethnic police force, were forced out of the neighboring \
Serbian village, where they were patrolling with fellow Serbs. \
It was the latest sign that Serbs in Kosovo, incensed by the \
declaration of independence Sunday, are trying to assert control \
over the northern part of Kosovo in an attempt to force partition."

noLine = re.sub('(\n)+', ' ', sentences)
sLine = re.sub('[.!:?]', ' ', noLine)
sLine2 = re.sub("[,\-']", ' ', sLine)
sLine3 = re.sub("[,';:]", '', sLine2)
sLine3 = re.sub('"', '', sLine3)
words = sLine3.split(' ')
words = [w for w in words if len(w) > 0]
wDict = {} 
for w in words:
    if wDict.has_key(w):
        wDict[w] += 1
    else:
        wDict[w] = 1
alph = sorted(wDict.items())
print 'Sorted by alphabet:'
for word in alph:
  print word[0], ":", word[1]
freq = sorted(wDict.items(), key=lambda (k, v): (v, k))
print 'Sorted by frequency:'
for word in freq:
  print word[0], ":", word[1] 
print 'There are a total of ' + str(len(wDict)) + ' words in this text.'

結果はこうなる。

Sorted by alphabet:
Albanian : 1
Earlier : 1
It : 1
Kosovo : 3
Serbian : 1
Serbs : 2
Sunday : 1
an : 1
are : 1
assert : 1
attempt : 1
by : 1
control : 1
declaration : 1
fellow : 1
force : 2
forced : 1
in : 2
incensed : 1
independence : 1
latest : 1
multiethnic : 1
neighboring : 1
northern : 1
of : 4
officers : 1
out : 1
over : 1
part : 2
partition : 1
patrolling : 1
police : 2
s : 1
sign : 1
that : 1
the : 4
they : 1
to : 2
trying : 1
village : 1
was : 1
were : 2
where : 1
with : 1
Sorted by frequency:
Albanian : 1
Earlier : 1
It : 1
Serbian : 1
Sunday : 1
an : 1
are : 1
assert : 1
attempt : 1
by : 1
control : 1
declaration : 1
fellow : 1
forced : 1
incensed : 1
independence : 1
latest : 1
multiethnic : 1
neighboring : 1
northern : 1
officers : 1
out : 1
over : 1
partition : 1
patrolling : 1
s : 1
sign : 1
that : 1
they : 1
trying : 1
village : 1
was : 1
where : 1
with : 1
Serbs : 2
force : 2
in : 2
part : 2
police : 2
to : 2
were : 2
Kosovo : 3
of : 4
the : 4
There are a total of 44 words in this text.

これでアルファベット順と頻度順の一覧が出るが……おや、頻度順は少ない順ではないか。多い順にするにはどうしたらいいんだ？　PHPなら簡単なのに。ちなみにPHPで同じことをさせてみるとこうなる。簡単ではないか。結果は同じなのでもう示さない。

<?php

sentences = "Earlier, Albanian police officers, part of Kosovo's\
multiethnic police force, were forced out of the neighboring \
Serbian village, where they were patrolling with fellow Serbs. \
It was the latest sign that Serbs in Kosovo, incensed by the \
declaration of independence Sunday, are trying to assert control \
over the northern part of Kosovo in an attempt to force partition."

$noLine = preg_replace("/(\n)+/", " ", $sentences);
$sLine = preg_replace("/[.!:?]/", " ", $noLine);
$sLine = preg_replace("/[,';:]/", "", $sLine);
$sLine = preg_replace("/\"/", "", $sLine);
$words = explode(" ",$sLine);
$wDict = array_count_values($words);
unset($wDict[""]);
ksort($wDict);
echo "Sorted by alphabet:\n";
foreach($wDict as $key=>$val){
  echo $key.":".$val."\n";
}
arsort($wDict);
echo "\nSorted by frequency:\n";
foreach($wDict as $key=>$val){
  echo $key.":".$val."\n";
}
echo "There are a total of ".count($wDict)." words in this text.\n";
?>

　今日参考にしたのは以下の２サイト。他にオライリーのPythonの本とか。
http://handasse.blogspot.com/2008/01/python.html
http://mainline.brynmawr.edu/Courses/cs325/fall2005/PythonIntro.html

Back to Home

過去の日記

2021年

１月
２月
３月
４月
５月
６月
７月
８月
９月
10月
11月
12月

2013年

１月
２月
３月
４月
５月
６月
７月
８月
９月
10月
11月
12月

2012年

2011年

2010年

2009年

2008年

2007年

１月
２月
３月
４月
５月
６月
７月
８月
９月
10月
11月
12月

日記検索

ホームページに戻る

屋根裏の備忘録

2008-02-21 22:39:23

日記検索