Tolle et Lege Diary

2008-08-26 22:18:01

　ここ三週間ばかり、Google App EngineやらAmazon EC2に悩んでいた。どうもよく解らない。うまく使えない……

　自宅サーバの管理も面倒臭いからこういうものを積極的に利用していこうかと思いながらも、やはり手元に本体があるのが何かと使いやすいような気もするし……何がいいのかと。そんな深刻な悩みじゃありませんが。

　何か使ってみたいと思って試してみたのが、Yahoo!検索Webサービスを利用して検索をして、結果をKWIC（keyword in context）で表示しようというもの。前に作ったことがあるのだけど、あまり出来がよくないので作り直したかったのである。このときはPHPで作ったが、Google App Engineなので今回はPythonだ。とりあえず、こんなふうにしてみた。

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import cgi
import wsgiref.handlers

from google.appengine.ext import webapp
from google.appengine.api import urlfetch
import urllib
import xml.etree.cElementTree as ET

class MainPage(webapp.RequestHandler):
  def get(self):
    self.response.out.write("""
      <html>
        <body>
        <h2>KWIC-search</h2>
          <form action="/kwic" method="post">
            <div><input type="text"  name="word" ></div>
            <div>
	    <select name="lang">
		<option value="ja" selected>日本語</option>
		<option value="ko">韓国語</option>
		<option value="szh">中国語（簡体字）</option>
		<option value="tzh">中国語（繁体字）</option>
	    </select>
            <select name="resnum">
		<option value="20" selected>20</option>
		<option value="50">50</option>
	    </select>
	    </div>
            <div><input type="submit" value="検索"></div>
          </form>
        </body>
      </html>""")

def kwic(word,lang,resnum):
  query = {'appid':'xxxxxx','query':word.encode('utf-8'),\\
    'language':lang,'results':resnum}
  url = 'http://search.yahooapis.jp/WebSearchService/V1/webSearch?'\\
   + urllib.urlencode(query)
  xmlns='{urn:yahoo:jp:srch}'
  xml = urlfetch.fetch(url).content
  dom = ET.fromstring(xml)
  results = []
  for item in dom.findall(xmlns + 'Result'):
    summary = item.findtext(xmlns + 'Summary')
    title = item.findtext(xmlns + 'Title')
    clickurl = item.findtext(xmlns + 'ClickUrl')
    res = {'summary':summary,'title':title,'clickurl':clickurl}
    results.append(res)
  return results

class Result(webapp.RequestHandler):
  def post(self):
    word = self.request.get('word')
    lang = self.request.get('lang')
    resnum = self.request.get('resnum')
    results = kwic(word,lang,resnum)
    table = "<table border='0'>"
    for r_item in results:
      prts = r_item['summary'].split('...')
      for part in prts:
        if part.find(word) > 0:
          lr = part.split(word)
          trow = "<tr><td align='right'>"+lr[0][(len(lr[0])-20):]\\
            +"</td><td><strong>"+word+"</strong></td><td>"+lr[1][0:20]\\
            +"</td><td><a href='"+r_item['clickurl']+"' \\
            target='_blank'>"+r_item['title'][0:10]+"</a></td></tr>"
          table += trow
    table += "</table>"
    self.response.out.write('<html><body><h2>KWIC-yahoo</h2>\n\\
    	<p>Keyword = <strong>')
    self.response.out.write(cgi.escape(word))
    self.response.out.write(('</strong></p>\n'))
    self.response.out.write(table)
    self.response.out.write('<hr /></body></html>')

def main():
  application = webapp.WSGIApplication([('/', MainPage),
                                        ('/kwic', Result)],
                                       debug=True)
  wsgiref.handlers.CGIHandler().run(application)

if __name__ == '__main__':
  main()

　これの、for item in dom.findall(xmlns + 'Result'):のところが解らなくて、何日も何週間も悩んでいたのだ。こんなこと誰も教えてくれない。何でみんな解るんだ？　とにかくこれでやっと結果を処理できるようになったのだ。返ってきた結果にには検索語が複数入っていることがある。たいていは「...」で分割されているので、なるべくそれも反映するようにしてみた。今日は余裕がないので、四言語しか選べないようになっているけれども、これは後で追加したい。

　これはhttp://yahookwic.appspot.com/で使えるようにしておいた。たくさん結果を集めて前後の単語の頻度順なんかで並べ替えられるようにすると格好いいのだが、今はそんなことをする元気はない。日本語だと形態素解析とかしなければならないし。がんばってやったことはあるんだけど。

Back to Home

過去の日記

2021年

１月
２月
３月
４月
５月
６月
７月
８月
９月
10月
11月
12月

2013年

１月
２月
３月
４月
５月
６月
７月
８月
９月
10月
11月
12月

2012年

2011年

2010年

2009年

2008年

2007年

１月
２月
３月
４月
５月
６月
７月
８月
９月
10月
11月
12月

日記検索

ホームページに戻る

屋根裏の備忘録

2008-08-26 22:18:01

日記検索