Sunday, July 15, 2007

Python 與 CSV

許多資料,像通訊錄或試算表之類的,很適合列表呈現。而 comma-separated values, CSV是微軟牌視窗軟體存放表格資料常用的檔案格式。這種純文字的檔案格式是以逗號(comma)來為每筆(record)資料的欄位(field)作分隔。

舉個實際的例子,不久前我因論文需要,由 Davis 那取得了 1999 年美國千大企業的董事會成員資料。內容包括這些董事(directors)的公司、職稱、年齡等等。

由於我只關心每間公司的董事有哪些,所以就輕快地以 Python 語寫了一個 function ,要電腦讀入這個 CSV 檔後,順便吐出各公司的董事們:

def LoadBoards_v0(fn='direct99.csv'):
    """Loads directors of companies from a CSV file and
    returns a dictionary to lookup directors for a company board (version 0).
    Field Names:
        Company name, Director name, Title, Age, Salary, Boards, HQ city, HQ state
    """
    lines = open(fn).readlines()
    field_names = lines[0].split(',')
    records = [dict(zip(field_names, line.split(',')))  for line in lines[1:]]
    boards = {}
    for record in records:
        boards.setdefault(record['Company name'], []).append(record['Director name'])
    return boards

這段 code 只用到 Python 最標準的開檔讀檔 functions 及內定的資料結構,短短幾行就把事情搞定!什麼?這 code 竟然無法正確執行?哎呀,原來董事的 second name 及 first name 間竟然也出現逗號(至少在這個例子中,我們不想把名字拆成兩個欄位)。微軟應付這件事情的方法是把整個 second name, first name 用引號(")括起來。

還好在咒罵完微軟害人要寫煩人的「引號配對碰」程式後,我想起了 Python 也提供了 CSV 模組,於是將程式改寫如下:

def LoadBoards_v1(fn='direct99.csv'):
    """Loads directors of companies from an Excel CSV file and
    returns a dictionary to lookup directors for a company board (version 1).
    Field Names:
        Company name, Director name, Title, Age, Salary, Boards, HQ city, HQ state
    """
    import csv
    reader = csv.reader(file(fn), dialect="excel")
    reader.next() # cast away the field-name tuple
    boards = {}
    for tuple in reader:
        boards.setdefault(tuple[0], []).append(tuple[1])
    return boards

如果有人覺得還要去算欄位順序是一件很蠢的事,可以改採 CSV 的 DictReader:

def LoadBoards_v2(fn='direct99.csv'):
    """Loads directors of companies from an Excel CSV file and
    returns a dictionary to lookup directors for a company board (version 2).
    Field Names:
        Company name, Director name, Title, Age, Salary, Boards, HQ city, HQ state
    """
    import csv
    boards = {}
    for record in csv.DictReader(file(fn), dialect="excel"):
        boards.setdefault(record['Company name'], []).append(record['Director name'])
    return boards
Tags: [] [] []

0 comments: