許多資料,像通訊錄或試算表之類的,很適合列表呈現。而 comma-separated values, CSV是微軟牌視窗軟體存放表格資料常用的檔案格式。這種純文字的檔案格式是以逗號(comma)來為每筆(record)資料的欄位(field)作分隔。
舉個實際的例子,不久前我因論文需要,由 Davis 那取得了 1999 年美國千大企業的董事會成員資料。內容包括這些董事(directors)的公司、職稱、年齡等等。
由於我只關心每間公司的董事有哪些,所以就輕快地以 Python 語寫了一個 function ,要電腦讀入這個 CSV 檔後,順便吐出各公司的董事們:
def LoadBoards_v0(fn='direct99.csv'): """Loads directors of companies from a CSV file and returns a dictionary to lookup directors for a company board (version 0). Field Names: Company name, Director name, Title, Age, Salary, Boards, HQ city, HQ state """ lines = open(fn).readlines() field_names = lines[0].split(',') records = [dict(zip(field_names, line.split(','))) for line in lines[1:]] boards = {} for record in records: boards.setdefault(record['Company name'], []).append(record['Director name']) return boards
這段 code 只用到 Python 最標準的開檔讀檔 functions 及內定的資料結構,短短幾行就把事情搞定!什麼?這 code 竟然無法正確執行?哎呀,原來董事的 second name 及 first name 間竟然也出現逗號(至少在這個例子中,我們不想把名字拆成兩個欄位)。微軟應付這件事情的方法是把整個 second name, first name 用引號(")括起來。
還好在咒罵完微軟害人要寫煩人的「引號配對碰」程式後,我想起了 Python 也提供了 CSV 模組,於是將程式改寫如下:
def LoadBoards_v1(fn='direct99.csv'): """Loads directors of companies from an Excel CSV file and returns a dictionary to lookup directors for a company board (version 1). Field Names: Company name, Director name, Title, Age, Salary, Boards, HQ city, HQ state """ import csv reader = csv.reader(file(fn), dialect="excel") reader.next() # cast away the field-name tuple boards = {} for tuple in reader: boards.setdefault(tuple[0], []).append(tuple[1]) return boards
如果有人覺得還要去算欄位順序是一件很蠢的事,可以改採 CSV 的 DictReader:
def LoadBoards_v2(fn='direct99.csv'): """Loads directors of companies from an Excel CSV file and returns a dictionary to lookup directors for a company board (version 2). Field Names: Company name, Director name, Title, Age, Salary, Boards, HQ city, HQ state """ import csv boards = {} for record in csv.DictReader(file(fn), dialect="excel"): boards.setdefault(record['Company name'], []).append(record['Director name']) return boards
0 comments:
Post a Comment