Software Cinema: 2007/2/1

SPARS-J

SPARS-J 是日本井上研究室(Software Engineering Labratory, Osaka University) 的 SPARS project 所建立的 web-based software component search engine. 目前在 source code 部分還僅支援 java source, 另外支援部份的 document search (根據 paper [1]). 從 search engine 實際上的使用可以發現還提供 XML 以及 JSP search.

SPARS-J 的基本原理是利用 authors 所提的 component rank model 對於 component 做 weighting. 搭配上基於 text-based token analysis 對於 component 所建立的 keyword index 作為搜尋的支持. 整個運作分為兩個 phase.

Phase 1 為 source input phase, source provider 所提供的 source 會自動被分析, 做 compoennt ranking, 然後存起來. Phase 2 為 search phase. 使用者基於 keyword 進行 search, 結果會分為數個 component groups 以 component rank 做排序. 細節請參考 [1].

一個 component group 由數個 similar components 所組成, 大部份情況下一個 component group 內的 components 會是同一個 software source package 的不同版本, 例如搜尋 "qsort" "string" 就會看到有 Eclipse 內部不同版本的 Util.java 被歸納在同一個 group. 下圖可以看到搜尋結果共有 13 個 groups (綠色), 第一個 group 的 rank ordering 是 1 (藍色), 同一個 file 事實上是不同版本 (紅色).

至於怎樣決定 component similarity 則依據 SPARS-J 內定的 similarity comparison algorithm, 這是可以被調整的. 目前來說應該是使用 [2] 內所提到的 classification 方法.

[1] 文內也針對效能與 Google, Nazama 做比較, 不過可惜的是沒有跟 Google Code 做比較.

References

K. Inoue, R. Yokomori, T. Yamamoto, M. Matsushita, and S. Kusumoto, "Ranking Significance of Software Components based on Use Relations," IEEE Transactions on Software Engineering, vol.31, No.3, pp.213-225, March 2005 ( available pdf)

K. Kobori, T. Yamamoto, M. Matsusita, and K. Inoue, "Classification of Java Programs in SPARS-J," Proceedings of International Workshop on Community-Driven Evolution of Knowledge Artifacts, 2003

NeoWORX

剛剛看到這個 blog visiting counting service provider : NeoWORX , 固然他提供的 services 很有意思, 但是更影起我興趣的是他 title 底下的幾個字 :

Quality tools for a world of blogs

當然光看他提供的 4 種 services 嚴格來說並沒有真的符合 quality tools 的敘述, 但是這卻是個很有趣的想法. 我們怎樣去 measure 一個 blog 的 quality ? 而想這個問題之前, 有個前提要訂清楚, 什麼是一個 blog 的 quality, 以及 blog 需要 quality 嗎 ?

過去已經有研究 website quality 的 papers 存在, 但是所定義出來的 quality 大多與 blog 沒有直接的對應關係, 畢竟 blog 不同於一般的 websites, 而是更接近於 personality 的表現, 因此 art 的成分要比一般的 websites 來的高, 也更加的難以作較為準確的衡量.

以 NeoWORX 的 tools 固然是一個角度, 但是卻流於以瀏覽行為的表徵作為 quality indication, 雖然說越多人次在看可能暗示這個 blog 的 quality 越好, 但是這樣有點間接的推論並不讓我覺得很舒服, 就像我們不能單純地由某 open source software 在 S.F.net 上被 download 的次數來直接宣稱其是否為一個 successful 的 open source software 一樣. 評斷一個 blog 的 quality, 應該需要更多的 attributes 來做判斷.

有時間該用 GQM 的方法來想想看.

Concept Explorer

Concept Explorer 是我第一個接觸的 concept analysis tool. 當初是因為 Lab. group meeting 時, 學長報了一篇 concept analysis 的 paper [1] , 因為有興趣而找來用的. 後來也拿來對 sourceforge.net (S.F.net) 上的 open source softwares 作一些歸納工作.

當初好像是到 S.F.net 上, 抓取各個 projects 的 summary, 然後根據裡面的 value 作 concept analysis, 亦即 (以 OS attribuites value 為例) :

(Object, {Arrtibutes}) = (Software, {OS1, OS2, ... OSn})

底下是所形成的 concept matrix (version 1.2) :

當然上面的 metrix 不是手動產生的 (手動會死人), Concept Explorer 有自己可以讀的 cxt 格式檔案, 其格式蠻簡單的:

B

Number of Row (Object)
Number of Column (Arrtibute)

List of Raw Entities (Objects)
List of Column Entities (Arrtibutes)
Concept Matrix ( . = false, x = true)

因此其實很容易從各種 data 轉換過來, 或是轉存成 XML. 上述的 format 是 for version 1.2, 目前的版本好像是 version 1.3, 不過應該還是適用. 分析結果 concept lattice 大概像下面這樣 :

當選取特定 node 時, 其他不相關 lattice 會被隱藏. 可以注意到 FreeBSD 是 0 %, 這應該跟我當時取樣的 softwares 有關, 不過現在已經忘記當初是怎樣取樣得到這個 concept lattice 了, 畢竟是兩年前的事了 :p

Concept Explorer 本身算是蠻容易使用的, 但是功能較少, 同時 visualization 的效果比較差一點, flexibility 比較低, 我有看到有些 conference papers 有使用它做為分析工具. 其實後來有看到其他較 fancy, powerfal 的 context analysis tool, 像是 Galicia, ToscanaJ, 不過都還沒有深入嘗試, 過一兩個禮拜有試用過再另外寫吧.

References

Paolo Tonella,"Using a Concept Lattice of Decomposition Slices for ProgramUnderstanding and Impact Analysis", IEEE Transactions on Software Engineering, Vol. 29, No. 6, June 2003

Bunch Tool and REportal

Bunch Tool [1] 是由 Software Engineering Research Group (SERG) 所建立的 reverse engineering tool, 可在此下載. 基本的原理是利用 graph-based source transformation, 以 module dependency graph (MDG) 為基礎, 以特定的 clustering algorithm 作不同 level 的 graph clustering. 並且基於 coupling / cohesion 計算各個 cluster 的 MQ value, 從中選出最佳的 clustering 結果作為對於 software architecture 的 best guess. 以下圖來說明的話, 左側的 MDG graph 有 4 個 nodes C1, C2, C3, 以及 C4. 其之間的 dependency 如圖所示, 同時 depedency 上有加註 weight, 經過計算可以得到各種組合的 MQ 值, 選出 MQ 值最高的. 詳細原理可以參考 [1]. 需注意其得到的結果為 partial view, 而非 complete view on architecture.

在使用上, Bunch Tool 提供 Swing-based GUI 介面, 還算容易使用, 但是對於原理或是 algorithm 不了解可能部太知道怎樣去對結果作 tuning, 但基本上不影響使用. 缺點是沒有 (或許是因為不能) 整合前端的 source code analysis components, 所以必須要自己找來用. SERG 另外有一個基於 Bunch Tool 提供 reverse engineering service 的 web-based tool, 叫做 REportal [2], 據網頁上所說, 目前僅支援 Java-based source code analysis, 但是我偷偷試過其實 C++ 也是可以. REportal 上就內建有 decompiler, code analyzer, bunch tool, graphiz dot viewer 等等, 話句話說它是一個完整的 Bunch Tool 使用環境, 如果懶得自己找相關 tools, 直接利用 Reportal 也是可以, 而且分析解果也可以打包下載.

Bunch Tool 相關的前端 code analysis 以及 module dependency graph generation 工具, 目前我試用過的有以下兩個

depgraph : 給 python 用的
cinclude2dot : 給 C / C++ 用的, 這個很容易操作, 可以產出 .dot 檔案. 因為產出的 .dot 檔案格式很接近 .mdg 格式, 因此寫小程式轉換也好, 手動轉換也好 (我是直接手動簡單的 editing / replace 就處理了), 很容易轉成 mdg 檔, 再用抓下來的 Bunch Tool 分析就 ok. 我用這個加上 Bunch Tool 成功地對 FileZilla 作了分析.

Java 的部份我就直接用 REportal 處理了. 其實 MDG 格式相當簡單, 如果有其他的格式, 只要能夠轉成 graph, 自己動手寫個小程式其實也不難, 因此其實很多 code analysis tool 都可以接到 Bunch Tool 使用, 重點在於使用 Bunch Tool 分析是否有意義.

後端的 visualization 由於是 dot 格式的 partitioned MDG graph, 因此只要能夠讀 dot 格式的 viewer 應該都 ok, 像是 Graphiz 內建的 dotviewer 就能用了. 但是觀看大型程式時的效果並不是很好. 不過我也找不到其他好用的 dot visualization tools, 真是有點奇怪. 下面是對於 jChecs 的分析結果之部分, 原始圖太大了, software 真的是很複雜的東西.

Bunch Tool 的另外一個缺點是, 沒有提供 source, 所以雖然理論上可以很容易地擴充其 clustering algorithm, 但是必須要 decompile 其 bytecode 才行, 同時又有 license 的考量, 所以想拿來自行修改並不是很容易的事情. 前不久因為出於 research 上的需求, 我偷偷做了一點 hacking, 也發現 Bunch Tool 內的 algorithm strategy architecture design 有點亂, 對照他們之前的 conference paper 到 journal paper 的內容, 應該是中間經過不同 maintainers 修改的緣故, 總之不是很容易 trace 就是了.

Bunch Tool 同時可以讓你自行指定那些 nodes 必定要在同一個 cluster 內, 這在他的 document 內有說, 但是他沒說該 configuration file 的 format 該是如何, 我從 source code 反向去推, 試出了一種寫法, 不過不確定是不是唯一寫法, 反正可以 work 就是了 :p , 寫法如下 :

1.(Cluster1)=C1,C4
2.(Cluster2)=C2,C3

前面的 1. 2. 是必須的, 表示是兩個分開的 equation indications. C1, C2, C3, 以及 C4 都是 node id.

目前對此 tool 的瞭解大概是這樣, 因為有個題目會利用此 tool, 估計會繼續深入研究, 甚至重新實做一個.

References

B. S. Mitchell and S. Mancoridis,"On The Automatic Modularization of Software Systems Using The Bunch Tool", IEEE Transactions on Software Engineering, Vol. 32, No. 3, March 2005

S. Mancoridis, T. S. Souder, Y-F. Chen, E. R. Gansner, J. L. Korn, "REportal: A Web-based Portal Site for Reverse Engineering," IEEE Proceedings of the 2001 Working Conference in Reverse Engineering (WCRE'01), pp. 221-230, October, 2001

Is Computer Science Science ?

(用以前寫的一篇心得文作此 blog 的開場吧 )

閱讀 Denning 的 "Is Computer Science Science?" [1] 這篇 article 時其實我並不是感到很舒服, 主要是我跟 author, 或是 author 虛擬的對話對象, 對於 Science 的見解並不相同.

儘管某個程度上我同意

Science deals with fundamental laws of nature. [1]

但是這並不十分適合作為 "What is Science" 的解釋.

對我來說, 我認為 Science is discovering the kernel of the nature, or said, the mystery of the nature.( 與其說是 kernel, 其實我更想說是"本質", 但是不知道該用哪個字, esse ? )

當然我這樣 define 事實上是很不負責任的, 因為目前為止沒有人知道 nature 的本質是什麼, 或者可能是什麼, 從而無法去真正的認定, 怎樣的東西, 或是行為, 是 Science. 但我想, Science 事實上是只對於人類有意義的字, 因此當 nature 的本質是未知的時候, 如果人們認定某種行為, 或是某種東西, 是朝著 discovering the nature 前進的, 那麼在我的 definition 之下, 它就是 Science. 顯而易見的, 我認為 Physics 是 Science, Chemistry 是 Science, Philosophy 是 Science, 但是 Computer Science ? YES, and NO.

如果 Computer Science 是指, Things related with Computer, or Computing, 那麼我的答案是 NO. 對於這些東西, 我會比較認同為 Computer Technology, Information Technology, Computer Industry, or Information Industry 之類的稱呼.

撇開 "Computer Science" 這個詞在字面上因為有了 "Computer" 而給予的解釋限制, 我認為 Computer Science 的最根本, 在於 Signal, 而 Computer Science 則是朝著解釋,尋找, 或是 modeling, How Signals Cooperate 的 rule. 某個程度上來說, 我認為 Computer Science 建立在許多既有的 Science 之上, 但是卻比這些 Science 更快速的接近 mystery of the nature. 而這中間所出現的, hardware, software, 都只是衍生物, 或是工具, 真正值得稱 Computer Science 為 Science 的, 是潛藏在 Computer Science 背後真正的目的.

Computer Science is not Science ? Computer Science is not even starting yet.

不過以上多少有點亂扯. 帶有點 Wilson 所謂的 "Ionian Enchantment" 在裡面 [2]. 回到這篇 article, 我認為這篇 article 的著眼點在於, 利用目前已經被公認是 Science 的, 以及與 Science 相對的, 來佐證 Computer Science is Science, 正如它 subtitle 就寫了 Computer science meets every criterio for being a science, but it has a self-inflicted credibility problem. [1]

這是很實際的手段, 但是我卻不認為他有真正的說明到 Computer Science 必須是 Science 的理由. Computer Science 是 Science 有什麼意義? 不是 Science 又有什麼意義 ? Does it matter that Computer Science being science or not ?

這篇 article 站在某些觀點下了 conclusion, but I don't actually buy it.

References

Peter J. Denning, "Is Computer Science Science?" Communications of the ACM, Vol. 48, No. 4, April 2005
Edward O. Wilson, Consilience : The Unity Of Knowledge, Vintage published, 1999

訂閱：文章 (Atom)

Software Cinema

Search

Blog Archive

SPARS-J

NeoWORX

Concept Explorer

Bunch Tool and REportal

Is Computer Science Science ?