Skip to main content

Blaž Fortuna

About  Publications  Software  Teaching   
Archive 2010/2011 >  

Knowledge management and semantic web

 

Task: Using OntoGen develop one ontology of around 20 to 40 concepts from each of the files (.desc.lndoc and .full.lndoc) in the dataset assigned to you. Analyse the difference between the output. Use additional stop-words when needed.

Prepare a presentation of the results in a 5-10 page report and 5-10 slides presentation (all in English).


Material:

Astronomy – Luka Bradesko, Matevz Vucnik

Physics – Ales Jurca, Mario Karlovcec

Own data – Janez Kranjc, Jasmina Smailovic


Deadlines:

Submit written report and ontology by 2-2-2010

Presentation and oral exam on 9-2-2010+

 

 

Text, web and multimedia mining

 

Task:Generate bag-of-words file out of provided news articles using Txt2Bow utility. Analyse how does stop-word removal, stemming and the length of n-grams influence the resulting vectors. To achieve this, work with the following parameters:

  • -stopword:none (no stop-word removal)
  • -stopword:en523 (use pre-defined list of 523 stop-words)
  • -stemmer:none (no stemming)
  • -stemmer:porter (stemming using Porter stemmer)
  • -ngramlen:1 (no n-grams)
  • -ngramlen:5 (n-grams of length 5)

 

Perform k-means clustering for two different values of k using BowKMeans utility and analyse the results.

 

Perform classification using BowTrainBinSVM and BowClassify utilities for two frequent and to rare categories. Find an article on the internet that is positively classified into each of the selected categories.

 

Prepare a presentation of the results in a 5-10 page report and 5-10 slides presentation (all in English).

 

Example from lectures:

>Txt2Bow.exe -inlndoc:news.txt -o:news.bow -stopword:none -stemmer:none -ngramlen:1
>BowKMeans.exe -i:news.bow -clusts:5
>BowTrainBinSVM.exe -i:news.bow -o:news.bowmd -cat:GSPO
>BowClassify.exe -ibow:news.bow -imd:news.bowmd -qs:”olympic games”
>BowClassify.exe -ibow:news.bow -imd:news.bowmd -qh:article1.txt

Material:

 

Deadlines:

Submit written report on 9-03-2010
Presentations on 16-03-2011