Skip to main content

Blaž Fortuna

About  Publications  Software  Teaching   
Archive 2010/2011 >  
ICT2 - Data Mining and Knowledge Discovery
 
 
Task: Same as ICT3.
 

ICT3 - Module Knowledge Technologies - Knowledge management and semantic web
 
 
Task: Using OntoGen develop one ontology of around 20 to 40 concepts from each of the files (.desc.lndoc and .full.lndoc) in the dataset assigned to you. Analyse the difference between the output. Use additional stop-words when needed.
Prepare a presentation of the results in a 5-10 page report and 5-10 slides presentation (all in English).
 

Material:

Astronomy – Božidara Cvetković, Hristjan Gjoreski

Physics – Janez Starc, Jasna Škrbec

Chemistry – Jovan Tanevski, Nikola Simidjievski

Own data – Anže Vavpetič, Nejc Trdin, Alexandra Moraru, Tomaž Kompara

 

Deadlines:
Submit written report and ontology by 31st March 2012
Presentation and oral exam on 11th April 2012
 

ICT3 - Module Knowledge Technologies - Text, web and multimedia mining
 
 
Task: Using the provided data, perform the following operations:
  • Generate bag-of-words file using Txt2Bow utility.
  • Perform k-means clustering for two different values of k using BowKMeans utility and analyse the results.
  • Perform classification using BowTrainBinSVM and BowClassify utilities for two frequent and to rare categories.
  • Find an article on the internet that is positively classified into each of the selected categories.
  • Extract list of entities (people, locations, organizations) from news articles.
  • Generate a text corpus with entities correpsonding to documents. The content of the document is assambled by concatinating all the sentences (or paragraphs) where the entitity occures.
  • Train a classifier, which can predict type of entity (e.g. person, politician).

 

Text Garden example for lectures:

> Txt2Bow.exe -inlndoc:news.txt -o:news.bow -stopword:none -stemmer:none -ngramlen:1
> BowKMeans.exe -i:news.bow -clusts:5
> BowTrainBinSVM.exe -i:news.bow -o:news.bowmd -cat:GSPO
> BowClassify.exe -ibow:news.bow -imd:news.bowmd -qs:”olympic games”
> BowClassify.exe -ibow:news.bow -imd:news.bowmd -qh:article1.txt

Enrycher example from lectures:

public static void main(String[] args) {
  // input document
  String docString = "Tiger Woods emerged from a traffic jam of his " +
    "own making to thrill thousands of fans with a six-under 66 at " +
    "the $1.4 million Australian Masters on Thursday.";
  // URL of Enrycher web service
 
URL pipelineUrl = new URL("
http://enrycher.ijs.si/run");
  // convert input document to input stream
  InputStream docStream = new ByteArrayInputStream(docString.getBytes());
  // call Enrycher web service
  Document doc = EnrycherWebExecuter.processSync(pipelineUrl, docStream);
  // iterate over all the annotations
  for (Annotation ann : doc.getAnnotations()) {
    // list all persons
    if (ann.isPerson()) {
      System.out.println("Person: " + ann.getDisplayName());
      // get sentences in which it occurs
      for (Instance inst : ann.getInstances()) {
        int sentenceId = inst.getSentenceId(0);
        Paragraph paragraph = doc.getParagraph(sentenceId);
        Sentence sentence = paragraph.getSentence(sentenceId);
        System.out.println(inst.getDisplayName() + ": " + sentence.getPlainText());
      }

      // list all attributes
      for (Attribute attr : ann.getAttributes()) {
        if (attr.isLiterl()) {
          System.out.println(attr.getType() + " : " + attr.getLiteral());
        } else if (attr.isResource()){
          System.out.println(attr.getType() + " : " + attr.getResource());
        }
      }
    }
  }
}

 

Material:

 

Deadlines:

Submit written report on 8th of April 2012
Presentations on 18th of April 2012