ICT2 - Data Mining and Knowledge Discovery
Task: Same as ICT3.
ICT3 - Module Knowledge Technologies - Knowledge management and semantic web
Task: Using OntoGen develop one ontology of around 20 to 40 concepts from each of the files (.desc.lndoc and .full.lndoc) in the dataset assigned to you. Analyse the difference between the output. Use additional stop-words when needed.
Prepare a presentation of the results in a 5-10 page report and 5-10 slides presentation (all in English).
Material:
Astronomy – Božidara Cvetković, Hristjan Gjoreski
Physics – Janez Starc, Jasna Škrbec
Chemistry – Jovan Tanevski, Nikola Simidjievski
Own data – Anže Vavpetič, Nejc Trdin, Alexandra Moraru, Tomaž Kompara
Deadlines:
Submit written report and ontology by 31st March 2012
Presentation and oral exam on 11th April 2012
ICT3 - Module Knowledge Technologies - Text, web and multimedia mining
Task: Using the provided data, perform the following operations:
Generate bag-of-words file using Txt2Bow utility.
- Perform k-means clustering for two different values of k using BowKMeans utility and analyse the results.
- Perform classification using BowTrainBinSVM and BowClassify utilities for two frequent and to rare categories.
- Find an article on the internet that is positively classified into each of the selected categories.
- Extract list of entities (people, locations, organizations) from news articles.
Generate a text corpus with entities correpsonding to documents. The content of the document is assambled by concatinating all the sentences (or paragraphs) where the entitity occures.
- Train a classifier, which can predict type of entity (e.g. person, politician).
Text Garden example for lectures:
> Txt2Bow.exe -inlndoc:news.txt -o:news.bow -stopword:none -stemmer:none -ngramlen:1 > BowKMeans.exe -i:news.bow -clusts:5 > BowTrainBinSVM.exe -i:news.bow -o:news.bowmd -cat:GSPO > BowClassify.exe -ibow:news.bow -imd:news.bowmd -qs:”olympic games” > BowClassify.exe -ibow:news.bow -imd:news.bowmd -qh:article1.txt |
Enrycher example from lectures:
public static void main(String[] args) { // input document String docString = "Tiger Woods emerged from a traffic jam of his " + "own making to thrill thousands of fans with a six-under 66 at " + "the $1.4 million Australian Masters on Thursday."; // URL of Enrycher web service URL pipelineUrl = new URL("http://enrycher.ijs.si/run"); // convert input document to input stream InputStream docStream = new ByteArrayInputStream(docString.getBytes()); // call Enrycher web service Document doc = EnrycherWebExecuter.processSync(pipelineUrl, docStream); // iterate over all the annotations for (Annotation ann : doc.getAnnotations()) { // list all persons if (ann.isPerson()) { System.out.println("Person: " + ann.getDisplayName()); // get sentences in which it occurs for (Instance inst : ann.getInstances()) { int sentenceId = inst.getSentenceId(0); Paragraph paragraph = doc.getParagraph(sentenceId); Sentence sentence = paragraph.getSentence(sentenceId); System.out.println(inst.getDisplayName() + ": " + sentence.getPlainText()); } // list all attributes for (Attribute attr : ann.getAttributes()) { if (attr.isLiterl()) { System.out.println(attr.getType() + " : " + attr.getLiteral()); } else if (attr.isResource()){ System.out.println(attr.getType() + " : " + attr.getResource()); } } } } } |
Material:
Deadlines:
Submit written report on 8th of April 2012
Presentations on 18th of April 2012