Project

·         Two students will be a team for each project. Each project will be presented in the class at the end of the quarter.

·         Project submission and presentation schedule is in the syllabus.

 

·         Alternative project: Students may work on another project with other tasks and competition data with instructor’s approval. Other Data mining competitions can be found on the web. If this is a current competition you can submit your forecast to organizers.



 

Part 1

 

·         Find on the web two packages of data mining software (freeware and commercial)

·         Learn the capabilities of that software

·         Fill table below for each software similar to done for Tanagra in the table.

·         Write a word file Part1.docs

·         Post your report in the directory "Project" on your class account (U drive).

 

Software, web link,

 type (e.g., freeware, commercial)

List of functional capabilities (e.g., data entry, editing, classification, clustering)

Names of the classes of algorithms implemented (e.g., Decision tree, Neural Networks, k-Nearest Neighbors, Linear Discrimination)

brief description of the ideas of one of the  implemented algorithms (1-4 sentences)

Software and Hardware Platform(E.g., windows, PC)

1.Tanagra, freeware

exploratory data analysis, statistical learning, machine learning and databases, supervised learning algorithms, the data management.

an architecture allowing users to easily add own data mining methods, to compare their performances.

free access to source code for learning DM programming techniques. Stream diagram.

Data source Visualization Descriptive statistics Instance selection Feature selection Feature construction

an interactive and visual construction of decision trees;

Regression Factorial analysis Clustering Supervised learning Meta-spv learning Learning assessment Association

Binary logistic regression,

K-Nearest neighbor,

Multi-layer perceptron,

Decision Trees,

Naive Bayes,

Radial basis function, Prototype-NN

Decision tree implementation is ID3 algorithm with the following steps <describe steps>.

ID3 : J.R. Quinlan, "Discovering rules by induction from large collections of examples", D. Michie ed., Expert Systems in the Microelectronic age, pp. 168-201, 1979.

Windows, PC.

2.

 

 

 

 

3.

 

 

 

 

 

Tips:  Check data mining software from major industry companies such as IBM, Microsoft, Silicon Graphics, SAS, SPSS and Oracle.

 

Useful links to start

Data Mining Software Tools (review)

Tools for learning Computational Intelligence (download)

Real world examples and software (COIL 2000 links)

Data mining in finance

Additional Information

 



 

Part 2

 

·         Write a summary of the task using the sources below

o   CoILWeb - Case studies  see section data

o   Winner’s paper: C. ElkanMagical Thinking in Data Mining: Lessons From CoIL Challenge 2000 (postscript) ( pdf).  Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining (KDD'01).

·         Build a model and a forecast using data located at \neve\cs456\data\competition.

·         Post your report including the task description as a word files Project2.docs to your Project2 directory.


Details

1.  Data (see Insurance company problem. File neve\cs456\data\COIL competition\Analysis\analysispositive.xls contains some preliminary data analysis.)

1.1. Use the first half of data as training (file neve\cs456\data\COIL competition\COIL-Data\training data 5822 with target & header\ticdata2000h.txt).
1.2. Use the second half of these data as testing data.

1.3. Use file ticeval2000ht.txt (neve\cs456\data\COIL competition\COIL-Data\validation data with target and header\ ticeval2000ht.txt) for forecasting the target for each record in this file. Generate forecast as a single forecasting column, see instructions in
CoIL Challenge 2000.

2. Building Data Mining Model

Develop a data mining model by using Tanagra or other software. See references about successful algorithms for this task in CoIL Challenge 2000.
Write a word file Pafrt2.docs as a report with

Accuracy of the model for training data (include confusion matrix)
Accuracy of the model for testing data (include confusion matrix)
Comparison of forecast for 1.3 with the result of the CoIL winner
Description of the model (the complete model as a file model.txt should be in your directory too)

Post your report as a word file Part2.docs on your class U drive.

Present your results in the class



 

Part 3

1. Develop your data mining model for data from  CWU AL data

(data location neve\cs456\data\cwudata\).

2. Make a presentation.

3. Write a report on your forecast with a brief description of the model as an executive summary for CWU AL association

Later it can be submitted to the foundation based on peer evaluation and instructor's judgments.

 

·         Develop data mining model by using Tanagra or other data mining software

·         Write down a report with
    model accuracy for training data
    model accuracy  for testing data.
    comparison of forecast for 1.3 with at least with one student in the class
    description of the model (the complete model as a file model.txt should be in your directory)

·         Post your report (make link to your model.txt file and other related info)

·         Present your model to the class.

 



 Evaluation metrics

Criterion

Completeness of the report

Accuracy on training (% of correct forecast, compare with our own)

Accuracy on testing (% of correct forecast, compare with our own)

Quality of explanation of the model (the user should be convinced that the model  reasonably explains the forecast) 

Complexity of the model

Clarity and readability of the Report

Quality of presentation