Data Mining


Terminology
  1. DBMS (Data Base Management System): A DBMS is is a software system needed to manage a database in order to ensure  
    1. Independence of data and program
    2. Reduction of data redundancy
    3. Data shareability
    4. Improved integrity
    5. Security
    6. Performance 
    7. Centralized control
  2.  Schema: A Schema is the systematic and schematic organization of data. A conceptual schema (logical schema or simply a schema) is the overall formal description of whole data. A subschema (external schema) is the way how individual users see the data (at application level). An internal schema (or physical schema) is the way how data is stored physically. User requests and results are mapped between these level by using appropriate structures.
  3. Data Independence: It is the capacity to change schema at one level without changing the schema at other level. Logical data independence is the capacity to change schema without changing subschema and physical data independence is the capacity to change internal schema without changing the schema or subschema. The mapping needs to be changed when a change occurs at one level and implementation of this mapping is an overhead of implementing Data independence. 
  4.  Data Model: Data model is a way of structuring the data and defining the set of operations for this data. Data model can be file based model (like flat file containing related data), traditional model (like hierarchical or relational model) or semantic models. In Hierarchical data models, data are represented in the form of hierarchical relationship. These have advantages of simplicity, fast search, efficiency but are marred by programming and implementation complexity and lack of data independence. A relational model stores the data in terms of interrelated tables.
  5.  Keys: Candidate key is the set of attributes that uniquely determines a record. A primary key is one of the candidate key chosen to uniquely determine the record. A foreign key is the reference to key n another table. A super key is the set of attributes that contain a key.

Codd's 12 rules to qualify for a fully RDBMS:

(Ref: Codd, E. (1985). "Is Your DBMS Really Relational?" and "Does Your DBMS Run By the Rules?" Computer World, October 14 and October 21)

Dr. E.F. Codd, an IBM researcher, first developed the relational data model in 1970. In 1985, Dr. Codd published a list of 12 rules that concisely define an ideal relational database, which have provided a guideline for the design of all relational database systems ever since.The rules are
  1. Rule 0: A relational database management system must manage its stored data using only its relational capabilities.
  2. Information Rule: All information in the database should be represented in one and only one way - as values in a table
  3. Guaranteed Access Rule: Every data must be logically accessible by using a combination of table name, primary key value and column name.
  4. Systematic Treatment of NULL values: Null values for representing missing information, are supported in a systematic way, independent of data type.
  5. Dynamic on line catalog (Data Dictionary): The database description is represented at the logical level in the same way as ordinary data and is accessible by using same relational language by authorized users. 
  6. Comprehensive data sub language: There must be at least one language (like SQL) with well-defined syntax and support for all of the following
    1. data definition
    2. view definition
    3. data manipulation
    4. integrity constraints
    5. authorization
    6. transaction boundaries (start, commit, and rollback).
  7. View updating rule: All views that are theoretically updateable are also updateable by the system
  8. High level insert update and delete: Insert, update, and delete operations should be supported for any retrievable set rather than just for a single row in a single table
  9. Physical Data Independence: Changes can be made to the underlying architecture ( hardware, disk storage methods ) without affecting application programs, terminal activities and how the user accesses it.
  10. Logical Data Independence: Application programs, terminal activities and how a user views data should not change when the logical structure (tables structure) of the database changes while preserving the information. This rule is particularly difficult to satisfy because most DBMSs rely on strong ties between the user view of the data and the actual structure of the underlying tables.
  11. Integrity Data Independence: The database language (like SQL) should support constraints on user input that maintain database integrity. This rule is not fully implemented by most DBMSs. At a minimum, all DBMSs do preserve two constraints through SQL namely entity integrity (No component of a primary key can have a null value) and referential integrity ( If a foreign key is defined in one table, any value in it must exist as a primary key in another table).
  12. Distribution Independence: A user should be totally unaware of whether or not the database is distributed (parts of the database at multiple locations). This rule is difficult to implement.
  13. Non subversion Rule: There should be no way to modify the database structure other than through the comprehensive data sub language  (like SQL). Most DBMSs support administrative tools that allow some direct manipulation of the data stored in the database. 
Every RDBMS satisfies at least 8/9 rules.Oracle satisfies almost 11 and half rules. A semantic data model, such as an entity-relationship (ER) data model, is often constructed for relational databases. An ER data model represents the database as a set of entities and their relationships.

Normalization:This is the process used to improve the design of a database and to remove insertion, update and delete anomalies. There can be infinite normal forms theoretically but usually 5 NFs (1 NF to 5 NF) are recognized . A database is considered "normalized" if it satisfies the first three normal forms. A 0NF exists, but it is, for the most part, only the initial attempt at a design.
  1. 0 NF: Each column must only contain a single piece of data (This is simply rule 2 of Codd's rules)
  2. 1 NF: A table will have no repeating groups (Functional dependency)
  3. 2 NF: All non-key columns must be functionally dependent on the entire primary key (Full functional dependency)
  4. 3 NF: No non-key columns can be dependent on another non-key column (Transitive dependency)
  5. 4 NF (or BCNF ie Boyce Codd Normal form): Only related data entries are to be in a table. It is 3NF wrt all candidate keys and not just primary key. 
  6. 5 NF: Any table that has been normalized into separate tables must be recreate-able by using one or more joins (Join dependency)
There are two trends after DBMS.
  1. Advanced data base systems:
    • Advanced data models: extended-relational, object relational etc
    • Managing complex data: spatial, temporal, multimedia, scientific etc.
    • Data streams
    • Web-based databases (XML, semantic web)
    • Managing uncertain data and data cleaning
    • Heterogeneous and distributed databases
    • Information retrieval from Text database systems
    • Extremely large data management
    • Adaptive systems
    • Advanced queries: ranking, skyline, etc.
    • Cloud computing and concurrent data processing
    • Managing data privacy and security
  2. Advanced data analysis:
    • Data warehouse, Data cube and OLAP
    • Data mining and knowledge discovery: techniques like classification, clustering, outlier analysis,
      association and correlation, comparative summary, discrimination analysis, pattern discovery, trend and deviation analysis, etc.
    • Mining complex types of data: streams, sequence, text, spatial, temporal, multimedia, Web etc.
    • Sundry data mining applications:
    • Privacy-preserving data mining, mining social and information networks
The integration of these two trends will give rise to development of  future information systems.

In Memory Data Bases (IMDB): In Memory Data Base(IMDB) or Main Memory Data Base (MMDB) or fully cached data bases is a database management system that uses main memory for data storage.  These are faster than disk-optimized databases since the internal optimization algorithms are simpler and execute fewer CPU instructions. Accessing data in memory reduces the I/O reading activity when querying the data which provides faster and more predictable performance than disk. These are used in systems where response time is critical like telecommunications network equipment and other mission critical real time systems. Since databases are stored in volatile memory so these lack the durability property of ACID (atomicity, consistency, Isolation and Durability). Durability is added using snapshot files, checkpoint images, Transaction logging, Database replication, hybrid data bases etc. RDBMSs like MySQL, Oracle, SQLlite can be configured to run in main memory. In Hybrid Data Bases, in-memory databases are fully integrated with the on-disk databases, sharing the same drivers, SQL language and administration tools. This means that an application can access both an in-memory database and on-disk database with virtually no code changes. This innovative approach allows IT departments to quickly deploy in-memory database technology with minimal cost and complexity. In-memory databases are supported as embedded or standalone databases.


Data mining (also known as knowledge discovery from data (KDD)):
 (Ref : Data Mining: Concepts and Techniques by Han, Kamber and the PPT slides associated with that book available on-line)

 It is the entire process of using computer based technology to generate new and useful knowledge from existing data. It goes much ahead of traditional data analysis with its classic aim of knowledge discovery and knowledge prediction (forecast future trends and events like stock market prediction by using some intelligence over simple grouping and partitioning of data. It starts from Data dredging, which is the process of scanning a data set for relations and then coming up with a hypothesis for existence of those relations. Applications for Data Dredging include Market and Risk Analysis in business and disaster prediction in science. It uses intersection of sophisticated algorithms like Association Rules, Decision Tree Classification, Clustering and other AI techniques like machine learning, genetic algorithms, support vector machines, fuzzy logic and Neural Networks etc along with traditional statistics and database technologies. Data mining turns a large collection of data into knowledge. The applications are
  1. AI/Machine Learning(Combinatorial/Game Data Mining)
    Good for analyzing winning strategies to games, and thus developing intelligent AI opponents e.g. Chess
  2. Business Strategies(Market Basket Analysis)
    Identify customer demographics, preferences, and purchasing patterns
  3. Risk Analysis(Product Defect Analysis)
    Analyze product defect rates for given plants and predict possible complications (read: lawsuits) down the line.
  4. User Behavior Validation(Fraud Detection)
    In the realm of cell phone, comparing phone activity to calling records. Can help detect calls made on cloned phones.Similarly, with credit cards, comparing purchases with historical purchases. Can detect activity with stolen cards.
  5. Health and Science(Protein Folding)
    Predicting protein interactions and functionality within biological cells. Applications of this research include determining causes and possible cures for Alzheimers, Parkinson's, and some cancers (caused by protein "misfolds")
  6. Extra-Terrestrial Intelligence
    Scanning Satellite receptions for possible transmissions from other planets.
 Three major contemporary challenges for  data mining are  Dirty data, Difficult access to data and explaining data mining to others.

The process of Data Mining: 
  1. Data cleaning: Removal of  noise and inconsistent data
  2. Data integration: Multiple data sources are integrated
  3. Data selection: Data relevant to the analysis is retrieved from the database
  4. Data transformation: Data is transformed and consolidated into forms appropriate for mining by performing aggregation operations
  5. Data mining: Intelligent methods are used to extract data patterns
  6. Knowledge evaluation: Identification of interesting patterns representing knowledge
  7. Knowledge presentation: Visualization and knowledge representation methods are used to present the  mined knowledge

Relational data can be accessed by relational query language (like SQL) or with the assistance of graphical user interfaces, which uses SQL at the background. A given query is mapped onto a set of relational operations, such as join, selection, and projection, and the query is then optimized for efficient processing. The query  retrieves specified subsets of the data. When mining relational databases, we can go further by searching for trends or patterns. For example, data mining systems can analyze research data to predict
the failure risk of new researches based on their experience, qualification, and previous research conduction
information. Data mining systems can also detect deviations— outstanding successes or failures Such deviations can then be further investigated to find out the causes and then knowledge generated can be fed back to improve the research process.

A data warehouse is a repository of information collected from multiple sources, stored under a unified
schema, and usually residing at a single site. Data warehouses are constructed via a process of data cleaning, data integration, data transformation, data loading, and periodic refreshing of data. A data warehouse is usually modeled by a multidimensional data structure, called a data cube, in which each dimension corresponds to an attribute or a set of attributes in the schema, and each cell stores the value of some aggregate measure. A data cube provides a multidimensional view of data and allows the precomputation and fast access of summarized data, which issued by OLAP (On line analytical processing) operators like drill-down, roll-up etc to view the data at differing degrees of summarization. Mining different kinds of data gives different kind of knowledge. Mining temporal data gives trends, Network data gives anomalous behavior like intrusion etc, spatial data gives spatial patterns, Video data gives patterns of particular events, Web data gives the distribution of WWW and relationships among different web pages, users, communities, and web-based activities.

Types of patterns that can be mined:
  1. Data Characterization and summarization using graphs, crosstabs etc.
  2. Data Discrimination:  Contrasting the features of target class with one or more contrasting classes using discrimination rules e.g. profiling outstanding researchers from mediocre ones.
  3. Frequent Patterns: Patterns which occur frequently in data. There are many kinds of frequent patterns, including frequent itemsets, frequent subsequences (or sequential patterns), and frequent substructures. A frequent itemset is a set of items that often appear together in a transactional
    data set—for example we can find what kind of research tools are used together by expert researchers.  A frequently occurring subsequence, such as the pattern that people, tend to purchase first a computer, followed by a webcam, then a memory card,  and then a printer is a (frequent) sequential pattern. A substructure combines different structural forms (e.g., graphs, trees etc ) with itemsets or subsequences.
    • Association Analysis: 

Techniques of Data mining:
  1. Frequent patterns: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set. e.g. What products were often purchased together? What kinds of DNA are sensitive to this new drug? etc. The applications include sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis. Itemset X = {x1, …, xk},Find all the rules X Y with minimum support and confidence where
    1. support, s, probability that a transaction contains X Y
    2. confidence, c, conditional probability that a transaction having X also contains Y
      Transaction-id
      Items bought
      10
      A, B, D
      20
      A, C, D
      30
      A, D, E
      40
      B, E, F
      50
      B, C, D, E, F
      Let  supmin = 50%,  confmin = 50%
      Frequent Pattern: {A:3, B:3, D:4, E:3, AD:3}
      Association rules:
      A (60%, 100%)
      D (60%, 75%)
       
    3. A long pattern contains a combinatorial number of sub-patterns,  so we should Mine closed patterns and max-patterns instead. An itemset X is closed if X is frequent and there exists no super-pattern Y כ X, with the same support as X. An itemset X is a max-patternif X is frequent and there exists no frequent super-pattern Y כ X. Closed pattern is a lossless compression of frequent patterns
    4. Any subset of a frequent itemset must be frequent
    5. Scalable mining methods: 
      1. Apriori: If there is any itemsetwhich is infrequent, its superset should not be generated/tested! So the steps are 
        1. Initially, scan DB once to get frequent 1-itemset 
        2. Generate length (k+1) candidate itemsets from length k frequent itemsets 
        3. Test the candidates against DB 
        4. Terminate when no frequent or candidate set can be generated. Counting support of candidates is done through a hash tree.  
        5. Apriori algorithm is improved by 
          1. Reducing passes of DB scan (partitioning into halves => scan DB only twice)
          2. Shrink number of candidates -A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent  
          3. Facilitate support counting of candidates. 
      2. Frequent Pattern Growth: Multiple database scans are costly. Mining long patterns needs many passes of scanning and generates lots of candidates. We can get freq itemsets without candidate generation. FP-Tree is generated by
        1. Scan DB once, find frequent 1-itemset (single item pattern)
        2. Sort frequent items in frequency descending order, f-list
        3. Scan DB again, construct FP-tree
        4. Scaling FP-growth by DB Projection
          1. FP-tree cannot fit in memory?—DB projection
          2. First partition a database into a set of projected DBs, Then construct and mine FP-tree for each projected DB
        5. Advantages of FP-Tree
          1. Divide-and-conquer:decompose both the mining task and DB according to the frequent patterns obtained so far 
          2. no candidate generation, no candidate test
          3. compressed database: FP-tree structure
          4. no repeated scan of entire database  
          5.  Vertical Data Format Approach - CHARM
          6.  Max-Miner : Mining Max-Patterns
          7.  CLOSET: Mining frequent closed patterns
          8. CLOSET+:  Mining frequent closed Itemsets by Pattern Growth   
  2. Association Rules (AR):  
    1. Multiple level AR: Items often form hierarchies, Flexible support settings, Items at the lower level are expected to have lower support
      1. A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor
    2. Multi dimensional AR 
      1. Inter-dimension assoc. rules (no repeated predicates) eg age(X,”19-25”) occupation(X,“student”) buys(X, “coke”)
      2. Hybrid-dimension assoc. rules (repeated predicates) eg age(X,”19-25”)   buys(X, “popcorn”) buys(X, “coke”)
      3. Quantitative Associations:  
        1. Static discretization based on predefined concept hierarchies (data cube methods): Numerical values replaced by ranges and cube dimensions correspond to predicate sets.
        2. Static discretization and clustering method
        3. Dynamic discretizationbased on data distribution (quantitative rules): Numeric attributes are dynamically discretized such that the confidence or compactness of the rules mined is maximized
        4. Clustering: Distance-based association: 2-D quantitative association rules: Aquan1 Aquan2 Acat. Cluster adjacent association rules  to form general rules using a 2-D grid. For example 
          age(X,”34-35”) income(X,”30-50K”) buys(X,”highresolution TV”)
  3.  Correlations:  
    1. play basketball  eat cereal [40%, 66.7%]  is misleading since the overall % of students eating cereal is 75% > 66.7%.

      Basketball
      Not basketball
      Sum (row)
      Cereal
      2000
      1750
      3750
      Not cereal
      1000
      250
      1250
      Sum(col.)
      3000
      2000
      5000

    2.  play basketball  not eat cereal[20%, 33.3%] is more accurate, although with lower support and confidence
    3. Measure of dependent/correlated events: Support and confidence are not good to represent correlations.  lift, all-conf or coherence are the proper metrics 
    4.   lift and  Χ2 are not good measures for correlations in large transactional DBs. all-conf or coherence could be good measures and have the downward closure property
  4. Constraint based Mining: Interactive Data Mining using query language similar to SQL for Relational database.
    1. Constraint types
      1. Knowledge type constraint: classification, association, etc.
      2. Data constraint — using SQL-like queries eg find product pairs sold together in stores in Mumbai in Dec.’12
      3. Dimension/level constraint: in relevance to region, price, brand, customer category 
      4. Rule (or pattern) constraint: small sales (price  < $10) triggers big sales (sum > $200)
      5.  Interestingness constraint: strong rules: min_support ≥ 3%, min_confidence   60%
    2. Constraint properties
      1. Anti-monotonicity: When an intemset S violates the constraint, so does any of its superset eg sum(S.Price)  ≤ v  is anti-monotone; sum(S.Price   is not anti-monotone
      2. Monotonicity: When an intemset S satisfies the constraint, so does any of its superset eg sum(S.Price)    is monotone; min(S.Price) is monotone  
      3. Succinctness: Given A1, the set of items satisfying a succinctness constraint C, then any set S satisfying C is based on A1 , i.e., Scontains a subset belonging to A1. Without looking at the transaction database, whether an itemset S satisfies constraint C can be determined based on the selection of items eg min(S.Price) v  is succinct; sum(S.Price is not succinct. If C  is succinct, C is pre-counting pushable in Apriori etc.
      4. Convertibility: Convert tough constraints into anti-monotone or monotone by properly ordering items. egC: avg(S.profit) 25; Order items in value-descending order <a, f, g, d, b, h, c, e>; If an itemset afb violates C So does afbh, afb*; It becomes anti-monotone! So avg(X) 25 is convertible anti-monotone w.r.t. item value descending order and convertible monotone w.r.t. item value ascending order Thus, avg(X) ³ 25 is strongly convertible. A convertible, not monotone nor anti-monotone nor succinct constraint cannot be pushed deep into the an Apriorimining algorithm, but it can be pushed into frequent-pattern growth framework!
  5. Classification: Classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data. Also called supervised learning since New data is classified based on the training set whose labels are defined. This is a three step process. 
    1. Process
      1. Model construction: describing a set of predetermined classes
        1.  Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute
        2. The set of tuples used for model construction is training set
        3.  The model is represented as classification rules (eg IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ ), decision trees, or mathematical formulae
      2.  Model usage: for classifying future or unknown objects
        1.  Estimate accuracy of the model
        2. The known label of test sample is compared with the classified result from the model
        3. Accuracy rate is the percentage of test set samples that are correctly classified by the model  
        4. Test set is independent of training set, otherwise over-fitting will occur 
      3. If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known 
    2. Issues: 
      1. Data Preparation: 
          1. Data Cleaning (reducing noise and missing values)
      2. Relevance analysis (features selection)
      3. Data Transformation: Normalization or generalization
    3. Evaluation of classification methods:
      1. Accuracy (Classifier and predictor)
      2. Time (Training time, classification / prediction time)
      3. Robustness (handling noise and missing values)
      4. scalability: efficient in disk resident DBs
      5. Interpretability
    4. Classification Methods
      1. Decision Tree: 
        1. Basic algorithm (a greedy algorithm)
          1.  Tree is constructed in a top-down recursive divide-and-conquer manner. At start, all the training examples are at the root. Attributes are categorical (if continuous-valued, they are discretized in advance). Examples are partitioned recursively based on selected attributes.
          2. Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)
        2.  Conditions for stopping partitioning
          1.  All samples for a given node belong to the same class or There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf or There are no samples left
        3.  Attributes Selection Measures
          1. Information Gain (ID3 ) 
            1. Select the attribute with the highest information gain. Let pi  be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D|. Info(D) = Expected information (entropy) needed to classify a tuple in D. InfoA(D) Information needed (after using A to split D into v partitions) to classify D. Gain(A) = Information gained by branching on attribute A.
          2.  Gain Ratio: (C 4.5): Information gain measure is biased towards attributes with a large number of values. C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain)
          3. Gain Ratio(A) =  Gain(A) / SplitInfo(A)
            Gain Ratio(income) = 0.029 / 0.926 = 0.031
            The attribute with max gain ratio is selected as splitting attribute.
          4. Gini Index (CART, IBM intelligent Miner): If a data set D contains examples from n classes, gini index is given as gini(D), where pj is the relative frequency of class j in D; If a data set D  is split on A into two subsets D1 and D2, the gini index is given as giniA(D); and reduction in impurity is given as Δgini(A). The attribute provides the smallest ginisplit(D) (or the largest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute)
          5. Here the attribute partitions D into 10 in D1: {low, medium} and 4 in D2, but gini{medium,high} is 0.30 and thus the best since it is the lowest
          6. Comparing attributes selection measures: 
            1. Information gain: biased towards multivalued attributes
            2.  Gain ratio:tends to prefer unbalanced splits in which one partition is much smaller than the others
            3. Gini index:biased to multivalued attributes, has difficulty when # of classes is large,tends to favor tests that result in equal-sized partitions and purity in both partitions 
          7. Other measures:
            1. Chi Square for independence - CHAID (popular method of decision tree
            2. C-SEP
            3. G-STatistic
            4. MDL (Minimal Description Length) Principle - tree requiring least no of bits to encode
            5. Multivariate splits - CART
          8. Approaches to deal with over-fitting (ie too many bracnches and poor accuracy for unseen samples)
            1.  Prepruning: Halt tree construction early
              1. Stop if all instances belong to the same class
              2. Stop if all the attribute values are the same
              3. Stop if number of instances is less than some user-specified threshold
              4. Stop if class distribution of instances are independent of the available features (e.g., using chi sqtest
              5. Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain)
            2. Postpruning:
              1. Grow decision tree to its entirety
              2. Trim the nodes of the decision tree in a bottom-up fashion
              3. If generalization error improves after trimming, replace sub-tree by a leaf node.
              4. Class label of leaf node is determined from majority class of instances in the sub-tree
              5. Can use MDL for post-pruning
          9. Enhancements 
            1. Allow continuous valued attributes
            2. Handle missing value
            3. create new attributes based on existing ones
            4. Decision boundary is parallel to axes because test condition involves a single attribute at-a-time,Test condition may involve multiple attributes--More expressive representation;Finding optimal test condition is computationally expensive 
          10. Scalable methods in large databases
            1. SLIQ, SPRINT, PUBLIC
            2. RAINFOREST - based on AVC( Attribute on which tuples are aggregated, value, class label) set
            3. BOAT (Bootstrapped Optimistic Algorithm for Tree construction) - Use a statistical technique called bootstrapping to create several smaller samples (subsets), each fits in memory,Each subset is used to create a tree, resulting in several trees,These trees are examined and used to construct a new tree T’, which turns out  very close to the tree that would be generated using the whole data set together, Advantage is that it requires only two scans of DB so acts as an an incremental algorithm 
          11.  Model Evaluation: 
            1. Metrics for performance evaluation

              PREDICTED CLASS

              ACTUAL
              CLASS

              Class=Yes
              Class=No
              Class=Yes
              a
              b
              Class=No
              c
              d

              1. a: TP (true positive), b: FN (false negative), c: FP (false positive), d: TN (true negative) 
              2. Accuracy = (a+d)/(a+b+c+d)
              3. Cost of classification:
              4. Weighted accuracy = (Waa+Wbd)/(Waa+Wbb+Wcc+Wdd)
            2. Methods of estimation
              1. HoldoutReserve 2/3 for training and 1/3 for testing   
              2.  Random subsamplingRepeated holdout
              3. Cross validationPartition data into k disjoint subsets,k-fold: train on k-1 partitions, test on the remaining one,Leave-one-out:   k=n 
              4. Stratified samplinoversampling vs undersampling 
              5. BootstrapSampling with replacement 
            3. Methods for model comparison
              1. Prediction can be regarded as a Bernoulli trialA Bernoulli trial has 2 possible outcomes; Possible outcomes for prediction: correct or wrong; Collection of Bernoulli trials has a Binomial distribution
              2. Given x (# of correct predictions) or equivalently, acc=x/N, and N (# of test instances), Can we predict p (true accuracy of model)?
              3. For large test sets (N > 30), acc has a normal distribution with mean p and variance p(1-p)/N, the probability and the confidence interval is given as
              4. Consider a model that produces an accuracy of 80% when evaluated on 100 test instances: N=100, acc = 0.8, Let 1-α = 0.95 (95% confidence)From probability table, Zα/2=1.96 => CI = (0.711, 0.866)
              5.  Given two models, say M1 and M2, which is better?
                1. M1 is tested on D1 (size=n1), found error rate = e1
                2. M2 is tested on D2 (size=n2), found error rate = e2
                3. Assume D1 and D2 are independent, If n1 and n2 are sufficiently large, then e1∼ N(μ1,σ1), e2∼ N(μ2,σ2), σi=ei(1-ei)/ni
                4. To test if performance difference is statistically significant:  d = e1e2, d ~ N(dt,st)   where dt is the true difference. Since D1 and D2 are independent, their variance adds up: so σt2σi2+ σ22= e1(1-e1)/n1+ e2(1-e2)/n2
                5. At (1-α) confidence level,  dt = d ± Zα/2 σt
                6. Given: M1: n1 = 30, e1 = 0.15 and M2: n2 = 5000, e2 = 0.25, we have d = |e2 – e1| = 0.1, At 95% confidence level, Za/2=1.96 : dt = 0.1+ 1.96 √0.0043 => Interval contains 0 => difference may not be statistically significant
                   
      2.  Bayesian Classification:
    5.  
  6. Prediction: Models continuous-valued functions, i.e., predicts unknown or missing values
  7. Clustering: Also called supervised learning since the class lables of training data is unknown
  8.   
  9. Given: M1: n1 = 30, e1 = 0.15 and M2: n2 = 5000, e2 = 0.25, we have d = |e2 – e1| = 0.1, At 95% confidence level, Za/2=1.96 : dt = 0.1+ 1.96 0.0043 => Interval contains 0 => difference may not be statistically significant 

4 comments:

Twitter Delicious Facebook Digg Stumbleupon Favorites More

 
Design by Free WordPress Themes | Bloggerized by Lasantha - Premium Blogger Themes | Blogger Templates