首页 > 其他 > 详细

Understand the data

时间:2017-02-26 23:56:11      阅读:362      评论:0      收藏:0      [点我收藏+]

A new data set (problem) is a wrapped gift. It’s full of promise and anticipation at the miracles you can wreak once you’ve solved it. But it remains a  mystery until you’ve opened it. This chapter is about opening up your new data set so you can see what’s inside, get an appreciation for what you’ll be able to do with the data, and start thinking about how you’ll approach model building with it.

 

Attributes (the variables being used to make predictions) are also known as the
following:
■Predictors
■Features

■Independent variables
■Inputs
Labels are also known as the following:
■Outcomes
■Targets
■Dependent variables
■Responses

 

Different Types of Attributes and Labels Drive Modeling Choices

The attributes come in two different types: numeric variables and categorical (or factor) variables. Attribute 1 (height) is a numeric variable and is the most usual type of attribute. Attribute 2 is gender and is indicated by the entry Male or Female. This type of attribute is called a categoricalor factor variable. Categorical variables have the property that there’s no order relation between the various values. There’s no sense to Male < Female (despite centuries of squabbling). Categorical variables can be two‐valued, like Male Female, or multivalued, like states (AL, AK, AR . . . WY). Other distinctions can be drawn regarding attributes (integer versus float, for example), but they do not have the same impact on machine learning algorithms. The reason for this is that many machine learning algorithms take numeric attributes only; they cannot handle categorical or factor variables. Penalized regression algorithms deal only with numeric attributes. The same is true for support vector machines, kernel methods, and K‐nearest neighbors.

 

When the labels are numeric, the problem is called a regression problem. When the labels are categorical, the problem is called a classification problem. If the categorical target takes only two values, the problem is called a binary classification problem. If it takes more than two values, the problem is called a multiclass classification problem.

 

The classification problem might also be simpler than the regression problem. Consider, for instance, the difference in complexity between a topographic map with a single contour line (say the 100‐foot contour line) and a topographic map with contour lines every 10 feet. The single contour divides the map into the areas that are higher than 100 feet and those that are lower and contains considerably less information than the more detailed contour map. A classifier is trying to compute a single dividing contour without regard for behavior distant from
the decision boundary, whereas regression is trying to draw the whole map.????不懂

 

Items to Check:
Number of rows and columns
Number of categorical variables and number of unique values for each
Missing values
Summary statistics for attributes and labels

 

Classification Problems: Detecting Unexploded Mines Using Sonar

待续

 

Understand the data

原文:http://www.cnblogs.com/hyqxln/p/6451508.html

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!