EPIB 698C lecture 7 - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

EPIB 698C lecture 7

Description:

EPIB 698C lecture 7 Raul Cruz-Cano * – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 26
Provided by: Gua114
Learn more at: http://www.brac.umd.edu
Category:
Tags: 698c | epib | beluga | lecture | whales

less

Transcript and Presenter's Notes

Title: EPIB 698C lecture 7


1
EPIB 698C lecture 7
  • Raul Cruz-Cano

2
Sorting, Printing and Summarizing Your Data
  • SAS Procedures (or PROC) perform specific
    analysis or function, produce results or reports
  • Eg Proc Print data new run
  • All procedures have required statements, and most
    have optional statements
  • All procedures start with the key word PROC,
    followed by the name of the procedure, such as
    PRINT, or contents
  • Options, if there are any, follow the procedure
    name
  • Datadata_name options tells SAS which dataset to
    use as an input for this procedure. NOTE if you
    skip it, SAS will use the most recently created
    dataset, which is not necessary the same as the
    mostly recently used data.

3
BY statement
  • The BY statement is required for only one
    procedure, Proc sort
  • PROC Sort data new
  • By gender
  • Run
  • For all the other procedures, BY is an optional
    statement, and tells SAS to perform analysis for
    each level of the variable after the BY
    statement, instead of treating all subjects as
    one group
  • Proc Print data new
  • By gender
  • Run
  • All procedures, except Proc sort, assumes you
    data are already sorted by the variables in your
    BY statement

4
PROC Sort
  • Syntax
  • Proc Sort data input_data_name out
    out_data_name
  • By variable-1 variable-n
  • The variables in the by statement are called by
    variables.
  • With one by variable, SAS sorts the data based on
    the values of that variable
  • With more than one variable, SAS sorts
    observations by the first variable, then by the
    second variable within the categories of the
    first variable, and so on
  • The DATA and OUT options specify the input and
    output data sets. Without the DATA option, SAS
    will use the most recently created data set.
    Without the OUT statement, SAS will replace the
    original data set with the newly sorted version

5
PROC Sort
  • By default, SAS sorts data in ascending order,
    from the lowest to the highest value or from A to
    Z. To have the the ordered reversed, you can add
    the keyword DESCENDING before the variable you
    want to use the highest to the lowest order or Z
    to A order
  • The NODUPKEY option tells SAS to eliminate any
    duplicate observations that have the same values
    for the BY variables

6
PROC Sort
  • Example The sealife.txt contains information on
    the average length in feet of selected whales and
    sharks. We want to sort the data by the family
    and length
  • Name Family Length
  • beluga whale 15
  • whale shark 40
  • basking shark 30
  • gray whale 50
  • mako shark 12
  • sperm whale 60
  • dwarf shark .5
  • whale shark 40
  • humpback . 50
  • blue whale 100
  • killer whale 30

7
PROC Sort
  • Example The sealife.txt contains information on
    the average length in feet of selected whales and
    sharks. We want to sort the data by the family
    and length
  • Name Family Length
  • beluga whale 15
  • whale shark 40
  • basking shark 30
  • gray whale 50
  • mako shark 12
  • sperm whale 60
  • dwarf shark .5
  • whale shark 40
  • humpback . 50
  • blue whale 100
  • killer whale 30

8
PROC Sort
  • DATA marine
  • INFILE 'F\sealife.txt'
  • INPUT Name Family Length
  • run
  • Sort the data
  • PROC SORT DATA marine OUT seasort
  • NODUPKEY
  • BY Family DESCENDING Length
  • run

9
Title and Footnote statement
  • Title and Footnote statements are global
    statements, and are not technically part of any
    step.
  • You can put them anywhere in your program but
    since they apply to the procedure output, it is
    usually make sense to put them with the procedure
  • Syntax
  • Title This is a title for this procedure
  • Footnote This is the footnote for this
    procedure
  • To cancel the current title or footnote, use the
    following null statement
  • Title
  • Footnote

10
Label Statement
  • The label statement can create descriptive
    labels, up to 256 characters long, for each
    variable
  • Eg
  • Label Shipdate Date merchandise was
    shipped
  • ID Identification number of subject
  • When a label statement is used in a data step,
    the labels become part of the data set but when
    used in a PROC step, the labels stay in effect
    only for the duration of that step

11
PROC Format statement
  • The PROC FORMAT procedure allows you to create
    your own formats. It is useful when you use coded
    data.
  • The Proc format procedure creates formats what
    will later be associated with variables in a
    FORMAT statement
  • Syntax of the PROC FORMAT
  • PROC FORMAT
  • Value name range-1 formated-text-1
  • range-2 formated-text-2
  • range-n
    formated-text-n
  • Name is the name of the format you are creating
    if the format is for character data, the you need
    to use name instead of name. In addition the
    name can not be the name of an existing format

12
PROC Format statement
  • Each range is the value of the variable that is
    assigned to the text given in the quotation marks
  • The text can be up to 32,767 characters long, but
    some procedures print only the first 8 to 16
    characters
  • The following are some examples of valid range
    specifications
  • AAsian character values must be put
    in quotation marks
  • 1,3,5,7,9ODD with more than one value in the
    range, separate
  • them with comma or
    hyphen (-)
  • 5000-highhigh price the key word high and low
    can be used in
  • ranges
    to indicate the lowest and highest

  • non-missing values for the variable

13
PROC Format statement
  • Here is a survey about subjects preference of
    car colors. The data contains subjects age, sex
    (coded as 1 for male and 2 for female), annual
    income, and preferred car color (yellow, green,
    blue, and white). Here are the data
  • age sex income color
  • 19 1 14000 Y
  • 45 1 65000 G
  • 72 2 35000 B
  • 31 1 44000 Y
  • 58 2 83000 W

14
  • DATA carsurvey
  • INFILE C\car.txt'
  • INPUT Age Sex Income Color
  • run
  • PROC FORMAT
  • VALUE gender 1 'Male
  • 2 'Female'
  • VALUE agegroup 13 -lt 20 'Teen'
  • 20 -lt 65 'Adult'
  • 65 - HIGH 'Senior'
  • VALUE col 'W' 'Moon White'
  • 'B' 'Sky Blue'
  • 'Y' 'Sunburst Yellow'
  • 'G' Green'
  • PROC PRINT DATA carsurvey
  • FORMAT Sex gender. Age agegroup.
  • Color col. Income DOLLAR8.
  • RUN

15
Subsetting in procedures with a where statement
  • The WHERE statement tells a procedure to use a
    subset of data
  • It is an optional statement for any PROC step
  • Unlike subsetting in the DATA step, using a WHERE
    statement in a procedure does not create a new
    data set
  • The basic form is
  • Where condition (eg where gender female)

16
Subsetting in procedures with a where statement
  • A data set contains information about well-known
    painters
  • Name Style Nation of origin
  • Mary Cassatt Impressionism
    U
  • Paul Cezanne Post-impressionism F
  • Edgar Degas Impressionism
    F
  • Paul Gauguin Post-impressionism F
  • Claude Monet Impressionism
    F
  • Pierre Auguste Renoir Impressionism F
  • Vincent van Gogh Post-impressionism N
  • Goal we want a list of impressionist painters

17
  • DATA style
  • INFILE C\style.txt'
  • INPUT Name 1-21 style 23-40 Origin 42
  • RUN
  • PROC PRINT DATA style
  • WHERE style 'Impressionism'
  • TITLE 'Major Impressionist Painters'
  • FOOTNOTE 'F France N Netherlands U US'
  • RUN

18
Summarizing you data with PROC MEANS
  • The proc means procedure provide simple
    statistics on numeric variables. Syntax Proc
    means options
  • List of simple statistics can be produced by proc
    means
  • MAX the maximum value
  • MIN the minimum value
  • MEAN the mean
  • N number of non-missing values
  • STDDEV the standard deviation
  • NMISS number of missing values
  • RANGE the range of the data
  • SUM the sum
  • MEDIAN the median

DEFAULT
19
Proc means
  • Options of Proc means
  • By variable-list perform analysis for each
    level of the variables in the list. Data needs to
    be sorted first
  • Var variable list specifies which variables to
    use in the analysis

20
Proc means
  • A wholesale nursery is selling garden flowers,
    they want to summarize their sales figures by
    month. The data is as follows
  • ID Date Lily SnapDragon Marigold
  • 756-01 05/04/2001 120 80 110
  • 756-01 05/14/2001 130 90 120
  • 834-01 05/12/2001 90 160 60
  • 834-01 05/14/2001 80 60 70
  • 901-02 05/18/2001 50 100 75
  • 834-01 06/01/2001 80 60 100
  • 756-01 06/11/2001 100 160 75
  • 901-02 06/19/2001 60 60 60
  • 756-01 06/25/2001 85 110 100

21
  • DATA sales
  • INFILE 'C\Flowers.txt'
  • INPUT CustomerID _at_9 SaleDate MMDDYY10. Lily
    SnapDragon Marigold
  • Month MONTH(SaleDate)
  • PROC SORT DATA sales
  • BY Month
  • Calculate means by Month for flower sales
  • PROC MEANS DATA sales OUTPUT OUT values
  • BY Month
  • VAR Lily SnapDragon Marigold
  • TITLE 'Summary of Flower Sales by Month'
  • RUN

22
OUTPUT statement
  • We can use the OUTPUT statement to write summary
    statistics in a SAS data set
  • Syntax
  • OUTPUT out data_name output-statistic-list
  • Eg
  • Proc means data new
  • Var age BMI
  • Output out new1 mean (age BMI)mean_age
    mean_BMI
  • Run
  • In the output data set new1, we have two means
    for age and BMI respectively. The variable names
    are mean_age mean_BMI respectively.

23
Proc means
  • A wholesale nursery is selling garden flowers,
    they want to summarize their sales figures by
    month. The data is as follows
  • ID Date Lily SnapDragon Marigold
  • 756-01 05/04/2001 120 80 110
  • 756-01 05/14/2001 130 90 120
  • 834-01 05/12/2001 90 160 60
  • 834-01 05/14/2001 80 60 70
  • 901-02 05/18/2001 50 100 75
  • 834-01 06/01/2001 80 60 100
  • 756-01 06/11/2001 100 160 75
  • 901-02 06/19/2001 60 60 60
  • 756-01 06/25/2001 85 110 100

24
  • PROC MEANS DATA sales
  • BY Month
  • VAR Lily SnapDragon Marigold
  • output outnew1
  • mean(Lily SnapDragon
    Marigold)mean_lily
  • mean_SnapDragon
    mean_Marigold
  • sum (lily SnapDragon
    Marigold)sum_lily
  • sum_SnapDragon
    sum_Marigold
  • TITLE 'Summary of Flower Sales by Month'
  • RUN

25
OUTPUT statement
  • The SAS data set created by the output statement
    will contain all the variables defined in the
    output statistic list any variables in a BY or
    CLASS statement, plus two new variables _TYPE_
    and _FREQ_
  • Without BY or CLASS statement, the data will have
    just one observation
  • If there is a BY statement, the data will have
    one observation for each level of the BY group
  • CLASS statements produce one observation for each
    level of interaction of the class variables
  • The value _TYPE_depends on the level of
    interactions of the CLASS statement.
  • _TYPE_ 0 is the grand total
Write a Comment
User Comments (0)
About PowerShow.com