creation and maintenance of sas infrastructure...•make data set names unique in metadata • never...
TRANSCRIPT
Best Practice for Creation and Maintenance of a
SAS® Infrastructure
Paul Thomas, [email protected]
June 2015
SAS is great because…
• Unparalleled ability to manipulate data
• Data step
• PROC SQL
• PROC TRANSPOSE, etc.
• All kinds of products
• Data quality
• ETL
• Statistical – Forecasting, Modelling, etc.
• BI, Visual Analytics
• You can export to Excel
• And more…
SAS is (not so) great because…
• Unparalleled ability to manipulate data
• This means you lose control of your data
• All kinds of products
• These can be hard to co-ordinate
• So, the point of the SAS metadata infrastructure is to give the administrator
the ability to control what happens on the estate.
• But this takes careful management
So, metadata…
• If you are here – you probably run SAS via a metadata server
• Does that help or hinder you when you are doing your job?
• Can you see all the data you need?
• Can you see data that you don’t need?
• If you are an administrator…
• Do you have the ability to control access to SAS?
• Do you have the ability to limit who can see a specific data item?
• Do you know what data is on your estate?
• Do you know what jobs are running on your estate?
• Do you have the right level of control over the environments in the estate?
Is this a problem?
• SAS Estates can be BIG
• Lots of users
• Lots of data going in
• Lots of information coming out
• SAS code gets complicated
Complicated!
Approx. 150 lines
code, 2 macros
More complicated!
• Approx. 1900 lines of
code
• 4 Macros
• 1 Spreadsheet (DDE)
• Owning team say this is
one of the more obvious
programs
Is this a problem?
• SAS Estates can be BIG
• Lots of users
• Lots of data going in
• Lots of information coming out
• SAS code gets complicated
• At some stage there will be a project!
• Project managers will appear and ask questions
• Business analysts will get involved and expect answers
• You may be required to do some work
• You may have to read other people’s SAS code
• You may have to explain why SAS and not Access / Excel / R / etc.
Let’s get it organised then!
• The day to day problems are going to be…
• Users in the same team
• Slightly different users in the same team
• Teams within teams
i.e. any user admin
Let’s get it organised then!
• The day to day problems are going to be…
• User admin
• Metadata permissions
• Operating system permissions
• Metadata vs. operating system permissions
• Metadata vs. operating system vs. 3rd party database permissions
• i.e. - Libnames
Let’s get it organised then!
• The day to day problems are going to be…
• User admin
• Libnames
• Finding content
• Not finding content
• Avoiding replication / duplication
How do we avoid problems?
• Lots of different types of metadata.
• Folders
• Users, Groups and ACTs
• Libnames
• Data sets
• Environments
• Promotions
Metadata folders
• Have an Obvious Filing System
• Considered, and predefined
• Minimise clicks to find items
• Actively maintained folder structure
• Project based / Data source based /
Client based?
• Mirror operating system folders
• Use groups or ACTs to control rights to
specific access points
• Minimise these access points
• Use the same structure for non-OS
folders
• User folders
• BI content
Users, Groups and ACTs
• No one is individual
• Users need to be clustered into groups
• Never assign metadata rights to an individual
user
• Assign rights to Groups (maybe with an Access
Control Template) to Metadata folders
• Never assign rights to anything other than a
metadata folder
• If you have shared log in details, these need to
be on a Group
• Passwords can be called from metadata using
Authdomain
Libnames
• Don’t use pre-assigned
libraries when any production
data is being written
• Where pre-assigned libraries
are required, use metadata
bound libraries
Metadata data bound libraries
http://support.sas.com/documentation/cdl/en/bisecag/67045/HTML/default/viewer.htm#n1nesjvtxu7783n1pveml8ct4txk.htm
Libnames
• Don’t use pre-assigned libraries when
any production data is being written
• Where pre-assigned libraries are
required, use metadata bound
libraries
• Apply access rights to metadata
folders, put libraries inside the folders
• Avoid shared passwords for database
libnames
• Where you can’t avoid them, use a
specific group to hold the sign on
details
Data sets
• Make data set names unique in metadata
• Never had more than one metadata
representation of a physical data set
• Easier promotion
• Use the same variable names in different
data sets where they join.
• Makes life much easier in drag and drop SQL
queries in EG
• Do some data modelling
• 3rd Normal Form
• Star Schema
dd_trans_ref_no
product sk
channel sk
original date sk
last updated date sk
final date sk
original action sk
current action sk
final action sk
has refer flag
has pending flag
primary decline sk
primary refer sk
row start timestamp
row end timestamp
current row flag
batch timestamp
... need much more
DIM DATE
date sk
...
date variables
...
DIM CHANNEL
channel sk
channel desc
source code
store
store code
channel code
row start timestamp
row end timestamp
current row flag
batch timestamp
dd_trans_ref_no
product sk
channel sk
decline date sk
decline sk
primary flag
row start timestamp
row end timestamp
current row flag
batch timestamp
dd_trans_ref_no
product sk
channel sk
referral sk
referral date sk
primary flag
row start timestamp
row end timestamp
current row flag
batch timestamp
DIM DEC REASON
decline sk
decline code
decline description
row start timestamp
row end timestamp
current row flag
batch timestamp
DIM REF REASON
referral sk
referral code
referral description
row start timestamp
row end timestamp
current row flag
batch timestamp
DIM ACTION
action sk
original action code
original action short
original action description
current action code
current action short
current action description
final action code
final action short
final action description
description
row start timestamp
row end timestamp
current row flag
batch timestamp
product sk
channel sk
date sk
action sk
decline sk
referral sk
apps
apps with pend
apps with refer
DIM PRODUCT
product sk
org
logo
client description
product description
staff scheme
row start timestamp
row end timestamp
current row flag
batch timestamp
10 DI Job Standards
• Planned
• All jobs should be planned and coordinated before build starts
• All detail and mart data sets must use surrogate keys not business keys
• Reasonable size < 10 nodes
• Controlled
• All jobs will start and end with control macros
• All permanent data sets to have a load date/time column applied by the loading transform
• All type 2 tables to additionally have start and end date/times, a row version number, current flag and unique
index on the unique row number field.
10 DI Job Standards
• Self documenting
• All nodes will have meaningful descriptions
• Environment agnostic
• Paths and environment variables will be set up in the control macros and should not created in the code.
• This includes all date parameters. These macros will maintain any job tracking tables
• Well formed
• All non-work tables or input/output files will be created as metadata objects and used in standard transformations
• Automatically generated
• Keep user-written code to a minimum
• Where it is necessary always use the provided &input and &output variables created by the transform
10 DI Job Standards
• No Nesting
• Jobs should never be nested – this is what schedules are for!
• No Loops
• Jobs should not use loops, this should be handled by separate jobs or well formed data
• (Unless you are doing some very clever grid processing)
• Re-runnable
• It should be possible to re-run a job on the same day and get exactly the same output, however dates in control
tables may have to be re-aligned if the jobs was initially successful
• Specific
• Jobs should have no permanent tables that are created and re-used
• This helps with re-running when errors occur
Multiple Environments
• How many environments do you need?
• How many environment can I have?
ProductionTestSandbox Dev
SVN
Promotions
• Ensure that environments are as close as possible
• Metadata names have to match between environments
• Use the metadata behind the names to control environment changes
• File names
• Server names
• Environment variables
• Use macro variables and data where necessary to differentiate between environments
• Packages and code should have the same deployment process in test as in production
• Use metadata to perform promotions, and code deployments
• Check code deployments between versions
Scheduled code & logs
• Code:
• Have one area for production code
• Keep it clean
• Logs
• Have one area for logs
• Zip up to an archive folder regularly
• Manage your schedule
• Be able to see what is happening
• Look for bottlenecks
• Don’t be afraid to change the default Platform (LSF) parameters
Summary
• Have an obvious folder system
• No individuals
• Mirror meta and operating system
• Be strict with libnames
• Promote with care
• Work hard at being lazy