creation and maintenance of sas infrastructure...•make data set names unique in metadata • never...

Best Practice for Creation and Maintenance of a

SAS® Infrastructure

Paul Thomas, [email protected]

June 2015

SAS is great because…

• Unparalleled ability to manipulate data

• Data step

• PROC SQL

• PROC TRANSPOSE, etc.

• All kinds of products

• Data quality

• ETL

• Statistical – Forecasting, Modelling, etc.

• BI, Visual Analytics

• You can export to Excel

• And more…

SAS is (not so) great because…

• Unparalleled ability to manipulate data

• This means you lose control of your data

• All kinds of products

• These can be hard to co-ordinate

• So, the point of the SAS metadata infrastructure is to give the administrator

the ability to control what happens on the estate.

• But this takes careful management

So, metadata…

• If you are here – you probably run SAS via a metadata server

• Does that help or hinder you when you are doing your job?

• Can you see all the data you need?

• Can you see data that you don’t need?

• If you are an administrator…

• Do you have the ability to control access to SAS?

• Do you have the ability to limit who can see a specific data item?

• Do you know what data is on your estate?

• Do you know what jobs are running on your estate?

• Do you have the right level of control over the environments in the estate?

Is this a problem?

• SAS Estates can be BIG

• Lots of users

• Lots of data going in

• Lots of information coming out

• SAS code gets complicated

Complicated!

Approx. 150 lines

code, 2 macros

More complicated!

• Approx. 1900 lines of

code

• 4 Macros

• 1 Spreadsheet (DDE)

• Owning team say this is

one of the more obvious

programs

Is this a problem?

• SAS Estates can be BIG

• Lots of users

• Lots of data going in

• Lots of information coming out

• SAS code gets complicated

• At some stage there will be a project!

• Project managers will appear and ask questions

• Business analysts will get involved and expect answers

• You may be required to do some work

• You may have to read other people’s SAS code

• You may have to explain why SAS and not Access / Excel / R / etc.

Let’s get it organised then!

• The day to day problems are going to be…

• Users in the same team

• Slightly different users in the same team

• Teams within teams

i.e. any user admin



• User admin

• Metadata permissions

• Operating system permissions

• Metadata vs. operating system permissions

• Metadata vs. operating system vs. 3rd party database permissions

• i.e. - Libnames



• User admin

• Libnames

• Finding content

• Not finding content

• Avoiding replication / duplication

How do we avoid problems?

• Lots of different types of metadata.

• Folders

• Users, Groups and ACTs

• Libnames

• Data sets

• Environments

• Promotions

Metadata folders

• Have an Obvious Filing System

• Considered, and predefined

• Minimise clicks to find items

• Actively maintained folder structure

• Project based / Data source based /

Client based?

• Mirror operating system folders

• Use groups or ACTs to control rights to

specific access points

• Minimise these access points

• Use the same structure for non-OS

folders

• User folders

• BI content

Users, Groups and ACTs

• No one is individual

• Users need to be clustered into groups

• Never assign metadata rights to an individual

user

• Assign rights to Groups (maybe with an Access

Control Template) to Metadata folders

• Never assign rights to anything other than a

metadata folder

• If you have shared log in details, these need to

be on a Group

• Passwords can be called from metadata using

Authdomain

Libnames

• Don’t use pre-assigned

libraries when any production

data is being written

• Where pre-assigned libraries

are required, use metadata

bound libraries

Metadata data bound libraries

http://support.sas.com/documentation/cdl/en/bisecag/67045/HTML/default/viewer.htm#n1nesjvtxu7783n1pveml8ct4txk.htm

Libnames

• Don’t use pre-assigned libraries when

any production data is being written

• Where pre-assigned libraries are

required, use metadata bound

libraries

• Apply access rights to metadata

folders, put libraries inside the folders

• Avoid shared passwords for database

libnames

• Where you can’t avoid them, use a

specific group to hold the sign on

details

Data sets

• Make data set names unique in metadata

• Never had more than one metadata

representation of a physical data set

• Easier promotion

• Use the same variable names in different

data sets where they join.

• Makes life much easier in drag and drop SQL

queries in EG

• Do some data modelling

• 3rd Normal Form

• Star Schema

dd_trans_ref_no

product sk

channel sk

original date sk

last updated date sk

final date sk

original action sk

current action sk

final action sk

has refer flag

has pending flag

primary decline sk

primary refer sk

row start timestamp

row end timestamp

current row flag

batch timestamp

... need much more

DIM DATE

date sk

...

date variables

...

DIM CHANNEL

channel sk

channel desc

source code

store

store code

channel code

row start timestamp

row end timestamp

current row flag

batch timestamp

dd_trans_ref_no

product sk

channel sk

decline date sk

decline sk

primary flag

row start timestamp

row end timestamp

current row flag

batch timestamp

dd_trans_ref_no

product sk

channel sk

referral sk

referral date sk

primary flag

row start timestamp

row end timestamp

current row flag

batch timestamp

DIM DEC REASON

decline sk

decline code

decline description

row start timestamp

row end timestamp

current row flag

batch timestamp

DIM REF REASON

referral sk

referral code

referral description

row start timestamp

row end timestamp

current row flag

batch timestamp

DIM ACTION

action sk

original action code

original action short

original action description

current action code

current action short

current action description

final action code

final action short

final action description

description

row start timestamp

row end timestamp

current row flag

batch timestamp

product sk

channel sk

date sk

action sk

decline sk

referral sk

apps

apps with pend

apps with refer

DIM PRODUCT

product sk

org

logo

client description

product description

staff scheme

row start timestamp

row end timestamp

current row flag

batch timestamp

10 DI Job Standards

• Planned

• All jobs should be planned and coordinated before build starts

• All detail and mart data sets must use surrogate keys not business keys

• Reasonable size < 10 nodes

• Controlled

• All jobs will start and end with control macros

• All permanent data sets to have a load date/time column applied by the loading transform

• All type 2 tables to additionally have start and end date/times, a row version number, current flag and unique

index on the unique row number field.

10 DI Job Standards

• Self documenting

• All nodes will have meaningful descriptions

• Environment agnostic

• Paths and environment variables will be set up in the control macros and should not created in the code.

• This includes all date parameters. These macros will maintain any job tracking tables

• Well formed

• All non-work tables or input/output files will be created as metadata objects and used in standard transformations

• Automatically generated

• Keep user-written code to a minimum

• Where it is necessary always use the provided &input and &output variables created by the transform

10 DI Job Standards

• No Nesting

• Jobs should never be nested – this is what schedules are for!

• No Loops

• Jobs should not use loops, this should be handled by separate jobs or well formed data

• (Unless you are doing some very clever grid processing)

• Re-runnable

• It should be possible to re-run a job on the same day and get exactly the same output, however dates in control

tables may have to be re-aligned if the jobs was initially successful

• Specific

• Jobs should have no permanent tables that are created and re-used

• This helps with re-running when errors occur

Multiple Environments

• How many environments do you need?

• How many environment can I have?

ProductionTestSandbox Dev

SVN

Promotions

• Ensure that environments are as close as possible

• Metadata names have to match between environments

• Use the metadata behind the names to control environment changes

• File names

• Server names

• Environment variables

• Use macro variables and data where necessary to differentiate between environments

• Packages and code should have the same deployment process in test as in production

• Use metadata to perform promotions, and code deployments

• Check code deployments between versions

Scheduled code & logs

• Code:

• Have one area for production code

• Keep it clean

• Logs

• Have one area for logs

• Zip up to an archive folder regularly

• Manage your schedule

• Be able to see what is happening

• Look for bottlenecks

• Don’t be afraid to change the default Platform (LSF) parameters

Summary

• Have an obvious folder system

• No individuals

• Mirror meta and operating system

• Be strict with libnames

• Promote with care

• Work hard at being lazy

www.SAS.com

[email protected]

http://sascode.asup.uk

http://www.sas.com/

creation and maintenance of sas infrastructure...•make data set names unique in metadata • never...

Documents