Problem Statement / Opportunity
- To set up a big data analytics platform on the cloud to enable the bank towards better data-driven decisions
- The platform will be able to collect, store and process myriad data feeds (transactional, semi-structured, unstructured, batch, stream etc.) to enable advanced analytics.
- Frameworks to ensure safe data practices and quicker model implementation and deployment
Oneture has been development partner in this journey almost from the start; following is the list of use cases/features that we have partnered in implementing for the bank
- Building of the Big data lake for execution of various types of Big data applications including Data science and Automation
- Customer 360 data for searches and analysis
- Identification of suspicious transactions in Aadhaar enabled payment system
- Automation of credit card quarterly and half yearly loss reports
- Customer segmentation for cross-sell based on transactions behavior
- Identification of transaction and micro-ATM breeches
Proposed Solution & Architecture
Following considerations are undertaken to design the solution.
- Create S3 directory structure for dev/prod
- Setup encryption/decryption on S3
- Setup S3 lockdown rules
- Setup EMR
- Setup encryption/decryption on EMR
- Setup data purge policies
- Cloud Formation Templates/Service Portal
- Data Governance
- Identify data sources that need to be ingested and get small data samples
- Identify data ingest cadence for each data set
- Decide upon file paths (directory structure)
- Identify, Provision and Setup Staging Area (provision, install, ports etc.)
- Create ETL pipelines to push data to S3 at specified cadence
Data Storage and Discovery
- S3 part covered as part of cloud infrastructure
- Data Discovery using Glue/ES
- Metadata management
- Create applications for various required use cases
- Fine-tune performance
- Schedule jobs
- Application dashboards
- Application reports
- DB or shared S3 bucket for application results
Code Integration and Deployment
- Code commit
- Test and Build on Laptop
- Deploy on UAT
- Deploy on Prod
- Identify User Groups (data warehouse, big data developers, analysts/data scientists. business users, ad-hoc users, external) and their access roles
- Data governance
Tools and Technologies Used
|Front End||Apache Zeppelin|
|Build tools||SBT, Apache Maven, Apache Ant- Apache Ivy|
|Big data tools||Apache Spark, Apache Hadoop, Apache Hive, Apache Oozie|
|SDKs||Oracle Java Development Kit|
|Development Environments||Scala IDE for Eclipse, Jetbrains IntelliJ IDEA|
|Others||Jenkins, Jfrog Artifactory, Apache NiFi|
|Code Commit||Apache Subversion|
|Amazon Web Services||EC2, EMR, Glue, KMS, S3, IAM , Cloudwatch, Cloudtrail|
- This solution has given the customer an own development and deployment platform for big data analytics to help them provide solutions to many of their existing problems and challenges.
- Faster and cheaper implementation of new analytical models to improve customer experience by offering the right product to the right customer at the right time
- Better visibility of risk/fraud by continuously looking for anomalies in data. Considering one of the use-case, it has helped the customer to save many man-hours in manually identifying these suspicious transactions and other related details. This has saved tremendous efforts and time of the firm.
- Increased operational efficiency by using templates in respective data processes
- Focus on data management : Need to determine where the data and apps will reside, on premise system or together in a cloud implementation. Most of the value of Big Data comes from co-locating it with knowledgeable end users, at the edges of the organization, where they can tinker with and glean insights from their own data.
- Scale and speed : Hadoop can process a lot of data, but it is a batch process. Using Apache Spark and NoSQL helps scale up and improve the speed drastically.
- Data visualization : Front line professionals and others who are expected to be able to take action based on Big Data insights need an easily digestible delivery mechanism.
- Big Data implementations : Data and the applications should be accessible via a platform as a service (PaaS) approach. POCs creation can be done on-premises but its production environment is recommended on cloud. Few of the advantages of having them in cloud are robustness, flexibility, scalability, availability, durability, no maintenance cost, etc.
- Data and application togetherness : Creating a data lake to support the big data platform gives the choices of data and tools availability to perform tasks at a much higher rate, saves a lot of network delays and increases productivity hugely.
- Integration of compliance and security : Securing the channels to and from the data lake also the accessibility within it, in order to make it much more secure and easy for governance.
- On your marks, Get set, Go : There are numerous factors considering the performance of any application. With big data applications handling massive loads of data, performance tuning becoming a very important concern. With POCs proving working solutions, calculating and planning numbers of executors/workers help in handling such kind of job with varying loads. This also helps in memory optimization which indeed helps to make such jobs run smoothly without any excess overload on the system at lower cost.