Phoenix Data Science Platform
February 2021 – July 2024
Phoenix Data Science Platform is a web-based application used by teams within Bank of America to build and deploy data science models and applications quickly and securely. Phoenix provides a central location for accessing platforms such as RStudio and Jupyter Notebooks, as well as integrating with data storage systems such as Oracle or SQL server. The team provided several services to assist users in building, testing, and deploying their models and applications to production. I first began working on the Client Support Team, and then late as the Core Platform support team. I was responsible for developing and refactoring multiple key components of the application while also assisting clients with questions and issues they may encounter on a day-to-day basis.
Skills:
Python | Technical Writing | GRPC | Streamlit | Django | Autosys | ETCD | Git | Conda | SQL | MongoDB | Conda | Pip | OCR | Angular | Redis | Bash | Linux | Tornado | Git Pytest
Virtual Environment Service
One of the key features of Phoenix is the virtual environment service which allows users to create conda virtual environments through a user-friendly interface and automatically access in Jupyter notebooks. The service served two essential purposes
Allow users to create virtual environments themselves without needing to request server access
Automatically install all dependencies for utilizing other Phoenix features such as OCR and Data Storage.
This was the first service I worked on after joining the core platform support team, and I ultimately became the primary resource for any questions or developments in relation to it.
Challenge | Resolution |
---|---|
When I first joined the team, the process flow of the service was to first create an environment with only python installed, and then one by one install other packages. This process was not only extremely slow, but often led to conflicts which would cause the virtual environment to fail to create. In addition to the packages requested by the user, we had a long list of packages which were installed by default, many of which turned out to be redundant. | First, I worked with the developers who managed related services to create an updated list of packages that were required for their utilities to run. After initial testing, I discovered that the main source of conflict was that one of our utilities was written using a very outdated version of a package. After discussing with the developers, we made the decision to refactor the code to eliminate this dependency entirely. |
The original service API code was written in a way that did not properly take advantage of the asynchronous structure which Phoenix uses. The previous developer was making a call to an entirely different service to run the steps needed to create a virtual environment synchronously. This was not only a waste of resources, but it meant that if any stage of the process failed, the entire process would be impacted. | I refactored the entirety of the service to properly utilize the benefits of asynchronous programming and eliminate the need for the separate service. During this time, I also cleaned up the code by creating reusable functions, and adding helpful log statements so we could easily track issues with the service in the future |
Additional Contributions
Updated the API for creating virtual environments to allow users to opt out of installing default packages in the cases where clients solely want to do development work without using other aspects of Phoenix. This was also my first experience working with the frontend, as I made an update to the UI to add a checkbox to support this feature.
Added new APIs for soft deleting virtual environments, restoring previously deleted virtual environments, and viewing log history for a specific virtual environment.
Configured AutoSys jobs for routinely removing virtual environments that have been in “deleted” status for over 90 days and for updating our available package list on a weekly basis
OCR Service
One of the most popular features of Phoenix is an out-of-the-box OCR (optical character recognition) service. The current OCR process is built using tesseract OCR. Users provide documents either via UI or Rest API call and the service processes these documents and extracts all text. While this model works well on most documents, there is a sharp decline in accuracy when used on handwritten data. One of my current projects has been experimenting with other OCR solutions, namely TrOCR and easyOCR, to see if there is another option that we can use to handle these rare use cases.One of the most popular features of Phoenix is an out-of-the-box OCR (optical character recognition) service. The current OCR process is built using tesseract OCR. Users provide documents either via UI or Rest API call and the service processes these documents and extracts all text. While this model works well on most documents, there is a sharp decline in accuracy when used on handwritten data. One of my current projects has been experimenting with other OCR solutions, namely TrOCR and easyOCR, to see if there is another option that we can use to handle these rare use cases.
Digital Mail
I first joined the Phoenix Data Science Platform team within the ADS Data Science organization at Bank of America in February 2021. I spent the majority of my first year in this organization working as a member of the client support team. This team worked directly with clients who used the Phoenix platform to manage their projects, troubleshoot issues with the platform and even assist with development work. After completing onboarding and training, I was assigned to work on the Digital Mail project, an application which uses OCR to scan emails and envelopes and then email a digital copy to the intended recipient.
The main objective assigned to me was creating group mailbox functionality. I developed APIs to allow admins to create, update and delete group mailboxes, as well as updating existing APIs to allow sending to groups as well as individuals. I worked with my team lead, a senior developer, and the client to determine what information needed to be captured during API calls and stored in our database.
The second objective that I worked on was creating an admin reports process. As per the risk policy at the bank, all data science models must submit monthly metric reports so we can monitor the model performance. Each time an envelope is processed, metric data is stored in the database. I created APIs for pulling and summarizing the data and then displaying in UI. These reports could be filtered by date range, recipient, sender and recipient type (group or individual).