Practicing Data Engineering
“What are the current cases in my college? How fast are they rising? Is it worse than last month? How does it compare to other similar institutions in the state?” These are just some of the questions I found myself musing over as I’ve tried to struggle through whatever “online learning” is. Regardless, I viewed this as an opportunity to finally practice some data engineering and pipeline development.
For this project, I built and managed a live data pipeline, along with simple analysis.
This project evolved over time, but it was fun to start from ideation and end with a fully working pipeline which scraped, organized, stored, and displayed data. I’m blown away that I can get my own dedicated Linux server on the cloud for just $5 a month, and definitely plan on leveraging these services more in the future!
Overall the biggest things I learned in this project were cloud computing, better linux/bash skills, cron, and MySQL data management.
Like I mentioned above, I did this project to get better at data engineering and developing pipelines. I knew I wouldn’t be doing any groundbreaking analysis. However I could generate some interesting visuals, and I actually got the attention of a few professors in the state, as well as my ex-employer Intermountain Healthcare. They were both given access to my repository for this project for research into the spread of COVID-19 on college campuses in the state.
Failure & Change (And more failure)
During this project, I failed a lot. I struggled adding linux accounts, properly managing environments, and getting cron to work properly. While I had done all of these things to various degrees before, I knew I needed to get better at developing full stack applications.
I also at one point accidentally wiped all my data and can honestly say I was heartbroken. When practicing with MySQL, I installed a test database that pushed me over my data limit and with how I had structured my code, it deleted my COVID data to compensate. Luckily I had a backup, but it was missing a few weeks worth.
Ultimately I had to realize that I make mistakes. I fail, a lot. But in the end I have to remember that this is why I’m doing this on my own time, my own projects, and on a $5 a month commitment. Also as weird as it sounds, I just like making things work. As weird as it sounds, I actually slept better the first few days when I knew that 24/7 my code was scraping and storing data on a server with no intervention. Definitely a cool feeling!
For fun, I also include an initial whiteboard sketch of how I initially saw the pipeline working. Obviously it was changed over time, but it was a good first place to start. Also, its the only real photo I had on my phone of this project?