COVID-19 Monitor

Practicing Data Engineering

“What are the current cases in my college? How fast are they rising? Is it worse than last month? How does it compare to other similar institutions in the state?” These are just some of the questions I found myself musing over as I’ve tried to struggle through whatever “online learning” is. Regardless, I viewed this as an opportunity to finally practice some data engineering and pipeline development.

For this project, I built and managed a live data pipeline, along with simple analysis.

The Process

This project evolved over time, but it was fun to start from ideation and end with a fully working pipeline which scraped, organized, stored, and displayed data. I’m blown away that I can get my own dedicated Linux server on the cloud for just $5 a month, and definitely plan on leveraging these services more in the future!

Overall the biggest things I learned in this project were cloud computing, better linux/bash skills, cron, and MySQL data management.

  • Brainstorm Project

    First, I worked to break down thought process in what I wanted to build. My initial idea was to scrape and store COVID numbers as they were reported at various colleges around the state, and then send PDF reports to those who signed up on a Google Form. I choose a Google Form because I don’t have experience with more complicated web apps (yet) and knew I could access my private Google Forms responses via API.

  • Build Out Our Tools

    I built a Python project that consisted of various scrapers, error handling, config files, requirements.txt and more. As I wanted this running 24/7, I bought a small Linode server to host the project. This is really where the learning took place on this project. Getting a server, managing accounts, installing and utilizing MySQL server all were new to me. The final script runs the scraper every 4 hours, and uploads the new tallies into the database. Errors are handled accordingly to ensure the system has high runtime.

  • Generate Insights & Graphs

    With the data in a database, it was trivial to access with a simple query. From there I could generate various plots and statistics looking at 5 colleges in my home state of Utah. While I could have done more advanced analysis, it would have required more data and like I mentioned previously, what I really set out to learn with this project was developing a pipeline and rudimentary product end to end.

  • Future Work

    My next possible step? Connect to twitter and automatically post updated graphics and statistics on a daily basis. I actually already have the twitter account and API set up, I just have gotten caught up in studying for finals and other projects. However now with a server running and the pipeline working beautifully, I am in a great place to add a Streamlit web app, Twitter Bot, or some other sort of front end application on top of my process.

Interesting Graphics

Like I mentioned above, I did this project to get better at data engineering and developing pipelines. I knew I wouldn’t be doing any groundbreaking analysis. However I could generate some interesting visuals, and I actually got the attention of a few professors in the state, as well as my ex-employer Intermountain Healthcare. They were both given access to my repository for this project for research into the spread of COVID-19 on college campuses in the state.

Graphic 1

Graphic 2

Failure & Change (And more failure)

During this project, I failed a lot. I struggled adding linux accounts, properly managing environments, and getting cron to work properly. While I had done all of these things to various degrees before, I knew I needed to get better at developing full stack applications.

I also at one point accidentally wiped all my data and can honestly say I was heartbroken. When practicing with MySQL, I installed a test database that pushed me over my data limit and with how I had structured my code, it deleted my COVID data to compensate. Luckily I had a backup, but it was missing a few weeks worth.

Ultimately I had to realize that I make mistakes. I fail, a lot. But in the end I have to remember that this is why I’m doing this on my own time, my own projects, and on a $5 a month commitment. Also as weird as it sounds, I just like making things work. As weird as it sounds, I actually slept better the first few days when I knew that 24/7 my code was scraping and storing data on a server with no intervention. Definitely a cool feeling!

For fun, I also include an initial whiteboard sketch of how I initially saw the pipeline working. Obviously it was changed over time, but it was a good first place to start. Also, its the only real photo I had on my phone of this project?