Hi everyone.
I'm currently working on my final project for a Data Science degree and after a month of literature review, exploratory analysis and model testing, I'm not sure if the questions I set out to answer are suitable for the data I have.
This is a very broad question I'm asking here, as it's more guidance than anything else, so if this is not the place to ask, I would appreciate it if you could redirect me to the right place.
You can find the data sets and code on my github here. The code is messy but working; I've only picked up programming last year.
The data
Indoor Air Quality data recorded hourly through 4 sensors (Kitchen, Bedroom, Living Room, Bathroom) for 7 days in a house for a total of 3 houses. For 6 of those days, each sensor was in a different room and on the last one, all sensors were together so we could see how spread apart their signals were and account for that). So in here I have 9 continuous variables: Temperature, Relative Humidity, CO, CO2, TVOC, PM2.5, NO2, Ozone and Air Pressure.
I then got 3 manually-filled questionnaires on Occupant Activity, one for each house, such as "Door open/closed", "Window open/closed", "Heating On/off", "Frying", "Boiling", "Hoovering", "Mopping", etc. Now, these logs were missing a lot of data.
These questionnaires were a mess and a lot of the missing values had to be imputed. This data is reported in binary format such as "Did Activity X occur at hour Y? - Yes(1)/No(1).
With this project I've chosen to predict a sensor data variable (in this case CO2), based on activities.
Models
Just to have a feel for the data, I ran a Linear Regression, Decision Tree and Random Forest model with a choice of only Occupant Activity predictors and both Occupant Activity and other sensor variables as predictors on individual rooms of each house and the results are just atrocious in every case. Cross-validation shows the model's performance to be all over the place and looking at features for statistical significance gives me different significant features in every room of every house, it's like I'm playing feature roulette. Problem with some features such as Mopping, Frying, Boiling, Hoovering is that there will be a lot of "0"s in comparison to "1"s due to the nature of the feature, so one or two "1"s in the wrong place is enough to give a misguided correlation.
As you can tell and see from this, I'm still a Data Scientist in-training here, having only done a few models in the past and rather new-ish to programming (1 year experience).
What I'm looking for
I suppose that more than anything, I'm asking for guidance on whether pursuing this as a Regression problem is feasible or not.
I'm very short on time but if this won't work, I can look into alternatives. For instance, Air Pollutants have safety thresholds. I could create a class feature on whether the value is over the threshold or not and turn it into a classification problem or even a cluster one to identify the room based on activities and air pollutants..
Bottom-line is that I have a 12,500 word paper to deliver in a month, I've been at this for month already with nothing to show for, so I'm hoping someone with more experience under their belt could see if I'm chasing a dead end. Any help in the form of guidance would be so very much appreciated, I've ran out of ideas here.
Thanks,
Tom