Data Analysis and Visualization Assignment¶
Objective:
The goal of this assignment is to perform an in-depth analysis of a dataset, create meaningful visualizations, and provide insights based on your analysis. This assignment is designed to help you develop skills in exploratory data analysis, data visualization, and descriptive statistics.
Instructions:¶
Dataset Selection:
Choose a dataset that interests you.
I've collected a number of datasets to get you started that you may be interested in exploring: https://drive.google.com/drive/folders/1QPozjOiBxfx8iACBNkkFdMkb2cTb504C?usp=drive_link
But, alternatively, feel free to look at online repositories such as Kaggle to find a dataset that speaks to something you're interested in: https://www.kaggle.com/datasets
Whichever dataset you choose it should have:
- A large number of rows
- A good mix of categorial and numerical data.
- The potential to say something interesting.
Load the dataset into a pandas DataFrame. Perform any necessary data cleaning, such as handling missing values, removing duplicates, and correcting data types.
Exploratory Data Analysis:
Use a mix of text and code blocks to provide a summary of the dataset, include basic statistics and data types.
Create at least 2-3 different visualizations using matplotlib or seaborn to illustrate interesting aspects of the data. Examples of visualizations include histograms, bar plots, scatter plots, box plots, and heatmaps.
Describe any trends, patterns, or anomalies you observe in the data. Discuss any interesting findings or insights.
Descriptive Analysis:
Write a detailed description of the dataset. Include information about the context of the data, its features, and any initial observations.
Formulate at least three interesting questions or hypotheses about the dataset.
Attempt to answer the questions or explain how you believe the dataset could answer these questions. If you find one question to be the most interesting, feel free to focus all of your attention and energy on that one.
Potential Predictive Models:
Discuss potential predictive models that could be built using this dataset. What kind of predictions or classifications could be made? For example, you might consider regression models for continuous targets or classification models for categorical targets.
Prediction (Optional):
If you're interested, attempt to make a prediction from the dataset using the basics we talked about in class.
NLP (Optional):
If your dataset includes text data, consider performing simple NLP such as TFIDF or sentiment analysis using spaCy
and/or nltk
.
Summary:
Summarize your findings and reflect on the analysis process. What did you learn from the data? What challenges did you encounter? How would you improve your analysis in the future?
Submission:
In order to submit your notebook, you will use Colab's share feature.
- Clicking "Share" in the top right corner.
- Enter my email address,
dp3305@columbia.edu
. - "Send" the notebook to me.
(If you used a non-provided dataset, please provide a link to where I can find that dataset in the message)
Grading:
There is no grade for this assignment! This is purely for you to push yourslef and an opportunity for me to provide your work feedback!
Resources:
I encourage you to explore the books and documentation for the tools we've used in class.
- A Byte of Python
- Python Data Science Handbook
- Pandas Documentation
- Matplotlib Documentation
- Seaborn Documentation
- scikit-learn Documentation
- spaCy Documentation
There may be things you want to use that we did not cover or explanations that are better than my own.