问题描述
Date,Locality,District,New Cases,Hospitalizations,Deaths 5/21/2020,Accomack,Eastern Shore,709,40,11 5/21/2020,Albemarle,Thomas Jefferson,142,19,4 5/21/2020,Alleghany,Alleghany,9,4,0 5/21/2020,Amelia,Piedmont,22,7,1 5/21/2020,Amherst,Central Virginia,25,3,0 5/21/2020,Appomattox,Central Virginia,25,1,0 5/21/2020,Arlington,Arlington,1763,346,89 ... // skipped down to the next day 5/20/2020,Accomack,Eastern Shore,709,39,11 5/20/2020,Albemarle,Thomas Jefferson,142,18,4 5/20/2020,Alleghany,Alleghany,10,4,0 5/20/2020,Amelia,Piedmont,21,7,1 5/20/2020,Amherst,Central Virginia,25,3,0 5/20/2020,Appomattox,Central Virginia,24,1,0 5/20/2020,Arlington,Arlington,1728,334,81 5/20/2020,Augusta,Central Shenandoah,88,4,1 ... // continued
我像CSV中的上述状态一样,在美国有一个数据,并希望对其进行一些数据分析,以便我可以通过REST API发送它.我想做的数据分析是各种汇总,例如:按日期跨州的总案件,整个州的总案件,按地区分组的总案件,按日期分组的总案件,县的总案件按日期等等.只有所有基本组可以使用这些数据.
现在,我的问题是弄清楚如何在没有数据库的情况下将这些数据正确存储在Java中.我使用一个行对象列表进行了一个成功的实现,其中每个Row对象仅包含CSV中的一个行.然后,使用Java的Stream api我已经能够过滤并获得其中一些统计信息.然后,我将这些统计信息包装到一个Row对象或a List<Row>中,然后将其发送到API以解析为JSON.这很好,但我觉得这不是最好的方法.
还有其他其他面向对象的方法来利用Date,District,County,Cases列.
我正在考虑这样做:
class State { List<District> districtList; String name; } class District { List<County> countyList; String name; } class County { LocalDate date; String name; int cases; // more stuff }
然后,我将创建一个State对象,其中一个District对象列表,每个对象都有许多County对象的列表,每个日期一个.
这看起来像过度杀伤力?还有其他一些干净的方法可以将该数据集读取到数据结构中,以易于汇总摘要信息.
我目前正在这样做的方式工作,但是我正在寻找一种更好的方法!
推荐答案
从您的描述中,您的方法似乎是合理的,并且适当地面向对象.但是,如果没有其他信息(例如,可能决定其他指示的特定聚合),您的地区对象中似乎会有多个"重复"'县'对象.例如:
[{"date":"5/21/2020","name":"Accomack"}, {"date":"5/20/2020","name":"Accomack"}]
从面向对象的视图中,您似乎需要额外的聚合级别,按"日期"(每个日期包含'County'行列表).
一个考虑:如果您的聚合与数据库方法更好地对齐,我认为应将源数据中的每一行保存并查询为/is/is/is,通过Stream Lambdas进行过滤和排序.
问题描述
Date,Locality,District,New Cases,Hospitalizations,Deaths 5/21/2020,Accomack,Eastern Shore,709,40,11 5/21/2020,Albemarle,Thomas Jefferson,142,19,4 5/21/2020,Alleghany,Alleghany,9,4,0 5/21/2020,Amelia,Piedmont,22,7,1 5/21/2020,Amherst,Central Virginia,25,3,0 5/21/2020,Appomattox,Central Virginia,25,1,0 5/21/2020,Arlington,Arlington,1763,346,89 ... // skipped down to the next day 5/20/2020,Accomack,Eastern Shore,709,39,11 5/20/2020,Albemarle,Thomas Jefferson,142,18,4 5/20/2020,Alleghany,Alleghany,10,4,0 5/20/2020,Amelia,Piedmont,21,7,1 5/20/2020,Amherst,Central Virginia,25,3,0 5/20/2020,Appomattox,Central Virginia,24,1,0 5/20/2020,Arlington,Arlington,1728,334,81 5/20/2020,Augusta,Central Shenandoah,88,4,1 ... // continued
I have data for a State in the US like the above in a CSV and would like to do some data analysis on it so that I can send it through a rest API. The data analysis that I would like to do are various aggregations, such as: total cases across the state by date, total cases for the entire state , total cases grouped by district, total cases for a district by date, total cases for a county by date, etc. Just all the basic groupby's that one could do with this data.
Now, my problem is figuring out how to properly store this data in java, without a database. I have one successful implementation using a list of Row objects, where each Row object contains just one row in the CSV. Then using java's Stream api I have been able to filter and get some of these statistics. I then package these statistics into a single Row object or a List<Row> and send it to the API to be parsed into JSON. This has worked ok, but I feel that this is not the best way.
Is there some other more object-oriented way to utilize the Date, District, County, Cases column.
I was thinking of doing something like this :
class State { List<District> districtList; String name; } class District { List<County> countyList; String name; } class County { LocalDate date; String name; int cases; // more stuff }
Then I would create one State object with a list of District objects, each with a list of many County objects, one per date.
Does this seem like overkill? Is there some other clean way to read this dataset into a data structure that allows for easily aggregating summary information.
The way that I'm currently doing it now works, but I am looking for a better way!
推荐答案
From your description, your approach seems sound, and properly object-oriented. However, without additional information (e.g. specific aggregations which may dictate otherwise), it seems odd you would have multiple "duplicate" 'County' objects in your District objects. For example:
[{"date":"5/21/2020","name":"Accomack"}, {"date":"5/20/2020","name":"Accomack"}]
From an object-oriented view, it seems you'd want an additional level of aggregation, by "Date" (with each date containing a list of 'County' rows).
One consideration: if your aggregations align better with a database approach, I would think each row from the source data should be kept and queried AS/IS, filtered and sorted via Stream lambdas.